Case Studies

Podcast Episode: #065 Data Engineering At CERN Case Study
How is CERN doing Data Engineering? They must get huge amounts of data from the Large Hadron Collider. Let’s check it out.
Watch on YouTube \ Listen on Anchor

Slides:

https://en.wikipedia.org/wiki/Large_Hadron_Collider

http://www.lhc-facts.ch/index.php?page=datenverarbeitung

https://www.slideshare.net/SparkSummit/next-cern-accelerator-logging-service-with-jakub-wozniak

https://databricks.com/session/the-architecture-of-the-next-cern-accelerator-logging-service

http://opendata.cern.ch

https://gobblin.apache.org

https://www.slideshare.net/databricks/cerns-next-generation-data-analysis-platform-with-apache-spark-with-enric-tejedor

https://www.slideshare.net/SparkSummit/realtime-detection-of-anomalies-in-the-database-infrastructure-using-apache-spark-with-daniel-lanza-and-prasanth-kothuri

Data Science at Disney

https://medium.com/disney-streaming/delivering-data-in-real-time-via-auto-scaling-kinesis-streams-72a0236b2cd9

Data Science at DLR

https://www.unibw.de/code/events-u/jt-2018-workshops/ws3_bigdata_vortrag_bamler.pdf

Data Science at Drivetribe

https://berlin-2017.flink-forward.org/kb_sessions/drivetribes-kappa-architecture-with-apache-flink/

https://www.slideshare.net/FlinkForward/flink-forward-berlin-2017-aris-kyriakos-koliopoulos-drivetribes-kappa-architecture-with-apache-flink

Data Science at Dropbox

https://blogs.dropbox.com/tech/2019/01/finding-kafkas-throughput-limit-in-dropbox-infrastructure/

Data Science at Ebay

https://www.slideshare.net/databricks/moving-ebays-data-warehouse-over-to-apache-spark-spark-as-core-etl-platform-at-ebay-with-kim-curtis-and-brian-knauss https://www.slideshare.net/databricks/analytical-dbms-to-apache-spark-auto-migration-framework-with-edward-zhang-and-lipeng-zhu

Data Science at Expedia

https://www.slideshare.net/BrandonOBrien/spark-streaming-kafka-best-practices-w-brandon-obrien https://www.slideshare.net/Naveen1914/brandon-obrien-streamingdata

Data Science at Facebook

https://code.fb.com/core-data/apache-spark-scale-a-60-tb-production-use-case/

Data Science at Google

http://www.unofficialgoogledatascience.com/
https://ai.google/research/teams/ai-fundamentals-applications/
https://cloud.google.com/solutions/big-data/
https://datafloq.com/read/google-applies-big-data-infographic/385

Data Science at Grammarly

https://www.slideshare.net/databricks/building-a-versatile-analytics-pipeline-on-top-of-apache-spark-with-mikhail-chernetsov

Data Science at ING Fraud

https://sf-2017.flink-forward.org/kb_sessions/streaming-models-how-ing-adds-models-at-runtime-to-catch-fraudsters/

Data Science at Instagram

https://www.slideshare.net/SparkSummit/lessons-learned-developing-and-managing-massive-300tb-apache-spark-pipelines-in-production-with-brandon-carl

Data Science at LinkedIn

Podcast Episode: #073 Data Engineering At LinkedIn Case Study
Let’s check out how LinkedIn is processing data :)
Watch on YouTube \ Listen on Anchor

Slides:

https://engineering.linkedin.com/teams/data#0

https://www.slideshare.net/yaelgarten/building-a-healthy-data-ecosystem-around-kafka-and-hadoop-lessons-learned-at-linkedin

https://thirdeye.readthedocs.io/en/latest/about.html

http://samza.apache.org

https://www.slideshare.net/ConfluentInc/more-data-more-problems-scaling-kafkamirroring-pipelines-at-linkedin?ref=https://www.confluent.io/kafka-summit-sf18/more_data_more_problems

https://www.slideshare.net/KhaiTran17/conquering-the-lambda-architecture-in-linkedin-metrics-platform-with-apache-calcite-and-apache-samza

https://www.slideshare.net/Hadoop_Summit/unified-batch-stream-processing-with-apache-samza

http://druid.io/docs/latest/design/index.html

Data Science at Lyft

https://eng.lyft.com/running-apache-airflow-at-lyft-6e53bb8fccff

Data Science at NASA

Podcast Episode: #067 Data Engineering At NASA Case Study
A look into how NASA is doing data engineering.
Watch on YouTube \ Listen on Anchor

Slides:

https://esip.figshare.com/articles/Apache_Science_Data_Analytics_Platform/5786421

http://www.socallinuxexpo.org/sites/default/files/presentations/OnSightCloudArchitecture-scale14x.pdf

https://www.slideshare.net/SparkSummit/spark-at-nasajplchris-mattmann?qid=90968554-288e-454a-b63a-21a45cfc897d&v=&b=&from_search=4

https://en.m.wikipedia.org/wiki/Hierarchical_Data_Format

Data Science at Netflix

Podcast Episode: #062 Data Engineering At Netﬂix Case Study
How Netﬂix is doing Data Engineering using their Keystone platform.
Watch on YouTube \ Listen on Anchor

Netflix revolutionized how we watch movies and TV. Currently over 75 million users watch 125 million hours of Netflix content every day!

Netflix's revenue comes from a monthly subscription service. So, the goal for Netflix is to keep you subscribed and to get new subscribers.

To achieve this, Netflix is licensing movies from studios as well as creating its own original movies and TV series.

But offering new content is not everything. What is also very important is, to keep you watching content that already exists.

To be able to recommend you content, Netflix is collecting data from users. And it is collecting a lot.

Currently, Netflix analyses about 500 billion user events per day. That results in a stunning 1.3 Petabytes every day.

All this data allows Netflix to build recommender systems for you. The recommenders are showing you content that you might like, based on your viewing habits, or what is currently trending.

The Netflix batch processing pipeline

When Netflix started out, they had a very simple batch processing system architecture.

The key components were Chuckwa, a scalable data collection system, Amazon S3 and Elastic MapReduce.

{#fig:Bild1 width="90%"}

Chuckwa wrote incoming messages into Hadoop sequence files, stored in Amazon S3. These files then could be analysed by Elastic MapReduce jobs.

Netflix batch processing pipeline Jobs were executed regularly on a daily and hourly basis. As a result, Netflix could learn how people used the services every hour or once a day.

Know what customers want:

Because you are looking at the big picture you can create new products. Netflix uses insight from big data to create new TV shows and movies.

They created House of Cards based on data. There is a very interesting TED talk about this you should watch:

How to use data to make a hit TV show | Sebastian Wernicke

Batch processing also helps Netflix to know the exact episode of a TV show that gets you hooked. Not only globally but for every country where Netflix is available.

Check out the article from TheVerge

They know exactly what show works in what country and what show does not.

It helps them create shows that work in everywhere or select the shows to license in different countries. Germany for instance does not have the full library that Americans have :(

We have to put up with only a small portion of TV shows and movies. If you have to select, why not select those that work best.

Batch processing is not enough

As a data platform for generating insight the Cuckwa pipeline was a good start. It is very important to be able to create hourly and daily aggregated views for user behavior.

To this day Netflix is still doing a lot of batch processing jobs.

The only problem is: With batch processing you are basically looking into the past.

For Netflix, and data driven companies in general, looking into the past is not enough. They want a live view of what is happening.

The trending now feature

One of the newer Netflix features is "Trending now". To the average user it looks like that "Trending Now" means currently most watched.

This is what I get displayed as trending while I am writing this on a Saturday morning at 8:00 in Germany. But it is so much more.

What is currently being watched is only a part of the data that is used to generate "Trending Now".

{#fig:Bild1 width="90%"}

"Trending now" is created based on two types of data sources: Play events and Impression events.

What messages those two types actually include is not really communicated by Netflix. I did some research on the Netflix Techblog and this is what I found out:

Play events include what title you have watched last, where you did stop watching, where you used the 30s rewind and others. Impression events are collected as you browse the Netflix Library like scroll up and down, scroll left or right, click on a movie and so on.

Basically, play events log what you do while you are watching. Impression events are capturing what you do on Netflix, while you are not watching something.

Netflix real-time streaming architecture

Netflix uses three internet facing services to exchange data with the client's browser or mobile app. These services are simple Apache Tomcat based web services.

The service for receiving play events is called "Viewing History". Impression events are collected with the "Beacon" service.

The "Recommender Service" makes recommendations based on trend data available for clients.

Messages from the Beacon and Viewing History services are put into Apache Kafka. It acts as a buffer between the data services and the analytics.

Beacon and Viewing History publish messages to Kafka topics. The analytics subscribes to the topics and gets the messages automatically delivered in a first in first out fashion.

After the analytics the workflow is straight forward. The trending data is stored in a Cassandra Key-Value store. The recommender service has access to Cassandra and is making the data available to the Netflix client.

{#fig:Bild1 width="90%"}

The algorithms how the analytics system is processing all this data is not known to the public. It is a trade secret of Netflix.

What is known, is the analytics tool they use. Back in Feb 2015 they wrote in the tech blog that they use a custom made tool.

They also stated, that Netflix is going to replace the custom made analytics tool with Apache Spark streaming in the future. My guess is, that they did the switch to Spark some time ago, because their post is more than a year old.

Data Science at OLX

Podcast Episode: #083 Data Engineering at OLX Case Study
This podcast is a case study about OLX with Senior Data Scientist Alexey Grigorev as guest. It was super fun.
Watch on YouTube \ Listen on Anchor

Slides:

https://www.slideshare.net/mobile/AlexeyGrigorev/image-models-infrastructure-at-olx

Data Science at OTTO

https://www.slideshare.net/SparkSummit/spark-summit-eu-talk-by-sebastian-schroeder-and-ralf-sigmund

Data Science at Paypal

https://www.paypal-engineering.com/tag/data/

Data Science at Pinterest

Podcast Episode: #069 Engineering Culture At Pinterest
In this podcast we look into data platform and processing at Pinterest.
Watch on YouTube \ Listen on Anchor

Slides:

https://www.slideshare.net/ConfluentInc/pinterests-story-of-streaming-hundreds-of-terabytes-of-pins-from-mysql-to-s3hadoop-continuously?ref=https://www.confluent.io/kafka-summit-sf18/pinterests-story-of-streaming-hundreds-of-terabytes

https://www.slideshare.net/ConfluentInc/building-pinterest-realtime-ads-platform-using-kafka-streams?ref=https://www.confluent.io/kafka-summit-sf18/building-pinterest-real-time-ads-platform-using-kafka-streams

https://medium.com/@Pinterest_Engineering/building-a-real-time-user-action-counting-system-for-ads-88a60d9c9a

https://medium.com/pinterest-engineering/goku-building-a-scalable-and-high-performant-time-series-database-system-a8ff5758a181

https://medium.com/pinterest-engineering/building-a-dynamic-and-responsive-pinterest-7d410e99f0a9

https://medium.com/@Pinterest_Engineering/building-pin-stats-25ec8460e924

https://medium.com/@Pinterest_Engineering/improving-hbase-backup-efficiency-at-pinterest-86159da4b954

https://medium.com/@Pinterest_Engineering/pinterest-joins-the-cloud-native-computing-foundation-e3b3e66cb4f

https://medium.com/@Pinterest_Engineering/using-kafka-streams-api-for-predictive-budgeting-9f58d206c996

https://medium.com/@Pinterest_Engineering/auto-scaling-pinterest-df1d2beb4d64

Data Science at Salesforce

https://engineering.salesforce.com/building-a-scalable-event-pipeline-with-heroku-and-salesforce-2549cb20ce06

Data Science at Siemens Mindsphere

Podcast Episode: #059 What Is The Siemens Mindsphere IoT Platform?
The Internet of things is a huge deal. There are many platforms available. But, which one is actually good? Join me on a 50 minute dive into the Siemens Mindsphere online documentation. I have to say I was super unimpressed by what I found. Many limitations, unclear architecture and no pricing available? Not good!
Watch on YouTube \ Listen on Anchor

Data Science at Slack

https://speakerdeck.com/vananth22/streaming-data-pipelines-at-slack

Data Science at Spotify

Podcast Episode: #071 Data Engineering At Spotify Case Study
In this episode we are looking at data engineering at Spotify, my favorite music streaming service. How do they process all that data?
Watch on YouTube \ Listen on Anchor

Slides:

https://labs.spotify.com/2016/02/25/spotifys-event-delivery-the-road-to-the-cloud-part-i/

https://labs.spotify.com/2016/03/03/spotifys-event-delivery-the-road-to-the-cloud-part-ii/

https://labs.spotify.com/2016/03/10/spotifys-event-delivery-the-road-to-the-cloud-part-iii/

https://www.slideshare.net/InfoQ/scaling-the-data-infrastructure-spotify

https://www.datanami.com/2018/05/16/big-data-file-formats-demystified/

https://labs.spotify.com/2017/04/26/reliable-export-of-cloud-pubsub-streams-to-cloud-storage/

https://labs.spotify.com/2017/11/20/autoscaling-pub-sub-consumers/

Data Science at Symantec

https://www.slideshare.net/planetcassandra/symantec-cassandra-data-modelling-techniques-in-action

Data Science at Tinder

https://www.slideshare.net/databricks/scalable-monitoring-using-apache-spark-and-friends-with-utkarsh-bhatnagar

Data Science at Twitter

Podcast Episode: #072 Data Engineering At Twitter Case Study
How is Twitter doing data engineering? Oh man, they have a lot of cool things to share these tweets.
Watch on YouTube \ Listen on Anchor

Slides:

https://www.slideshare.net/sawjd/real-time-processing-using-twitter-heron-by-karthik-ramasamy

https://www.slideshare.net/sawjd/big-data-day-la-2016-big-data-track-twitter-heron-scale-karthik-ramasamy-engineering-manager-twitter

https://techjury.net/stats-about/twitter/

https://developer.twitter.com/en/docs/tweets/post-and-engage/overview

https://www.slideshare.net/prasadwagle/extracting-insights-from-data-at-twitter

https://blog.twitter.com/engineering/en_us/topics/insights/2018/twitters-kafka-adoption-story.html

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2017/the-infrastructure-behind-twitter-scale.html

https://blog.twitter.com/engineering/en_us/topics/infrastructure/2019/the-start-of-a-journey-into-the-cloud.html

https://www.slideshare.net/billonahill/twitter-heron-in-practice

https://medium.com/@kramasamy/introduction-to-apache-heron-c64f8c7c0956

https://www.youtube.com/watch?v=3QHGhnHx5HQ

https://hbase.apache.org

https://db-engines.com/en/system/Amazon+DynamoDB%3BCassandra%3BGoogle+Cloud+Bigtable%3BHBase

Data Science at Uber

https://eng.uber.com/uber-big-data-platform/

https://eng.uber.com/aresdb/

https://www.uber.com/us/en/uberai/

Data Science at Upwork

https://www.slideshare.net/databricks/how-to-rebuild-an-endtoend-ml-pipeline-with-databricks-and-upwork-with-thanh-tran

Data Science at Woot

https://aws.amazon.com/de/blogs/big-data/our-data-lake-story-how-woot-com-built-a-serverless-data-lake-on-aws/

Data Science at Zalando

Podcast Episode: #087 Data Engineering At Zalando Case Study Talk
I had a great conversation about data engineering for online retailing with Michal Gancarski and Max Schultze. They showed Zalando’s data platform and how they build data pipelines. Super interesting especially for AWS users.
Watch on YouTube