Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

•

39 recomendaciones•11,288 vistas

Amy W. Tang

This talk was given by Joel Koshy (Senior Software Engineer at LinkedIn) at the Hadoop Summit (June 2013).

Tecnología Empresariales

Building a Real-Time Data Pipeline:
Apache Kafka at Linkedin
Hadoop Summit 2013
Joel Koshy
June 2013
LinkedIn Corporation ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Network update stream

LinkedIn Corporation ©2013 All Rights Reserved
We have a lot of data.
We want to leverage this data to build products.
Data pipeline

HADOOP SUMMIT 2013
System and application metrics/logging
LinkedIn Corporation ©2013 All Rights Reserved 5

How do we integrate this variety of data
and make it available to all these systems?
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Point-to-point pipelines

HADOOP SUMMIT 2013
LinkedIn’s user activity data pipeline (circa 2010)

HADOOP SUMMIT 2013
Four key ideas
1. Central data pipeline
2. Push data cleanliness upstream
3. O(1) ETL
4. Evidence-based correctness
LinkedIn Corporation ©2013 All Rights Reserved 10

HADOOP SUMMIT 2013
Central data pipeline

First attempt: don’t re-invent the wheel
LinkedIn Confidential ©2013 All Rights Reserved

Second attempt: re-invent the wheel!
LinkedIn Confidential ©2013 All Rights Reserved

Use a central commit log
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
What is a commit log?

HADOOP SUMMIT 2013
The log as a messaging system
LinkedIn Corporation ©2013 All Rights Reserved 17

HADOOP SUMMIT 2013
Apache Kafka
LinkedIn Corporation ©2013 All Rights Reserved 18

HADOOP SUMMIT 2013
Usage at LinkedIn
 16 brokers in each cluster
 28 billion messages/day
 Peak rates
– Writes: 460,000 messages/second
– Reads: 2,300,000 messages/second
 ~ 700 topics
 40-50 live services consuming user-activity data
 Many ad hoc consumers
 Every production service is a producer (for metrics)
 10k connections/colo
LinkedIn Corporation ©2013 All Rights Reserved 19

HADOOP SUMMIT 2013
Usage at LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 20

HADOOP SUMMIT 2013
Standardize on Avro in data pipeline
LinkedIn Corporation ©2013 All Rights Reserved 22
{
"type": "record",
"name": "URIValidationRequestEvent",
"namespace": "com.linkedin.event.usv",
"fields": [
{
"name": "header",
"type": {
"type": "record",
"name": ”TrackingEventHeader",
"namespace": "com.linkedin.event",
"fields": [
{
"name": "memberId",
"type": "int",
"doc": "The member id of the user initiating the action"
},
{
"name": ”timeMs",
"type": "long",
"doc": "The time of the event"
},
{
"name": ”host",
"type": "string",
...
...

HADOOP SUMMIT 2013
Hadoop data load (Camus)
 Open sourced:
– https://github.com/linkedin/camus
 One job loads all events
 ~10 minute ETA on average from producer to HDFS
 Hive registration done automatically
 Schema evolution handled transparently

Does it work?
“All published messages must be delivered to all consumers (quickly)”
LinkedIn Confidential ©2013 All Rights Reserved

HADOOP SUMMIT 2013
Kafka replication (0.8)
 Intra-cluster replication feature
– Facilitates high availability and durability
 Beta release available
https://dist.apache.org/repos/dist/release/kafka/
 Rolled out in production at LinkedIn last week
LinkedIn Corporation ©2013 All Rights Reserved 28

HADOOP SUMMIT 2013
Join us at our user-group meeting tonight @ LinkedIn!
– Thursday, June 27, 7.30pm to 9.30pm
– 2025 Stierlin Ct., Mountain View, CA
– http://www.meetup.com/http-kafka-apache-org/events/125887332/
– Presentations (replication overview and use-case studies) from:
 RichRelevance
 Netflix
 Square
 LinkedIn
LinkedIn Corporation ©2013 All Rights Reserved 29

HADOOP SUMMIT 2013LinkedIn Corporation ©2013 All Rights Reserved 30

Más contenido relacionado

La actualidad más candente

Tutorial: Using GoBGP as an IXP connecting routerShu Sugimoto

LinuxCon 2015 Linux Kernel Networking WalkthroughThomas Graf

Linux Networking ExplainedThomas Graf

MongodB InternalsNorberto Leite

All about Zookeeper and ClickHouse Keeper.pdfAltinity Ltd

Introduction to NoSQL DatabasesDerek Stainer

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...confluent

Mq presentationxddu

NGINX: Basics & Best Practices - EMEA BroadcastNGINX, Inc.

Less03 db dbcaAmit Bhalla

Mongo DB Tata Consultancy Services

IPFS: A Whole New WorldArcBlock

Docker Networking Tip - Macvlan driverSreenivas Makam

Containers and CloudStackShapeBlue

Meet cute-between-ebpf-and-tracingViller Hsiao

Introduction to MongoDBMongoDB

DevConf 2014 Kernel Networking WalkthroughThomas Graf

eBPF maps 101SUSE Labs Taipei

Docker on DockerDocker, Inc.

Storage 101: Rook and Ceph - Open Infrastructure Denver 2019Sean Cohen

La actualidad más candente (20)

Tutorial: Using GoBGP as an IXP connecting router

LinuxCon 2015 Linux Kernel Networking Walkthrough

Linux Networking Explained

MongodB Internals

All about Zookeeper and ClickHouse Keeper.pdf

Introduction to NoSQL Databases

Kafka Cluster Federation at Uber (Yupeng Fui & Xiaoman Dong, Uber) Kafka Summ...

Mq presentation

NGINX: Basics & Best Practices - EMEA Broadcast

Less03 db dbca

Mongo DB

IPFS: A Whole New World

Docker Networking Tip - Macvlan driver

Containers and CloudStack

Meet cute-between-ebpf-and-tracing

Introduction to MongoDB

DevConf 2014 Kernel Networking Walkthrough

eBPF maps 101

Docker on Docker

Storage 101: Rook and Ceph - Open Infrastructure Denver 2019

Destacado

Architecture of a Kafka camus infrastructuremattlieber

Data Infrastructure at LinkedInAmy W. Tang

Netflix Data Pipeline With KafkaAllen (Xiaozhong) Wang

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network

LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang

Data Infrastructure at LinkedIn Amy W. Tang

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang

Introduction to Apache KafkaJeff Holoman

LinkedIn Communication ArchitectureLinkedIn

Introduction to DatabusAmy W. Tang

Building Distributed Systems Using HelixAmy W. Tang

Building a Data Pipeline from Scratch - Joe CrobakHakka Labs

What is a distributed data science pipeline. how with apache spark and friends.Andy Petrella

Rakuten LeoFs - distributed file systemRakuten Group, Inc.

Introduction to apache kafkaSamuel Kerrien

Apache KafkaMaher TEBOURBI

Realtime streaming architecture in INFINARIOJozo Kovac

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...In-Memory Computing Summit

Intro to SnappyData WebinarSnappyData

Destacado (20)

Architecture of a Kafka camus infrastructure

Data Infrastructure at LinkedIn

Netflix Data Pipeline With Kafka

Data Applications and Infrastructure at LinkedIn__HadoopSummit2010

LinkedIn Segmentation & Targeting Platform: A Big Data Application

Espresso: LinkedIn's Distributed Data Serving Platform (Talk)

Data Infrastructure at LinkedIn

A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn

Introduction to Apache Kafka

LinkedIn Communication Architecture

Introduction to Databus

Building Distributed Systems Using Helix

Building a Data Pipeline from Scratch - Joe Crobak

What is a distributed data science pipeline. how with apache spark and friends.

Rakuten LeoFs - distributed file system

Introduction to apache kafka

Apache Kafka

Realtime streaming architecture in INFINARIO

IMCSummit 2015 - Day 2 IT Business Track - Real-time Interactive Big Data Ana...

Intro to SnappyData Webinar

Similar a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

All data accessible to all my organization - Presentation at OW2con'19, June...OW2

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...Shirshanka Das

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...Yael Garten

The Enterprise Guide to Building a Data Mesh - Introducing SpecMeshIanFurlong4

Sparkling Water Webinar October 29th, 2014Sri Ambati

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit/Hadoop Summit

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012Bhaskar Ghosh

Breaking down data silos with ODataWoodruff Solutions LLC

The LOD Gateway: Open Source Infrastructure for Linked DataDavid Newbury

The oecd delta project – providing easier access to data through api'sJonathan Challener

Microsoft Graph: Connect to essential data every app needsMicrosoft Tech Community

Big Data, Bigger BrainsDenny Lee

Interactive Analytics at Scale in Apache Hive Using DruidDataWorks Summit

Advanced Analytics and Machine Learning with Data VirtualizationDenodo

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...Neo4j

Opensocial Haifa Seminar - 2008.04.08Ari Leichtberg

Better integrations through open interfacesSteve Speicher

Test trend analysis: Towards robust reliable and timely testsHugh McCamphill

Similar a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

All data accessible to all my organization - Presentation at OW2con'19, June...

Strata 2017 (San Jose): Building a healthy data ecosystem around Kafka and Ha...

Building a healthy data ecosystem around Kafka and Hadoop: Lessons learned at...

The Enterprise Guide to Building a Data Mesh - Introducing SpecMesh

Sparkling Water Webinar October 29th, 2014

Advanced Analytics and Machine Learning with Data Virtualization

Interactive Analytics at Scale in Apache Hive Using Druid

Bg linkedin bigdata_martinschultz_symposium_yale_oct2012

Breaking down data silos with OData

The LOD Gateway: Open Source Infrastructure for Linked Data

The oecd delta project – providing easier access to data through api's

Microsoft Graph: Connect to essential data every app needs

Big Data, Bigger Brains

Interactive Analytics at Scale in Apache Hive Using Druid

Advanced Analytics and Machine Learning with Data Virtualization

Graphs & Big Data - Philip Rathle and Andreas Kollegger @ Big Data Science Me...

Opensocial Haifa Seminar - 2008.04.08

Better integrations through open interfaces

Test trend analysis: Towards robust reliable and timely tests

Más de Amy W. Tang

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang

LinkedIn Graph PresentationAmy W. Tang

Data Infrastructure at LinkedInAmy W. Tang

Voldemort on Solid State DrivesAmy W. Tang

Untangling Cluster Management with HelixAmy W. Tang

All Aboard the DatabusAmy W. Tang

Más de Amy W. Tang (6)

Espresso: LinkedIn's Distributed Data Serving Platform (Paper)

LinkedIn Graph Presentation

Data Infrastructure at LinkedIn

Voldemort on Solid State Drives

Untangling Cluster Management with Helix

All Aboard the Databus

Último

How to write a Business Continuity PlanDatabarracks

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

CloudStudio User manual (basic edition):comworks

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

From Family Reminiscence to Scholarly Archive .Alan Dix

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Story boards and shot lists for my a level piececharlottematthew16

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays

Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

2. HADOOP SUMMIT 2013 Network update stream

4. HADOOP SUMMIT 2013 People you may know

7. HADOOP SUMMIT 2013 Point-to-point pipelines

8. HADOOP SUMMIT 2013 LinkedIn’s user activity data pipeline (circa 2010)

9. HADOOP SUMMIT 2013 Point-to-point pipelines

11. HADOOP SUMMIT 2013 Central data pipeline

13. HADOOP SUMMIT 2013

16. HADOOP SUMMIT 2013 What is a commit log?

19. HADOOP SUMMIT 2013 Usage at LinkedIn  16 brokers in each cluster  28 billion messages/day  Peak rates – Writes: 460,000 messages/second – Reads: 2,300,000 messages/second  ~ 700 topics  40-50 live services consuming user-activity data  Many ad hoc consumers  Every production service is a producer (for metrics)  10k connections/colo LinkedIn Corporation ©2013 All Rights Reserved 19

22. HADOOP SUMMIT 2013 Standardize on Avro in data pipeline LinkedIn Corporation ©2013 All Rights Reserved 22 { "type": "record", "name": "URIValidationRequestEvent", "namespace": "com.linkedin.event.usv", "fields": [ { "name": "header", "type": { "type": "record", "name": ”TrackingEventHeader", "namespace": "com.linkedin.event", "fields": [ { "name": "memberId", "type": "int", "doc": "The member id of the user initiating the action" }, { "name": ”timeMs", "type": "long", "doc": "The time of the event" }, { "name": ”host", "type": "string", ... ...

24. HADOOP SUMMIT 2013 Hadoop data load (Camus)  Open sourced: – https://github.com/linkedin/camus  One job loads all events  ~10 minute ETA on average from producer to HDFS  Hive registration done automatically  Schema evolution handled transparently

27. HADOOP SUMMIT 2013 Audit Trail

28. HADOOP SUMMIT 2013 Kafka replication (0.8)  Intra-cluster replication feature – Facilitates high availability and durability  Beta release available https://dist.apache.org/repos/dist/release/kafka/  Rolled out in production at LinkedIn last week LinkedIn Corporation ©2013 All Rights Reserved 28

29. HADOOP SUMMIT 2013 Join us at our user-group meeting tonight @ LinkedIn! – Thursday, June 27, 7.30pm to 9.30pm – 2025 Stierlin Ct., Mountain View, CA – http://www.meetup.com/http-kafka-apache-org/events/125887332/ – Presentations (replication overview and use-case studies) from:  RichRelevance  Netflix  Square  LinkedIn LinkedIn Corporation ©2013 All Rights Reserved 29

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn

Similar a Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn (20)

Más de Amy W. Tang

Más de Amy W. Tang (6)

Último

Último (20)

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn