SlideShare una empresa de Scribd logo
1 de 36
Descargar para leer sin conexión
Data Infrastructure at LinkedIn
Kapil Surlaker

http://www.linkedin.com/in/kapilsurlaker
@kapilsurlaker




                                           1
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           2
LinkedIn By The Numbers

   150M + users*
   ~ 4.2B People Searches in 2011**
   >2M companies with LinkedIn Company Pages**
   16 languages
   75% of Fortune 100 Companies use LinkedIn to hire***




                                                  * As of February 9th 2012
                                              ** As of December 31st 2011
                                            *** As of September 30th 2011




                                                                         3
Broad Range of Products & Services




                                     4
User Profiles
                 Large dataset
                Medium writes
                Very high reads
                Freshness <1s




                             5
Communications
                 Large dataset
                   High writes
                   High reads
                 Freshness <1s




                            6
People You May Know
                        Large dataset
                      Compute intensive
                         High reads
                       Freshness ~hrs




                                   7
LinkedIn Today    Moving dataset
                    High writes
                    High reads
                 Freshness ~mins




                                   8
Outline


 LinkedIn Products
 Data Ecosystem
 LinkedIn Data Infrastructure Solutions
 Next Play




                                           9
Three Paradigms : Simplifying the Data Continuum




• Member Profiles         • Linkedin Today            • People You May Know

• Company Profiles        • Profile Standardization   • Connection Strength
• Connections             • News                      • News
• Communications          • Recommendations           • Recommendations
                          • Search                    • Next best idea
                          • Communications


 Online                   Nearline                     Offline
Activity that should     Activity that should         Activity that can be
be reflected immediately be reflected soon            reflected later


                                                                              10
LinkedIn Product Architecture




                                11
LinkedIn Product Architecture




                                12
LinkedIn Product Architecture




                                13
LinkedIn Data Infrastructure Solutions
Databus : Timeline-Consistent Change Data Capture




                                                    14
Databus at LinkedIn
                                                           Client
                   Relay                                    Consumer 1




                                              Client Lib
      Capture                      On-line




                                              Databus
 DB   Changes                      Changes
                  Event Win                                 Consumer n
                 On-line
                Changes

                 Bootstrap                                 Client
                                                            Consumer 1




                                              Client Lib
                                              Databus
                               Consistent
                              Snapshot at U
                      DB                                    Consumer n




                                                                         15
Databus at LinkedIn
                                                                  Client
                         Relay                                     Consumer 1




                                                     Client Lib
        Capture                           On-line




                                                     Databus
  DB    Changes                           Changes
                        Event Win                                  Consumer n
                      On-line
                     Changes

                       Bootstrap                                  Client
                                                                   Consumer 1




                                                     Client Lib
                                                     Databus
                                      Consistent
                                     Snapshot at U
                           DB                                      Consumer n




 Transport independent of data       Tens of relays
  source: Oracle, MySQL, …            Hundreds of sources
 Transactional semantics             Low latency - milliseconds
 In order, at least once delivery

                                                                                16
LinkedIn Product Architecture




                                17
LinkedIn Product Architecture




                                18
LinkedIn Data Infrastructure Solutions

Voldemort: Highly-Available Distributed KV Store




                                                   19
Voldemort: Architecture




 •   Pluggable components    • 10 clusters, 100+ nodes
 •   Tunable consistency /   • Largest cluster – 10K+ qps
     availability            • Avg latency: 3ms
 •   Key/value model,        • Hundreds of Stores
     server side “views”     • Largest store – 2.8TB+
LinkedIn Product Architecture




                                21
LinkedIn Data Infrastructure Solutions
Kafka: High-Volume Low-Latency Messaging System




                                                  22
LinkedIn Product Architecture




                                23
Kafka: Architecture
                          Broker Tier
WebTier                                                                  Consumers

     Push         Sequential write            sendfile     Pull
                                                                                    Iterator 1




                                                                       Client Lib
     Event                                                 Events
                             Topic 1




                                                                        Kafka
     s
     100 MB/sec                                           200 MB/sec
                             Topic 2
                                                                                    Iterator n
                             Topic N

                                                                                    Topic  Offset


                                     Topic, Partition                    Offset
                                     Ownership
                                                         Zookeeper       Management




                                                                                                 24
Kafka: Architecture
                          Broker Tier
WebTier                                                                     Consumers

        Push      Sequential write            sendfile        Pull
                                                                                       Iterator 1




                                                                          Client Lib
        Event                                                 Events
                             Topic 1




                                                                           Kafka
        s
     100 MB/sec                                              200 MB/sec
                             Topic 2
                                                                                       Iterator n
                             Topic N

                                                                                       Topic  Offset


                                     Topic, Partition                       Offset
                                     Ownership
                                                            Zookeeper       Management


        At least once delivery                            Billions of Events, TBs per day
        Very high throughput                              50K+ per sec at peak
        Low latency                                       Inter and Intra-cluster replication
        Durability                                        End-to-end latency: few seconds
                                                                                                    25
LinkedIn Product Architecture




                                26
LinkedIn Data Infrastructure Solutions
Espresso: Indexed Timeline-Consistent Distributed
           Data Store




                                                    27
Application View


                   Hierarchical data model



                   Rich functionality on resources
                        Conditional updates
                        Partial updates
                        Atomic counters


                   Rich functionality within
                   resource groups
                        Transactions
                        Secondary index
                        Text search


                                               28
Partitioning




               29
Espresso Partition Layout: Master, Slave
3 Storage Engine nodes, 2 way replication

   Database
                       P.1    P.2     P.3    P.5     P.6     P.7
 Partition: P.1
 Node: 1               P.4    P.5     P.6    P.8     P.1     P.2
 …
 Partition: P.12
 Node: 3
                       P.9    P.1            P.11    P.1
                              0                      2
                             Node 1                 Node 2
   Cluster

 Node: 1
 M: P.1 – Active       P.9    P.1     P.11
     …                        0
 S: P.5 – Active
     …                 P.1    P.3     P.4
                       2
                       P.7    P.8                            Master
   Cluster                                                   Slave
   Manager                   Node 3
Espresso: System Components




                              31
Generic Cluster Manager: Helix

• Generic Distributed State Model
• Centralized Config Management
• Automatic Load Balancing
• Fault tolerance
• Health monitoring
• Cluster expansion and
  rebalancing
• Espresso, Databus and Search
• Open Source Apr 2012
• https://github.com/linkedin/helix




                                      32
Espresso@Linkedin

 Launched first application Oct 2011
 Open source 2012
 Future
   – Multi-Datacenter support
   – Global secondary indexes
   – Time-partitioned data




                                        33
LinkedIn Product Architecture




                                34
Acknowledgments

 Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar,
 Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra
 Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna,
 Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez,
 Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor
 Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham
 Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay
 Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid
 Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White,
 Victor Ye, David Zhang, and Jason Zhang




                                                                         35
Questions?




             36

Más contenido relacionado

La actualidad más candente

[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloudJeff Hung
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseDataWorks Summit
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Denodo
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneMongoDB
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB
 
What's New In MongoDB 3.6
What's New In MongoDB 3.6What's New In MongoDB 3.6
What's New In MongoDB 3.6MongoDB
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewDataWorks Summit/Hadoop Summit
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...Databricks
 
MongoDB Europe 2016 - MongoDB Atlas
MongoDB Europe 2016 - MongoDB AtlasMongoDB Europe 2016 - MongoDB Atlas
MongoDB Europe 2016 - MongoDB AtlasMongoDB
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationmarkgrover
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessMongoDB
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryNeo4j
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsenmarkgrover
 
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...Jeff Hung
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETLLily Luo
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integrationnguyenfilip
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeDataWorks Summit
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discoverymarkgrover
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected BreweryJason Hubbard
 

La actualidad más candente (20)

[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
[DataCon.TW 2017] Data Lake: centralize in on-prem vs. decentralize on cloud
 
The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop WarehouseData Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
Data Driving Yahoo Mail Growth and Evolution with a 50 PB Hadoop Warehouse
 
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
Enabling Fast Data Strategy: What’s new in Denodo Platform 6.0
 
L’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova GenerazioneL’architettura di Classe Enterprise di Nuova Generazione
L’architettura di Classe Enterprise di Nuova Generazione
 
MongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data LakeMongoDB Europe 2016 - The Rise of the Data Lake
MongoDB Europe 2016 - The Rise of the Data Lake
 
What's New In MongoDB 3.6
What's New In MongoDB 3.6What's New In MongoDB 3.6
What's New In MongoDB 3.6
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
All In - Migrating a Genomics Pipeline from BASH/Hive to Spark (Azure Databri...
 
MongoDB Europe 2016 - MongoDB Atlas
MongoDB Europe 2016 - MongoDB AtlasMongoDB Europe 2016 - MongoDB Atlas
MongoDB Europe 2016 - MongoDB Atlas
 
Amundsen at Brex and Looker integration
Amundsen at Brex and Looker integrationAmundsen at Brex and Looker integration
Amundsen at Brex and Looker integration
 
Webinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your BusinessWebinar: 10-Step Guide to Creating a Single View of your Business
Webinar: 10-Step Guide to Creating a Single View of your Business
 
How Lyft Drives Data Discovery
How Lyft Drives Data DiscoveryHow Lyft Drives Data Discovery
How Lyft Drives Data Discovery
 
REA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and AmundsenREA Group's journey with Data Cataloging and Amundsen
REA Group's journey with Data Cataloging and Amundsen
 
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
[DataCon.TW 2018] Metadata Store: Generalized Entity Database for Intelligenc...
 
Data Virtualization and ETL
Data Virtualization and ETLData Virtualization and ETL
Data Virtualization and ETL
 
Encompassing Information Integration
Encompassing Information IntegrationEncompassing Information Integration
Encompassing Information Integration
 
Ultralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC EdgeUltralight Data Movement for IoT with SDC Edge
Ultralight Data Movement for IoT with SDC Edge
 
Disrupting Data Discovery
Disrupting Data DiscoveryDisrupting Data Discovery
Disrupting Data Discovery
 
IoT Connected Brewery
IoT Connected BreweryIoT Connected Brewery
IoT Connected Brewery
 

Destacado

Resume- William Myers FD2016.1.4
Resume- William Myers FD2016.1.4Resume- William Myers FD2016.1.4
Resume- William Myers FD2016.1.4William Myers
 
Personal branding playbook
Personal branding playbookPersonal branding playbook
Personal branding playbookOnline Business
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationAmy W. Tang
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Shirshanka Das
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsPerficient, Inc.
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Shirshanka Das
 
Participatory Design: Bringing Users Into Your Process
Participatory Design: Bringing Users Into Your ProcessParticipatory Design: Bringing Users Into Your Process
Participatory Design: Bringing Users Into Your ProcessDavid Sherwin
 
Unlocking the Experts
Unlocking the ExpertsUnlocking the Experts
Unlocking the ExpertsLinkedIn
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Edureka!
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShareSlideShare
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Carol Smith
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Edureka!
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Shirshanka Das
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...Edureka!
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017NVIDIA
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017Carol Smith
 

Destacado (17)

Resume- William Myers FD2016.1.4
Resume- William Myers FD2016.1.4Resume- William Myers FD2016.1.4
Resume- William Myers FD2016.1.4
 
Personal branding playbook
Personal branding playbookPersonal branding playbook
Personal branding playbook
 
LinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data ApplicationLinkedIn Segmentation & Targeting Platform: A Big Data Application
LinkedIn Segmentation & Targeting Platform: A Big Data Application
 
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
Strata SG 2015: LinkedIn Self Serve Reporting Platform on Hadoop
 
Using Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and AnalyticsUsing Big Data for Improved Healthcare Operations and Analytics
Using Big Data for Improved Healthcare Operations and Analytics
 
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
Apache Gobblin: Bridging Batch and Streaming Data Integration. Big Data Meetu...
 
Participatory Design: Bringing Users Into Your Process
Participatory Design: Bringing Users Into Your ProcessParticipatory Design: Bringing Users Into Your Process
Participatory Design: Bringing Users Into Your Process
 
Unlocking the Experts
Unlocking the ExpertsUnlocking the Experts
Unlocking the Experts
 
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
Introduction To TensorFlow | Deep Learning Using TensorFlow | TensorFlow Tuto...
 
Big data ppt
Big  data pptBig  data ppt
Big data ppt
 
What to Upload to SlideShare
What to Upload to SlideShareWhat to Upload to SlideShare
What to Upload to SlideShare
 
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
Making Great User Experiences, Pittsburgh Scrum MeetUp, Oct 17, 2017
 
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
Big Data Tutorial For Beginners | What Is Big Data | Big Data Tutorial | Hado...
 
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strat...
 
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
What is Artificial Intelligence | Artificial Intelligence Tutorial For Beginn...
 
Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017Top 5 Deep Learning and AI Stories - October 6, 2017
Top 5 Deep Learning and AI Stories - October 6, 2017
 
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
AI and Machine Learning Demystified by Carol Smith at Midwest UX 2017
 

Similar a Data Infrastructure at LinkedIn

Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inliqiang xu
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarCloudera, Inc.
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to DatabusAmy W. Tang
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationMongoDB
 
Apache Kafka: Past, Present and Future
Apache Kafka: Past, Present and FutureApache Kafka: Past, Present and Future
Apache Kafka: Past, Present and Futureconfluent
 
Meetup realtime datacollection
Meetup realtime datacollectionMeetup realtime datacollection
Meetup realtime datacollectionInder Singh
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2Calpont Corporation
 
Lovett introducing cloud computing nov 2009
Lovett introducing cloud computing nov 2009Lovett introducing cloud computing nov 2009
Lovett introducing cloud computing nov 2009Hilde Lovett
 
Embedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of InnovationEmbedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of InnovationInside Analysis
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...SL Corporation
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAmazon Web Services
 
Audaxis : BI Project for an Association of Pharmacists
Audaxis : BI Project for an Association of PharmacistsAudaxis : BI Project for an Association of Pharmacists
Audaxis : BI Project for an Association of PharmacistsAudaxis
 
Citrix Netscaler Intro
Citrix Netscaler IntroCitrix Netscaler Intro
Citrix Netscaler IntroRui Lopes
 
Load Balancing und Beschleunigung mit Citrix Net Scaler
Load Balancing und Beschleunigung mit Citrix Net ScalerLoad Balancing und Beschleunigung mit Citrix Net Scaler
Load Balancing und Beschleunigung mit Citrix Net ScalerDigicomp Academy AG
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the DatabusAmy W. Tang
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jTobias Lindaaker
 
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Krishnan Parasuraman
 
Meetup Microservices Commandments
Meetup Microservices CommandmentsMeetup Microservices Commandments
Meetup Microservices CommandmentsBill Zajac
 
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAService Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAIMC Institute
 

Similar a Data Infrastructure at LinkedIn (20)

Xldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_inXldb2011 tue 1005_linked_in
Xldb2011 tue 1005_linked_in
 
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop WebinarWhy Every NoSQL Deployment Should Be Paired with Hadoop Webinar
Why Every NoSQL Deployment Should Be Paired with Hadoop Webinar
 
Introduction to Databus
Introduction to DatabusIntroduction to Databus
Introduction to Databus
 
Creating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital TransformationCreating a Modern Data Architecture for Digital Transformation
Creating a Modern Data Architecture for Digital Transformation
 
Secure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & IntelSecure Big Data Analytics - Hadoop & Intel
Secure Big Data Analytics - Hadoop & Intel
 
Apache Kafka: Past, Present and Future
Apache Kafka: Past, Present and FutureApache Kafka: Past, Present and Future
Apache Kafka: Past, Present and Future
 
Meetup realtime datacollection
Meetup realtime datacollectionMeetup realtime datacollection
Meetup realtime datacollection
 
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
InfiniDB 3 - Speeding Big Data Analytics in Amazon EC2
 
Lovett introducing cloud computing nov 2009
Lovett introducing cloud computing nov 2009Lovett introducing cloud computing nov 2009
Lovett introducing cloud computing nov 2009
 
Embedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of InnovationEmbedded Analytics: The Next Mega-Wave of Innovation
Embedded Analytics: The Next Mega-Wave of Innovation
 
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
Overcoming the Top Four Challenges to Real‐Time Performance in Large‐Scale, D...
 
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYCAWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
AWS Partner Presentation - Datapipe - Deploying Hybrid IT, AWS Summit 2012 - NYC
 
Audaxis : BI Project for an Association of Pharmacists
Audaxis : BI Project for an Association of PharmacistsAudaxis : BI Project for an Association of Pharmacists
Audaxis : BI Project for an Association of Pharmacists
 
Citrix Netscaler Intro
Citrix Netscaler IntroCitrix Netscaler Intro
Citrix Netscaler Intro
 
Load Balancing und Beschleunigung mit Citrix Net Scaler
Load Balancing und Beschleunigung mit Citrix Net ScalerLoad Balancing und Beschleunigung mit Citrix Net Scaler
Load Balancing und Beschleunigung mit Citrix Net Scaler
 
All Aboard the Databus
All Aboard the DatabusAll Aboard the Databus
All Aboard the Databus
 
NOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4jNOSQLEU - Graph Databases and Neo4j
NOSQLEU - Graph Databases and Neo4j
 
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
Big Data Journeys: Review of roadmaps taken by early adopters to achieve thei...
 
Meetup Microservices Commandments
Meetup Microservices CommandmentsMeetup Microservices Commandments
Meetup Microservices Commandments
 
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOAService Oriented Architecture (SOA) [1/5] : Introduction to SOA
Service Oriented Architecture (SOA) [1/5] : Introduction to SOA
 

Más de Amy W. Tang

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInAmy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Amy W. Tang
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Amy W. Tang
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using HelixAmy W. Tang
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph PresentationAmy W. Tang
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State DrivesAmy W. Tang
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with HelixAmy W. Tang
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInAmy W. Tang
 

Más de Amy W. Tang (8)

Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedInBuilding a Real-Time Data Pipeline: Apache Kafka at LinkedIn
Building a Real-Time Data Pipeline: Apache Kafka at LinkedIn
 
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
Espresso: LinkedIn's Distributed Data Serving Platform (Talk)
 
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
Espresso: LinkedIn's Distributed Data Serving Platform (Paper)
 
Building Distributed Systems Using Helix
Building Distributed Systems Using HelixBuilding Distributed Systems Using Helix
Building Distributed Systems Using Helix
 
LinkedIn Graph Presentation
LinkedIn Graph PresentationLinkedIn Graph Presentation
LinkedIn Graph Presentation
 
Voldemort on Solid State Drives
Voldemort on Solid State DrivesVoldemort on Solid State Drives
Voldemort on Solid State Drives
 
Untangling Cluster Management with Helix
Untangling Cluster Management with HelixUntangling Cluster Management with Helix
Untangling Cluster Management with Helix
 
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedInA Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
A Small Overview of Big Data Products, Analytics, and Infrastructure at LinkedIn
 

Último

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfOverkill Security
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesrafiqahmad00786416
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024The Digital Insurer
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamUiPathCommunity
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfOverkill Security
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDropbox
 

Último (20)

Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Ransomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdfRansomware_Q4_2023. The report. [EN].pdf
Ransomware_Q4_2023. The report. [EN].pdf
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Cyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdfCyberprint. Dark Pink Apt Group [EN].pdf
Cyberprint. Dark Pink Apt Group [EN].pdf
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 

Data Infrastructure at LinkedIn

  • 1. Data Infrastructure at LinkedIn Kapil Surlaker http://www.linkedin.com/in/kapilsurlaker @kapilsurlaker 1
  • 2. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 2
  • 3. LinkedIn By The Numbers  150M + users*  ~ 4.2B People Searches in 2011**  >2M companies with LinkedIn Company Pages**  16 languages  75% of Fortune 100 Companies use LinkedIn to hire*** * As of February 9th 2012 ** As of December 31st 2011 *** As of September 30th 2011 3
  • 4. Broad Range of Products & Services 4
  • 5. User Profiles Large dataset Medium writes Very high reads Freshness <1s 5
  • 6. Communications Large dataset High writes High reads Freshness <1s 6
  • 7. People You May Know Large dataset Compute intensive High reads Freshness ~hrs 7
  • 8. LinkedIn Today Moving dataset High writes High reads Freshness ~mins 8
  • 9. Outline  LinkedIn Products  Data Ecosystem  LinkedIn Data Infrastructure Solutions  Next Play 9
  • 10. Three Paradigms : Simplifying the Data Continuum • Member Profiles • Linkedin Today • People You May Know • Company Profiles • Profile Standardization • Connection Strength • Connections • News • News • Communications • Recommendations • Recommendations • Search • Next best idea • Communications Online Nearline Offline Activity that should Activity that should Activity that can be be reflected immediately be reflected soon reflected later 10
  • 14. LinkedIn Data Infrastructure Solutions Databus : Timeline-Consistent Change Data Capture 14
  • 15. Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture On-line Databus DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U DB Consumer n 15
  • 16. Databus at LinkedIn Client Relay Consumer 1 Client Lib Capture On-line Databus DB Changes Changes Event Win Consumer n On-line Changes Bootstrap Client Consumer 1 Client Lib Databus Consistent Snapshot at U DB Consumer n  Transport independent of data  Tens of relays source: Oracle, MySQL, …  Hundreds of sources  Transactional semantics  Low latency - milliseconds  In order, at least once delivery 16
  • 19. LinkedIn Data Infrastructure Solutions Voldemort: Highly-Available Distributed KV Store 19
  • 20. Voldemort: Architecture • Pluggable components • 10 clusters, 100+ nodes • Tunable consistency / • Largest cluster – 10K+ qps availability • Avg latency: 3ms • Key/value model, • Hundreds of Stores server side “views” • Largest store – 2.8TB+
  • 22. LinkedIn Data Infrastructure Solutions Kafka: High-Volume Low-Latency Messaging System 22
  • 24. Kafka: Architecture Broker Tier WebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management 24
  • 25. Kafka: Architecture Broker Tier WebTier Consumers Push Sequential write sendfile Pull Iterator 1 Client Lib Event Events Topic 1 Kafka s 100 MB/sec 200 MB/sec Topic 2 Iterator n Topic N Topic  Offset Topic, Partition Offset Ownership Zookeeper Management  At least once delivery  Billions of Events, TBs per day  Very high throughput  50K+ per sec at peak  Low latency  Inter and Intra-cluster replication  Durability  End-to-end latency: few seconds 25
  • 27. LinkedIn Data Infrastructure Solutions Espresso: Indexed Timeline-Consistent Distributed Data Store 27
  • 28. Application View Hierarchical data model Rich functionality on resources  Conditional updates  Partial updates  Atomic counters Rich functionality within resource groups  Transactions  Secondary index  Text search 28
  • 30. Espresso Partition Layout: Master, Slave 3 Storage Engine nodes, 2 way replication Database P.1 P.2 P.3 P.5 P.6 P.7 Partition: P.1 Node: 1 P.4 P.5 P.6 P.8 P.1 P.2 … Partition: P.12 Node: 3 P.9 P.1 P.11 P.1 0 2 Node 1 Node 2 Cluster Node: 1 M: P.1 – Active P.9 P.1 P.11 … 0 S: P.5 – Active … P.1 P.3 P.4 2 P.7 P.8 Master Cluster Slave Manager Node 3
  • 32. Generic Cluster Manager: Helix • Generic Distributed State Model • Centralized Config Management • Automatic Load Balancing • Fault tolerance • Health monitoring • Cluster expansion and rebalancing • Espresso, Databus and Search • Open Source Apr 2012 • https://github.com/linkedin/helix 32
  • 33. Espresso@Linkedin  Launched first application Oct 2011  Open source 2012  Future – Multi-Datacenter support – Global secondary indexes – Time-partitioned data 33
  • 35. Acknowledgments Siddharth Anand, Aditya Auradkar, Chavdar Botev, Vinoth Chandar, Shirshanka Das, Dave DeMaagd, Alex Feinberg, John Fung, Phanindra Ganti, Mihir Gandhi, Lei Gao, Bhaskar Ghosh, Kishore Gopalakrishna, Brendan Harris, Rajappa Iyer, Swaroop Jagadish, Joel Koshy, Kevin Krawez, Jay Kreps, Shi Lu, Sunil Nagaraj, Neha Narkhede, Sasha Pachev, Igor Perisic, Lin Qiao, Tom Quiggle, Jun Rao, Bob Schulman, Abraham Sebastian, Oliver Seeliger, Adam Silberstein, Boris Shkolnik, Chinmay Soman, Subbu Subramaniam, Roshan Sumbaly, Kapil Surlaker, Sajid Topiwala, Cuong Tran, Balaji Varadarajan, Jemiah Westerman, Zach White, Victor Ye, David Zhang, and Jason Zhang 35