SlideShare una empresa de Scribd logo
1 de 20
Taming the ETL beast
How LinkedIn uses metadata to run
complex ETL flows reliably
Rajappa Iyer
Strata Conference, London, November 12, 2013
`whoami`

 Data Infrastructure @ LinkedIn since 2011
 Prior to that:
– Director of Engineering at Digg
– Enterprise Data Architect at eBay

 www.linkedin.com/in/rajappaiyer/
Outline of talk
 Background and Context – The Why
 Challenges with Data Delivery – The What
 Metadata to the Rescue – The How
 Q&A
LinkedIn: The World’s Largest
Professional Network
Connecting Talent  Opportunity. At scale…

259M+ 2 new
Members Worldwide

Members Per Second

100M+
Monthly Unique Visitors

3M+
Company Pages
Data Driven Products and
Insights
Products for
Members

Data,
Platforms,
Analytics

Products for
Enterprises

(Companies)

(Professionals)

Insights

(Analysts and Data
Scientists)
Products for Members
Products for Enterprises

Hire - Talent Solutions

Sell - Sales Navigator

Market - Marketing Solutions
Examples of Insights
Example of Deeper Insight

Job Migration After Financial Collapse
Data is critical to LinkedIn’s products
It needs to be delivered in a reliable
and timely manner

LinkedIn Confidential ©2013 All Rights Reserved

10
A Simplified Overview of Data Flow
Hadoop
Site
(Member
Facing
Products)

Activity
Data

Kafka

Camus

Member Data

Espresso /
Voldemort /
Oracle

DWH ETL

Product,
Sciences,
Enterprise
Analytics

Changes

Databus

External
Partner Data

Lumos

Ingest
Utilities

Computed Results for Member Facing Products

Teradata

Enterprise
Products

Core Data
Set

Derived
Data Set
Components of typical ETL jobs
 Ingress / Egress of message-oriented data
– Logs and clickstream data

 Ingress / Egress of record-oriented data
– Database data

 Transformations
–
–
–
–
–

Select, project, join
Aggregations
Partitioning
Cleansing and data normalization
Schema conversions – e.g., Nested JSON to
Relational

LinkedIn Confidential ©2013 All Rights Reserved

12
An Example ETL Flow

LinkedIn Confidential ©2013 All Rights Reserved

13
Challenges
 Complex process dependencies
– Some flows are over 30 levels deep
– Flows may span multiple platforms (Hadoop, RDBMS etc.)

 Complex data dependencies
– Multiple flows may consume a data element
– Multiple data elements feed into a single flow
– Can be viewed as “data sync barriers”

 Recovery
– Restartable flows that pick up from last checkpoint
– Catch up mode to compensate for downtime

 Monitoring and Alerting
– Prioritization of “important” flows for ops attention
– Who do you call when things fail?

LinkedIn Confidential ©2013 All Rights Reserved

14
Metadata to the rescue
 What metadata is collected?
– Process dependencies
– Data dependencies
– Execution history and data processing statistics

 How is it used?
– Drives the ETL framework with lots of functionality





Check for data availability
Retries and restarts
Standardized error reporting / alerting
Prioritized view of business critical flows

LinkedIn Confidential ©2013 All Rights Reserved

15
Metadata: Process Dependencies
 Capture process
dependency graph

Workflow F
Start

– Also capture metadata such
as process owners,
importance, SLA etc.

Workunit
W1

on success

Workunit
W2

on success

on failure

Workunit
W3

Workunit
W4

on success

on success

Workunit
W5

 Capture stats for each
execution of a workflow
– Time of execution
– Execution status
– Pointer to error logs

 Alert on delayed processes
– Based on execution history

Stop
Metadata: Data Dependencies
Data Entity
D1

Data Entity
D2

consumes

consumes

Workflow F

produces

Data Entity
D3

 For each flow, capture input
and output data elements
 For each flow execution,
capture stats on data element
 Number of records or
messages processed
 Error counts
 Watermarks
– Can be time based or
sequence based
– This can be per flow as more
than one flow can consume a
data element
Metadata: Data Elements
 Simple catalog of data elements
– Name, physical location, owner etc.

 Data elements can have logical names
– Names resolve to one or more physical entity
– Logical names can represent useful collections
 E.g., data as of a particular interval

 Data element availability can trigger processes
– E.g., kick off hourly process when hourly data is
complete and available
– Enables data driven ETL scheduling

18
Putting it all together
Dashboards
,
Reports

ETL applications

Data
Availability
Status

ETL Framework

Scheduler

Checkpoint
Execution
State

Retry /
Resume

Name
resolver

Execution
History

Data Check

Statistics
(process
and data)

Alerting /
Monitoring

Log Parsers

Data
Lineage

Metadata Management System
LinkedIn Confidential ©2013 All Rights Reserved

19
Questions?

More at data.linkedin.com
Come Work on Challenging Data Infrastructure problems - We’re Hiring

Más contenido relacionado

La actualidad más candente

Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Materialobieefans
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architectureuncleRhyme
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse ArchitecturesTheju Paul
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwramesh rao
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesValmik Potbhare
 
Data Integration: the Beginner's Guide
Data Integration: the Beginner's GuideData Integration: the Beginner's Guide
Data Integration: the Beginner's GuideLisa Falcone
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousingHimanshu
 
Hand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and ChallengesHand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and Challengesmark madsen
 
Data integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaData integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaBhawani N Prasad
 

La actualidad más candente (20)

Informatica and datawarehouse Material
Informatica and datawarehouse MaterialInformatica and datawarehouse Material
Informatica and datawarehouse Material
 
Unit4
Unit4Unit4
Unit4
 
Data mining
Data miningData mining
Data mining
 
Data warehouse architecture
Data warehouse architectureData warehouse architecture
Data warehouse architecture
 
Big Data Pitfalls
Big Data PitfallsBig Data Pitfalls
Big Data Pitfalls
 
Data Warehouse Architectures
Data Warehouse ArchitecturesData Warehouse Architectures
Data Warehouse Architectures
 
Data Warehouse 101
Data Warehouse 101Data Warehouse 101
Data Warehouse 101
 
The Big Metadata
The Big MetadataThe Big Metadata
The Big Metadata
 
jagadeesh updated
jagadeesh updatedjagadeesh updated
jagadeesh updated
 
Data Warehouse
Data WarehouseData Warehouse
Data Warehouse
 
Data Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bwData Archiving -Ramesh sap bw
Data Archiving -Ramesh sap bw
 
Basics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration TechniquesBasics of Microsoft Business Intelligence and Data Integration Techniques
Basics of Microsoft Business Intelligence and Data Integration Techniques
 
Data Integration: the Beginner's Guide
Data Integration: the Beginner's GuideData Integration: the Beginner's Guide
Data Integration: the Beginner's Guide
 
Aspects of data mart
Aspects of data martAspects of data mart
Aspects of data mart
 
Business analysis in data warehousing
Business analysis in data warehousingBusiness analysis in data warehousing
Business analysis in data warehousing
 
Hand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and ChallengesHand Coding ETL Scenarios and Challenges
Hand Coding ETL Scenarios and Challenges
 
ETL Process
ETL ProcessETL Process
ETL Process
 
data warehousing
data warehousingdata warehousing
data warehousing
 
Data integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcuttaData integration ppt-bhawani nandan prasad - iim calcutta
Data integration ppt-bhawani nandan prasad - iim calcutta
 
JOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big DataJOSA TechTalk: Metadata Management
in Big Data
JOSA TechTalk: Metadata Management
in Big Data
 

Destacado

ETL Validator: Flat File Validation
ETL Validator: Flat File ValidationETL Validator: Flat File Validation
ETL Validator: Flat File ValidationDatagaps Inc
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageNR Computer Learning Center
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL SystemASHOK BHATLA
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopEric Sun
 
ETL Validator: Creating Data Model
ETL Validator: Creating Data ModelETL Validator: Creating Data Model
ETL Validator: Creating Data ModelDatagaps Inc
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Yahoo Developer Network
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolAlex Rayón Jerez
 
Crossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Hortonworks
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...Amazon Web Services
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDMKousik Mukherjee
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsKhalid Salama
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...Christopher Bradley
 
State of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportState of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportDen Reymer
 
Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Den Reymer
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modelingvivekjv
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesIvo Andreev
 

Destacado (20)

Manage users & tables in Oracle Database
Manage users & tables in Oracle DatabaseManage users & tables in Oracle Database
Manage users & tables in Oracle Database
 
ETL Validator: Flat File Validation
ETL Validator: Flat File ValidationETL Validator: Flat File Validation
ETL Validator: Flat File Validation
 
Managing users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise ManageManaging users & tables using Oracle Enterprise Manage
Managing users & tables using Oracle Enterprise Manage
 
Capacity Management of an ETL System
Capacity Management of an ETL SystemCapacity Management of an ETL System
Capacity Management of an ETL System
 
Oracle Tablespace - Basic
Oracle Tablespace - BasicOracle Tablespace - Basic
Oracle Tablespace - Basic
 
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for HadoopPartners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
Partners 2013 LinkedIn Use Cases for Teradata Connectors for Hadoop
 
ETL Validator: Creating Data Model
ETL Validator: Creating Data ModelETL Validator: Creating Data Model
ETL Validator: Creating Data Model
 
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
Data Applications and Infrastructure at LinkedIn__HadoopSummit2010
 
Kettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration toolKettle: Pentaho Data Integration tool
Kettle: Pentaho Data Integration tool
 
Crossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latestCrossref webinar - Maintaining your metadata - latest
Crossref webinar - Maintaining your metadata - latest
 
Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015Data Governance - Atlas 7.12.2015
Data Governance - Atlas 7.12.2015
 
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
(BDT303) Construct Your ETL Pipeline with AWS Data Pipeline, Amazon EMR, and ...
 
Seven building blocks for MDM
Seven building blocks for MDMSeven building blocks for MDM
Seven building blocks for MDM
 
Building the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake AnalyticsBuilding the Data Lake with Azure Data Factory and Data Lake Analytics
Building the Data Lake with Azure Data Factory and Data Lake Analytics
 
How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...How to identify the correct Master Data subject areas & tooling for your MDM...
How to identify the correct Master Data subject areas & tooling for your MDM...
 
State of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter ReportState of Digital Transformation 2016. Altimeter Report
State of Digital Transformation 2016. Altimeter Report
 
Introduction to ETL and Data Integration
Introduction to ETL and Data IntegrationIntroduction to ETL and Data Integration
Introduction to ETL and Data Integration
 
Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016Gartner: Top 10 Strategic Technology Trends 2016
Gartner: Top 10 Strategic Technology Trends 2016
 
Data Warehouse Modeling
Data Warehouse ModelingData Warehouse Modeling
Data Warehouse Modeling
 
Data Warehouse Design and Best Practices
Data Warehouse Design and Best PracticesData Warehouse Design and Best Practices
Data Warehouse Design and Best Practices
 

Similar a Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Rittman Analytics
 
FlockData Overview
FlockData OverviewFlockData Overview
FlockData OverviewFlockData
 
Talend MDM
Talend MDMTalend MDM
Talend MDMTalend
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopInside Analysis
 
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...AgileNetwork
 
Reducing Tool Costs
Reducing Tool CostsReducing Tool Costs
Reducing Tool CostsKalido
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkkguest4e975e2
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRyan Andhavarapu
 
Thought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveThought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveRon Krzoska
 
Industrializing Data Integration
Industrializing Data IntegrationIndustrializing Data Integration
Industrializing Data IntegrationTalend
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Balvinder Hira
 
Accelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsAccelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsSplunk
 
An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018Denodo
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bankChungsik Yun
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemCapgemini
 
Salesforce mumbai user group june meetup
Salesforce mumbai user group   june meetupSalesforce mumbai user group   june meetup
Salesforce mumbai user group june meetupRakesh Gupta
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsJeffrey T. Pollock
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An IntroductionDenodo
 

Similar a Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably (20)

Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
Data Integration for Big Data (OOW 2016, Co-Presented With Oracle)
 
FlockData Overview
FlockData OverviewFlockData Overview
FlockData Overview
 
Talend MDM
Talend MDMTalend MDM
Talend MDM
 
The Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of HadoopThe Maturity Model: Taking the Growing Pains Out of Hadoop
The Maturity Model: Taking the Growing Pains Out of Hadoop
 
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
Agile Mumbai 2022 - Balvinder Kaur & Sushant Joshi | Real-Time Insights and A...
 
Reducing Tool Costs
Reducing Tool CostsReducing Tool Costs
Reducing Tool Costs
 
Bi presentation to bkk
Bi presentation to bkkBi presentation to bkk
Bi presentation to bkk
 
Rev_3 Components of a Data Warehouse
Rev_3 Components of a Data WarehouseRev_3 Components of a Data Warehouse
Rev_3 Components of a Data Warehouse
 
Thought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserveThought leadership Oct2015 selfserve
Thought leadership Oct2015 selfserve
 
Industrializing Data Integration
Industrializing Data IntegrationIndustrializing Data Integration
Industrializing Data Integration
 
Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...Real time insights for better products, customer experience and resilient pla...
Real time insights for better products, customer experience and resilient pla...
 
Accelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise ApplicationsAccelerating SDLC for Large Public Sector Enterprise Applications
Accelerating SDLC for Large Public Sector Enterprise Applications
 
An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018An Introduction to Data Virtualization in 2018
An Introduction to Data Virtualization in 2018
 
Big Data Case study - caixa bank
Big Data Case study - caixa bankBig Data Case study - caixa bank
Big Data Case study - caixa bank
 
Contexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to ProductionContexti / Oracle - Big Data : From Pilot to Production
Contexti / Oracle - Big Data : From Pilot to Production
 
Informatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake EcosystemInformatica Becomes Part of the Business Data Lake Ecosystem
Informatica Becomes Part of the Business Data Lake Ecosystem
 
Salesforce mumbai user group june meetup
Salesforce mumbai user group   june meetupSalesforce mumbai user group   june meetup
Salesforce mumbai user group june meetup
 
Jeeva_Resume
Jeeva_ResumeJeeva_Resume
Jeeva_Resume
 
Oracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast ChartsOracle Big Data Governance Webcast Charts
Oracle Big Data Governance Webcast Charts
 
Data Virtualization: An Introduction
Data Virtualization: An IntroductionData Virtualization: An Introduction
Data Virtualization: An Introduction
 

Último

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.Curtis Poe
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersNicole Novielli
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rick Flair
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????blackmambaettijean
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxLoriGlavin3
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 

Último (20)

What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxPasskey Providers and Enabling Portability: FIDO Paris Seminar.pptx
Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptx
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.How AI, OpenAI, and ChatGPT impact business and software.
How AI, OpenAI, and ChatGPT impact business and software.
 
A Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software DevelopersA Journey Into the Emotions of Software Developers
A Journey Into the Emotions of Software Developers
 
"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...Rise of the Machines: Known As Drones...
Rise of the Machines: Known As Drones...
 
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICESSALESFORCE EDUCATION CLOUD | FEXLE SERVICES
SALESFORCE EDUCATION CLOUD | FEXLE SERVICES
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
What is Artificial Intelligence?????????
What is Artificial Intelligence?????????What is Artificial Intelligence?????????
What is Artificial Intelligence?????????
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptxA Deep Dive on Passkeys: FIDO Paris Seminar.pptx
A Deep Dive on Passkeys: FIDO Paris Seminar.pptx
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 

Taming the ETL beast: How LinkedIn uses metadata to run complex ETL flows reliably

  • 1. Taming the ETL beast How LinkedIn uses metadata to run complex ETL flows reliably Rajappa Iyer Strata Conference, London, November 12, 2013
  • 2. `whoami`  Data Infrastructure @ LinkedIn since 2011  Prior to that: – Director of Engineering at Digg – Enterprise Data Architect at eBay  www.linkedin.com/in/rajappaiyer/
  • 3. Outline of talk  Background and Context – The Why  Challenges with Data Delivery – The What  Metadata to the Rescue – The How  Q&A
  • 4. LinkedIn: The World’s Largest Professional Network Connecting Talent  Opportunity. At scale… 259M+ 2 new Members Worldwide Members Per Second 100M+ Monthly Unique Visitors 3M+ Company Pages
  • 5. Data Driven Products and Insights Products for Members Data, Platforms, Analytics Products for Enterprises (Companies) (Professionals) Insights (Analysts and Data Scientists)
  • 7. Products for Enterprises Hire - Talent Solutions Sell - Sales Navigator Market - Marketing Solutions
  • 9. Example of Deeper Insight Job Migration After Financial Collapse
  • 10. Data is critical to LinkedIn’s products It needs to be delivered in a reliable and timely manner LinkedIn Confidential ©2013 All Rights Reserved 10
  • 11. A Simplified Overview of Data Flow Hadoop Site (Member Facing Products) Activity Data Kafka Camus Member Data Espresso / Voldemort / Oracle DWH ETL Product, Sciences, Enterprise Analytics Changes Databus External Partner Data Lumos Ingest Utilities Computed Results for Member Facing Products Teradata Enterprise Products Core Data Set Derived Data Set
  • 12. Components of typical ETL jobs  Ingress / Egress of message-oriented data – Logs and clickstream data  Ingress / Egress of record-oriented data – Database data  Transformations – – – – – Select, project, join Aggregations Partitioning Cleansing and data normalization Schema conversions – e.g., Nested JSON to Relational LinkedIn Confidential ©2013 All Rights Reserved 12
  • 13. An Example ETL Flow LinkedIn Confidential ©2013 All Rights Reserved 13
  • 14. Challenges  Complex process dependencies – Some flows are over 30 levels deep – Flows may span multiple platforms (Hadoop, RDBMS etc.)  Complex data dependencies – Multiple flows may consume a data element – Multiple data elements feed into a single flow – Can be viewed as “data sync barriers”  Recovery – Restartable flows that pick up from last checkpoint – Catch up mode to compensate for downtime  Monitoring and Alerting – Prioritization of “important” flows for ops attention – Who do you call when things fail? LinkedIn Confidential ©2013 All Rights Reserved 14
  • 15. Metadata to the rescue  What metadata is collected? – Process dependencies – Data dependencies – Execution history and data processing statistics  How is it used? – Drives the ETL framework with lots of functionality     Check for data availability Retries and restarts Standardized error reporting / alerting Prioritized view of business critical flows LinkedIn Confidential ©2013 All Rights Reserved 15
  • 16. Metadata: Process Dependencies  Capture process dependency graph Workflow F Start – Also capture metadata such as process owners, importance, SLA etc. Workunit W1 on success Workunit W2 on success on failure Workunit W3 Workunit W4 on success on success Workunit W5  Capture stats for each execution of a workflow – Time of execution – Execution status – Pointer to error logs  Alert on delayed processes – Based on execution history Stop
  • 17. Metadata: Data Dependencies Data Entity D1 Data Entity D2 consumes consumes Workflow F produces Data Entity D3  For each flow, capture input and output data elements  For each flow execution, capture stats on data element  Number of records or messages processed  Error counts  Watermarks – Can be time based or sequence based – This can be per flow as more than one flow can consume a data element
  • 18. Metadata: Data Elements  Simple catalog of data elements – Name, physical location, owner etc.  Data elements can have logical names – Names resolve to one or more physical entity – Logical names can represent useful collections  E.g., data as of a particular interval  Data element availability can trigger processes – E.g., kick off hourly process when hourly data is complete and available – Enables data driven ETL scheduling 18
  • 19. Putting it all together Dashboards , Reports ETL applications Data Availability Status ETL Framework Scheduler Checkpoint Execution State Retry / Resume Name resolver Execution History Data Check Statistics (process and data) Alerting / Monitoring Log Parsers Data Lineage Metadata Management System LinkedIn Confidential ©2013 All Rights Reserved 19
  • 20. Questions? More at data.linkedin.com Come Work on Challenging Data Infrastructure problems - We’re Hiring

Notas del editor

  1. Filter, aggregation, partition, normalization, joins