SlideShare una empresa de Scribd logo
1 de 46
Big Data Ecosystem at LinkedIn
Big 2015 at WWW
LinkedIn: Largest Professional Network
2
360M members 2 new members/sec
Rich Data Driven Products at LinkedIn
3
Similar Profiles
Connections
News
Skill Endorsements
How to build Data Products
4
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data Egress
• Moving results from offline to online system
Example Data Product: PYMK
5
• People You May Know (PYMK): recommend members to connect
Outline
6
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data Egress
• Moving results from offline to online system
Ingress - types of Data
7
• Database data: member profile, connections, …
• Activity data: Page views, Impressions, etc.
• Application and System metrics
• Service logs
Data Ingress - Point-to-point Pipelines
8
• O(n^2) data integration complexity
• Fragile, delayed, lossy
• Non-standardized
Data Ingress - Centralized Pipeline
9
• O(n) data integration complexity
• More reliable
• Standardizable
Data Ingress: Apache Kafka
10
• Publish subscribe messaging
• Producers send messages to Brokers
• Consumers read messages from
Brokers
• Messages are sent to a topic
• E.g. PeopleYouMayKnowTopic
• Each topic is broken into one or more
ordered partitions of messages
Kafka: Data Evolution and loading
11
• Standardized Schema for each topic
• Avro
• Central repository
• Producers/consumers use the same schema
• Data verification - audits
• ETL to Hadoop
• Map only jobs load data from broker
Goodhope et al., IEEE Data Eng. 2012
Outline
12
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing using Hadoop, Azkaban, Cubert
• Stream processing using Samza
• Iterative processing using Spark
• Data Egress
• Moving results from offline to online system
Data Processing: Hadoop
13
• Ease of programming
• High level Map and Reduce functions
• Scalable to very large cluster
• Fault tolerant
• Speculative execution, auto restart of failed jobs
• Scripting languages: PIG, Hive, Scalding
Data Processing: Hadoop at LinkedIn
14
• Used for data products, feature computation, training
models, analytics and reporting, trouble shooting, …
• Native MapReduce, PIG, Hive
• Workflows with 100s of Hadoop jobs
• 100s of workflows
• Processing petabytes of data everyday
Data Processing Example PYMK Feature
Engineering
Triangle closing
Prob(Bob knows Carol) ~ the # of common
connections
Alice
Bob Carol
15
How do people
know each other?
Data Processing in Hadoop Example
16
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
-- second degree pairs (id1, id2), aggregate and count common
connections
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
How to do PYMK Triangle Closing in Hadoop
17
How to manage Production Hadoop Workflow
Azkaban: Hadoop Workflow management
18
• Configuration
• Dependency management
• Access control
• Scheduling and SLA management
• Monitoring, history
Distributed Machine Learning: ML-ease
20
• ADMM Logistic Regression for binary response prediction
Agarwal et al. 2014
Limitations of Hadoop: Join and Group By
21
— Two datasets: A=(Salesman, Product), B=(Salesman,
Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman
= B.Salesman GROUP BY A.Product, B.Location
• Common Hadoop MapReduce/Pig/Hive implementation
• MapReduce: Load data and shuffle and reduce to do Inner Join and
store the output
• MapReduce: Load the above output, shuffle on group by keys and
aggregate on reducer to generate final output
Limitations of Triangle Closing Using Hadoop
22
• Large amount of data to shuffle from Mappers to Reducers
— connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
— Shuffling all 2nd degree connections - terabytes of data
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING
PigStorage();
Cubert
23
• An open source project built for analytics needs
• Map side aggregation
• Minimizes intermediate data and shuffling
• Fast and scalable primitives for joins and aggregation
• Partitions data into blocks
• Specialized operators MeshJoin, Cube
• 5-60X faster in experience
• Developer friendly - script like
Vemuri et al. VLDB 2014
Cubert Design
24
• Language
• Scripting language
• Physical - write MR programs
• Execution
• Data movement: Shuffle, Blockgen,
Combine, Pivot
• Primitives: MashJoin, Cube
• Data blocks: partition of data by cost
Vemuri et al. VLDB 2014
Cubert Script: count Daily/Weekly Stats
25
JOB "create blocks of the fact table"
MAP {
data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO();
}
// create blocks of one week of data with a cost function
BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId;
STORE data INTO "$output/blocks" USING RUBIX;
END
JOB "compute cubes"
MAP {
data = LOAD "$output/blocks" USING RUBIX;
// create a new column 'todayUserId' for today's records only
data = FROM data GENERATE country, locale, userId, clicks,
CASE(timestamp == $today, userId) AS todayUserId;
}
// creates the three cubes in a single job to count daily, weekly users and clicks
CUBE data BY country, locale INNER userId
AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers,
COUNT_DISTINCT(todayUserId) as dailyUniqueUsers,
SUM(clicks) as totalClicks;
STORE data INTO "$output/results" USING AVRO();
END
Vemuri et al. VLDB 2014
Cubert Example: Join and Group By
26
Vemuri et al. VLDB 2014
— Two datasets: A=(Salesman, Product), B=(Salesman, Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman =
B.Salesman GROUP BY A.Product, B.Location
• Sort A by Product and B by
Location
• Divide A and B in specialized
blocks sorted by group by keys
• Load A’s blocks in memory and
stream B’s blocks to Join
• Group by can be performed
immediately after Join
Cubert Example: Triangle Closing
27
• Divide connections (src, dest) in blocks
• Duplicate connection graph G1, G2
• Sort G1 edges (src, dest) by src
• Sort G2 edges (src, dest) by dest
• MeshJoin G1 and G2 such that G1.dest=G2.src
• Aggregate by (G1.src, G2,dest) to get the number of common
connections
• Speedup by 50%
Cubert Summary
28
Vemuri et al. VLDB 2014
• Built for analytics needs
• Faster and scalable: 5-60X
• Working well in practice
Outline
29
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Egress
• Moving results from offline to online system
Samza
30
• Samza streaming computation
• On top of messaging layer like Kafka for input/output
• Low latency
• Stateful processing through local store
• Many use cases at LinkedIn
• Site-speed monitoring
• Data standardization
Samza: Site Speed Monitoring
31
• LinkedIn homepage assembled by calling many services
• Each service logs through Kafka what went on with a request Id
Samza: Site Speed Monitoring
32
• The complete record of request - scattered across Kafka logs
• Problem: combine these logs to generate wholistic view
Samza: Site Speed Monitoring
33
• Hadoop/MR: join the logs using the request Id - once a day
• Too late to troubleshoot any issue
• Samza: near real-time join the Kafka logs using the requestId
Samza: Site Speed Monitoring
34
• Samza: near real-time join the Kafka logs using the requestId
• Two jobs
• Partition Kafka stream by request Id
• Aggregate all the records for a request Id
Fernandez et al. CIDR 2015
Outline
35
• Ingress
• Moving data from online to offline system
• Offline Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Egress
• Moving results from offline to online system
Iterative Processing using Spark
36
• Limitations of MapReduce
• What is Spark?
• Spark at LinkedIn
Limitations of MapReduce
37
• Iterative computation is slow
• Inefficient multi-pass computation
• Intermediate data written in distributed file system
Limitations of MapReduce
38
• Interactive computation is slow
• Same data is loaded again from distributed file system
Example: ADMM at LinkedIn
39
• Intermediate data is stored in distributed file system - slow
Intermediate
data in HDFS
SPARK
40
• Extends programming language with a
distributed data structure
• Resilient Distributed Datasets (RDD)
• can be stored in memory
• Faster iterative computation
• Faster interactive computation
• Clean APIs in Python, Scala, Java
• SQL, Streaming, Machine learning,
graph processing support
Matei Zaharia et al. NSDI 2012
Spark at LinkedIn
41
• ADMM on Spark
• Intermediate data is stored in memory - faster
Intermediate
data in memory
Outline
42
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Iterative processing - Spark
• Stream processing - Samza
• Data Egress
• Moving results from offline to online system
Data Egress - Key/Value
43
• Key-value store: Voldemort
• Based on Amazon’s Dynamo DB
• Distributed
• Scalable
• Bulk load from Hadoop
• Simple to use
• store results into ‘url’ using KeyValue(‘member_id’)
Sumbaly et al. FAST 2012
Data Egress - Streams
44
• Stream - Kafka
• Hadoop job as a Producer
• Service acts as Consumer
• Simple to use
• store data into ‘url’ using Stream(“topic=x“)
Goodhope et al., IEEE Data Eng. 2012
Conclusion
45
• Rich primitives for Data Ingress, Processing, Egress
• Data Ingress: Kafka, ETL
• Data Processing
• Batch processing - Hadoop, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Data Egress: Voldemort, Kafka
• Allow Data Scientists to focus to build Data Products
Future Opportunities
46
• Models of computation
• Efficient Graph processing
• Distributed Machine Learning
47
Acknowledgement
Thanks to data team at LinkedIn: data.linkedin.com
Contact: mtiwari@linkedin.com
@mitultiwari

Más contenido relacionado

La actualidad más candente

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Lucidworks
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overviewJulian Hyde
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRAmazon Web Services
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture EMC
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowDataWorks Summit
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at OracleSandesh Rao
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Alex Levenson
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsDatabricks
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesDatabricks
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data ArchitectureGuido Schmutz
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterDatabricks
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)NAVER D2
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsDatabricks
 
Oracle architecture ppt
Oracle architecture pptOracle architecture ppt
Oracle architecture pptDeepak Shetty
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data TechnologiesDATAVERSITY
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache SolrChristos Manios
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentDatabricks
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data PipelineManish Kumar
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on ReadDatabricks
 

La actualidad más candente (20)

Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
Learning to Rank in Solr: Presented by Michael Nilsson & Diego Ceccarelli, Bl...
 
Apache Calcite overview
Apache Calcite overviewApache Calcite overview
Apache Calcite overview
 
Intro to HBase
Intro to HBaseIntro to HBase
Intro to HBase
 
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMRBDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
BDA302 Deep Dive on Migrating Big Data Workloads to Amazon EMR
 
Hadoop Overview & Architecture
Hadoop Overview & Architecture  Hadoop Overview & Architecture
Hadoop Overview & Architecture
 
The columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache ArrowThe columnar roadmap: Apache Parquet and Apache Arrow
The columnar roadmap: Apache Parquet and Apache Arrow
 
Machine Learning and AI at Oracle
Machine Learning and AI at OracleMachine Learning and AI at Oracle
Machine Learning and AI at Oracle
 
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
Hadoop Summit 2015: Performance Optimization at Scale, Lessons Learned at Twi...
 
Best Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale PlatformsBest Practices for Enabling Speculative Execution on Large Scale Platforms
Best Practices for Enabling Speculative Execution on Large Scale Platforms
 
Presto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation EnginesPresto on Apache Spark: A Tale of Two Computation Engines
Presto on Apache Spark: A Tale of Two Computation Engines
 
Big Data Architecture
Big Data ArchitectureBig Data Architecture
Big Data Architecture
 
Apache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and SmarterApache Spark 2.0: Faster, Easier, and Smarter
Apache Spark 2.0: Faster, Easier, and Smarter
 
[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)[211] HBase 기반 검색 데이터 저장소 (공개용)
[211] HBase 기반 검색 데이터 저장소 (공개용)
 
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested ColumnsMaterialized Column: An Efficient Way to Optimize Queries on Nested Columns
Materialized Column: An Efficient Way to Optimize Queries on Nested Columns
 
Oracle architecture ppt
Oracle architecture pptOracle architecture ppt
Oracle architecture ppt
 
Integrating Big Data Technologies
Integrating Big Data TechnologiesIntegrating Big Data Technologies
Integrating Big Data Technologies
 
Introduction to Apache Solr
Introduction to Apache SolrIntroduction to Apache Solr
Introduction to Apache Solr
 
Unified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model DeploymentUnified MLOps: Feature Stores & Model Deployment
Unified MLOps: Feature Stores & Model Deployment
 
Challenges in Building a Data Pipeline
Challenges in Building a Data PipelineChallenges in Building a Data Pipeline
Challenges in Building a Data Pipeline
 
Delta: Building Merge on Read
Delta: Building Merge on ReadDelta: Building Merge on Read
Delta: Building Merge on Read
 

Destacado

The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInSam Shah
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of HadoopDataWorks Summit
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedInKeith Dsouza
 
A Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersA Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersPaul Greenberg
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalInnovation Enterprise
 
Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Nidhi Baranwal
 
Software Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInSoftware Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInC4Media
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate SolutionMohamed Ouabi
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate SolutionMohamed Ouabi
 
LinkedIn Corporate Solutions
LinkedIn Corporate SolutionsLinkedIn Corporate Solutions
LinkedIn Corporate SolutionsBenny Gould
 
Linkedin Corporate Solutions
Linkedin Corporate SolutionsLinkedin Corporate Solutions
Linkedin Corporate SolutionsAndrewBoe
 
Rmi, corba and java beans
Rmi, corba and java beansRmi, corba and java beans
Rmi, corba and java beansRaghu nath
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Databricks
 
Distributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIDistributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIelliando dias
 
Corba introduction and simple example
Corba introduction and simple example Corba introduction and simple example
Corba introduction and simple example Alexia Wang
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Denodo
 

Destacado (20)

The "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedInThe "Big Data" Ecosystem at LinkedIn
The "Big Data" Ecosystem at LinkedIn
 
Social crm
Social crmSocial crm
Social crm
 
What is the Point of Hadoop
What is the Point of HadoopWhat is the Point of Hadoop
What is the Point of Hadoop
 
Hadoop at LinkedIn
Hadoop at LinkedInHadoop at LinkedIn
Hadoop at LinkedIn
 
A Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay PackersA Case Study In Social CRM Without Technology: The Green Bay Packers
A Case Study In Social CRM Without Technology: The Green Bay Packers
 
Big Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, PaypalBig Data: It's More Than Volume, Paypal
Big Data: It's More Than Volume, Paypal
 
Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)Distributed Multimedia Systems(DMMS)
Distributed Multimedia Systems(DMMS)
 
Software Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedInSoftware Development & Architecture @ LinkedIn
Software Development & Architecture @ LinkedIn
 
Diversegy Consultant Program
Diversegy Consultant ProgramDiversegy Consultant Program
Diversegy Consultant Program
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate Solution
 
Linkedin Corporate Solution
Linkedin Corporate SolutionLinkedin Corporate Solution
Linkedin Corporate Solution
 
Analog
AnalogAnalog
Analog
 
LinkedIn Corporate Solutions
LinkedIn Corporate SolutionsLinkedIn Corporate Solutions
LinkedIn Corporate Solutions
 
Linkedin Corporate Solutions
Linkedin Corporate SolutionsLinkedin Corporate Solutions
Linkedin Corporate Solutions
 
Rmi, corba and java beans
Rmi, corba and java beansRmi, corba and java beans
Rmi, corba and java beans
 
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
Spark's Role in the Big Data Ecosystem (Spark Summit 2014)
 
Distributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMIDistributed Objects: CORBA/Java RMI
Distributed Objects: CORBA/Java RMI
 
C O R B A Unit 4
C O R B A    Unit 4C O R B A    Unit 4
C O R B A Unit 4
 
Corba introduction and simple example
Corba introduction and simple example Corba introduction and simple example
Corba introduction and simple example
 
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
Designing an Agile Fast Data Architecture for Big Data Ecosystem using Logica...
 

Similar a Big Data Ecosystem at LinkedIn

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsAndrew Brust
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesMongoDB
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreducehansen3032
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Spark Summit
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraVictor Coustenoble
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksAnyscale
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Databricks
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceDeepak Chandramouli
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with HadoopCloudera, Inc.
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreUri Laserson
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About Jesus Rodriguez
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerIBM Cloud Data Services
 
Large scale computing
Large scale computing Large scale computing
Large scale computing Bhupesh Bansal
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...DataWorks Summit/Hadoop Summit
 

Similar a Big Data Ecosystem at LinkedIn (20)

Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Big Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI ProsBig Data and NoSQL for Database and BI Pros
Big Data and NoSQL for Database and BI Pros
 
Intro to Big Data - Spark
Intro to Big Data - SparkIntro to Big Data - Spark
Intro to Big Data - Spark
 
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data LakesWebinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
Webinar: Enterprise Data Management in the Era of MongoDB and Data Lakes
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Large scale computing with mapreduce
Large scale computing with mapreduceLarge scale computing with mapreduce
Large scale computing with mapreduce
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 
Neo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to GraphsNeo4j GraphTalk Oslo - Introduction to Graphs
Neo4j GraphTalk Oslo - Introduction to Graphs
 
BI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache CassandraBI, Reporting and Analytics on Apache Cassandra
BI, Reporting and Analytics on Apache Cassandra
 
Jump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with DatabricksJump Start on Apache Spark 2.2 with Databricks
Jump Start on Apache Spark 2.2 with Databricks
 
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
Deploying Python Machine Learning Models with Apache Spark with Brandon Hamri...
 
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J ConferenceNodes2020 | Graph of enterprise_metadata | NEO4J Conference
Nodes2020 | Graph of enterprise_metadata | NEO4J Conference
 
Getting Started with Hadoop
Getting Started with HadoopGetting Started with Hadoop
Getting Started with Hadoop
 
Hadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant StoreHadoop for Bioinformatics: Building a Scalable Variant Store
Hadoop for Bioinformatics: Building a Scalable Variant Store
 
10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About 10 Big Data Technologies you Didn't Know About
10 Big Data Technologies you Didn't Know About
 
Webinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data LayerWebinar: The Anatomy of the Cloudant Data Layer
Webinar: The Anatomy of the Cloudant Data Layer
 
Large scale computing
Large scale computing Large scale computing
Large scale computing
 
MongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDBMongoDB Days Germany: Data Processing with MongoDB
MongoDB Days Germany: Data Processing with MongoDB
 
Big Data training
Big Data trainingBig Data training
Big Data training
 
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...Using SparkR to Scale Data Science Applications in Production. Lessons from t...
Using SparkR to Scale Data Science Applications in Production. Lessons from t...
 

Más de Mitul Tiwari

Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInMitul Tiwari
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsMitul Tiwari
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationMitul Tiwari
 
Metaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMitul Tiwari
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedInMitul Tiwari
 
Structural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsMitul Tiwari
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsMitul Tiwari
 
Large-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityLarge-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityMitul Tiwari
 
Building Data Driven Products at Linkedin
Building Data Driven Products at LinkedinBuilding Data Driven Products at Linkedin
Building Data Driven Products at LinkedinMitul Tiwari
 
Social Network Analysis at LinkedIn
Social Network Analysis at LinkedInSocial Network Analysis at LinkedIn
Social Network Analysis at LinkedInMitul Tiwari
 

Más de Mitul Tiwari (10)

Large scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedInLarge scale social recommender systems at LinkedIn
Large scale social recommender systems at LinkedIn
 
Modeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systemsModeling Impression discounting in large-scale recommender systems
Modeling Impression discounting in large-scale recommender systems
 
Large scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluationLarge scale social recommender systems and their evaluation
Large scale social recommender systems and their evaluation
 
Metaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendationsMetaphor: A system for related searches recommendations
Metaphor: A system for related searches recommendations
 
Related searches at LinkedIn
Related searches at LinkedInRelated searches at LinkedIn
Related searches at LinkedIn
 
Structural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender SystemsStructural Diversity in Social Recommender Systems
Structural Diversity in Social Recommender Systems
 
Organizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its ApplicationsOrganizational Overlap on Social Networks and its Applications
Organizational Overlap on Social Networks and its Applications
 
Large-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and OpportunityLarge-scale Social Recommendation Systems: Challenges and Opportunity
Large-scale Social Recommendation Systems: Challenges and Opportunity
 
Building Data Driven Products at Linkedin
Building Data Driven Products at LinkedinBuilding Data Driven Products at Linkedin
Building Data Driven Products at Linkedin
 
Social Network Analysis at LinkedIn
Social Network Analysis at LinkedInSocial Network Analysis at LinkedIn
Social Network Analysis at LinkedIn
 

Último

办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...vmzoxnx5
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119APNIC
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxmibuzondetrabajo
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxAndrieCagasanAkio
 
How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...rrouter90
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesLumiverse Solutions Pvt Ltd
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxMario
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...ICT Watch - Indonesia
 
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)ICT Watch - Indonesia
 

Último (9)

办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
办理澳洲USYD文凭证书学历认证【Q微/1954292140】办理悉尼大学毕业证书真实成绩单GPA修改/办理澳洲大学文凭证书Offer录取通知书/在读证明...
 
IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119IP addressing and IPv6, presented by Paul Wilson at IETF 119
IP addressing and IPv6, presented by Paul Wilson at IETF 119
 
Unidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptxUnidad 4 – Redes de ordenadores (en inglés).pptx
Unidad 4 – Redes de ordenadores (en inglés).pptx
 
TRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptxTRENDS Enabling and inhibiting dimensions.pptx
TRENDS Enabling and inhibiting dimensions.pptx
 
How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...How to login to Router net ORBI LOGIN...
How to login to Router net ORBI LOGIN...
 
Cybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best PracticesCybersecurity Threats and Cybersecurity Best Practices
Cybersecurity Threats and Cybersecurity Best Practices
 
Company Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptxCompany Snapshot Theme for Business by Slidesgo.pptx
Company Snapshot Theme for Business by Slidesgo.pptx
 
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...Summary  ID-IGF 2016 National Dialogue  - English (tata kelola internet / int...
Summary ID-IGF 2016 National Dialogue - English (tata kelola internet / int...
 
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)Summary  IGF 2013 Bali - English (tata kelola internet / internet governance)
Summary IGF 2013 Bali - English (tata kelola internet / internet governance)
 

Big Data Ecosystem at LinkedIn

  • 1. Big Data Ecosystem at LinkedIn Big 2015 at WWW
  • 2. LinkedIn: Largest Professional Network 2 360M members 2 new members/sec
  • 3. Rich Data Driven Products at LinkedIn 3 Similar Profiles Connections News Skill Endorsements
  • 4. How to build Data Products 4 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  • 5. Example Data Product: PYMK 5 • People You May Know (PYMK): recommend members to connect
  • 6. Outline 6 • Data Ingress • Moving data from online to offline system • Data Processing • Managing offline processes • Data Egress • Moving results from offline to online system
  • 7. Ingress - types of Data 7 • Database data: member profile, connections, … • Activity data: Page views, Impressions, etc. • Application and System metrics • Service logs
  • 8. Data Ingress - Point-to-point Pipelines 8 • O(n^2) data integration complexity • Fragile, delayed, lossy • Non-standardized
  • 9. Data Ingress - Centralized Pipeline 9 • O(n) data integration complexity • More reliable • Standardizable
  • 10. Data Ingress: Apache Kafka 10 • Publish subscribe messaging • Producers send messages to Brokers • Consumers read messages from Brokers • Messages are sent to a topic • E.g. PeopleYouMayKnowTopic • Each topic is broken into one or more ordered partitions of messages
  • 11. Kafka: Data Evolution and loading 11 • Standardized Schema for each topic • Avro • Central repository • Producers/consumers use the same schema • Data verification - audits • ETL to Hadoop • Map only jobs load data from broker Goodhope et al., IEEE Data Eng. 2012
  • 12. Outline 12 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing using Hadoop, Azkaban, Cubert • Stream processing using Samza • Iterative processing using Spark • Data Egress • Moving results from offline to online system
  • 13. Data Processing: Hadoop 13 • Ease of programming • High level Map and Reduce functions • Scalable to very large cluster • Fault tolerant • Speculative execution, auto restart of failed jobs • Scripting languages: PIG, Hive, Scalding
  • 14. Data Processing: Hadoop at LinkedIn 14 • Used for data products, feature computation, training models, analytics and reporting, trouble shooting, … • Native MapReduce, PIG, Hive • Workflows with 100s of Hadoop jobs • 100s of workflows • Processing petabytes of data everyday
  • 15. Data Processing Example PYMK Feature Engineering Triangle closing Prob(Bob knows Carol) ~ the # of common connections Alice Bob Carol 15 How do people know each other?
  • 16. Data Processing in Hadoop Example 16 -- connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); -- second degree pairs (id1, id2), aggregate and count common connections common_conn = GROUP pairs BY (id1, id2); common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage(); How to do PYMK Triangle Closing in Hadoop
  • 17. 17 How to manage Production Hadoop Workflow
  • 18. Azkaban: Hadoop Workflow management 18 • Configuration • Dependency management • Access control • Scheduling and SLA management • Monitoring, history
  • 19. Distributed Machine Learning: ML-ease 20 • ADMM Logistic Regression for binary response prediction Agarwal et al. 2014
  • 20. Limitations of Hadoop: Join and Group By 21 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Common Hadoop MapReduce/Pig/Hive implementation • MapReduce: Load data and shuffle and reduce to do Inner Join and store the output • MapReduce: Load the above output, shuffle on group by keys and aggregate on reducer to generate final output
  • 21. Limitations of Triangle Closing Using Hadoop 22 • Large amount of data to shuffle from Mappers to Reducers — connections in (source_id, dest_id) format in both directions connections = LOAD `connections` USING PigStorage(); group_conn = GROUP connections BY source_id; pairs = FOREACH group_conn GENERATE generatePair(connections.dest_id) as (id1, id2); common_conn = GROUP pairs BY (id1, id2); — Shuffling all 2nd degree connections - terabytes of data common_conn = FOREACH common_conn GENERATE flatten(group) as (source_id, dest_id), COUNT(pairs) as common_connections; STORE common_conn INTO `common_conn` USING PigStorage();
  • 22. Cubert 23 • An open source project built for analytics needs • Map side aggregation • Minimizes intermediate data and shuffling • Fast and scalable primitives for joins and aggregation • Partitions data into blocks • Specialized operators MeshJoin, Cube • 5-60X faster in experience • Developer friendly - script like Vemuri et al. VLDB 2014
  • 23. Cubert Design 24 • Language • Scripting language • Physical - write MR programs • Execution • Data movement: Shuffle, Blockgen, Combine, Pivot • Primitives: MashJoin, Cube • Data blocks: partition of data by cost Vemuri et al. VLDB 2014
  • 24. Cubert Script: count Daily/Weekly Stats 25 JOB "create blocks of the fact table" MAP { data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO(); } // create blocks of one week of data with a cost function BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId; STORE data INTO "$output/blocks" USING RUBIX; END JOB "compute cubes" MAP { data = LOAD "$output/blocks" USING RUBIX; // create a new column 'todayUserId' for today's records only data = FROM data GENERATE country, locale, userId, clicks, CASE(timestamp == $today, userId) AS todayUserId; } // creates the three cubes in a single job to count daily, weekly users and clicks CUBE data BY country, locale INNER userId AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers, COUNT_DISTINCT(todayUserId) as dailyUniqueUsers, SUM(clicks) as totalClicks; STORE data INTO "$output/results" USING AVRO(); END Vemuri et al. VLDB 2014
  • 25. Cubert Example: Join and Group By 26 Vemuri et al. VLDB 2014 — Two datasets: A=(Salesman, Product), B=(Salesman, Location) Select SomeAggregate() FROM A Inner Join B ON A.salesman = B.Salesman GROUP BY A.Product, B.Location • Sort A by Product and B by Location • Divide A and B in specialized blocks sorted by group by keys • Load A’s blocks in memory and stream B’s blocks to Join • Group by can be performed immediately after Join
  • 26. Cubert Example: Triangle Closing 27 • Divide connections (src, dest) in blocks • Duplicate connection graph G1, G2 • Sort G1 edges (src, dest) by src • Sort G2 edges (src, dest) by dest • MeshJoin G1 and G2 such that G1.dest=G2.src • Aggregate by (G1.src, G2,dest) to get the number of common connections • Speedup by 50%
  • 27. Cubert Summary 28 Vemuri et al. VLDB 2014 • Built for analytics needs • Faster and scalable: 5-60X • Working well in practice
  • 28. Outline 29 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  • 29. Samza 30 • Samza streaming computation • On top of messaging layer like Kafka for input/output • Low latency • Stateful processing through local store • Many use cases at LinkedIn • Site-speed monitoring • Data standardization
  • 30. Samza: Site Speed Monitoring 31 • LinkedIn homepage assembled by calling many services • Each service logs through Kafka what went on with a request Id
  • 31. Samza: Site Speed Monitoring 32 • The complete record of request - scattered across Kafka logs • Problem: combine these logs to generate wholistic view
  • 32. Samza: Site Speed Monitoring 33 • Hadoop/MR: join the logs using the request Id - once a day • Too late to troubleshoot any issue • Samza: near real-time join the Kafka logs using the requestId
  • 33. Samza: Site Speed Monitoring 34 • Samza: near real-time join the Kafka logs using the requestId • Two jobs • Partition Kafka stream by request Id • Aggregate all the records for a request Id Fernandez et al. CIDR 2015
  • 34. Outline 35 • Ingress • Moving data from online to offline system • Offline Processing • Batch processing - Hadoop, Azkaban, Cubert • Stream processing - Samza • Iterative processing - Spark • Egress • Moving results from offline to online system
  • 35. Iterative Processing using Spark 36 • Limitations of MapReduce • What is Spark? • Spark at LinkedIn
  • 36. Limitations of MapReduce 37 • Iterative computation is slow • Inefficient multi-pass computation • Intermediate data written in distributed file system
  • 37. Limitations of MapReduce 38 • Interactive computation is slow • Same data is loaded again from distributed file system
  • 38. Example: ADMM at LinkedIn 39 • Intermediate data is stored in distributed file system - slow Intermediate data in HDFS
  • 39. SPARK 40 • Extends programming language with a distributed data structure • Resilient Distributed Datasets (RDD) • can be stored in memory • Faster iterative computation • Faster interactive computation • Clean APIs in Python, Scala, Java • SQL, Streaming, Machine learning, graph processing support Matei Zaharia et al. NSDI 2012
  • 40. Spark at LinkedIn 41 • ADMM on Spark • Intermediate data is stored in memory - faster Intermediate data in memory
  • 41. Outline 42 • Data Ingress • Moving data from online to offline system • Data Processing • Batch processing - Hadoop, Azkaban, Cubert • Iterative processing - Spark • Stream processing - Samza • Data Egress • Moving results from offline to online system
  • 42. Data Egress - Key/Value 43 • Key-value store: Voldemort • Based on Amazon’s Dynamo DB • Distributed • Scalable • Bulk load from Hadoop • Simple to use • store results into ‘url’ using KeyValue(‘member_id’) Sumbaly et al. FAST 2012
  • 43. Data Egress - Streams 44 • Stream - Kafka • Hadoop job as a Producer • Service acts as Consumer • Simple to use • store data into ‘url’ using Stream(“topic=x“) Goodhope et al., IEEE Data Eng. 2012
  • 44. Conclusion 45 • Rich primitives for Data Ingress, Processing, Egress • Data Ingress: Kafka, ETL • Data Processing • Batch processing - Hadoop, Cubert • Stream processing - Samza • Iterative processing - Spark • Data Egress: Voldemort, Kafka • Allow Data Scientists to focus to build Data Products
  • 45. Future Opportunities 46 • Models of computation • Efficient Graph processing • Distributed Machine Learning
  • 46. 47 Acknowledgement Thanks to data team at LinkedIn: data.linkedin.com Contact: mtiwari@linkedin.com @mitultiwari

Notas del editor

  1. Hi Everyone. I am Mitul Tiwari. Today I am going to talk about Big Data Ecosystem at LinkedIn.
  2. LinkedIn is the largest professional network with more than 360M members and it’s growing fast with more than 2 members joining per second. What’s LinkedIn’s Mission? … LinkedIn’s mission is to connect the world’s professionals and make them more productive and successful. - Members can connect with each other and maintain their professional network on linkedin.
  3. A rich recommender ecosystem at linkedin: from connections, news, skills, Jobs, companies, groups, search queries, talent, similar profiles, ...
  4. How do we build these data driven products? Building these data products involve three major steps. First, moving production data from online to offline system. Second, processing data in the offline system using technologies such as Hadoop, Samza, Spark. And finally, moving the results or processed data from offline to online serving system.
  5. Let’s take a concrete data product example of People You May Know at LinkedIn. Production data such as database data, activity data is moved to offline system. Offline system processes this data to generate PYMK recommendations for each member. This recommendation output is stored in a key value store Voldemort. Production system query this store to get PYMK recommendation for a given member and serve it online. In aAny deployed large-scale recommendation systems has to deal with scaling challenges high level design Kafka, Voldemort citations, url to Azkaban
  6. Let me talk about each of these three steps in more detail starting with Ingress that is moving data from online system to offline system.
  7. There are various types of data at LinkedIn in online production system. Database data contains various member information such as profile and connections. This is persistent data that member has provided. Activity data contains various kinds of member activities such as which pages member viewed or which People You May Know results were shown or impressed to users Performance and system metrics of online serving application system is also stored to monitor the health of the serving system. Finally, each online service generates various kinds of log information, for example, what kind of request parameters were used by the People You May Know backend service while serving the results
  8. Initial solution built for Data Ingress was point to point solution. That is each production service had many offline clients and data was transferred from a production service to an offline system. There are many limitations of such a solution. First, O(N^2) data integration complexity. That is, each online system could be transferring data to all the offline systems. Second, this is fragile and easy to break. That is, very hard to monitor the correctness of the the data flow. Also, because of O(N^2) complexity this can easily overload a service or data pipeline resulting in delay or loss of data. Finally, this solution is very hard to standardize and each point-to-point data transfer can come up with their own schema.
  9. At LinkedIn we have built a centralized data pipeline. This reduces point-to-point data transfer complexity to O(N) We could build more reliable data pipeline And this data pipeline is standardizable. That is,
  10. At LinkedIn we have built an open source data ingress pipeline called Kafka. Kafka is a publish subscribe messaging system Producers of data (such as online serving systems) send data to brokers. Consumers such as offline system can read messages from brokers Messages are sent for a particular topic. For example, PYMK impressions are sent at a topic such as PYMKImpressionTopic Each topic is broken into one or more ordered partitions of messages
  11. TODO: Kafka stats Kafka uses a standardize schema for each topic We use Avro schema which is like a Json schema with superior serialization, deserialization benefits There is a central repository of schema for each topic Both producers and consumers use the same topic schema Kafka also simplifies data verification using audits on the number of produced messages and the number of consumed messages Kakfa also facilitate each ETL of data to Hadoop by using Map online jobs to load data from brokers For more details check out this IEEE Data Engineering paper.
  12. Once we have data available in offline data processing system from production, we use various technologies such as Hadoop, Samza, and Spark to process this data. Let me start with talking about batch processing technologies based on Hadoop.
  13. Hadoop has been very successful to scale offline computation needs. Hadoop made ease of distributed programming by providing simple high level primitives like Map and Reduce functions Hadoop is scalable to a very large cluster Hadoop MapReduce provide fault tolerant functionalities like speculative execution, restarting failed MapReduce tasks automatically There are many scripting language like Pig, Hive, Scalding built on top of Hadoop to further ease the programming
  14. At LinkedIn: Hadoop is in use for building data products, feature computation, training machine learning models, business analytics, trouble shooting by analyzing data, etc. We have workflows with 100s of Hadoop MapReduce jobs And 100s of such workfllows Daily we process peta-bytes of data on Hadoop
  15. One good signal to indicate are common connections. That is Bob and Carol likely to know each other if they share a common connection. Bob and Carol likely to know each other if they share a common connection. Also, as the number of common connections increases, the likelihood of the two people knowing each other increases.
  16. Here is an example of data processing using Hadoop. For PYMK an important feature is triangle closing that is, finding the second degree connections and the number of common connections between two members Here is a PIG script that computes that Go through the PIG script
  17. Here is the PYMK production Azkaban Hadoop workflow, which involves dozens of hadoop jobs and dependencies Looks complicated but it’s trivial to manage such workflows using Azkaban
  18. How to manage
  19. After feature engineering and getting features such as triangle closing, organizational overlap scores for schools and companies, we apply a machine learning model to predict probability of two people knowing each other. We also incorporate user feedback both explicit and implicit in enhancing the connection probability We use pass connections as positive response variable to train our machine learning model
  20. ADMM stands for Alternating Direction Method of Multipliers (Boyd et al. 2011). The basic idea of ADMM is as follows: ADMM considers the large scale logistic regression model fitting as a convex optimization problem with constraints. While minimizing the user-defined loss function, it enforces an extra constraint that coefficients from all partitions have to equal. To solve this optimization problem, ADMM uses an iterative process. For each iteration it partitions the big data into many small partitions, and fits an independent logistic regression for each partition. Then, it aggregates the coefficients collected from all partitions, learns the consensus coefficients, and sends it back to all partitions to retrain. After 10-20 iterations, it ends up with a converged solution that is theoretically close to what you would have obtained if you trained it on a single machine.
  21. TODO: get comfortable
  22. TODO: get comfortable with this slide
  23. load one week of data and build a OLAP cube over country and locale as dimensions for unique users over the week, unique users for today, as well as total number of clicks.
  24. TODO: get comfortable
  25. TODO: revise
  26. TODO: get comfortable
  27. Consider what data is necessary to build a particular view of the LinkedIn home page. We provide interesting news via Pulse, timely updates from your connections in the Network Update Stream, potential new connections from People You May Know, advertisements targeted to your background, and much, much more. Each service publishes its logs to its own specific Kafka topic, which is named after the service, i.e. <service>_service_call. There are hundreds of these topics, one for each service and they share the same Avro schema, which allows them to be analyzed together. This schema includes timing information, who called whom, what was returned, etc, as well as the specific of what each particular service call did. Additionally log4j-style warnings and errors are also routed to Kafka in a separate <service>_log_event topic.
  28. After a request has been satisfied, the complete record of all the work that went into generating it is scattered across the Kafka logs for each service that participated. These individual logs are great tools for evaluating the performance and correctness of the individual services themselves, and are carefully monitored by the service owners. But how can we use these individual elements to gain a larger view of the entire chain of calls that created that page? Such a perspective would allow us to see how the calls are interacting with each other, identify slow services or highlight redundant or unnecessary calls.
  29. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  30. By creating a unique value or GUID for each call at the front end and propagating that value across all subsequent service calls, it's possible to tie them together and define a tree-structure of the calls starting from the front end all the way through to the leave service events. We call this value the TreeID and have built one of the first production Samza workflows at LinkedIn around it: the Call Graph Assembly (CGA) pipeline. All events involved in building the page now have such a TreeID, making it a powerful key on which to join data in new and fascinating ways. The CGA pipeline consists of two Samza jobs: the first repartitions the events coming from the sundry service call Kafka topics, creating a new key from their TreeIDs, while the second job assembles those repartitioned events into trees corresponding to the original calls from the front end request. This two-stage approach looks quite similar to the classic Map-Reduce approach where mappers will direct records to the correct reducer and those reducers then aggregate them together in some fashion. We expect this will be a common pattern in Samza jobs, particularly those that are implementing continuous, stream-based implementations of work that had previously been done in a batch fashion on Hadoop or similar situations.
  31. That concludes my brief discussion on Stream processing using Samza. Next I am going to talk about iterative processing using Spark.
  32. ADMM example
  33. ADMM example
  34. ADMM example
  35. ADMM example
  36. TODO: add reference
  37. TODO: get comfortable
  38. TODO: revise - add lessons, opportunities
  39. TODO: revise