Big Data Ecosystem at LinkedIn

Big Data Ecosystem at LinkedIn
Big 2015 at WWW

LinkedIn: Largest Professional Network
2
360M members 2 new members/sec

Rich Data Driven Products at LinkedIn
3
Similar Profiles
Connections
News
Skill Endorsements

How to build Data Products
4
• Data Ingress
• Moving data from online to offline system
• Data Processing
• Managing offline processes
• Data Egress
• Moving results from offline to online system

Example Data Product: PYMK
5
• People You May Know (PYMK): recommend members to connect

Outline
6
• Data Ingress
• Data Processing
• Managing offline processes
• Data Egress

Ingress - types of Data
7
• Database data: member profile, connections, …
• Activity data: Page views, Impressions, etc.
• Application and System metrics
• Service logs

Data Ingress - Point-to-point Pipelines
8
• O(n^2) data integration complexity
• Fragile, delayed, lossy
• Non-standardized

Data Ingress - Centralized Pipeline
9
• O(n) data integration complexity
• More reliable
• Standardizable

Data Ingress: Apache Kafka
10
• Publish subscribe messaging
• Producers send messages to Brokers
• Consumers read messages from
Brokers
• Messages are sent to a topic
• E.g. PeopleYouMayKnowTopic
• Each topic is broken into one or more
ordered partitions of messages

Kafka: Data Evolution and loading
11
• Standardized Schema for each topic
• Avro
• Central repository
• Producers/consumers use the same schema
• Data verification - audits
• ETL to Hadoop
• Map only jobs load data from broker
Goodhope et al., IEEE Data Eng. 2012

Outline
12
• Data Ingress
• Data Processing
• Batch processing using Hadoop, Azkaban, Cubert
• Stream processing using Samza
• Iterative processing using Spark
• Data Egress

Data Processing: Hadoop
13
• Ease of programming
• High level Map and Reduce functions
• Scalable to very large cluster
• Fault tolerant
• Speculative execution, auto restart of failed jobs
• Scripting languages: PIG, Hive, Scalding

Data Processing: Hadoop at LinkedIn
14
• Used for data products, feature computation, training
models, analytics and reporting, trouble shooting, …
• Native MapReduce, PIG, Hive
• Workflows with 100s of Hadoop jobs
• 100s of workflows
• Processing petabytes of data everyday

Data Processing Example PYMK Feature
Engineering
Triangle closing
Prob(Bob knows Carol) ~ the # of common
connections
Alice
Bob Carol
15
How do people
know each other?

Data Processing in Hadoop Example
16
-- connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
-- second degree pairs (id1, id2), aggregate and count common
connections
common_conn = GROUP pairs BY (id1, id2);
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING PigStorage();
How to do PYMK Triangle Closing in Hadoop

17
How to manage Production Hadoop Workflow

Azkaban: Hadoop Workflow management
18
• Configuration
• Dependency management
• Access control
• Scheduling and SLA management
• Monitoring, history

Distributed Machine Learning: ML-ease
20
• ADMM Logistic Regression for binary response prediction
Agarwal et al. 2014

Limitations of Hadoop: Join and Group By
21
— Two datasets: A=(Salesman, Product), B=(Salesman,
Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman
= B.Salesman GROUP BY A.Product, B.Location
• Common Hadoop MapReduce/Pig/Hive implementation
• MapReduce: Load data and shuffle and reduce to do Inner Join and
store the output
• MapReduce: Load the above output, shuffle on group by keys and
aggregate on reducer to generate final output

Limitations of Triangle Closing Using Hadoop
22
• Large amount of data to shuffle from Mappers to Reducers
— connections in (source_id, dest_id) format in both directions
connections = LOAD `connections` USING PigStorage();
group_conn = GROUP connections BY source_id;
pairs = FOREACH group_conn GENERATE
generatePair(connections.dest_id) as (id1, id2);
common_conn = GROUP pairs BY (id1, id2);
— Shuffling all 2nd degree connections - terabytes of data
common_conn = FOREACH common_conn GENERATE
flatten(group) as (source_id, dest_id),
COUNT(pairs) as common_connections;
STORE common_conn INTO `common_conn` USING
PigStorage();

Cubert
23
• An open source project built for analytics needs
• Map side aggregation
• Minimizes intermediate data and shuffling
• Fast and scalable primitives for joins and aggregation
• Partitions data into blocks
• Specialized operators MeshJoin, Cube
• 5-60X faster in experience
• Developer friendly - script like
Vemuri et al. VLDB 2014

Cubert Design
24
• Language
• Scripting language
• Physical - write MR programs
• Execution
• Data movement: Shuffle, Blockgen,
Combine, Pivot
• Primitives: MashJoin, Cube
• Data blocks: partition of data by cost

Cubert Script: count Daily/Weekly Stats
25
JOB "create blocks of the fact table"
MAP {
data = LOAD ("$FactTable", $weekAgo, $today) USING AVRO();
}
// create blocks of one week of data with a cost function
BLOCKGEN data BY ROW 1000000 PARTITIONED ON userId;
STORE data INTO "$output/blocks" USING RUBIX;
END
JOB "compute cubes"
MAP {
data = LOAD "$output/blocks" USING RUBIX;
// create a new column 'todayUserId' for today's records only
data = FROM data GENERATE country, locale, userId, clicks,
CASE(timestamp == $today, userId) AS todayUserId;
}
// creates the three cubes in a single job to count daily, weekly users and clicks
CUBE data BY country, locale INNER userId
AGGREGATES COUNT_DISTINCT(userId) as weeklyUniqueUsers,
COUNT_DISTINCT(todayUserId) as dailyUniqueUsers,
SUM(clicks) as totalClicks;
STORE data INTO "$output/results" USING AVRO();
END

Cubert Example: Join and Group By
26
— Two datasets: A=(Salesman, Product), B=(Salesman, Location)
Select SomeAggregate() FROM A Inner Join B ON A.salesman =
B.Salesman GROUP BY A.Product, B.Location
• Sort A by Product and B by
Location
• Divide A and B in specialized
blocks sorted by group by keys
• Load A’s blocks in memory and
stream B’s blocks to Join
• Group by can be performed
immediately after Join

Cubert Example: Triangle Closing
27
• Divide connections (src, dest) in blocks
• Duplicate connection graph G1, G2
• Sort G1 edges (src, dest) by src
• Sort G2 edges (src, dest) by dest
• MeshJoin G1 and G2 such that G1.dest=G2.src
• Aggregate by (G1.src, G2,dest) to get the number of common
connections
• Speedup by 50%

Cubert Summary
28
• Built for analytics needs
• Faster and scalable: 5-60X
• Working well in practice

Outline
29
• Ingress
• Offline Processing
• Batch processing - Hadoop, Azkaban, Cubert
• Stream processing - Samza
• Iterative processing - Spark
• Egress

Samza
30
• Samza streaming computation
• On top of messaging layer like Kafka for input/output
• Low latency
• Stateful processing through local store
• Many use cases at LinkedIn
• Site-speed monitoring
• Data standardization

Samza: Site Speed Monitoring
31
• LinkedIn homepage assembled by calling many services
• Each service logs through Kafka what went on with a request Id

32
• The complete record of request - scattered across Kafka logs
• Problem: combine these logs to generate wholistic view

33
• Hadoop/MR: join the logs using the request Id - once a day
• Too late to troubleshoot any issue
• Samza: near real-time join the Kafka logs using the requestId

34
• Samza: near real-time join the Kafka logs using the requestId
• Two jobs
• Partition Kafka stream by request Id
• Aggregate all the records for a request Id
Fernandez et al. CIDR 2015

Outline
35
• Ingress
• Offline Processing
• Egress

Iterative Processing using Spark
36
• Limitations of MapReduce
• What is Spark?
• Spark at LinkedIn

Limitations of MapReduce
37
• Iterative computation is slow
• Inefficient multi-pass computation
• Intermediate data written in distributed file system

Limitations of MapReduce
38
• Interactive computation is slow
• Same data is loaded again from distributed file system

Example: ADMM at LinkedIn
39
• Intermediate data is stored in distributed file system - slow
Intermediate
data in HDFS

SPARK
40
• Extends programming language with a
distributed data structure
• Resilient Distributed Datasets (RDD)
• can be stored in memory
• Faster iterative computation
• Faster interactive computation
• Clean APIs in Python, Scala, Java
• SQL, Streaming, Machine learning,
graph processing support
Matei Zaharia et al. NSDI 2012

Spark at LinkedIn
41
• ADMM on Spark
• Intermediate data is stored in memory - faster
Intermediate
data in memory

Outline
42
• Data Ingress
• Data Processing
• Data Egress

Data Egress - Key/Value
43
• Key-value store: Voldemort
• Based on Amazon’s Dynamo DB
• Distributed
• Scalable
• Bulk load from Hadoop
• Simple to use
• store results into ‘url’ using KeyValue(‘member_id’)
Sumbaly et al. FAST 2012

Data Egress - Streams
44
• Stream - Kafka
• Hadoop job as a Producer
• Service acts as Consumer
• Simple to use
• store data into ‘url’ using Stream(“topic=x“)
Goodhope et al., IEEE Data Eng. 2012

Conclusion
45
• Rich primitives for Data Ingress, Processing, Egress
• Data Ingress: Kafka, ETL
• Data Processing
• Batch processing - Hadoop, Cubert
• Data Egress: Voldemort, Kafka
• Allow Data Scientists to focus to build Data Products

Future Opportunities
46
• Models of computation
• Efficient Graph processing
• Distributed Machine Learning

47
Acknowledgement
Thanks to data team at LinkedIn: data.linkedin.com
Contact: mtiwari@linkedin.com
@mitultiwari

Big Data Ecosystem at LinkedIn

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Big Data Ecosystem at LinkedIn

Similar a Big Data Ecosystem at LinkedIn (20)

Más de Mitul Tiwari

Más de Mitul Tiwari (10)

Último

Último (9)

Big Data Ecosystem at LinkedIn

Notas del editor