Just when you think you have your Kafka and Hadoop clusters set up and humming and you’re well on your path to democratizing data, you realize that you now have a very different set of challenges to solve. You want to provide unfettered access to data to your data scientists, but at the same time, you need to preserve the privacy of your members, who have entrusted you with their data.
Shirshanka Das and Tushar Shanbhag outline the path LinkedIn has taken to protect member privacy in its scalable distributed data ecosystem built around Kafka and Hadoop.
They also discuss three foundational building blocks for scalable data management that can meet data compliance regulations: a centralized metadata system, a standardized data lifecycle management platform, and a unified data access layer. Some of these systems are open source and can be of use to companies that are in a similar situation. Along the way, they also look to the future—specifically, to the General Data Protection Regulation, which comes into effect in 2018—and outline LinkedIn’s plans for addressing those requirements.
But technology is just part of the solution. Shirshanka and Tushar also share the culture and process change they’ve seen happen at the company and the lessons they’ve learned about sustainable process and governance.
Taming the ever-evolving Compliance Beast : Lessons learnt at LinkedIn [Strata NYC 2017]
1. Taming the Compliance Beast:
Lessons learnt at LinkedIn
Sept 28, 2017
Shirshanka Das, Principal Staff Engineer, LinkedIn
Tushar Shanbhag, Head of Data Products, LinkedIn
@shirshanka, @tusharis
ever-evolving
^
3. metric scripts
production code
Business facing
decision making
OUR VISION
Create economic opportunity for every
member of the global workforce
LinkedIn’s Vision
29K
schools
10M
companies
11B
endorsements
500M
Members
10M
jobs
4. The LinkedIn Privacy Paradox
“On one hand, the company has
500+ million members trusting
the company to protect highly
sensitive data.
On the other hand, one only
joins the largest professional
network on the Internet because
they want to be found !"
Kalinda Raina,
Head of Global Privacy, LinkedIn
MEMBER PRIVACY <> MEMBER DISCOVERY
5. metric scripts
Members First is a Core Value for LinkedIn
MEMBER PRIVACY WHILE DELIVERING MEMBER VALUE
production code
Well-connected.
Get relevance right.
Few connections.
Give them inventory.
Example
Member value is proportional to knowledge
Member privacy is paramount for LinkedIn
We strive to maintain this fine balance
6. Data Is the Lifeblood of LinkedIn
MEMBER EXPERIENCES + BUSINESS DECISIONS
production code
Member Data
System of Intelligence
Member Experiences
Business Decisions
7. We needed data democracy to
deliver member value
LinkedIn Data Science
I want to analyze as much data as
possible so my models are accurate
Data Democracy
ALL THE DATA, ALL THE TIME
I want to discover data that’s needed for my
analysis as fast as possible
I want to access that data as quickly as
possible for my analysis
8. I want my personal data to be stored only
where needed and not propagated
unnecessarily
Data Protection
Need to Ensure Member Privacy
LinkedIn Members
STORE, PROCESS, DELETE,..
I want my personal data to be deleted when
I close my account or request deletion
I want my personal data to only be
processed if essential and only if I consent
9. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
19. SFTP
JDBC
REST
Apache Gobblin: Simplifying Data Integration
@LinkedIn
Hundreds of TB per day
Thousands of datasets
~30 different source systems
80%+ of data ingest
Open source @ https://gobblin.apache.org/
Stream + Batch
Adopted by LinkedIn, Intel, PayPal, Apple, IBM,
Swisscom, Prezi, AppLift, NerdWallet and many more…
SFTP
Azure
Blob, Data
Lake
Storage
20. REQUIREMENTS
Less Data
Legal: Right to Erasure or Right to be Forgotten
“Delete all my personal data without undue delay when it is no
longer necessary / when consent has been withdrawn”
Engineering:
Need the ability to delete some specific subset or all data associated
with a specific LinkedIn member from all our data systems
21. A lot of data, different formats
Challenges
Understand HDFS data: organization, formats, …
Cycle asynchronously, within an SLA, deleting
records, without affecting running jobs
Quarantine exceptional records for manual triage
Can scale to processing hundreds of PB of data
Data Deletion
IMPLICATIONS FOR HADOOP
22. Gobblin: The Logical Pipeline
Source
Work
Unit
Work
Unit
Work
Unit
Extract Convert Quality Write Data
Publish
WriteQualityConvertExtract
Extract Convert Quality Write
Task
Task
Task
23. Gobblin: Extending for Purge
HDFS
Work
Unit
Data
Publish
Extract Convert Quality Write
Task
Task
HDFS
If needs purge
then drop
else continue
Member’s Delete
Requests
24. STATUS AND CHALLENGES
Gobblin: Data Lifecycle Management at Scale
Status
Number of datasets: many thousands
Amount of data scanned for purge: XXX TB/day
Challenges
Immutable Storage Formats + Right to Erasure = Unhappy Disks
“Widespread implementation will surely lead to innovation in these formats!”
25. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
26. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
28. Metadata based Search Experience
for Data Scientists
Data Discovery
Where is dataset X?
How did it get created?
Usage : In production since 2014
Users : Data Scientists, Product Engineers
Use Cases: Discovery, Impact Analysis
WhereHows
FIND DATA, NAVIGATE RELATIONSHIPS
Open source @ github.com/linkedin/wherehows
31. More than just Discovery
Use Cases
Which datasets at LinkedIn contain PII or highly
confidential data?
How many contain member-member messages?
How many of them are accessible by team X?
Have all datasets been purged within SLA?
Discovering Violations
ANSWERING HARDER QUESTIONS
32. Wide + Deep
Metadata
Comprehensive coverage of data systems at LinkedIn
We have > 20 systems!
SQL, NoSQL, Indexes, Blob Stores, …
Deeper understanding of each dataset
Schema is not enough
Need to understand semantics
Discovering Violations
REQUIREMENTS
33. A METADATA REFINERY APPROACH
WhereHows Architecture @ 10,000 ft
ML driven
refinements
34. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
35. METADATA
DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
37. HARD TO CHANGE ANYTHING UNDERNEATH!
Challenge for Infrastructure Providers
(Pig scripts)
My Raw Data
Native readers, dependencies on path, format hard-coded
Hard to move to
better formats
without breaking
everyone or
copying data twice
My Raw Data
38. HARD TO CHANGE ANYTHING UPSTREAM!
Semantic Challenges
Data is unclean (bad data on certain dates)
Data models are in constant flux (split event into multiple)
Have to change
data processing
logic everywhere!
My Raw Data
39. AN API TO MANAGE EVOLUTION
We need “microservices” for Data
My Data API
My Raw Data
40. A DATA ACCESS LAYER FOR LINKEDIN
We built Dali to solve this
Logical Tables + Views
Logical FileSystem
Abstract away underlying physical details to
allow users to focus solely on the logical
concerns
41. Dali: Implementation Details in Context
Dali FileSystem
Processing Engine
(MR, Spark)
Dali Datasets (Tables+Views)
Dataflow APIs
(MR, Spark,
Scalding)
Query Layers
(Pig, Hive,
Spark)
Dali CLI
Data Catalog
Git + Artifactory
View Def +
UDFs
Dataset
Owner
Data Source
Data Sink
42. Simple to Complex
Different Types
Basic Restrictions
Access to dataset based on business need
Privacy by Default
Analysts shouldn’t get access to raw PII by
default
Consent-based Access
Access to certain data elements only available
if member has consented for that particular use-
case
Access Restrictions
REQUIREMENTS
43. STEP 1: DATA + METADATA
Solving for Compliant Access
Schema = {
int memberId
String firstName
String lastName
Position[] positions
educationHistory[] educationHistory
…
}
MemberProfile
MEMBER_ID
NAME
PROFILE DATA
NAME : is_pii
MEMBER_ID : is_pii
Raw
Dataset
Meta
Data
44. STEP 2: A MEMBER’S PREFERENCES
Privacy Preferences
45. A BITMAP DATASET: ONE PER MEMBER
Privacy Preferences
Member Privacy
Preferences
46. Solving for Compliant Access With Dali
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Dali Reader responsibility:
Given:
(Dataset, Metadata, UseCase)
Generate:
Dataset and Column-level
transformations
(obfuscate, null, …)
Auto-join with Member
Privacy Preferences
(filter out data elements that
are not consented to)
Processing
Logic
Dali
Reader
Library
Use
Case = X
47. Solving for Compliant Purging With Dali + Gobblin
Raw
Dataset
Meta
Data
Member Privacy
Preferences
Gobblin
Purger
Dali
Reader
Library
Use
Case =
Purge
Member’s Delete
Requests
Purged
Dataset
48. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox
DATA LIFECYCLE MANAGEMENT
METADATA
DATA ACCESS LAYER
49. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox : Solved !
METADATA
DATA ACCESS LAYER
DATA LIFECYCLE MANAGEMENT
50. DATA DEMOCRACY + DATA PROTECTION
The Technology Blueprint
WhereHows*
Dali Apache Gobblin*
* Open Source : We can collaborate on these together!
DATA LIFECYCLE MANAGEMENTDATA ACCESS LAYER
METADATA
51. Core company value, implemented
by Technology & Process
Privacy By Design
Privacy : Technology + Process
SUSTAINABILITY IS CRITICAL
Product : Security & Privacy Review
Data : Data Model Review
Legal : Regulation change -> Tech requirements
Company-wide : “Horizontal” Initiatives
52. Getting Stricter and more complex
Data Protection
Key Takeaways
THE BEAST IS REAL
Stricter regulations in a digital world
Increasingly more complex to implement
This is an accelerating global trend
53. We’ve established a blueprint to
sustainably address privacy
Learnings at LinkedIn
Key Takeaways
THE BEAST CAN BE TAMED !
Privacy By Design : baked into technology
stack & product development process
Standardization : To solve at scale, certain
parts need to be centralized and standardized
Company-wide : Needs co-ordinated effort
across various functions
54. DATA DEMOCRACY <> DATA PROTECTION
More Data
Discover Data
Easy Access
Less Data
Discover Violations
Restricted Access
The Data Paradox : Solved !
METADATA
DATA ACCESS LAYER
DATA LIFECYCLE MANAGEMENT