SlideShare una empresa de Scribd logo
1 de 60
Turning Big Data
into Knowledge
September 25, 2019
Kaan Onuk, Luyao Li, Atul Gupte
Hi, welcome!
● Engineer turned Product Manager
● Previously: building FarmVille & the mobile advertising
platform @ Zynga
● Currently: Product Manager on the Data Platform team
building data science, data knowledge, and interactive
analytics platforms
About me
Atul Gupte
Product Manager
What we’ll talk about today
Data landscape
at Uber
Our journey
since 2016
Metadata management
through Databook
dtbData relationships
through Lineage
lin
We ignite opportunity by setting the world in motion
15M
Trips/Day
700+
Cities
100M
Monthly Users
Data informs every decision at the company
Daily Uber trips
powered by ML
Millions
Messages
processed by Kafka
2T
Queries across
Hive, Vertica and
Presto
1M
Data ingested
into HDFS
150TB
How Big is our Big Data?
Data Platform Team
Move the world with
global data,
local insights, and
intelligent decisions.
Data Infrastructure
Data Platform
DataTools
Data Lake
Logging
Stream
Data
Modelers
Data
Consumers
...
Trips
Users
Data
Engineers
Overview of Data at Uber
Data
Scientists
Raw
~10,000
Curated
~100
Derived
>100,000
Data LakeSources Usage
WAU 8,000+
Queries 1M/day
Pipelines Thousands
Metrics Thousands
Experiments Thousands
ML models 10s of thousands
Self Serve & Open Platform
Use Cases
Eng ETA, surge, safety
DS incentives, churn, pickup
Ops driver onboarding, eats
cash, partner data sharing
Compliance ops metrics, city
Challenges compounded by the scale of Data
Data produced by
Mobile users 100s of millions
Events
Trillions/day
What are users looking to do?
What data exists? How does it look?
Who’s using it? What happens if I
change it?
How can I adapt when this data
changes?
Discover Understand Trust
3+ hours
week
8%
Time wasted
every week
$$$M
Cost to company
Tasks requiring
human skill
Unproductive
time sinks
We power data fluency to help Uber
make confident, data-driven decisions
Any and all users
can access and use
datasets with ease
Users trust our data
because it meets
their expectations
Users access
appropriate data,
through compliant
means
Discover Understand Trust
Discover
Late 2016
● Indexed small amount of data
○ Offline analytics systems
● Datasets only - no other data entities
● Catalogued basic information about datasets
Late 2016
Novice Neville
Data Scientists
Software Engineers
ML Researchers
New to Uber
Requires help finding data
Relies on George for basic tasks
Manager Michelle
General Managers
Product Managers
City Operations
CXOs & other executives
Interface w/regulators &
customers
Meet critical deadlines
Deliver reports and insights
Genius George
Data Scientists
Software Engineers
ML Researchers
Built underlying systems
Tribal knowledge champion
De-facto knowledge bank
Late 2016
2017
HQ Non-HQ
Rideshare Eats Freight ATG Elevate
Support
NLP models for
support tickets
Safety
Trip classification
Uber Eats
Restaurant
recommendations
Operations
LTV models
2017
Data Scientists
Software Engineers
ML/AI Researchers
Advanced SQL
Advanced Statistics
Scala/Spark, Python/R
Data Modeling
Inventor Ivan
Marketing Managers
Entry-level Analysts
General Managers
Product Managers
Limited SQL
Spreadsheets
Reliant Rebecca
City Operations
Regional Managers
Advanced SQL
Spreadsheets
Dashboarding
Monitoring Matt
Operations Managers
Data Analysts
Product Analysts
Advanced SQL
Spreadsheets
Limited Statistics
Limited Python/R
Analyst Anna
2017
2018Cumulativefunctionality
Time
Low internal
quality
High internal
quality
Delivers more rapidly
+ cheaply later
● Users care about a variety of data assets
○ Datasets, dashboards, metrics, etc.
● Users want a holistic view of everything that
exists about their data
○ Ownership
○ Schemas
2018
2018
2018
● Data quality and health are key concerns
● Table usage information is valuable
● Operational and regulatory environment is
growing more complex
○ GDPR
○ Access control & audits
2019
2019
● Unified interface highlighting relevant metadata
○ Ownership & usage
○ Schema and stats
○ Quality & health signals
○ Lineage
2019
● Advanced metadata management
○ Automated ingestion
○ Automated classification
○ Simplified controls for data owners
60+
Types of
metadata
Curiosity Knowledge Wisdom
● Manages Data Lineage team under Data Platform
● Earlier: Senior Software Engineer II at Uber
About me
Luyao Li
Tech Lead Manager
Data Lineage
What is data lineage?
● Where is the data from
● Where it’s been
● How it’s being transformed
“I’m no longer
responsible for this
table, please ask team
X”
“This is an upstream
problem, we can’t fix it”
“Please ask the
table owner”
“How do I find
the pipeline
owner?”
Multiple days
Why does it matter?
Applications
Data Freshness Data Chargebacks Anomaly Detection Compliance
Features
End-to-end Isolated ingestion
Flexible
consumption
Advanced
filtering
10,000-foot
view
1,000-foot
view
Lessons learned
High quality data is essential for success
Always be customer obsessed
Magical search has a huge impact on usability
Make big, bold bets
What’s next
● Column-level lineage
● Self-diagnostic and reporting
● Self-serve onboarding
● Recommendation
/ Manages the Metadata Platform team within Big Data
/ Previously - Senior Software Engineer / Tech Lead for
Data Discovery & Data Privacy @ Uber
About me
Kaan Onuk
Engineering Manager
/ What is metadata?
/ Why does metadata matter?
What is
metadata?
Uber’s massive data holds deep
hidden insights.
Metadata helps to surface them.
Metadata drives data
productivity by making data easy
to discover, understand, and
govern.
/ Metadata Sources
/ Metadata Registry
/ Metadata Collection
/ Metadata Storage
/ Data Model
Metadata
Sources
uMetadata
Metadata Registry/Definition
Metadata Collection
Pull model Push model
○ Crawler (periodic)
e.g. sample data, stats
○ Event-based (Event Listeners)
e.g. data quality
○ Automated
e.g. data retention policies
○ Crowdsource
e.g. table descriptions
Storage
● Hive for analytical queries
and audit purposes
● Kafka to capture
metadata changes
● MySQL for persistent
storage
● Redis for cache to support
low latency & high
throughput
● Search functionality
powers various internal
platform including
Databook for data
discovery
Metadata Store: Data Model Requirements
1. Discovery
2. Cluster-specific & agnostic metadata
4. Flexibility on onboarding new entities
3. Easy metadata type creation
Metadata
Store
Data Model
Metadata
Management
(2019)
1. Easy onboarding
2. Derived metadata
through relationships
3. Efficient metadata
retrieval
Key Takeaways
1. Centralized Metastore: Datasets + Artifacts
2. Metadata Registry: Taxonomy / Metadata scheme
3. Metadata Collection: Choose the right approach
4. Data Model: Leverage metadata relationships
Next Steps
Metadata Management
Innovate
- Personalization to improve discovery
- Graph traversal optimizations
Automate
- Human-in-the-loop AI
Establish trust & accountability
- More integrations with data infra
- Very high qps & low latency
Accessible but secure foundation
- Fully self-served, ontology-based metadata management
Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this
document may be reproduced or utilized in any form or by any means, electronic or mechanical,
including photocopying, recording, or by any information storage or retrieval systems, without
permission in writing from Uber. This document is intended only for the use of the individual or entity
to whom it is addressed and contains information that is privileged, confidential or otherwise exempt
from disclosure under applicable law. All recipients of this document are notified that the information
contained herein includes proprietary and confidential information of Uber, and recipient may not
make use of, disseminate, or in any way disclose this document or any of the enclosed information
to any person other than employees of addressee to the extent necessary for consultations with
authorized personnel of Uber.
Thank you!
kaan@uber.com

Más contenido relacionado

La actualidad más candente

Intelligent Banking: AI cases in Retail and Commercial Banking
Intelligent Banking: AI cases in Retail and Commercial BankingIntelligent Banking: AI cases in Retail and Commercial Banking
Intelligent Banking: AI cases in Retail and Commercial BankingDmitry Petukhov
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityDATAVERSITY
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsDATAVERSITY
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI StrategyAtScale
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshJeffrey T. Pollock
 
FWD Insurance - Insurer Innovation Award 2022
FWD Insurance - Insurer Innovation Award 2022FWD Insurance - Insurer Innovation Award 2022
FWD Insurance - Insurer Innovation Award 2022The Digital Insurer
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceDATAVERSITY
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platformsJamesAnderson599331
 
AIA - Insurer transformation Award 2022
AIA - Insurer transformation Award 2022AIA - Insurer transformation Award 2022
AIA - Insurer transformation Award 2022The Digital Insurer
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata ManagementDATAVERSITY
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management DATAVERSITY
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationDatabricks
 
Docsumo pitch deck
Docsumo pitch deckDocsumo pitch deck
Docsumo pitch deckTech in Asia
 
Business Drivers Behind Data Governance
Business Drivers Behind Data GovernanceBusiness Drivers Behind Data Governance
Business Drivers Behind Data GovernancePrecisely
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...HostedbyConfluent
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDPTrieu Nguyen
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Cloudera, Inc.
 

La actualidad más candente (20)

Data mesh
Data meshData mesh
Data mesh
 
Intelligent Banking: AI cases in Retail and Commercial Banking
Intelligent Banking: AI cases in Retail and Commercial BankingIntelligent Banking: AI cases in Retail and Commercial Banking
Intelligent Banking: AI cases in Retail and Commercial Banking
 
Data Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data QualityData Modeling, Data Governance, & Data Quality
Data Modeling, Data Governance, & Data Quality
 
Building a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business GoalsBuilding a Data Strategy – Practical Steps for Aligning with Business Goals
Building a Data Strategy – Practical Steps for Aligning with Business Goals
 
Creating an Enterprise AI Strategy
Creating an Enterprise AI StrategyCreating an Enterprise AI Strategy
Creating an Enterprise AI Strategy
 
Company Profile
Company ProfileCompany Profile
Company Profile
 
Data Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to MeshData Mesh Part 4 Monolith to Mesh
Data Mesh Part 4 Monolith to Mesh
 
FWD Insurance - Insurer Innovation Award 2022
FWD Insurance - Insurer Innovation Award 2022FWD Insurance - Insurer Innovation Award 2022
FWD Insurance - Insurer Innovation Award 2022
 
Slides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data GovernanceSlides: Taking an Active Approach to Data Governance
Slides: Taking an Active Approach to Data Governance
 
Got data?… now what? An introduction to modern data platforms
Got data?… now what?  An introduction to modern data platformsGot data?… now what?  An introduction to modern data platforms
Got data?… now what? An introduction to modern data platforms
 
AIA - Insurer transformation Award 2022
AIA - Insurer transformation Award 2022AIA - Insurer transformation Award 2022
AIA - Insurer transformation Award 2022
 
Best Practices in Metadata Management
Best Practices in Metadata ManagementBest Practices in Metadata Management
Best Practices in Metadata Management
 
Data Governance and Metadata Management
Data Governance and Metadata ManagementData Governance and Metadata Management
Data Governance and Metadata Management
 
Active Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with AlationActive Governance Across the Delta Lake with Alation
Active Governance Across the Delta Lake with Alation
 
Docsumo pitch deck
Docsumo pitch deckDocsumo pitch deck
Docsumo pitch deck
 
Business Drivers Behind Data Governance
Business Drivers Behind Data GovernanceBusiness Drivers Behind Data Governance
Business Drivers Behind Data Governance
 
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
How to Build the Data Mesh Foundation: A Principled Approach | Zhamak Dehghan...
 
Lattice Inc. - Investor Presentation
Lattice Inc. - Investor PresentationLattice Inc. - Investor Presentation
Lattice Inc. - Investor Presentation
 
[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP[Notes] Customer 360 Analytics with LEO CDP
[Notes] Customer 360 Analytics with LEO CDP
 
Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360Using Big Data to Drive Customer 360
Using Big Data to Drive Customer 360
 

Similar a [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

Data Analytics in Digital Transformation
Data Analytics in Digital TransformationData Analytics in Digital Transformation
Data Analytics in Digital TransformationMukund Babbar
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data BSP Media Group
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"MDS ap
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as ProductDATAVERSITY
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwarePanorama Software
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationDenodo
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...BigDataEverywhere
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyDatabricks
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAmazon Web Services
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningNandakumar P
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sectorAnil Rana
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAmazon Web Services
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyTamrMarketing
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?Denodo
 
A Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyA Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyEric Kavanagh
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...DATAVERSITY
 

Similar a [Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale (20)

Data Analytics in Digital Transformation
Data Analytics in Digital TransformationData Analytics in Digital Transformation
Data Analytics in Digital Transformation
 
Capturing big value in big data
Capturing big value in big data Capturing big value in big data
Capturing big value in big data
 
Machine Data Analytics
Machine Data AnalyticsMachine Data Analytics
Machine Data Analytics
 
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
SAP Forum Ankara 2017 - "Verinin Merkezine Seyahat"
 
The ABCs of Treating Data as Product
The ABCs of Treating Data as ProductThe ABCs of Treating Data as Product
The ABCs of Treating Data as Product
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Top Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama SoftwareTop Business Intelligence Trends for 2016 by Panorama Software
Top Business Intelligence Trends for 2016 by Panorama Software
 
Accelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and VisualizationAccelerate Self-Service Analytics with Data Virtualization and Visualization
Accelerate Self-Service Analytics with Data Virtualization and Visualization
 
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
Big Data Everywhere Chicago: Platfora - Practices for Customer Analytics on H...
 
Sgcp14dunlea
Sgcp14dunleaSgcp14dunlea
Sgcp14dunlea
 
Transforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform StrategyTransforming GE Healthcare with Data Platform Strategy
Transforming GE Healthcare with Data Platform Strategy
 
AWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AIAWS Initiate Day Dublin 2019 – Big Data Meets AI
AWS Initiate Day Dublin 2019 – Big Data Meets AI
 
Taming Big Data With Modern Software Architecture
Taming Big Data  With Modern Software ArchitectureTaming Big Data  With Modern Software Architecture
Taming Big Data With Modern Software Architecture
 
UNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data MiningUNIT - 1 : Part 1: Data Warehousing and Data Mining
UNIT - 1 : Part 1: Data Warehousing and Data Mining
 
Big data analytics in banking sector
Big data analytics in banking sectorBig data analytics in banking sector
Big data analytics in banking sector
 
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AIAWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
AWS Initiate Day Manchester 2019 – AWS Big Data Meets AI
 
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and UncertaintyAgile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
Agile Leadership: Guiding DataOps Teams Through Rapid Change and Uncertainty
 
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?¿En qué se parece el Gobierno del Dato a un parque de atracciones?
¿En qué se parece el Gobierno del Dato a un parque de atracciones?
 
A Winning Strategy for the Digital Economy
A Winning Strategy for the Digital EconomyA Winning Strategy for the Digital Economy
A Winning Strategy for the Digital Economy
 
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
ADV Slides: What the Aspiring or New Data Scientist Needs to Know About the E...
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelDeepika Singh
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...apidays
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbuapidays
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native ApplicationsWSO2
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Zilliz
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsNanddeep Nachan
 

Último (20)

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot ModelNavi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Navi Mumbai Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu SubbuApidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
Apidays Singapore 2024 - Modernizing Securities Finance by Madhu Subbu
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
Emergent Methods: Multi-lingual narrative tracking in the news - real-time ex...
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 

[Strata NYC 2019] Turning big data into knowledge: Managing metadata and data relationships at Uber's scale

  • 1. Turning Big Data into Knowledge September 25, 2019 Kaan Onuk, Luyao Li, Atul Gupte
  • 3. ● Engineer turned Product Manager ● Previously: building FarmVille & the mobile advertising platform @ Zynga ● Currently: Product Manager on the Data Platform team building data science, data knowledge, and interactive analytics platforms About me Atul Gupte Product Manager
  • 4. What we’ll talk about today Data landscape at Uber Our journey since 2016 Metadata management through Databook dtbData relationships through Lineage lin
  • 5. We ignite opportunity by setting the world in motion 15M Trips/Day 700+ Cities 100M Monthly Users
  • 6. Data informs every decision at the company
  • 7. Daily Uber trips powered by ML Millions Messages processed by Kafka 2T Queries across Hive, Vertica and Presto 1M Data ingested into HDFS 150TB How Big is our Big Data?
  • 8. Data Platform Team Move the world with global data, local insights, and intelligent decisions.
  • 9. Data Infrastructure Data Platform DataTools Data Lake Logging Stream Data Modelers Data Consumers ... Trips Users Data Engineers Overview of Data at Uber Data Scientists
  • 10. Raw ~10,000 Curated ~100 Derived >100,000 Data LakeSources Usage WAU 8,000+ Queries 1M/day Pipelines Thousands Metrics Thousands Experiments Thousands ML models 10s of thousands Self Serve & Open Platform Use Cases Eng ETA, surge, safety DS incentives, churn, pickup Ops driver onboarding, eats cash, partner data sharing Compliance ops metrics, city Challenges compounded by the scale of Data Data produced by Mobile users 100s of millions Events Trillions/day
  • 11. What are users looking to do? What data exists? How does it look? Who’s using it? What happens if I change it? How can I adapt when this data changes?
  • 15. We power data fluency to help Uber make confident, data-driven decisions Any and all users can access and use datasets with ease Users trust our data because it meets their expectations Users access appropriate data, through compliant means
  • 17.
  • 18.
  • 20. Late 2016 ● Indexed small amount of data ○ Offline analytics systems ● Datasets only - no other data entities ● Catalogued basic information about datasets
  • 21. Late 2016 Novice Neville Data Scientists Software Engineers ML Researchers New to Uber Requires help finding data Relies on George for basic tasks Manager Michelle General Managers Product Managers City Operations CXOs & other executives Interface w/regulators & customers Meet critical deadlines Deliver reports and insights Genius George Data Scientists Software Engineers ML Researchers Built underlying systems Tribal knowledge champion De-facto knowledge bank
  • 23. 2017 HQ Non-HQ Rideshare Eats Freight ATG Elevate Support NLP models for support tickets Safety Trip classification Uber Eats Restaurant recommendations Operations LTV models
  • 24. 2017 Data Scientists Software Engineers ML/AI Researchers Advanced SQL Advanced Statistics Scala/Spark, Python/R Data Modeling Inventor Ivan Marketing Managers Entry-level Analysts General Managers Product Managers Limited SQL Spreadsheets Reliant Rebecca City Operations Regional Managers Advanced SQL Spreadsheets Dashboarding Monitoring Matt Operations Managers Data Analysts Product Analysts Advanced SQL Spreadsheets Limited Statistics Limited Python/R Analyst Anna
  • 25. 2017
  • 27. ● Users care about a variety of data assets ○ Datasets, dashboards, metrics, etc. ● Users want a holistic view of everything that exists about their data ○ Ownership ○ Schemas 2018
  • 28. 2018
  • 29. 2018 ● Data quality and health are key concerns ● Table usage information is valuable ● Operational and regulatory environment is growing more complex ○ GDPR ○ Access control & audits
  • 30. 2019
  • 31. 2019 ● Unified interface highlighting relevant metadata ○ Ownership & usage ○ Schema and stats ○ Quality & health signals ○ Lineage
  • 32. 2019 ● Advanced metadata management ○ Automated ingestion ○ Automated classification ○ Simplified controls for data owners 60+ Types of metadata
  • 34. ● Manages Data Lineage team under Data Platform ● Earlier: Senior Software Engineer II at Uber About me Luyao Li Tech Lead Manager
  • 36. What is data lineage? ● Where is the data from ● Where it’s been ● How it’s being transformed
  • 37. “I’m no longer responsible for this table, please ask team X” “This is an upstream problem, we can’t fix it” “Please ask the table owner” “How do I find the pipeline owner?” Multiple days Why does it matter?
  • 38. Applications Data Freshness Data Chargebacks Anomaly Detection Compliance
  • 42. Lessons learned High quality data is essential for success Always be customer obsessed Magical search has a huge impact on usability Make big, bold bets
  • 43. What’s next ● Column-level lineage ● Self-diagnostic and reporting ● Self-serve onboarding ● Recommendation
  • 44. / Manages the Metadata Platform team within Big Data / Previously - Senior Software Engineer / Tech Lead for Data Discovery & Data Privacy @ Uber About me Kaan Onuk Engineering Manager
  • 45. / What is metadata? / Why does metadata matter?
  • 47. Uber’s massive data holds deep hidden insights. Metadata helps to surface them.
  • 48. Metadata drives data productivity by making data easy to discover, understand, and govern.
  • 49. / Metadata Sources / Metadata Registry / Metadata Collection / Metadata Storage / Data Model
  • 52. Metadata Collection Pull model Push model ○ Crawler (periodic) e.g. sample data, stats ○ Event-based (Event Listeners) e.g. data quality ○ Automated e.g. data retention policies ○ Crowdsource e.g. table descriptions
  • 53. Storage ● Hive for analytical queries and audit purposes ● Kafka to capture metadata changes ● MySQL for persistent storage ● Redis for cache to support low latency & high throughput ● Search functionality powers various internal platform including Databook for data discovery
  • 54. Metadata Store: Data Model Requirements 1. Discovery 2. Cluster-specific & agnostic metadata 4. Flexibility on onboarding new entities 3. Easy metadata type creation
  • 56. Metadata Management (2019) 1. Easy onboarding 2. Derived metadata through relationships 3. Efficient metadata retrieval
  • 57. Key Takeaways 1. Centralized Metastore: Datasets + Artifacts 2. Metadata Registry: Taxonomy / Metadata scheme 3. Metadata Collection: Choose the right approach 4. Data Model: Leverage metadata relationships
  • 58. Next Steps Metadata Management Innovate - Personalization to improve discovery - Graph traversal optimizations Automate - Human-in-the-loop AI Establish trust & accountability - More integrations with data infra - Very high qps & low latency Accessible but secure foundation - Fully self-served, ontology-based metadata management
  • 59.
  • 60. Proprietary and confidential © 2018 Uber Technologies, Inc. All rights reserved. No part of this document may be reproduced or utilized in any form or by any means, electronic or mechanical, including photocopying, recording, or by any information storage or retrieval systems, without permission in writing from Uber. This document is intended only for the use of the individual or entity to whom it is addressed and contains information that is privileged, confidential or otherwise exempt from disclosure under applicable law. All recipients of this document are notified that the information contained herein includes proprietary and confidential information of Uber, and recipient may not make use of, disseminate, or in any way disclose this document or any of the enclosed information to any person other than employees of addressee to the extent necessary for consultations with authorized personnel of Uber. Thank you! kaan@uber.com