Join Anselmo for an engaging overview of the new end-to-end data architecture at Expedia Group, taking a journey through cloud and on-prem data lakes, real-time and batch processes and streamlined access for data producers and consumers. Find out how the new architecture unifies a complex mix of data sources and feeds the data science development cycle. Expedia might appear to be a market-leading travel company – in reality, it’s a highly successful technology and data science company.
17. Data Lake
Hive Metastore
On-Premises
CLOUD MIGRATION
Follow data producers path
Improve security, scalabity and resilience
Promote technology innovation
Separate computing from storage
Hive Metastore
Solid Foundation
18. Data Lake
Hive Metastore
On-Premises
Hive Metastore
DATA REPLICATION { CIRCUS-TRAIN }
Replicates Hive tables between clusters on request. It
replicates both the table's data and metadata.
It has a light touch, requiring no direct integration
with Hive's core services.
It can copy either entire unpartitioned tables or user
defined sets of partitions on partitioned tables.
it is not event driven and does not know how tables
differ between sites.
SOLID FOUNDATION
https://github.com/hotelsdotcom/circus-train
20. Data Lake
Hive Metastore
Data Lake #2
Hive Metastore
SOLID FOUNDATION
Data Lake #3
Hive Metastore
DATA FEDERATION { WAGGLE-DANCE }
Waggle Dance is a request routing
Hive metastore proxy that allows
tables to be concurrently accessed
across multiple Hive deployments.
It was created to tackle the
appearance of dataset silos that
arose as our large organization
gradually migrated from monolithic
on-premises clusters to cloud based
platforms.
https://github.com/hotelsdotcom/waggle-dance
21. Data Lake
Hive Metastore
SOLID FOUNDATION
Data Lake #3
Hive Metastore
waggle_dance_federation.yml
primary-meta-store:
access-control-type: READ_AND_WRITE_ON_DATABASE_WHITELIST
name: primary
remote-meta-store-uris: ${ON_PREM_HIVE_METASTORE_URI}
writable-database-white-list:
- foo_user_.*
federated-meta-stores:
- name: zed-bar-prod
access-control-type: READ_ONLY
remote-meta-store-uris: ${USW2_6623552_PROD_HIVE_METASTORE_URI}
mapped-databases:
- foo_transaction
- bar_stream
- zed_common
- opp_charles
22. Data Lake
Hive Metastore
SOLID FOUNDATION
On-Premises
Hive Metastore
DATA QUALITY FRAMEWORK
Manage core data-assets like anyother product,
promoting instrumentation, observability and
alerting.
“First to Know” culture and process, measuring
how data-assets are accessible, fresh,
complete, accurate, enriched, integrated.
#BKG-MART #USR-TABLE
#CLK—STREAM
23. Easy to produce data
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
24. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
NRT SERVICE
One way to produce data
Scalability - perf/efficiency (Kafka)
Simplifiied schema management
Support on all environments
Strive for a full hands-off service
EASY TO PRODUCE DATA
25. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
PRODUCER CONTRACT
Own the data schema
Own produced data (e2e)
Stream events in realtime
Obfuscate sensitive information
Document and update data assets
Monitor data in production
EASY TO PRODUCE DATA
26. Easy to consume data
Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
27. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
CONSUMER CONTRACT
Consume documented data-assets
Use approved access layers/libs
Report back any data quality issue
Anotate outputs with data-sources
Follow data governance guidelines
Adopt schema changes
*Do not duplicate data-assets*
EASY TO CONSUME DATA
28. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
QUERY ENGINES + TOOLS
Hive, Presto, Spark
EMR (data processing)
Databricks (data science)
Qubole (query, insights)
Athena (operational support)
EASY TO CONSUME DATA
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
29. Online
Offline
Development
Data Lake
Hive MetastoreOrchestrator
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
ANALYTICS API (METRICS/DIMS STORE)
Programatical access to analytical data
with granular ACL on data-sets,
columns, rows.
Metadata, search, breakdown, filter,
timeseries, comparison, forecast on key
data-sets (sub-second response time).
EASY TO CONSUME DATA
ANALYTICS API
curl -o analytics.eps/bookings?
dateField=created_day&date_range=2018-03-01,2018-05-01|
2018-01-01,2018-03-01&groupby=partner
[top=10,by=foo]&fields=foo,zed,bar&interval=hour
30. Data Science pushes the envelope
Online
Offline
Development
Data Lake
Orchestrator Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
31. Online
Offline
Development
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
DS DEVELOPMENT CYCLE
Models Tuning
Algorithm Training
ML Model storage
DATA SCIENCE PUSHES THE ENVELOPE
32. Online
Offline
Development
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
FEATURES PIPELINE
Training sets
Validation sets
Parameters
Configuration
33. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
BATCH EXECUTION
Prediction backtesting
34. Online
Offline
Development
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Orchestrator
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Data Exploration + Pipelines Setup
DATA SCIENCE PUSHES THE ENVELOPE
MODEL PERFORMANCE
Performance evaluation
Observability
Model Performance / Monitoring
50k
23k
35. Online
Offline
Development
EPS API
book
Partner(s)
Service
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
NRT Streaming Service
Data
Producers
DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Features Store
ML Service
Data Exploration + Pipelines Setup
CI/CD
Performance
Set
Model Performance / Monitoring
50k
23k
DATA SCIENCE PUSHES THE ENVELOPE
ONLINE SERVICE
CI/CD
Online features store
Model serialisason
Model serving
{ Custom, MLeap, Tensorflow, PMML }
Model Performance / Monitoring
50k
23k
36. Online
Offline
Development
NRT Streaming Service
Data
Producers
Features Store
ML Service
CI/CD
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Model Performance / Monitoring
50k
23k
EPS API
book
Partner(s)
Service
37. Online
Offline
Development
NRT Streaming Service
Data
Producers
Features Store
ML Service
CI/CD
EPS API
book
Partner(s)
Service
Orchestrator
Data Lake
Hive Metastore
On-Premises
Hive Metastore
Data Exploration + Pipelines Setup DS Development
Execute
> SQL oiu aosiud
oa
dasdaosiud
oas
asodiuaosid
Batch Model
Execution
Prediction + Backtesting
Training Set
Validation Set
Algorithm TrainingModel Config
ML Model Store
Performance
Set
Model Performance / Monitoring
50k
23k
“It Takes a Village … ”
38. IT TAKES A VILLAGE ...
C R O S S
F U N C T I O N A L
T E A M S
$
P R O M O T E
B E S T
E N G I N E E R I N G
P R A C T I C E S
C R I T I C A L
E X E C U T I O N
P A T H
M E A S U R E
O P E R A T I O N A L
C O S T S
S O L I D
P L A T F O R M
T O B U I L D
O N T O P