1. How to build
a data warehouse?
Dmytro Popovych, SE @ Tubular
2. Theory vs practice
Цитата #441422
Пока инженеры в белых халатах прикручивают красивый двигатель к
идеальному крылу, бригада взлохмаченных придурков во главе с
безумным авантюристом пролетает над ними на конструкции из
микроавтобуса, забора и двух промышленных фенов, навстречу
второму туру инвестиций.
Красивые проекты не взлетают, потому что они не успевают взлететь.
4. About us
• Video intelligence for the cross-platform world
• 30+ video platforms including YouTube, Facebook, Instagram
• 7M creators
• 3B videos
• 2Tb of newly ingested data a day
• 150Tb of data in the warehouse
5. What is a data warehouse?
A central repository of data collected from disparate sources.
ANALYST
ENGINEER
SERVICE
DATA
WAREHOUSE
6. Key features
Ingestion
Store raw data extracted from disparate data sources
Normalisation
Cleanup / combine raw data
Access
Help user to retrieve data
7. What problems does it solve in Tubular?
• For engineers / analysts:
• data discovery
• prototyping / analyse
• For services:
• data exchange
9. Data Ingestion Problems
• Real time data:
• tweets, comments, shares, views
• Periodical snapshots:
• dump of real time data
• results of the data analysis
• databases from internal services (in some cases)
10. Real time data
DATABUS / event log / message queue
Powered by KAFKA
Data serialised with AVRO
Keeps all events for the last N days
SERVICE #1 SERVICE #2
PERMANENT
STORAGE
...
11. Why did we choose Kafka?
• Stores streams of records in a fault-tolerant way
• Designed to serve multiple consumers per topic
• Allows to keep the last N days of records
• Tested in very big companies Linkedin, Twitter, Uber, Airbnb...
12. • Strict schema definition
• Safe schema evolution
• Compact (binary serialisation format)
• Cross-technology format (Java, Python, …)
• Has some ecosystem around (Schema Registry, CLI consumers, …)
• Hadoop-friendly
Why did we choose Avro?
13. Periodical Snapshots
DATABUS
SERVICE #1
Powered by ELASTIC
PERMANENT STORAGE
SERVICE #2
Powered by CASSANDRA
SERVICE #3
Powered by MYSQL
...
DATA IMPORT TOOL
Powered by S3
Data serialised with PARQUET
Powered by SPARK
14. Why did we choose S3?
• There is no need to support it
• Compatible with Hadoop ecosystem
• Relatively stable & cheap
15. Why did we choose Parquet?
• Column-oriented format (perfect for analytics and partial reads)
• Supports complex data structures
• Compatible with Hadoop ecosystem
16. Why did we choose Spark?
• Scalable data processing engine
• Faster than Hadoop
• Has connectors to all popular storages: JDBC, Elastic, Cassandra, Kafka
• Has Python bindings
• Built-in support of Parquet
18. Data Normalisation Problems
• Cleanup duplicates
• Partition by year / month / date / hour
• Join various data sources
19. Normalisation of real time data (example)
SERVICE #1
Powered by ELASTIC
DATABUS
UI
PERMANENT STORAGE
The service joins multiple data streams by sending
partial updates to Elastic.
Note! It isn’t the only way to implement a real time
join, more generic solution could be implemented
with Apache Samza.
20. Why did we choose Elastic?
• Provides real time search and analytics
• Has relatively cheap partial updates
• Easy to scale
21. Normalisation of previously imported data
DATA NORMALISATION TOOL
PERMANENT STORAGE
Powered by Spark
Joins various datasets
Removes duplicates
Creates partitions by time range buckets
22. Why did we choose Spark?
• Scalable data processing engine
• Has built-in SQL api to transform data (perfect for joins and deduplication)
26. Why did we choose Hive Metastore?
• Supported by Hadoop ecosystem
• Simple (Thrift api on top of MySQL table)
• Supported by Hue (UI to access metadata)