SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Traveloka’s
Data
Journey
Stories and lessons learned
on building a scalable data
pipeline at Traveloka.
Very Early
Days...
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Very Early days
Applications
& Services
Summarizer
Internal
Dashboard
Report Scripts +
Crontab
- Raw Activity
- Key Value
- Time Series
Full... Split & Shard!
Raw, KV, and Time Series DB
Applications
& Services Internal
Dashboard
Report Scripts +
Crontab
Raw Activity
(Sharded)
Time Series
SummarySummarizer
Lesson Learned
1. UNIX principle: “Do One Thing and Do It Well”
2. Split use cases based on SLA & query pattern
3. Scalable tech based on growth estimation
Key Value DB
(Sharded)
Throughput?
Kafka comes into rescue
Applications
& Services
Raw Activity
(Sharded)
Lesson Learned
1. Use something that can handle
higher throughput for cases with
high write volume like tracking
2. Decouple publish and consume
Kafka as
Datahub
Raw data
consumer
Key Value
(Sharded)
insert
update
We need Data Warehouse
and BI Tool, and we need it fast!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Postgres
Periscope BI
Tool
Lesson Learned
1. Think DW since the beginning of data pipeline
2. BI Tools: Do not reinvent the wheel
“Have” to
adopt big data
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Postgres couldn’t handle the load!
Raw Activity
(Sharded)
Other sources
Python ETL
(temporary
solution)
Star Schema
DW on
Redshift
Periscope BI
Tool
Lesson Learned
1. Choose specific tech that best fit the use case
Scaling out in MongoDB
every so often is not manageable...
Lesson Learned
1. MongoDB Shard: Scalability need to be tested!
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
“Have” to adopt big data
Lesson Learned
1. Processing have to be easily scaled
2. Scale processing separately for: day to day job,
backfill job
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Star Schema
DW on
Redshift
Near Real Time on Big Data
is challenging
Lesson Learned
1.Dig requirement until it is very
specific, for data it is related to:
1) latency SLA
2) query pattern
3) accuracy
4) processing requirement
5) tools integration
Kafka as
Datahub
MemSQL for Near
Real Time DB
No OPS!!!
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Open your mind for
any combination of tech!
Lesson Learned
1. Combination of cloud provider is possible, but
be careful of latency concern
2. During a research project, always prepare plan
B & C plus proper buffer on timeline
3. Autoscale!
PubSub as
Datahub
DataFlow for
Stream
Processing
Key Value on
DynamoDB
More autoscale!
Lesson Learned
1. Autoscale = cost monitoring
Caveat
Autoscale != everything solved
e.g. PubSub default quota 200MB/s (could be
increased, but manually request)
PubSub as
Datahub
BigQuery for Near
Real Time DB
More autoscale!
Lesson Learned
1. Scalable as granular as
possible, in this case
separate compute and
storage scalability
2. Separate BI with well
defined SLA and
exploration use case
Kafka as
Datahub
Gobblin as
Consumer
Raw Activity
on S3
Processing on
Spark
Hive & Presto on
Qubole as Query
Engine
BI & Exploration
Tools
WRAP UP
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics
UI
NoSQL DB
Traveloka
Services
Inges
t
Cloud
Pub/Sub
Storag
e
Cloud
Storage
Pipeline
s Cloud
Dataflow
Analytic
s
BigQuery
Monitoring
Logging
Hive, Presto
Query
Key Lessons Learned
● Scalability in mind -- esp disk full.. :)
● Scalable as granular as possible -- compute, storage
● Scalability need to be tested (of course!)
● Do one thing, and do it well, dig your requirement
-- SLA, query pattern
● Decouple publish and consume
-- publisher availability is very important!
● Choose tech that is specific to the use case
● Careful of Gotchas! There's no silver bullet...
THE FUTURE
Stories and lessons learned on building a
scalable data pipeline at Traveloka.
Future Roadmap
● In the past, we see problems/needs, see what technology
can solve it, and plug it to the existing pipeline.
● It works well.
● But after some time, we need to maintain a lot of different
components.
● Multiple clusters:
○ Kafka
○ Spark
○ Hive/Presto
○ Redshift
○ etc
● Multiple data entry points for analyst:
○ BigQuery
○ Hive/Presto
○ Redshift
Our Goal
● Simplifying our data architecture.
● Single data entry point for data analysts/scientists,
both streaming and batch data.
● Without compromising what we can do now.
● Reliability, speed, and scale.
● Less or no ops.
● We also want to make migration as simple/easy as
possible.
How will we achieve this?
● There are few options that we are considering right
now.
● Some of them introducing new
technologies/components.
● Some of them is making use of our existing
technology to its maximum potential.
● We are trying exciting new (relatively) technologies:
○ Google BigQuery
○ Google Dataprep on Dataflow
○ AWS Athena
○ AWS Redshift Spectrum
○ etc
Plan to simplify
Cloud Pub/Sub
Cloud Dataflow
BigQuery Cloud Storage
Kubernetes Cluster
Collector
Managed services
BI &
Analytics UI
BigTable
REST API
ML Models
Plan to simplify
● Seems promising, but…
● Need to be tested.
● Cover all use cases that we need ?
● Query migration ?
● Costs ?
● Maintainability ?
● Potential problems ?
See You On
Next Event!
Thank You

Más contenido relacionado

La actualidad más candente

SISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
SISTIM INFORMASI MANAJEMEN BUKALAPAK.COMSISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
SISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
Niar Afriyani
 

La actualidad más candente (20)

Tinder Pitch Deck
Tinder Pitch DeckTinder Pitch Deck
Tinder Pitch Deck
 
Uber pitch deck
Uber pitch deckUber pitch deck
Uber pitch deck
 
Facebook Pitch Deck
Facebook Pitch DeckFacebook Pitch Deck
Facebook Pitch Deck
 
Gojek pitch deck
Gojek pitch deckGojek pitch deck
Gojek pitch deck
 
Eziban pitch deck
Eziban pitch deckEziban pitch deck
Eziban pitch deck
 
Pitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deckPitch Deck Teardown: Mint House's $35M Series B deck
Pitch Deck Teardown: Mint House's $35M Series B deck
 
Kelompok 7 : Analisis Proses Bisnis Perusahaan Traveloka
Kelompok 7 : Analisis Proses Bisnis Perusahaan TravelokaKelompok 7 : Analisis Proses Bisnis Perusahaan Traveloka
Kelompok 7 : Analisis Proses Bisnis Perusahaan Traveloka
 
Uber Pitch Deck 2008
Uber Pitch Deck 2008Uber Pitch Deck 2008
Uber Pitch Deck 2008
 
MANAJEMEN KUALITAS PELAYANAN: TRAVELOKA
MANAJEMEN KUALITAS PELAYANAN: TRAVELOKAMANAJEMEN KUALITAS PELAYANAN: TRAVELOKA
MANAJEMEN KUALITAS PELAYANAN: TRAVELOKA
 
Airbnb - David Cao
Airbnb - David CaoAirbnb - David Cao
Airbnb - David Cao
 
Brill Power
Brill Power Brill Power
Brill Power
 
Airbnb Pitch Deck redesigned by Zlides
Airbnb Pitch Deck redesigned by ZlidesAirbnb Pitch Deck redesigned by Zlides
Airbnb Pitch Deck redesigned by Zlides
 
SISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
SISTIM INFORMASI MANAJEMEN BUKALAPAK.COMSISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
SISTIM INFORMASI MANAJEMEN BUKALAPAK.COM
 
The Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed RoundThe Deck We Used to Raise $1M Seed Round
The Deck We Used to Raise $1M Seed Round
 
Revolut pitch deck
Revolut pitch deckRevolut pitch deck
Revolut pitch deck
 
Data Company profile and mockup_Gojek.pdf
Data Company profile and mockup_Gojek.pdfData Company profile and mockup_Gojek.pdf
Data Company profile and mockup_Gojek.pdf
 
Condi Deck
Condi DeckCondi Deck
Condi Deck
 
How Wealthsimple raised $2M in 2 weeks
How Wealthsimple raised $2M in 2 weeksHow Wealthsimple raised $2M in 2 weeks
How Wealthsimple raised $2M in 2 weeks
 
MOG BURGER KING
MOG BURGER KINGMOG BURGER KING
MOG BURGER KING
 
WeWork pitch deck
WeWork pitch deckWeWork pitch deck
WeWork pitch deck
 

Similar a Traveloka's data journey — Traveloka data meetup #2

Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
Stitch Fix Algorithms
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
Sub Szabolcs Feczak
 

Similar a Traveloka's data journey — Traveloka data meetup #2 (20)

Scalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev BandungScalable data pipeline at Traveloka - Facebook Dev Bandung
Scalable data pipeline at Traveloka - Facebook Dev Bandung
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Leveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark PipelinesLeveraging Databricks for Spark Pipelines
Leveraging Databricks for Spark Pipelines
 
Leveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelinesLeveraging Databricks for Spark pipelines
Leveraging Databricks for Spark pipelines
 
Stream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and KafkaStream, stream, stream: Different streaming methods with Spark and Kafka
Stream, stream, stream: Different streaming methods with Spark and Kafka
 
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Apache Spark and Kafka
 
Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...Developing high frequency indicators using real time tick data on apache supe...
Developing high frequency indicators using real time tick data on apache supe...
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned AWS Big Data Demystified #1: Big data architecture lessons learned
AWS Big Data Demystified #1: Big data architecture lessons learned
 
Learn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best PracticesLearn from HomeAway Hadoop Development and Operations Best Practices
Learn from HomeAway Hadoop Development and Operations Best Practices
 
Headaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous ApplicationsHeadaches and Breakthroughs in Building Continuous Applications
Headaches and Breakthroughs in Building Continuous Applications
 
It's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda ArchitectureIt's Time To Stop Using Lambda Architecture
It's Time To Stop Using Lambda Architecture
 
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
Spark + AI Summit 2019: Headaches and Breakthroughs in Building Continuous Ap...
 
Data Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at BitlyData Science at Scale: Using Apache Spark for Data Science at Bitly
Data Science at Scale: Using Apache Spark for Data Science at Bitly
 
Improving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch FixImproving ad hoc and production workflows at Stitch Fix
Improving ad hoc and production workflows at Stitch Fix
 
Apache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - finalApache Beam and Google Cloud Dataflow - IDG - final
Apache Beam and Google Cloud Dataflow - IDG - final
 
SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017SnappyData Toronto Meetup Nov 2017
SnappyData Toronto Meetup Nov 2017
 
Spark to DocumentDB connector
Spark to DocumentDB connectorSpark to DocumentDB connector
Spark to DocumentDB connector
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 
Cloud Big Data Architectures
Cloud Big Data ArchitecturesCloud Big Data Architectures
Cloud Big Data Architectures
 

Último

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
amitlee9823
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
MarinCaroMartnezBerg
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
9953056974 Low Rate Call Girls In Saket, Delhi NCR
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Riyadh +966572737505 get cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
amitlee9823
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
amitlee9823
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
JoseMangaJr1
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
amitlee9823
 

Último (20)

CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Capstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics ProgramCapstone Project on IBM Data Analytics Program
Capstone Project on IBM Data Analytics Program
 
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night StandCall Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
Call Girls In Hsr Layout ☎ 7737669865 🥵 Book Your One night Stand
 
FESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdfFESE Capital Markets Fact Sheet 2024 Q1.pdf
FESE Capital Markets Fact Sheet 2024 Q1.pdf
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
Digital Advertising Lecture for Advanced Digital & Social Media Strategy at U...
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
Thane Call Girls 7091864438 Call Girls in Thane Escort service book now -
 
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
Mg Road Call Girls Service: 🍓 7737669865 🍓 High Profile Model Escorts | Banga...
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Probability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter LessonsProbability Grade 10 Third Quarter Lessons
Probability Grade 10 Third Quarter Lessons
 
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
Call Girls Jalahalli Just Call 👗 7737669865 👗 Top Class Call Girl Service Ban...
 

Traveloka's data journey — Traveloka data meetup #2

  • 1. Traveloka’s Data Journey Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 2. Very Early Days... Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 3. Very Early days Applications & Services Summarizer Internal Dashboard Report Scripts + Crontab - Raw Activity - Key Value - Time Series
  • 4. Full... Split & Shard! Raw, KV, and Time Series DB Applications & Services Internal Dashboard Report Scripts + Crontab Raw Activity (Sharded) Time Series SummarySummarizer Lesson Learned 1. UNIX principle: “Do One Thing and Do It Well” 2. Split use cases based on SLA & query pattern 3. Scalable tech based on growth estimation Key Value DB (Sharded)
  • 5. Throughput? Kafka comes into rescue Applications & Services Raw Activity (Sharded) Lesson Learned 1. Use something that can handle higher throughput for cases with high write volume like tracking 2. Decouple publish and consume Kafka as Datahub Raw data consumer Key Value (Sharded) insert update
  • 6. We need Data Warehouse and BI Tool, and we need it fast! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Postgres Periscope BI Tool Lesson Learned 1. Think DW since the beginning of data pipeline 2. BI Tools: Do not reinvent the wheel
  • 7. “Have” to adopt big data Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 8. Postgres couldn’t handle the load! Raw Activity (Sharded) Other sources Python ETL (temporary solution) Star Schema DW on Redshift Periscope BI Tool Lesson Learned 1. Choose specific tech that best fit the use case
  • 9. Scaling out in MongoDB every so often is not manageable... Lesson Learned 1. MongoDB Shard: Scalability need to be tested! Kafka as Datahub Gobblin as Consumer Raw Activity on S3
  • 10. “Have” to adopt big data Lesson Learned 1. Processing have to be easily scaled 2. Scale processing separately for: day to day job, backfill job Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Star Schema DW on Redshift
  • 11. Near Real Time on Big Data is challenging Lesson Learned 1.Dig requirement until it is very specific, for data it is related to: 1) latency SLA 2) query pattern 3) accuracy 4) processing requirement 5) tools integration Kafka as Datahub MemSQL for Near Real Time DB
  • 12. No OPS!!! Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 13. Open your mind for any combination of tech! Lesson Learned 1. Combination of cloud provider is possible, but be careful of latency concern 2. During a research project, always prepare plan B & C plus proper buffer on timeline 3. Autoscale! PubSub as Datahub DataFlow for Stream Processing Key Value on DynamoDB
  • 14. More autoscale! Lesson Learned 1. Autoscale = cost monitoring Caveat Autoscale != everything solved e.g. PubSub default quota 200MB/s (could be increased, but manually request) PubSub as Datahub BigQuery for Near Real Time DB
  • 15. More autoscale! Lesson Learned 1. Scalable as granular as possible, in this case separate compute and storage scalability 2. Separate BI with well defined SLA and exploration use case Kafka as Datahub Gobblin as Consumer Raw Activity on S3 Processing on Spark Hive & Presto on Qubole as Query Engine BI & Exploration Tools
  • 16. WRAP UP Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 17.
  • 18. Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3 Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Inges t Cloud Pub/Sub Storag e Cloud Storage Pipeline s Cloud Dataflow Analytic s BigQuery Monitoring Logging Hive, Presto Query
  • 19. Key Lessons Learned ● Scalability in mind -- esp disk full.. :) ● Scalable as granular as possible -- compute, storage ● Scalability need to be tested (of course!) ● Do one thing, and do it well, dig your requirement -- SLA, query pattern ● Decouple publish and consume -- publisher availability is very important! ● Choose tech that is specific to the use case ● Careful of Gotchas! There's no silver bullet...
  • 20. THE FUTURE Stories and lessons learned on building a scalable data pipeline at Traveloka.
  • 21. Future Roadmap ● In the past, we see problems/needs, see what technology can solve it, and plug it to the existing pipeline. ● It works well. ● But after some time, we need to maintain a lot of different components. ● Multiple clusters: ○ Kafka ○ Spark ○ Hive/Presto ○ Redshift ○ etc ● Multiple data entry points for analyst: ○ BigQuery ○ Hive/Presto ○ Redshift
  • 22. Our Goal ● Simplifying our data architecture. ● Single data entry point for data analysts/scientists, both streaming and batch data. ● Without compromising what we can do now. ● Reliability, speed, and scale. ● Less or no ops. ● We also want to make migration as simple/easy as possible.
  • 23. How will we achieve this? ● There are few options that we are considering right now. ● Some of them introducing new technologies/components. ● Some of them is making use of our existing technology to its maximum potential. ● We are trying exciting new (relatively) technologies: ○ Google BigQuery ○ Google Dataprep on Dataflow ○ AWS Athena ○ AWS Redshift Spectrum ○ etc
  • 24. Plan to simplify Cloud Pub/Sub Cloud Dataflow BigQuery Cloud Storage Kubernetes Cluster Collector Managed services BI & Analytics UI BigTable REST API ML Models
  • 25. Plan to simplify ● Seems promising, but… ● Need to be tested. ● Cover all use cases that we need ? ● Query migration ? ● Costs ? ● Maintainability ? ● Potential problems ?
  • 26. See You On Next Event! Thank You