Traveloka's journey to no ops streaming analytics

•Descargar como PPTX, PDF•

0 recomendaciones•170 vistas

Rendy Bambang Junior

Tecnología

● Business Intelligence
● Analytics
● Personalization
● Fraud Detection
● Ads optimization
● Cross selling
● AB Test
● etc.
How we use the data

6 offices
Incl. Singapore
1,000+
Global employees
400+
Engineers
Our technology core has enabled
us to scale Traveloka into
6 countries
across ASEAN rapidly in
less than 2 years.

Consumer of Data
Initial Data Architecture
Streaming
Batch
Traveloka
App
Kafka
ETL
In Memory Real
Time DW
Data Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
NoSQL Realtime DB
Traveloka
Services
Hive, Presto
Query
DOMO
Analytics UI

Key Numbers
● Volume kafka: billions of messages/day
● In-Memory DB: hundreds of GB in-memory data
● NoSQL DB: 50+ nodes, 20+ TB storage, 50+ use cases
● S3: hundreds of TB
● Spark: 20+ nodes, 200+ core
● Redshift DW: 20+ Nodes, tens of TB
● Team: 8 Developers + 3 SysOps/DevOps

Consumer of Data
Problems with Initial Data Architecture
Streaming
Batch
Traveloka
App
Kafka
ETL
In Memory Real
Time DW
Data Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
NoSQL Realtime DB
Traveloka
Services
Hive, Presto
Query
DOMO
Analytics UI

Problems with Initial Data Architecture
Debugging Kafka Issues - Dedicated
On-call
Data Warehouse throughput issues
for high frequency load, coupling
storage & compute
Team well being, paged on holiday,
even honeymoon for infra issue!
Scaling Issues with NoSQL DB and
In-Memory DB
Scaling Issues with Custom-built
Java Consumers

Ideal Solution
Fully-managed infrastructure to
free engineers to solve business
problems
Autoscaling of Storage and
Compute
Low end-to-end latency with
guaranteed SLA
Resilience, end-to-end system
availability

Solution Components
● Google Cloud PubSub (Events Data Ingestion)
● Google Cloud Dataflow (Stream Processing)
● Google Bigquery (Analytics)
● Cross-Cloud Environment (AWS-GCP)
● AWS DynamoDB (Operational datastore)
Note: Although Cloud Datastore was our prefered operational DB, but its non availability in SG region
necessitated use of Dynamodb.

Analytics Architecture: Reimagined
Consumer of Data
Streaming
Batch
Traveloka
App
Kafka
ETL
Data
Warehouse
S3 Data
Lake
Batch
Ingest
Android,
iOS
DOMO
Analytics UI
NoSQL DB
Traveloka
Services
Ingest
Cloud
Pub/Sub
Storage
Cloud
Storage
Pipelines
Cloud
Dataflow
Analytics
BigQuery
Monitoring
Logging
Hive, Presto
Query

Developed two Common Dataflow Engine
● Self-Service Streaming analytics to BigQuery

Developed two Common Dataflow Engine
● Stream processing to DynamoDB, common features for dev:
○ Combine by key
○ Optimistic Concurrency
○ Local-file based integration test

Key Facts/Numbers
● End to End Pipeline Latency: seconds
● Volume: hundreds of GB/day
● Team: 2 Developers, 0 Ops
● Agility: POC + Pilot in 1 month
● Migrate 50+ different stream processing use case in 1 month
● Bigquery Integration with BI tools: thousands of dashboard, hundreds of
users

Awesome Autoscale
Pubsub & Dataflow could
absorb spiky load just fine!
Our case: promo
PubSub Publish Count DataFlow vcpus Count

Why Cloud Dataflow
(Beam): Tuning Pipeline
vs. Managing Servers

Unified Model with Apache Beam
Batch
or
Streaming

http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

Traveloka Data Team Philosophy
● Managed Service
● NoOps
● Self-Service
Focus more on solving complex business problems rather than focusing on
infrastructure

What required us to change?
● Ever increasing scale
● Ever increasing operations burden
● New business needs: Streaming Analytics

Next Generation Architecture
Cloud Pub/Sub
Cloud Dataflow
BigQuery Cloud Storage
Kubernetes Cluster Collector
Managed services
Simplify!
BI &
Analytics UI

Conclusion
Our engineering team of
2 produces and
maintains like a team of
8 because of products
like PubSub, Dataflow &
Bigquery
“
”

Más contenido relacionado

La actualidad más candente

Big data real time architecturesDaniel Marcous

Lambda architecture @ IndixRajesh Muppalla

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | EnglishOmid Vahdaty

Speed layer : Real time views in LAMBDA architecture Tin Ho

Continuous delivery for machine learningRajesh Muppalla

Meetup Google BigQuery powered by aiIdo Volff

Using Hazelcast in the Kappa architectureOliver Buckley-Salmon

Challenges in Building a Data PipelineManish Kumar

Real time analytics at uber @ strata data 2019Zhenxiao Luo

Superworkflow of Graph Neural Networks with K8S and FugueDatabricks

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...Databricks

Anatomy of in memory processing in Sparkdatamantra

Netflix Big Data Paris 2017Jason Flittner

Uber Geo spatial data platform at DataWorks SummitZhenxiao Luo

Developing high frequency indicators using real time tick data on apache supe...Zekeriya Besiroglu

The evolution of the big data platform @ Netflix (OSCON 2015)Eva Tse

Lambda architectureMario Alexandro Santini

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...Spark Summit

Funnel Analysis with Apache Spark and DruidDatabricks

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...Grokking VN

La actualidad más candente (20)

Big data real time architectures

Lambda architecture @ Indix

AWS big-data-demystified #1.1 | Big Data Architecture Lessons Learned | English

Speed layer : Real time views in LAMBDA architecture

Continuous delivery for machine learning

Meetup Google BigQuery powered by ai

Using Hazelcast in the Kappa architecture

Challenges in Building a Data Pipeline

Real time analytics at uber @ strata data 2019

Superworkflow of Graph Neural Networks with K8S and Fugue

An End-to-End Spark-Based Machine Learning Stack in the Hybrid Cloud with Far...

Anatomy of in memory processing in Spark

Netflix Big Data Paris 2017

Uber Geo spatial data platform at DataWorks Summit

Developing high frequency indicators using real time tick data on apache supe...

The evolution of the big data platform @ Netflix (OSCON 2015)

Lambda architecture

R&D to Product Pipeline Using Apache Spark in AdTech: Spark Summit East talk ...

Funnel Analysis with Apache Spark and Druid

Grokking TechTalk #33: Architecture of AI-First Systems - Engineering for Big...

Similar a Traveloka's journey to no ops streaming analytics

Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.govChris Shenton

[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...Chris Shenton

Sparking up Data Engineering: Spark Summit East talk by Rohan SharmaSpark Summit

Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...Openbar

Data Provision API with BigQuery - Google Cloud Summit Jakarta 18Imre Nagi

Kubernetes Forum Seoul 2019: Re-architecting Data Platform with KubernetesSeungYong Oh

AWS Big Data Demystified #1: Big data architecture lessons learned Omid Vahdaty

Big Data in 200 km/h | AWS Big Data Demystified #1.3 Omid Vahdaty

Building data "Py-pelines"Rob Winters

Zenko @Cloud Native Foundation London Meetup March 6th 2018Laure Vergeron

AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...Amazon Web Services

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...Amazon Web Services

The great migration embracing serverless first AngelaTimofte1

AWS Techniques and lessons writing a minimal cost gitlab runnerAnthony Scata

Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...Codecamp Romania

CloudKitArun Nagarajan

Workshop on Google Cloud Data PlatformGoDataDriven

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at DatabricksDatabricks

How we leveraged Drupal to build a leading SaaS product Invotra

Data Platform on GCPPatrick Alexander

Similar a Traveloka's journey to no ops streaming analytics (20)

Not Your Father’s Web App: The Cloud-Native Architecture of images.nasa.gov

[AWS DC Meetup] Not Your Father’s WebApp: The Cloud-Native Architecture of im...

Sparking up Data Engineering: Spark Summit East talk by Rohan Sharma

Openbar Kontich // Google Cloud: past, present and the (oh so sweet) future b...

Data Provision API with BigQuery - Google Cloud Summit Jakarta 18

Kubernetes Forum Seoul 2019: Re-architecting Data Platform with Kubernetes

AWS Big Data Demystified #1: Big data architecture lessons learned

Big Data in 200 km/h | AWS Big Data Demystified #1.3

Building data "Py-pelines"

Zenko @Cloud Native Foundation London Meetup March 6th 2018

AWS Partner Webcast - Analyze Big Data for Consumer Applications with Looker ...

How Netflix Monitors Applications in Near Real-time w Amazon Kinesis - ABD401...

The great migration embracing serverless first

AWS Techniques and lessons writing a minimal cost gitlab runner

Bogdan botea, dmitry nefedkin no fiddle, efficient development on the googl...

CloudKit

Workshop on Google Cloud Data Platform

Lessons from Building Large-Scale, Multi-Cloud, SaaS Software at Databricks

How we leveraged Drupal to build a leading SaaS product

Data Platform on GCP

Último

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Training state-of-the-art general text embeddingZilliz

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Powerpoint exploring the locations used in television show Time Clashcharlottematthew16

Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

AI as an Interface for Commercial BuildingsMemoori

Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Traveloka's journey to no ops streaming analytics

1. Traveloka’s Journey to No Ops Streaming Analytics Rendy Bambang Jr., Data Eng Lead - Traveloka Gaurav Anand, Solutions Engineer - Google

2. ● Business Intelligence ● Analytics ● Personalization ● Fraud Detection ● Ads optimization ● Cross selling ● AB Test ● etc. How we use the data

3. 6 offices Incl. Singapore 1,000+ Global employees 400+ Engineers Our technology core has enabled us to scale Traveloka into 6 countries across ASEAN rapidly in less than 2 years.

4. #EnablingMobility

5. In the beginning...

6. Consumer of Data Initial Data Architecture Streaming Batch Traveloka App Kafka ETL In Memory Real Time DW Data Warehouse S3 Data Lake Batch Ingest Android, iOS NoSQL Realtime DB Traveloka Services Hive, Presto Query DOMO Analytics UI

7. Key Numbers ● Volume kafka: billions of messages/day ● In-Memory DB: hundreds of GB in-memory data ● NoSQL DB: 50+ nodes, 20+ TB storage, 50+ use cases ● S3: hundreds of TB ● Spark: 20+ nodes, 200+ core ● Redshift DW: 20+ Nodes, tens of TB ● Team: 8 Developers + 3 SysOps/DevOps

8. Consumer of Data Problems with Initial Data Architecture Streaming Batch Traveloka App Kafka ETL In Memory Real Time DW Data Warehouse S3 Data Lake Batch Ingest Android, iOS NoSQL Realtime DB Traveloka Services Hive, Presto Query DOMO Analytics UI

9. Problems with Initial Data Architecture Debugging Kafka Issues - Dedicated On-call Data Warehouse throughput issues for high frequency load, coupling storage & compute Team well being, paged on holiday, even honeymoon for infra issue! Scaling Issues with NoSQL DB and In-Memory DB Scaling Issues with Custom-built Java Consumers

10. How do we..?

11. Ideal Solution Fully-managed infrastructure to free engineers to solve business problems Autoscaling of Storage and Compute Low end-to-end latency with guaranteed SLA Resilience, end-to-end system availability

12. Solution Components ● Google Cloud PubSub (Events Data Ingestion) ● Google Cloud Dataflow (Stream Processing) ● Google Bigquery (Analytics) ● Cross-Cloud Environment (AWS-GCP) ● AWS DynamoDB (Operational datastore) Note: Although Cloud Datastore was our prefered operational DB, but its non availability in SG region necessitated use of Dynamodb.

13. How did we..?

14. Analytics Architecture: Reimagined Consumer of Data Streaming Batch Traveloka App Kafka ETL Data Warehouse S3 Data Lake Batch Ingest Android, iOS DOMO Analytics UI NoSQL DB Traveloka Services Ingest Cloud Pub/Sub Storage Cloud Storage Pipelines Cloud Dataflow Analytics BigQuery Monitoring Logging Hive, Presto Query

15. Developed two Common Dataflow Engine ● Self-Service Streaming analytics to BigQuery

16. Developed two Common Dataflow Engine ● Stream processing to DynamoDB, common features for dev: ○ Combine by key ○ Optimistic Concurrency ○ Local-file based integration test

17. Key Facts/Numbers ● End to End Pipeline Latency: seconds ● Volume: hundreds of GB/day ● Team: 2 Developers, 0 Ops ● Agility: POC + Pilot in 1 month ● Migrate 50+ different stream processing use case in 1 month ● Bigquery Integration with BI tools: thousands of dashboard, hundreds of users

18. Awesome Autoscale Pubsub & Dataflow could absorb spiky load just fine! Our case: promo PubSub Publish Count DataFlow vcpus Count

19. Why Cloud Dataflow (Beam): Tuning Pipeline vs. Managing Servers

20. The Lambda Architecture

21. Unified Model with Apache Beam Batch or Streaming

22. http://www.vldb.org/pvldb/vol8/p1792-Akidau.pdf

23. Unified Model with Apache Beam

24. Optimize Schedule

25. Why Google Cloud..?

26. Traveloka Data Team Philosophy ● Managed Service ● NoOps ● Self-Service Focus more on solving complex business problems rather than focusing on infrastructure

27. What required us to change? ● Ever increasing scale ● Ever increasing operations burden ● New business needs: Streaming Analytics

28. What’s Next..?

29. Next Generation Architecture Cloud Pub/Sub Cloud Dataflow BigQuery Cloud Storage Kubernetes Cluster Collector Managed services Simplify! BI & Analytics UI

30. Conclusion Our engineering team of 2 produces and maintains like a team of 8 because of products like PubSub, Dataflow & Bigquery “ ”

31.

32. Q&A

33. Thank You.

Notas del editor

One of the most strategic parts of Traveloka's business is a streaming data processing pipeline that powers a number of use cases, including fraud detection, personalization, ads optimization, cross selling, A/B testing, and promotion eligibility. In this talk, we’ll describe how Traveloka recently migrated this pipeline from a legacy architecture to a multi-cloud solution that includes the Google Cloud Platform (GCP) data analytics platform.
Highlight data engineer, highlight singapore office Traveloka is a travel technology company based in Indonesia, Singapore, and India its goal is to revolutionize human mobility
Traveloka vision
The purpose of my talk today is to give you a practical feedback regarding PubSub, DataFlow, BigQuery and more broadly our usage of Google Cloud Platform. I'll start by briefly talking about Traveloka and what we do. Then I'll discuss the architecture we used for the past few years and the reasons why we decided to investigate new solutions. Finally, I'll present you what we put in place and the lessons learned along the way.
Volume kafka: billions of messages/day In-Memory DB: hundreds of GB in-memory data NoSQL DB: 50+ nodes, 20+ TB storage, 50+ use cases S3: hundreds of TB Spark: 20+ nodes, 200+ core Redshift DW: 20+ Nodes, tens of TB Team: 8 Developers + 3 SysOps/DevOps
Volume kafka: billions of messages/day In-Memory DB: hundreds of GB in-memory data NoSQL DB: 50+ nodes, 20+ TB storage, 50+ use cases S3: hundreds of TB Spark: 20+ nodes, 200+ core Redshift DW: 20+ Nodes, tens of TB Team: 8 Developers + 3 SysOps/DevOps
Track Session: As Traveloka grew over time, several problems emerged, including:
Track Session: We did our homework on technology that could support these requirements for our use case.
4minutes Highlight component Role and mapping similar function from both sides, like bigquery Hybrid, not the end state
Track Session: We did our homework on technology that could support these requirements for our use case.
Track Session: We did our homework on technology that could support these requirements for our use case.
Touch team, 0 ops Agility migration Bigquery integration
How big is the spike? 10x?
20th Minute slide.
Basic idea: Run low-latency, weakly consistent streaming system Alongside high-latency, strongly consistent batch system And somehow merge their results together at the end. This provided low-latency, correct results. But at the cost of building, maintaining, and merging the results from two separate systems. So what we set out to do with Apache Beam, was to provide a unified model...
One which could give you the features of both systems… But even more than that, one which would allow you to tradeoff the characteristics of each according to use case So after you write your pipeline,
This is an approach we laid out in our 2013 VLDB paper on the Dataflow Model. And if you want to learn more in detail, that’s a good place to start.
...whether that’s Dataflow, Apache Spark, Apache Flink, Apache Apex, or any other runner we support. And then for our part...
Dynamic Load balancing & Autoscaling Worker VMs, Optimize Pipelines,
Remove AWS Rectangle, we will just add for BI Analytics
Thank you for them!

Traveloka's journey to no ops streaming analytics

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Traveloka's journey to no ops streaming analytics

Similar a Traveloka's journey to no ops streaming analytics (20)

Último

Último (20)

Traveloka's journey to no ops streaming analytics

Notas del editor