"GOJEK, the Southeast Asian super-app, has seen an explosive growth in both users and data over the past three years. Today the technology startup uses big data powered machine learning to inform decision-making in its ride-hailing, lifestyle, logistics, food delivery, and payment products. From selecting the right driver to dispatch, to dynamically setting prices, to serving food recommendations, to forecasting real-world events. Hundreds of millions of orders per month, across 18 products, are all driven by machine learning.
Building production grade machine learning systems at GOJEK wasn't always easy. Data processing and machine learning pipelines were brittle, long running, and had low reproducibility. Models and experiments were difficult to track, which led to downstream problems in production during serving and model evaluation. In this talk we will cover these and other challenges that we faced while trying to scale end-to-end machine learning systems at GOJEK. We will then introduce MLflow and explore the key features that make it useful as part of an ML platform. Finally, we will show how introducing MLflow into the ML life cycle has helped to solve many of the problems we faced while scaling machine learning at GOJEK.
"
3. Our Scale
Operating in 4 countries
and more than 70 cities
80mapp downloads
+250kmerchants
4countries
1m+drivers
100m+monthly bookings
Indonesia
Singapore
Thailand
Vietnam
13. Matchmaking: First Cut
Raw Data
Prod
Serving
How are we going
to train models?
Deploy
Process
Data
Airflow
14. Matchmaking: First Cut
Raw Data
Prod
Serving
Build, Test, Deploy
Application
Process Data, Train Model
Airflow
Trigger: API CallTrigger: Daily Schedule Helm deploy to Kubernetes
16. Challenges with this approach
● Inefficient
○ Need to wait hours for pipeline to run before deploying models
○ Can’t deploy serving without trigger from Airflow
17. Challenges with this approach
● Inefficient
● Hard to experiment
○ Do we fork the codebase for each small change?
○ Do we fan-in and fan-out a single pipeline?
○ Tracking model performance over time
18. Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
Model tracking
by timestamp?
19. Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
○ Pipelines have non-deterministic side inputs (API calls,
fetching data, reading configuration)
○ No standardized way to track artifacts or processes
20. Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● No visibility
Features? Models? Parameters? Metrics?
21. Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● Low visibility
● Hard to scale
How do we scale to 1000s
models and new markets?
Airflow trains model,
triggers new deploy
through GitLab
Hardcoded
deployments
targets
22. Challenges with this approach
● Inefficient
● Hard to experiment
● Versioning is broken
● Low reproducibility
● Low visibility
● Hard to scale
● No separation of roles
Raw Data
Prod
Serving
Process data + Train models + Deploy
Responsibility of
Data Engineers,
Software Engineers,
Data Scientists
23. Desired state
● Easy to experiment
● Easy to reproduce results
● Easy to deploy models
● Easy to evaluate performance of features and models
● Capable of scaling to 1000s of models in many regions
25. Tracking
Record and query
experiments: code,
data, config, results
Projects
Packaging format
for reproducible runs
on any platform
Models
General model format
that supports diverse
deployment tools
MLflow Components
26. • Parameters: key-value
inputs to your code
• Metrics: numeric values
(can update over time)
• Artifacts: arbitrary files,
including models
• Source: which version
of code ran?
Key Concepts in Tracking
28. Approach
1. Decouple based on concerns
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
29. 1. Decouple based on concerns
2. Implement ML pipeline solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach
30. 1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
???
Train
Models
???
Approach
31. 1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
3. Add an artifact store between stages for features (Feast)
Feature
Store
Raw Data
Prod
Serving
Deploy
Airflow
Process
Data
*GOJEK/
Feast
Train
Models
???
*http://github.com/gojek/feast
Approach
32. 1. Decouple based on concerns
2. Implement ML pipeline solution and Continuous Delivery solution
3. Add an artifact store between stages for features (Feast) and models (MLflow)
Model
Store
Feature
Store
Raw Data
Prod
Serving
Airflow
Process
Data
GOJEK/
Feast
Train
Models
Deploy
Approach
33. Advantages: Asynchronous Experimentation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Time
based
Instance
based
Artifact
based1 2 3
with mlflow.start_run():
# train model...
mlflow.log_param("alpha", alpha)
mlflow.log_param("l1_ratio", l1_ratio)
mlflow.log_metric("rmse", rmse)
mlflow.log_metric("r2", r2)
mlflow.sklearn.log_model(lr, "model")
34. Advantages: Reproducible & Traceable
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Track artifacts used to train
models
● features
● pipeline version (git+SHA)
● and other pipeline variables
Track artifacts used to
deploy ML systems
● docker image
● configuration
● model version
● feature data
Track artifacts used to
produce features
● data sources
● jobs
● parameters
35. Advantages: Governance & Evaluation
Prod
Serving
Feature
Store
Train
Models
Deploy
training run
parameters
deployment
configuration
model
performance
feature
performance
1 2
34
36. Advantages: Role Separation
Raw Data
Process
Data
Prod
Serving
Feature
Store
Train
Models
Deploy
Data Scientist Software EngineerData Engineer
37. Advantages: Scalability
Driver Allocation System: (3 environments) x (4 markets) x (5 model types) x (10+ live
experiments)
= 600+ simultaneous deployments
gke-PROD-SG-T1-EXP2323
CD Pipeline
(pull based)
Configuration
Helm Charts
Docker Images
gke-PROD-TH-T2-EXP1006
gke-PROD-ID-T3-EXP3423
gke-PROD-VN-T4-EXP1800
1
New model is
published
2
Monitors all artifacts
for new versions
3
Test and deploy changes to
relevant clusters