Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

Developing ML-enabled data
pipelines on Databricks using
IDE & CI/CD
1

Companies
presentation
Challenges
CICD Template
Runtastic
Integration
AGENDA
Demo

EMANUELE VIGLIANISI
Data Engineer at Runtastic since January 2020
Previously Security Testing Researcher in FinTech
emanuele.viglianisi@runtastic.com
3

MICHAEL SHTELMA
Solutions Architect at Databricks since April 2019
Previously Technical Lead Data Foundation at Teradata
michael.shtelma@databricks.com
4

RUNTASTIC
BY THE NUMBERS
4 FOUNDERS
WE HAVE
40 COUNTRIES
WE COME FROM
167M
REGISTERED USERS
10 YEARS OLD
WE ARE
3 OFFICES
WE HAVE
309M
APP STORE DOWNLOADS
20 MONTHS
WE WERE PROFITABLE AFTER JUST
14 LANGUAGES
OUR PRODUCTS ARE AVAILABLE IN
270 EMPLOYEES
WE ARE
5.3M
FOLLOWERS & FANS
4.71
APP STORE RATING
⭐
🥳
🌍
📱
#
🏠
😛
👍
🚩
🚀
6

ADIDAS
TRAINING
● 180+ HD exercise videos with step-by-step instructions
● 25+ Standalone workouts to workout anytime, anywhere
● Guided video workouts allow you to exercise along with our
fitness experts and your favorite athletes
● Special indoor workouts, suitable for home
● No additional equipment necessary
● Health and nutrition guide to complement your fitness
● Proven quality through development cooperation with Apple
and Google
● Top-rated app on the Apple App Store and Google Play
🔗 Download the App Now
7

🔗 Download the App Now
ADIDAS
RUNNING
● Our original flagship app
● Allows you to track your sports activities using GPS technology
● 90+ available sport types
● Share your sports activities and reach your goals
● Participate in challenges
● Compare yourself with your friends on the Leaderboard
● Listen to Story Runs, while you are active
● and use many more features…
8

▪ Global company with over 5,000 customers and 450+ partners
▪ Original creators of popular data and machine learning open source projects
A uniﬁed data analytics platform for accelerating innovation across
data engineering, data science, and analytics
10

Our Goal
As IS New
Move the on-premise Analytics Backend to the cloud,
Microsoft Azure and Databricks, and ensuring high quality software.
12

The CI/CD challenge
CI/CD is fundamental in software development workflow for ensuring high quality
code. Is there a way to integrate the CI/CD in Databricks for our Data Engineering
pipelines?
Question
13

CI/CD Benefits
- Continuous integration (CI) is the practice of automating the integration of code changes from
multiple contributors into a single software project. The CI process is comprised of automatic tools
that assert the code’s correctness before and after integration (tests).
- Continuous delivery (CD) is an approach where teams release quality products frequently and
predictably from source code repository to production in an automated fashion.
Key Points of CI/CD
CI/CD let us automate long and error prone deployment processes like:
- Testing the code before every Pull Request merge
- Deploy the right code into the right environment (DEV, PRD)
Our needs
14

CHALLENGES
- Tests require production-like data (static or dynamic data)
- Production-like data is available in the cloud only
- To perform integration test in the cloud
DAT
What is the cloud challenge?
DATA
ETL pipelines make use of different cloud services
- Ingest data into the cloud from Azure Event Hub
- Store in Azure Data Lake
- Require authorization for accessing the data using Azure Active Directory rules
- Use Secrets securely stored in the cloud using Azure Key Vault
CLOUD DEPENDENCIES
17

INTEGRATIONThe problem we had
Databricks notebook
Databricks connect
Option 1
Option 2
- It is difficult to divide the code in different sub-modules/project
- Versioning is possible, but one notebook at the time
- No tooling for automatic tests
- No perfect place for tests
Limitations
Limitations
- It does not support Streaming Jobs
- Not possible to run arbitrary code that is not a part of a Spark job
on the remote cluster.
AIMING TO IMPLEMENT CI/CD USING
18

CICD Templates by Databricks Labs
19

CICD TEMPLATE
- Benefits to Databricks notebooks
- Easy to use
- Scalable
- Provides access to ML tools such as mlflow for model logging and serving
- Challenges
- Non-trivial to hook into traditional software development tools such as CI tools or
local IDEs.
- Result
- Teams find themselves choosing between
- using traditional IDE based workflows but struggling to test and deploy at scale
or
- using Databricks notebooks or other cloud notebooks but then struggling to
ensure testing and deployment reliability via CICD pipelines.
ML teams struggle to combine traditional CI/CD tools
with Databricks notebooks
20

CICD TEMPLATE
CI/CD Templates allows you to
● create a production pipeline via template in a few steps
● that automatically hooks to github actions and
● runs tests and deployments on databricks upon git commit or whatever
trigger you define and
● gives you a test success status directly in github so you know if your
commit broke the build
CI/CD Templates gives you the benefits of traditional
CI/CD workflows and the scale of databricks clusters
21

A scalable CI/CD pipeline in 5 easy steps
1. Install and customize with a single command
2. Create a new github repo containing your databricks host
and token secrets
3. Initialize git in your repo and commit the code.
4. Push your new cicd templates project to the repo. Your tests
will start running automatically on Databricks. Upon your tests’
success or failure you will get a green checkmark or red x next to
your commit status.
5. You’re done! You now have a fully scalable CICD pipeline.
1 3
22
4
5
2

CI/CD Templates executes tests and deployments directly on databricks while
storing packages, model logging and other artifacts in Mlflow
23

INTEGR
ATION
There are in total 4 environments:
How we are using the CICD template
PRD
Stable code
On PRD data.
DEV
Playground for
DS/DA/DE.
PRE
Release-candidate code
on PRD data
STG
Stable Code on
release candidate
data
DB token DEV DB token PRE DB token STG DB token PRD
27

INTEGR
ATION
/analyticsbackend
/pipelines
/tests
/unit
/integration
runtime_requirements.txt
Project structure
Main python module
Pipelines configurations
Tests folder, divided in unit
and integration tests
Libraries installed in every
cluster
28

INTEGR
ATION
/analyticsbackend
Anonymizer.py
/pipelines
/anonymization_pipeline
/tests
/unit
/anonymization_udf_test.py
/integration
/anonymization_pipeline_test
Pipeline example
29

INTEGR
ATION/pipelines
. . .
/anonymization_pipeline
/databricks-config_dev.json
/databricks-config_prd.json
/databricks-config_pre.json
/databricks-config_stg.json
/job_spec_azure_dev.json
/job_spec_azure_prd.json
/job_spec_azure_pre.json
/job_spec_azure_stg.json
/pipeline_runner.py
Pipeline structure
JSON containing input
parameters for the pipeline
(eg. paths). One for each
environment.
Job configuration. One for
each environment.
Containing cluster
properties, pool id, etc.
Pipeline entry point
30

1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
python3 run_pipeline.py pipelines --pipeline-name anonymization_pipeline
Run a (test)pipeline
31

INTEGR
ATION
1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
from databrickslabs_cicdtemplates import release_cicd_pipeline;
release_cicd_pipeline.main('tests/integration', 'pipelines', True, env=DATABRICKS_ENV);"
Deploy pipelines
Folder with the testing pipelines Folder pipelines to deploy Run test before
deploying
Deployment
environment
32

INTEGRATION
Github integration
33

INTEGRATION
Github actions
name: Release workflow
on:
# Trigger the workflow once you create a new release
release:
types:
- created
jobs:
build:
runs-on: ubuntu-latest
[ . . . ]
- name: Deploy artifact on PRD and STG environments
run: |
export DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN_PRD }}
export DATABRICKS_ENV=PRD
python -c "from databrickslabs_cicdtemplates import
release_cicd_pipeline; release_cicd_pipeline.main('tests/integration',
'pipelines', True, env='${DATABRICKS_ENV}');"
34

INTEGRATION
Our git flow
MASTER
BRANCH
On-push (test+deploy)
NEW RELEASE
On-release (deploy)
FEATURE
BRANCH
PR
TESTED and
APPROVED
Test-it
label Some work
done
merge
PRE
PRD
STG
35

Conclusions
1. Code and Data of ETL pipelines need to be tested like everything in Software Engineering.
CI/CD is necessary for automating the testing and deployment processes and achieving high
quality software.
2. CI/CD is not easy to implement: Databricks Notebooks and Databricks connect are not
enough for complex scenarios.
3. CI/CD template by Databricks Lab allows us to better organize our code in sub-modules and
implement CI/CD using its easy integration with Github Actions
Key Takeaways
37

Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Thank you!
38

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

Similar a Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic