Data & ML projects bring many new complexities beyond the traditional software development lifecycle. Unlike software projects, after they were successfully delivered and deployed, they cannot be abandoned but must be continuously monitored if model performance still satisfies all requirements. We can always get new data with new statistical characteristics that can break our pipelines or influence model performance.
3. EMANUELE VIGLIANISI
Data Engineer at Runtastic since January 2020
Previously Security Testing Researcher in FinTech
emanuele.viglianisi@runtastic.com
3
4. MICHAEL SHTELMA
Solutions Architect at Databricks since April 2019
Previously Technical Lead Data Foundation at Teradata
michael.shtelma@databricks.com
4
6. RUNTASTIC
BY THE NUMBERS
4 FOUNDERS
WE HAVE
40 COUNTRIES
WE COME FROM
167M
REGISTERED USERS
10 YEARS OLD
WE ARE
3 OFFICES
WE HAVE
309M
APP STORE DOWNLOADS
20 MONTHS
WE WERE PROFITABLE AFTER JUST
14 LANGUAGES
OUR PRODUCTS ARE AVAILABLE IN
270 EMPLOYEES
WE ARE
5.3M
FOLLOWERS & FANS
4.71
APP STORE RATING
⭐
🥳
🌍
📱
#
🏠
😛
👍
🚩
🚀
6
7. ADIDAS
TRAINING
● 180+ HD exercise videos with step-by-step instructions
● 25+ Standalone workouts to workout anytime, anywhere
● Guided video workouts allow you to exercise along with our
fitness experts and your favorite athletes
● Special indoor workouts, suitable for home
● No additional equipment necessary
● Health and nutrition guide to complement your fitness
● Proven quality through development cooperation with Apple
and Google
● Top-rated app on the Apple App Store and Google Play
🔗 Download the App Now
7
8. 🔗 Download the App Now
ADIDAS
RUNNING
● Our original flagship app
● Allows you to track your sports activities using GPS technology
● 90+ available sport types
● Share your sports activities and reach your goals
● Participate in challenges
● Compare yourself with your friends on the Leaderboard
● Listen to Story Runs, while you are active
● and use many more features…
8
10. ▪ Global company with over 5,000 customers and 450+ partners
▪ Original creators of popular data and machine learning open source projects
A unified data analytics platform for accelerating innovation across
data engineering, data science, and analytics
10
12. Our Goal
As IS New
Move the on-premise Analytics Backend to the cloud,
Microsoft Azure and Databricks, and ensuring high quality software.
12
13. The CI/CD challenge
CI/CD is fundamental in software development workflow for ensuring high quality
code. Is there a way to integrate the CI/CD in Databricks for our Data Engineering
pipelines?
Question
13
14. CI/CD Benefits
- Continuous integration (CI) is the practice of automating the integration of code changes from
multiple contributors into a single software project. The CI process is comprised of automatic tools
that assert the code’s correctness before and after integration (tests).
- Continuous delivery (CD) is an approach where teams release quality products frequently and
predictably from source code repository to production in an automated fashion.
Key Points of CI/CD
CI/CD let us automate long and error prone deployment processes like:
- Testing the code before every Pull Request merge
- Deploy the right code into the right environment (DEV, PRD)
Our needs
14
17. CHALLENGES
- Tests require production-like data (static or dynamic data)
- Production-like data is available in the cloud only
- To perform integration test in the cloud
DAT
What is the cloud challenge?
DATA
ETL pipelines make use of different cloud services
- Ingest data into the cloud from Azure Event Hub
- Store in Azure Data Lake
- Require authorization for accessing the data using Azure Active Directory rules
- Use Secrets securely stored in the cloud using Azure Key Vault
CLOUD DEPENDENCIES
17
18. INTEGRATIONThe problem we had
Databricks notebook
Databricks connect
Option 1
Option 2
- It is difficult to divide the code in different sub-modules/project
- Versioning is possible, but one notebook at the time
- No tooling for automatic tests
- No perfect place for tests
Limitations
Limitations
- It does not support Streaming Jobs
- Not possible to run arbitrary code that is not a part of a Spark job
on the remote cluster.
AIMING TO IMPLEMENT CI/CD USING
18
20. CICD TEMPLATE
- Benefits to Databricks notebooks
- Easy to use
- Scalable
- Provides access to ML tools such as mlflow for model logging and serving
- Challenges
- Non-trivial to hook into traditional software development tools such as CI tools or
local IDEs.
- Result
- Teams find themselves choosing between
- using traditional IDE based workflows but struggling to test and deploy at scale
or
- using Databricks notebooks or other cloud notebooks but then struggling to
ensure testing and deployment reliability via CICD pipelines.
ML teams struggle to combine traditional CI/CD tools
with Databricks notebooks
20
21. CICD TEMPLATE
CI/CD Templates allows you to
● create a production pipeline via template in a few steps
● that automatically hooks to github actions and
● runs tests and deployments on databricks upon git commit or whatever
trigger you define and
● gives you a test success status directly in github so you know if your
commit broke the build
CI/CD Templates gives you the benefits of traditional
CI/CD workflows and the scale of databricks clusters
21
22. A scalable CI/CD pipeline in 5 easy steps
1. Install and customize with a single command
2. Create a new github repo containing your databricks host
and token secrets
3. Initialize git in your repo and commit the code.
4. Push your new cicd templates project to the repo. Your tests
will start running automatically on Databricks. Upon your tests’
success or failure you will get a green checkmark or red x next to
your commit status.
5. You’re done! You now have a fully scalable CICD pipeline.
1 3
22
4
5
2
23. CI/CD Templates executes tests and deployments directly on databricks while
storing packages, model logging and other artifacts in Mlflow
23
27. INTEGR
ATION
There are in total 4 environments:
How we are using the CICD template
PRD
Stable code
On PRD data.
DEV
Playground for
DS/DA/DE.
PRE
Release-candidate code
on PRD data
STG
Stable Code on
release candidate
data
DB token DEV DB token PRE DB token STG DB token PRD
27
31. 1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
python3 run_pipeline.py pipelines --pipeline-name anonymization_pipeline
Run a (test)pipeline
31
32. INTEGR
ATION
1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
from databrickslabs_cicdtemplates import release_cicd_pipeline;
release_cicd_pipeline.main('tests/integration', 'pipelines', True, env=DATABRICKS_ENV);"
Deploy pipelines
Folder with the testing pipelines Folder pipelines to deploy Run test before
deploying
Deployment
environment
32
37. Conclusions
1. Code and Data of ETL pipelines need to be tested like everything in Software Engineering.
CI/CD is necessary for automating the testing and deployment processes and achieving high
quality software.
2. CI/CD is not easy to implement: Databricks Notebooks and Databricks connect are not
enough for complex scenarios.
3. CI/CD template by Databricks Lab allows us to better organize our code in sub-modules and
implement CI/CD using its easy integration with Github Actions
Key Takeaways
37
38. Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Thank you!
38