SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Developing ML-enabled data
pipelines on Databricks using
IDE & CI/CD
1
Companies
presentation
Challenges
CICD Template
Runtastic
Integration
AGENDA
Demo
EMANUELE VIGLIANISI
Data Engineer at Runtastic since January 2020
Previously Security Testing Researcher in FinTech
emanuele.viglianisi@runtastic.com
3
MICHAEL SHTELMA
Solutions Architect at Databricks since April 2019
Previously Technical Lead Data Foundation at Teradata
michael.shtelma@databricks.com
4
Runtastic
5
RUNTASTIC
BY THE NUMBERS
4 FOUNDERS
WE HAVE
40 COUNTRIES
WE COME FROM
167M
REGISTERED USERS
10 YEARS OLD
WE ARE
3 OFFICES
WE HAVE
309M
APP STORE DOWNLOADS
20 MONTHS
WE WERE PROFITABLE AFTER JUST
14 LANGUAGES
OUR PRODUCTS ARE AVAILABLE IN
270 EMPLOYEES
WE ARE
5.3M
FOLLOWERS & FANS
4.71
APP STORE RATING
⭐
🥳
🌍
📱
#
🏠
😛
👍
🚩
🚀
6
ADIDAS
TRAINING
● 180+ HD exercise videos with step-by-step instructions
● 25+ Standalone workouts to workout anytime, anywhere
● Guided video workouts allow you to exercise along with our
fitness experts and your favorite athletes
● Special indoor workouts, suitable for home
● No additional equipment necessary
● Health and nutrition guide to complement your fitness
● Proven quality through development cooperation with Apple
and Google
● Top-rated app on the Apple App Store and Google Play
🔗 Download the App Now
7
🔗 Download the App Now
ADIDAS
RUNNING
● Our original flagship app
● Allows you to track your sports activities using GPS technology
● 90+ available sport types
● Share your sports activities and reach your goals
● Participate in challenges
● Compare yourself with your friends on the Leaderboard
● Listen to Story Runs, while you are active
● and use many more features…
8
Databricks
9
▪ Global company with over 5,000 customers and 450+ partners
▪ Original creators of popular data and machine learning open source projects
A unified data analytics platform for accelerating innovation across
data engineering, data science, and analytics
10
Challenges
11
Our Goal
As IS New
Move the on-premise Analytics Backend to the cloud,
Microsoft Azure and Databricks, and ensuring high quality software.
12
The CI/CD challenge
CI/CD is fundamental in software development workflow for ensuring high quality
code. Is there a way to integrate the CI/CD in Databricks for our Data Engineering
pipelines?
Question
13
CI/CD Benefits
- Continuous integration (CI) is the practice of automating the integration of code changes from
multiple contributors into a single software project. The CI process is comprised of automatic tools
that assert the code’s correctness before and after integration (tests).
- Continuous delivery (CD) is an approach where teams release quality products frequently and
predictably from source code repository to production in an automated fashion.
Key Points of CI/CD
CI/CD let us automate long and error prone deployment processes like:
- Testing the code before every Pull Request merge
- Deploy the right code into the right environment (DEV, PRD)
Our needs
14
Why we need CI/CD
15
Why we need CI/CD
16
CHALLENGES
- Tests require production-like data (static or dynamic data)
- Production-like data is available in the cloud only
- To perform integration test in the cloud
DAT
What is the cloud challenge?
DATA
ETL pipelines make use of different cloud services
- Ingest data into the cloud from Azure Event Hub
- Store in Azure Data Lake
- Require authorization for accessing the data using Azure Active Directory rules
- Use Secrets securely stored in the cloud using Azure Key Vault
CLOUD DEPENDENCIES
17
INTEGRATIONThe problem we had
Databricks notebook
Databricks connect
Option 1
Option 2
- It is difficult to divide the code in different sub-modules/project
- Versioning is possible, but one notebook at the time
- No tooling for automatic tests
- No perfect place for tests
Limitations
Limitations
- It does not support Streaming Jobs
- Not possible to run arbitrary code that is not a part of a Spark job
on the remote cluster.
AIMING TO IMPLEMENT CI/CD USING
18
CICD Templates by Databricks Labs
19
CICD TEMPLATE
- Benefits to Databricks notebooks
- Easy to use
- Scalable
- Provides access to ML tools such as mlflow for model logging and serving
- Challenges
- Non-trivial to hook into traditional software development tools such as CI tools or
local IDEs.
- Result
- Teams find themselves choosing between
- using traditional IDE based workflows but struggling to test and deploy at scale
or
- using Databricks notebooks or other cloud notebooks but then struggling to
ensure testing and deployment reliability via CICD pipelines.
ML teams struggle to combine traditional CI/CD tools
with Databricks notebooks
20
CICD TEMPLATE
CI/CD Templates allows you to
● create a production pipeline via template in a few steps
● that automatically hooks to github actions and
● runs tests and deployments on databricks upon git commit or whatever
trigger you define and
● gives you a test success status directly in github so you know if your
commit broke the build
CI/CD Templates gives you the benefits of traditional
CI/CD workflows and the scale of databricks clusters
21
A scalable CI/CD pipeline in 5 easy steps
1. Install and customize with a single command
2. Create a new github repo containing your databricks host
and token secrets
3. Initialize git in your repo and commit the code.
4. Push your new cicd templates project to the repo. Your tests
will start running automatically on Databricks. Upon your tests’
success or failure you will get a green checkmark or red x next to
your commit status.
5. You’re done! You now have a fully scalable CICD pipeline.
1 3
22
4
5
2
CI/CD Templates executes tests and deployments directly on databricks while
storing packages, model logging and other artifacts in Mlflow
23
Push Flow
24
Release Flow
25
Runtastic Integration
26
INTEGR
ATION
There are in total 4 environments:
How we are using the CICD template
PRD
Stable code
On PRD data.
DEV
Playground for
DS/DA/DE.
PRE
Release-candidate code
on PRD data
STG
Stable Code on
release candidate
data
DB token DEV DB token PRE DB token STG DB token PRD
27
INTEGR
ATION
/analyticsbackend
/pipelines
/tests
/unit
/integration
runtime_requirements.txt
Project structure
Main python module
Pipelines configurations
Tests folder, divided in unit
and integration tests
Libraries installed in every
cluster
28
INTEGR
ATION
/analyticsbackend
Anonymizer.py
/pipelines
/anonymization_pipeline
/tests
/unit
/anonymization_udf_test.py
/integration
/anonymization_pipeline_test
Pipeline example
29
INTEGR
ATION/pipelines
. . .
/anonymization_pipeline
/databricks-config_dev.json
/databricks-config_prd.json
/databricks-config_pre.json
/databricks-config_stg.json
/job_spec_azure_dev.json
/job_spec_azure_prd.json
/job_spec_azure_pre.json
/job_spec_azure_stg.json
/pipeline_runner.py
Pipeline structure
JSON containing input
parameters for the pipeline
(eg. paths). One for each
environment.
Job configuration. One for
each environment.
Containing cluster
properties, pool id, etc.
Pipeline entry point
30
1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
python3 run_pipeline.py pipelines --pipeline-name anonymization_pipeline
Run a (test)pipeline
31
INTEGR
ATION
1. Move to the target environment
export DATABRICKS_ENV=DEV
export DATABRICKS_TOKEN=<DB-DEV-TOKEN>
2. Run the pipeline
from databrickslabs_cicdtemplates import release_cicd_pipeline;
release_cicd_pipeline.main('tests/integration', 'pipelines', True, env=DATABRICKS_ENV);"
Deploy pipelines
Folder with the testing pipelines Folder pipelines to deploy Run test before
deploying
Deployment
environment
32
INTEGRATION
Github integration
33
INTEGRATION
Github actions
name: Release workflow
on:
# Trigger the workflow once you create a new release
release:
types:
- created
jobs:
build:
runs-on: ubuntu-latest
[ . . . ]
- name: Deploy artifact on PRD and STG environments
run: |
export DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN_PRD }}
export DATABRICKS_ENV=PRD
python -c "from databrickslabs_cicdtemplates import
release_cicd_pipeline; release_cicd_pipeline.main('tests/integration',
'pipelines', True, env='${DATABRICKS_ENV}');"
34
INTEGRATION
Our git flow
MASTER
BRANCH
On-push (test+deploy)
NEW RELEASE
On-release (deploy)
FEATURE
BRANCH
PR
TESTED and
APPROVED
Test-it
label Some work
done
merge
PRE
PRD
STG
35
Demo
36
Conclusions
1. Code and Data of ETL pipelines need to be tested like everything in Software Engineering.
CI/CD is necessary for automating the testing and deployment processes and achieving high
quality software.
2. CI/CD is not easy to implement: Databricks Notebooks and Databricks connect are not
enough for complex scenarios.
3. CI/CD template by Databricks Lab allows us to better organize our code in sub-modules and
implement CI/CD using its easy integration with Github Actions
Key Takeaways
37
Feedback
Your feedback is important to us.
Don’t forget to rate
and review the sessions.
Thank you!
38

Más contenido relacionado

La actualidad más candente

DevOps overview 2019-04-13 Nelkinda April Meetup
DevOps overview  2019-04-13 Nelkinda April MeetupDevOps overview  2019-04-13 Nelkinda April Meetup
DevOps overview 2019-04-13 Nelkinda April MeetupShweta Sadawarte
 
Lecture 8 (software Metrics) Unit 3.pptx
Lecture 8 (software Metrics) Unit 3.pptxLecture 8 (software Metrics) Unit 3.pptx
Lecture 8 (software Metrics) Unit 3.pptxironman427662
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD PipelineJames Wickett
 
Global Software Development powered by Perforce
Global Software Development powered by PerforceGlobal Software Development powered by Perforce
Global Software Development powered by PerforcePerforce
 
Introduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIntroduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIvano Malavolta
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSEnd-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSBhuvaneswari Subramani
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with kedaAdam Hamsik
 
Deploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsDeploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsOpsta
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformDatabricks
 
End to-End Monitoring for ITSM and DevOps
End to-End Monitoring for ITSM and DevOpsEnd to-End Monitoring for ITSM and DevOps
End to-End Monitoring for ITSM and DevOpseG Innovations
 
Software Project Management( lecture 1)
Software Project Management( lecture 1)Software Project Management( lecture 1)
Software Project Management( lecture 1)Syed Muhammad Hammad
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications MoovingON
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)Ravi Tadwalkar
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCDCloudOps2005
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"Databricks
 
FinOps@SC CH-Meetup.pdf
FinOps@SC CH-Meetup.pdfFinOps@SC CH-Meetup.pdf
FinOps@SC CH-Meetup.pdfWuming Zhang
 
Métricas del Software
Métricas del SoftwareMétricas del Software
Métricas del SoftwareArabel Aguilar
 

La actualidad más candente (20)

DevOps overview 2019-04-13 Nelkinda April Meetup
DevOps overview  2019-04-13 Nelkinda April MeetupDevOps overview  2019-04-13 Nelkinda April Meetup
DevOps overview 2019-04-13 Nelkinda April Meetup
 
Lecture 8 (software Metrics) Unit 3.pptx
Lecture 8 (software Metrics) Unit 3.pptxLecture 8 (software Metrics) Unit 3.pptx
Lecture 8 (software Metrics) Unit 3.pptx
 
Past, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps InfrastructurePast, Present and Future of DevOps Infrastructure
Past, Present and Future of DevOps Infrastructure
 
DevSecOps and the CI/CD Pipeline
 DevSecOps and the CI/CD Pipeline DevSecOps and the CI/CD Pipeline
DevSecOps and the CI/CD Pipeline
 
Global Software Development powered by Perforce
Global Software Development powered by PerforceGlobal Software Development powered by Perforce
Global Software Development powered by Perforce
 
Introduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTUREIntroduction to SOFTWARE ARCHITECTURE
Introduction to SOFTWARE ARCHITECTURE
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWSEnd-to-End CI/CD at scale with Infrastructure-as-Code on AWS
End-to-End CI/CD at scale with Infrastructure-as-Code on AWS
 
Event driven autoscaling with keda
Event driven autoscaling with kedaEvent driven autoscaling with keda
Event driven autoscaling with keda
 
Deploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOpsDeploy 22 microservices from scratch in 30 mins with GitOps
Deploy 22 microservices from scratch in 30 mins with GitOps
 
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML PlatformHow to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
How to Utilize MLflow and Kubernetes to Build an Enterprise ML Platform
 
End to-End Monitoring for ITSM and DevOps
End to-End Monitoring for ITSM and DevOpsEnd to-End Monitoring for ITSM and DevOps
End to-End Monitoring for ITSM and DevOps
 
Software Project Management( lecture 1)
Software Project Management( lecture 1)Software Project Management( lecture 1)
Software Project Management( lecture 1)
 
Observability for modern applications
Observability for modern applications  Observability for modern applications
Observability for modern applications
 
DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)DevOps Approach (Point of View by Ravi Tadwalkar)
DevOps Approach (Point of View by Ravi Tadwalkar)
 
GitOps with ArgoCD
GitOps with ArgoCDGitOps with ArgoCD
GitOps with ArgoCD
 
"Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow""Managing the Complete Machine Learning Lifecycle with MLflow"
"Managing the Complete Machine Learning Lifecycle with MLflow"
 
FinOps@SC CH-Meetup.pdf
FinOps@SC CH-Meetup.pdfFinOps@SC CH-Meetup.pdf
FinOps@SC CH-Meetup.pdf
 
Métricas del Software
Métricas del SoftwareMétricas del Software
Métricas del Software
 

Similar a Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksCI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksDatabricks
 
Optimize your CI/CD with GitLab and AWS
Optimize your CI/CD with GitLab and AWSOptimize your CI/CD with GitLab and AWS
Optimize your CI/CD with GitLab and AWSDevOps.com
 
AzureDay Kyiv 2016 Release Management
AzureDay Kyiv 2016 Release ManagementAzureDay Kyiv 2016 Release Management
AzureDay Kyiv 2016 Release ManagementSergii Kryshtop
 
Network Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Automation Forum
 
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseContinuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseVMware Tanzu
 
StampedeCon 2015 Keynote
StampedeCon 2015 KeynoteStampedeCon 2015 Keynote
StampedeCon 2015 KeynoteKen Owens
 
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015StampedeCon
 
Enterprise-Grade DevOps Solutions for a Start Up Budget
Enterprise-Grade DevOps Solutions for a Start Up BudgetEnterprise-Grade DevOps Solutions for a Start Up Budget
Enterprise-Grade DevOps Solutions for a Start Up BudgetDevOps.com
 
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...Edge AI and Vision Alliance
 
Path To Continuous Test Automation Using CICD Pipeline.pdf
Path To Continuous Test Automation Using CICD Pipeline.pdfPath To Continuous Test Automation Using CICD Pipeline.pdf
Path To Continuous Test Automation Using CICD Pipeline.pdfpCloudy
 
Introduction to Adaptive and 3DEXPERIENCE Cloud
Introduction to Adaptive and 3DEXPERIENCE CloudIntroduction to Adaptive and 3DEXPERIENCE Cloud
Introduction to Adaptive and 3DEXPERIENCE CloudAdaptive Corporation
 
Continuous Integration and Delivery using TeamCity and Jenkins
Continuous Integration and Delivery using TeamCity and JenkinsContinuous Integration and Delivery using TeamCity and Jenkins
Continuous Integration and Delivery using TeamCity and JenkinsMahmoud Ali
 
CICD Pipeline - AWS Azure
CICD Pipeline - AWS AzureCICD Pipeline - AWS Azure
CICD Pipeline - AWS AzureRatan Das
 
Continuous Deployment for Staging and Production Environments
Continuous Deployment for Staging and Production EnvironmentsContinuous Deployment for Staging and Production Environments
Continuous Deployment for Staging and Production EnvironmentsOlyaSurits
 
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelUsing Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelDocker, Inc.
 
Tampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerTampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerSakari Hoisko
 
DEVNET-1149 Leveraging Rapid Development with PaaS on Cisco Cloud
DEVNET-1149	Leveraging Rapid Development with PaaS on Cisco CloudDEVNET-1149	Leveraging Rapid Development with PaaS on Cisco Cloud
DEVNET-1149 Leveraging Rapid Development with PaaS on Cisco CloudCisco DevNet
 
PureApplication: Devops and Urbancode
PureApplication: Devops and UrbancodePureApplication: Devops and Urbancode
PureApplication: Devops and UrbancodeJohn Hawkins
 
The Need for Speed
The Need for SpeedThe Need for Speed
The Need for SpeedCapgemini
 

Similar a Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic (20)

CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on DatabricksCI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
CI/CD Templates: Continuous Delivery of ML-Enabled Data Pipelines on Databricks
 
Optimize your CI/CD with GitLab and AWS
Optimize your CI/CD with GitLab and AWSOptimize your CI/CD with GitLab and AWS
Optimize your CI/CD with GitLab and AWS
 
AzureDay Kyiv 2016 Release Management
AzureDay Kyiv 2016 Release ManagementAzureDay Kyiv 2016 Release Management
AzureDay Kyiv 2016 Release Management
 
Network Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisitedNetwork Source of Truth and Infrastructure as Code revisited
Network Source of Truth and Infrastructure as Code revisited
 
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with ConcourseContinuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse
Continuous Delivery: Fly the Friendly CI in Pivotal Cloud Foundry with Concourse
 
StampedeCon 2015 Keynote
StampedeCon 2015 KeynoteStampedeCon 2015 Keynote
StampedeCon 2015 Keynote
 
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
How Cisco Migrated from MapReduce Jobs to Spark Jobs - StampedeCon 2015
 
Enterprise-Grade DevOps Solutions for a Start Up Budget
Enterprise-Grade DevOps Solutions for a Start Up BudgetEnterprise-Grade DevOps Solutions for a Start Up Budget
Enterprise-Grade DevOps Solutions for a Start Up Budget
 
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
“Data Versioning: Towards Reproducibility in Machine Learning,” a Presentatio...
 
Path To Continuous Test Automation Using CICD Pipeline.pdf
Path To Continuous Test Automation Using CICD Pipeline.pdfPath To Continuous Test Automation Using CICD Pipeline.pdf
Path To Continuous Test Automation Using CICD Pipeline.pdf
 
DevOps: Age Of CI/CD
DevOps: Age Of CI/CDDevOps: Age Of CI/CD
DevOps: Age Of CI/CD
 
Introduction to Adaptive and 3DEXPERIENCE Cloud
Introduction to Adaptive and 3DEXPERIENCE CloudIntroduction to Adaptive and 3DEXPERIENCE Cloud
Introduction to Adaptive and 3DEXPERIENCE Cloud
 
Continuous Integration and Delivery using TeamCity and Jenkins
Continuous Integration and Delivery using TeamCity and JenkinsContinuous Integration and Delivery using TeamCity and Jenkins
Continuous Integration and Delivery using TeamCity and Jenkins
 
CICD Pipeline - AWS Azure
CICD Pipeline - AWS AzureCICD Pipeline - AWS Azure
CICD Pipeline - AWS Azure
 
Continuous Deployment for Staging and Production Environments
Continuous Deployment for Staging and Production EnvironmentsContinuous Deployment for Staging and Production Environments
Continuous Deployment for Staging and Production Environments
 
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment ModelUsing Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
Using Docker Hub at Scale to Support Micro Focus' Delivery and Deployment Model
 
Tampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday DockerTampere Docker meetup - Happy 5th Birthday Docker
Tampere Docker meetup - Happy 5th Birthday Docker
 
DEVNET-1149 Leveraging Rapid Development with PaaS on Cisco Cloud
DEVNET-1149	Leveraging Rapid Development with PaaS on Cisco CloudDEVNET-1149	Leveraging Rapid Development with PaaS on Cisco Cloud
DEVNET-1149 Leveraging Rapid Development with PaaS on Cisco Cloud
 
PureApplication: Devops and Urbancode
PureApplication: Devops and UrbancodePureApplication: Devops and Urbancode
PureApplication: Devops and Urbancode
 
The Need for Speed
The Need for SpeedThe Need for Speed
The Need for Speed
 

Más de Databricks

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of HadoopDatabricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 

Más de Databricks (20)

DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
5 Critical Steps to Clean Your Data Swamp When Migrating Off of Hadoop
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 

Último

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptxAnupama Kate
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...shambhavirathore45
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightDelhi Call girls
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxolyaivanovalion
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxfirstjob4
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxMohammedJunaid861692
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 

Último (20)

Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx100-Concepts-of-AI by Anupama Kate .pptx
100-Concepts-of-AI by Anupama Kate .pptx
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...Determinants of health, dimensions of health, positive health and spectrum of...
Determinants of health, dimensions of health, positive health and spectrum of...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in  KishangarhDelhi 99530 vip 56974 Genuine Escort Service Call Girls in  Kishangarh
Delhi 99530 vip 56974 Genuine Escort Service Call Girls in Kishangarh
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 nightCheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
Cheap Rate Call girls Sarita Vihar Delhi 9205541914 shot 1500 night
 
Zuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptxZuja dropshipping via API with DroFx.pptx
Zuja dropshipping via API with DroFx.pptx
 
Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
Introduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptxIntroduction-to-Machine-Learning (1).pptx
Introduction-to-Machine-Learning (1).pptx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptxBPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
BPAC WITH UFSBI GENERAL PRESENTATION 18_05_2017-1.pptx
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 

Developing ML-enabled Data Pipelines on Databricks using IDE & CI/CD at Runtastic

  • 1. Developing ML-enabled data pipelines on Databricks using IDE & CI/CD 1
  • 3. EMANUELE VIGLIANISI Data Engineer at Runtastic since January 2020 Previously Security Testing Researcher in FinTech emanuele.viglianisi@runtastic.com 3
  • 4. MICHAEL SHTELMA Solutions Architect at Databricks since April 2019 Previously Technical Lead Data Foundation at Teradata michael.shtelma@databricks.com 4
  • 6. RUNTASTIC BY THE NUMBERS 4 FOUNDERS WE HAVE 40 COUNTRIES WE COME FROM 167M REGISTERED USERS 10 YEARS OLD WE ARE 3 OFFICES WE HAVE 309M APP STORE DOWNLOADS 20 MONTHS WE WERE PROFITABLE AFTER JUST 14 LANGUAGES OUR PRODUCTS ARE AVAILABLE IN 270 EMPLOYEES WE ARE 5.3M FOLLOWERS & FANS 4.71 APP STORE RATING ⭐ 🥳 🌍 📱 # 🏠 😛 👍 🚩 🚀 6
  • 7. ADIDAS TRAINING ● 180+ HD exercise videos with step-by-step instructions ● 25+ Standalone workouts to workout anytime, anywhere ● Guided video workouts allow you to exercise along with our fitness experts and your favorite athletes ● Special indoor workouts, suitable for home ● No additional equipment necessary ● Health and nutrition guide to complement your fitness ● Proven quality through development cooperation with Apple and Google ● Top-rated app on the Apple App Store and Google Play 🔗 Download the App Now 7
  • 8. 🔗 Download the App Now ADIDAS RUNNING ● Our original flagship app ● Allows you to track your sports activities using GPS technology ● 90+ available sport types ● Share your sports activities and reach your goals ● Participate in challenges ● Compare yourself with your friends on the Leaderboard ● Listen to Story Runs, while you are active ● and use many more features… 8
  • 10. ▪ Global company with over 5,000 customers and 450+ partners ▪ Original creators of popular data and machine learning open source projects A unified data analytics platform for accelerating innovation across data engineering, data science, and analytics 10
  • 12. Our Goal As IS New Move the on-premise Analytics Backend to the cloud, Microsoft Azure and Databricks, and ensuring high quality software. 12
  • 13. The CI/CD challenge CI/CD is fundamental in software development workflow for ensuring high quality code. Is there a way to integrate the CI/CD in Databricks for our Data Engineering pipelines? Question 13
  • 14. CI/CD Benefits - Continuous integration (CI) is the practice of automating the integration of code changes from multiple contributors into a single software project. The CI process is comprised of automatic tools that assert the code’s correctness before and after integration (tests). - Continuous delivery (CD) is an approach where teams release quality products frequently and predictably from source code repository to production in an automated fashion. Key Points of CI/CD CI/CD let us automate long and error prone deployment processes like: - Testing the code before every Pull Request merge - Deploy the right code into the right environment (DEV, PRD) Our needs 14
  • 15. Why we need CI/CD 15
  • 16. Why we need CI/CD 16
  • 17. CHALLENGES - Tests require production-like data (static or dynamic data) - Production-like data is available in the cloud only - To perform integration test in the cloud DAT What is the cloud challenge? DATA ETL pipelines make use of different cloud services - Ingest data into the cloud from Azure Event Hub - Store in Azure Data Lake - Require authorization for accessing the data using Azure Active Directory rules - Use Secrets securely stored in the cloud using Azure Key Vault CLOUD DEPENDENCIES 17
  • 18. INTEGRATIONThe problem we had Databricks notebook Databricks connect Option 1 Option 2 - It is difficult to divide the code in different sub-modules/project - Versioning is possible, but one notebook at the time - No tooling for automatic tests - No perfect place for tests Limitations Limitations - It does not support Streaming Jobs - Not possible to run arbitrary code that is not a part of a Spark job on the remote cluster. AIMING TO IMPLEMENT CI/CD USING 18
  • 19. CICD Templates by Databricks Labs 19
  • 20. CICD TEMPLATE - Benefits to Databricks notebooks - Easy to use - Scalable - Provides access to ML tools such as mlflow for model logging and serving - Challenges - Non-trivial to hook into traditional software development tools such as CI tools or local IDEs. - Result - Teams find themselves choosing between - using traditional IDE based workflows but struggling to test and deploy at scale or - using Databricks notebooks or other cloud notebooks but then struggling to ensure testing and deployment reliability via CICD pipelines. ML teams struggle to combine traditional CI/CD tools with Databricks notebooks 20
  • 21. CICD TEMPLATE CI/CD Templates allows you to ● create a production pipeline via template in a few steps ● that automatically hooks to github actions and ● runs tests and deployments on databricks upon git commit or whatever trigger you define and ● gives you a test success status directly in github so you know if your commit broke the build CI/CD Templates gives you the benefits of traditional CI/CD workflows and the scale of databricks clusters 21
  • 22. A scalable CI/CD pipeline in 5 easy steps 1. Install and customize with a single command 2. Create a new github repo containing your databricks host and token secrets 3. Initialize git in your repo and commit the code. 4. Push your new cicd templates project to the repo. Your tests will start running automatically on Databricks. Upon your tests’ success or failure you will get a green checkmark or red x next to your commit status. 5. You’re done! You now have a fully scalable CICD pipeline. 1 3 22 4 5 2
  • 23. CI/CD Templates executes tests and deployments directly on databricks while storing packages, model logging and other artifacts in Mlflow 23
  • 27. INTEGR ATION There are in total 4 environments: How we are using the CICD template PRD Stable code On PRD data. DEV Playground for DS/DA/DE. PRE Release-candidate code on PRD data STG Stable Code on release candidate data DB token DEV DB token PRE DB token STG DB token PRD 27
  • 28. INTEGR ATION /analyticsbackend /pipelines /tests /unit /integration runtime_requirements.txt Project structure Main python module Pipelines configurations Tests folder, divided in unit and integration tests Libraries installed in every cluster 28
  • 30. INTEGR ATION/pipelines . . . /anonymization_pipeline /databricks-config_dev.json /databricks-config_prd.json /databricks-config_pre.json /databricks-config_stg.json /job_spec_azure_dev.json /job_spec_azure_prd.json /job_spec_azure_pre.json /job_spec_azure_stg.json /pipeline_runner.py Pipeline structure JSON containing input parameters for the pipeline (eg. paths). One for each environment. Job configuration. One for each environment. Containing cluster properties, pool id, etc. Pipeline entry point 30
  • 31. 1. Move to the target environment export DATABRICKS_ENV=DEV export DATABRICKS_TOKEN=<DB-DEV-TOKEN> 2. Run the pipeline python3 run_pipeline.py pipelines --pipeline-name anonymization_pipeline Run a (test)pipeline 31
  • 32. INTEGR ATION 1. Move to the target environment export DATABRICKS_ENV=DEV export DATABRICKS_TOKEN=<DB-DEV-TOKEN> 2. Run the pipeline from databrickslabs_cicdtemplates import release_cicd_pipeline; release_cicd_pipeline.main('tests/integration', 'pipelines', True, env=DATABRICKS_ENV);" Deploy pipelines Folder with the testing pipelines Folder pipelines to deploy Run test before deploying Deployment environment 32
  • 34. INTEGRATION Github actions name: Release workflow on: # Trigger the workflow once you create a new release release: types: - created jobs: build: runs-on: ubuntu-latest [ . . . ] - name: Deploy artifact on PRD and STG environments run: | export DATABRICKS_TOKEN=${{ secrets.DATABRICKS_TOKEN_PRD }} export DATABRICKS_ENV=PRD python -c "from databrickslabs_cicdtemplates import release_cicd_pipeline; release_cicd_pipeline.main('tests/integration', 'pipelines', True, env='${DATABRICKS_ENV}');" 34
  • 35. INTEGRATION Our git flow MASTER BRANCH On-push (test+deploy) NEW RELEASE On-release (deploy) FEATURE BRANCH PR TESTED and APPROVED Test-it label Some work done merge PRE PRD STG 35
  • 37. Conclusions 1. Code and Data of ETL pipelines need to be tested like everything in Software Engineering. CI/CD is necessary for automating the testing and deployment processes and achieving high quality software. 2. CI/CD is not easy to implement: Databricks Notebooks and Databricks connect are not enough for complex scenarios. 3. CI/CD template by Databricks Lab allows us to better organize our code in sub-modules and implement CI/CD using its easy integration with Github Actions Key Takeaways 37
  • 38. Feedback Your feedback is important to us. Don’t forget to rate and review the sessions. Thank you! 38