Migrating their data platform from AWS EMR and notebooks to Databricks, Abnormal Security conducted a two-week proof-of-concept that was successful. They are now migrating jobs ranked by cost to Databricks' configuration framework over the first quarter to reduce costs by 50% while gaining improved usability, operational overhead, and ability to scale. The migration of their platform to a single environment using Databricks will allow them to build their first data lakehouse and gain additional future use cases as the company grows rapidly.
6. Our Data Platform
▪ Airflow
▪ PySpark + Conda
▪ AWS EMR
▪ S3 Storage
▪ Thrift and Parquet
• Jupyter notebooks
• Data Science
• Data Pipelines
• Athena
• AdHoc Queries
• No Data Warehouse
• Analytics
7. Data Pipelines: Cluster Infrastructure
• YARN on EMR
• Several long-lived clusters
• Jobs submitted via SSH from CLI/Airflow
• Monitoring via Ganglia
• 3rd party deps via conda
8. YARN on EMR: The Challenges
• Multi-tenancy
• Auto-scaling
• Dependency Management
• Ongoing maintenance
9. Data Science: Notebook Infra
• Jupyter Notebooks
• Big EC2 machine(s), shared by many users at once
10. • Shared machine
• On 24/7
• Fragile dependencies
The Notebook: The Challenges
15. Pre-POC: Requirements
• Identified the top 3 use cases
• Must-have: EMR Replacement
• Nice-to-have: Notebook Replacement
• Nice-to-have: Data Warehouse implementation
16. EMR Replacement: Success Criteria
• Feature Parity
• Dependency management
• Troubleshooting ability
• Cluster Management
• Flow of bring up cluster, run job, tear down
• Must not be longer than 15 minutes
• Must be reliable
• Cost
• Must be within 10% of current cost
• Technical Support
17. POC Sprint
• Two weeks to set up experiments and gather all data
• Twice-weekly check-ins with Databricks contacts
• Daily standup
• Databricks <-> Abnormal Slack channel
18. POC Sprint: Week 1
• Under the hood, tooling to:
• Start clusters
• Launch jobs from CLI and Airflow
• Install dependencies
• Package and install code
• Navigate setup issues:
• Using Conda
• S3 permissions
• RDS connectivity
• Spark configurations
19. POC Sprint: Week 2
• Tested Airflow integration
• Ran a notoriously unstable pipeline
• Bring up a very large cluster
20. Tips and Gotchas
• Tooling and Dependencies
• Databricks API client
• Databricks Runtime for ML
• Our repo as a wheel
• Airflow
• Embrace best practices
• Take into account migration time
34. Killing 10 birds with 1 stone
• EMR and Notebook replacement just the start
• Having a single environment to rule them all
◦ A single source of data
◦ Storage decoupled from compute
◦ Data Engineers run data pipelines on workspace clusters
◦ ML engineers and Data Scientists use notebooks and SQL Analytics
• First Lakehouse project
◦ Automated metrics using DBT and SQL Analytics