Migrating Your Data Platform At a High Growth Startup

Migrating your Data Platform
At a High Growth Startup
Carlos Gasperi
Software Engineer, Abnormal Security

Agenda
▪ Intro
▪ Q4 2020: Data Platform
▪ Databricks POC
▪ Migration
▪ Future use cases

About Abnormal Security
tl;dr - Next-generation enterprise ML/AI email security platform
2018 - Company founding, stealth-mode
2019 - Series A ~30 employees
2020 - Series B ~90 employees
2021 - expected => 200? employees

Our Data Platform
▪ Airflow
▪ PySpark + Conda
▪ AWS EMR
▪ S3 Storage
▪ Thrift and Parquet
• Jupyter notebooks
• Data Science
• Data Pipelines
• Athena
• AdHoc Queries
• No Data Warehouse
• Analytics

Data Pipelines: Cluster Infrastructure
• YARN on EMR
• Several long-lived clusters
• Jobs submitted via SSH from CLI/Airflow
• Monitoring via Ganglia
• 3rd party deps via conda

YARN on EMR: The Challenges
• Multi-tenancy
• Auto-scaling
• Dependency Management
• Ongoing maintenance

Data Science: Notebook Infra
• Jupyter Notebooks
• Big EC2 machine(s), shared by many users at once

• Shared machine
• On 24/7
• Fragile dependencies
The Notebook: The Challenges

Analytics
• Athena
• Queries directly on source data
• MySQL, Postgres, Configuration files

Pre-POC: Requirements
• Identified the top 3 use cases
• Must-have: EMR Replacement
• Nice-to-have: Notebook Replacement
• Nice-to-have: Data Warehouse implementation

EMR Replacement: Success Criteria
• Feature Parity
• Dependency management
• Troubleshooting ability
• Cluster Management
• Flow of bring up cluster, run job, tear down
• Must not be longer than 15 minutes
• Must be reliable
• Cost
• Must be within 10% of current cost
• Technical Support

POC Sprint
• Two weeks to set up experiments and gather all data
• Twice-weekly check-ins with Databricks contacts
• Daily standup
• Databricks <-> Abnormal Slack channel

POC Sprint: Week 1
• Under the hood, tooling to:
• Start clusters
• Launch jobs from CLI and Airflow
• Install dependencies
• Package and install code
• Navigate setup issues:
• Using Conda
• S3 permissions
• RDS connectivity
• Spark configurations

POC Sprint: Week 2
• Tested Airflow integration
• Ran a notoriously unstable pipeline
• Bring up a very large cluster

Tips and Gotchas
• Tooling and Dependencies
• Databricks API client
• Databricks Runtime for ML
• Our repo as a wheel
• Airflow
• Embrace best practices
• Take into account migration time

POC Sprint: Wins
• Operational Overhead
• Roadmap
• Scale
• Cost => 50% reduction
• Usability

Migration
• Databricks deployment
• Tooling and Measurement
• Training
• Execution

Databricks Deployment
• Customer-managed VPC
• IAM Roles
• VPC Peering

Tooling Requirements
• Easy troubleshooting
• Zero-deploy migrations
• Cost comparisons

Tooling and Measurement: Config Framework

Tooling and Measurement
• Cost Measurement
• EC2 cost
• Used Databricks tagging heavily
• Integrated Databricks EC2 costs into our homegrown cost framework
• DBU cost
• Enable billable usage logs
• Use the pre-made usage analysis notebook

Training
• Internal presentations
• Interactive demo
• Documentation

Execution
• Goal: 90% of cost migrated in Q1
• Not 90% of pipelines
• Ranked jobs by cost

Execution
• Migrate job to configuration framework + deploy
• Move job to Databricks via config
• Rinse + Repeat

Tips and Gotchas
• Conda Dependencies
• Spark Job Tuning
• Runs submit vs. Run now
• Cluster policies
• Connecting to RDS

Killing 10 birds with 1 stone
• EMR and Notebook replacement just the start
• Having a single environment to rule them all
◦ A single source of data
◦ Storage decoupled from compute
◦ Data Engineers run data pipelines on workspace clusters
◦ ML engineers and Data Scientists use notebooks and SQL Analytics
• First Lakehouse project
◦ Automated metrics using DBT and SQL Analytics

We’re hiring!
https://abnormalsecurity.com/careers/
carlos@abnormalsecurity.com

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Migrating Your Data Platform At a High Growth Startup

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Migrating Your Data Platform At a High Growth Startup

Similar a Migrating Your Data Platform At a High Growth Startup (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Migrating Your Data Platform At a High Growth Startup