Delivering Insights from 20M+ Smart Homes with 500M+ Devices

Delivering Insights from
20M+ Smart Homes with
500M+ devices
Sameer Vaidya and Raghav Karnam
Data Engineering and Data Science

Universal Peace Mantra
May all beings everywhere be
happy and free and may the
thoughts, words and actions of
my own life contribute in some
way to that happiness and to that
freedom for all

Present Day:
Plume business and products imperatives and
expectations from data teams

Insights from world wide Smart home Locations,
Device types and Behaviors over Time
public examples at: https://plume.com, https://discover.plume.com/wfh-dashboard

Agenda:
Our journey to
developer & operations
productivity and scale:
▪ Job Clusters
▪ Template Notebooks
▪ Avro/Parquet -> Delta
▪ SQL Analytics
▪ ML Lifecycle https://www.plume.com/careers
@ Sameer
Data Engineering,
Analytics & BI
@ Raghav
Data Science &
ML Engineering

Challenges with our ﬁrst generation Spark
processing clusters and Data Warehouse

Poor Dev/Ops productivity, visibility, fragility
▪ DevOps owned AWS IaaS
became bottleneck
▪ Lack of automation created
poor utilization in prod and
dev
▪ Poor developer productivity:
Notebooks integration was
complicated and largely
unused
• AWS Athena *serverless
• AWS EMR Spark Clusters
• metadata management is
critical to see all data
• scheduling is tricky
• easy to make a mess
• creates lots of cruft tables
when misconﬁgured or
extraneous ﬁles in path
• AWS Glue Crawlers
due to lack of automation and developer IDE, control over resources and complexity
• Data scientists couldn’t
answer complex questions
requiring long running
queries timeout
• Enabling support Web app
limited queue slots cannot
handle unpredictable Web
app loads

#1 Developer and operational productivity:
Deploying worldwide E2 workspaces and
empowering developers with Notebooks and self
service clusters

Operate across N regions X [dev + prod] workspaces
Standardization and Automation: users:groups:clusters:buckets:subnets:jobs:databases:tables
• Standardize
Namespaces
• Map SAML IDP SSO
• Plan RBAC model
• https://status.databricks.com/

Developer productivity 30-50% up with Notebooks
• Use Github Repos
• Interactive dev/debug by
uploading jars
• Interactive SQL/python
• Easily convert to scheduled
Jobs
• Combine with IDEs
• Databricks Connect
• Simba JDBC
• Schedule via Airﬂow

• Databricks Job Clusters
DevOpsless self-service Developers with Clusters
Databricks clusters reduce operational tickets and enhanced
productivity
• Standard / High Concurrency
• $$$ needs high utilization
• Lesson: optimized for
multiple queries but runs
individual slower
• Use EC2 Reserved Instances
for Driver nodes - and Spot
instances for all Workers - for
long or short running jobs
• Use Service Principles for team
ownership of logs / jobs
• Plan dedicated subnet space
for expansion
• Use 1 hr idle termination
• Best Practices
• Developers decide
cluster size for their
jobs- cluster policies
put sanity bounds
• Achieve High Availability
• Retry Airﬂow
Retries for 30
mins
• AWS Instance
availability
• Databricks API
availability
• Retry Airﬂow Job
submission for 30 mins
• Plan AWS per AZ
Instance type
Availability
• Plan for Databricks API
outages during
upgrades
• Use Idempotency
tokens to avoid multiple
runs during API outages

Segment Usage/Billing by Teams, Projects, Owners
Use cost-center:region:team:env:project:owner AWS Tags in cluster creation
APIs
Jan ... Dec
Cost Center $ $$ $$$
Region $ $$ $$$
Environment $ $$ $$$
Team
Owner
With great authority
comes great
responsibility:
- Usage plan makes
owners
accountable
- Usage data is
available to you
- Customize using
Notebooks

#2 Query performance, scale and automated
metadata management:
Migrating from legacy Avro/Parquet to Delta lake

Migrate Glue metadata to DataBricks Metastore
Move to Delta ASAP
- Poor performance for poorly
partitioned Avro/Parquet
- No Glue Crawlers
Interim support for Legacy
Avro/Parquet data:
- Generate DDL from
templates
- Jobs to MSCK REPAIR
TABLE + scripts to scan S3
and ADD PARTITION
Convert to Delta:
- Migrate Jobs to read/write
- AutoLoader

Parquet -> Delta in place conversion optimal on resources
but requires complex coordination
1. Catalog all paths,
databases, tables
2. Prepare DDL USING
PARQUET & DELTA
3. Convert pipelines to
read/write Delta
instead of Parquet
4. Coordinate with
external consumers
5. Pause and upgrade all
pipelines
6. Migrate parquet to
delta
7. Resume pipelines
8. Schedule Glue MSCK
9. Recovery Plan

#3 Scalable SQL Analytics over large data sets:
Migrating Data Scientists, Analysts and BI
dashboards to consume Databricks SQL Analytics
Endpoints

SQLA Endpoints optimized for BI/Analytics workloads
• Start with single
“general-purpose”
• 1 hour idle
termination
• Rich SQL IDEs
supported - DBeaver
• Can serve Web APIs!
Create dedicated SQL endpoint / clusters for each use case; size clusters per use case / workload

#4. Summary:
Scaling development and operations for BI and
Analytics for worldwide deployments requires:
- Workspace management
- Clusters + Notebooks
- Metadata management
- Migrate BI/adhoc to SQL Endpoints

SPEAKER CHANGE
- TRANSITION TO RAGHAV’s PRESO
(DELETE THIS SLIDE)

Present Day:
Plume ML Focus areas and expectations from
Machine learning teams

Challenges with our ﬁrst generation ML Life cycle
and MLOPS.
Our evolution to increase productivity of our Data
Scientist’s.

Curate Data DE
Model
Performance
metrics
/Thresholds
Build Model Data Scientist
Model
Performance
metrics
/Thresholds
Deployment /ML
Model
ML Engineer A/B testing
Integrate Model SWE Pass/Fail
Operate Model
Monitor for Data
Drift
Model

#5. ML Lifecycle in Databricks

Models Across Databricks Workspaces

Feedback
Your feedback is important to us.
Don’t forget to rate and review the sessions.

Delivering Insights from 20M+ Smart Homes with 500M+ Devices

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Delivering Insights from 20M+ Smart Homes with 500M+ Devices

Similar a Delivering Insights from 20M+ Smart Homes with 500M+ Devices (20)

Más de Databricks

Más de Databricks (20)

Último

Último (20)

Delivering Insights from 20M+ Smart Homes with 500M+ Devices