Why building a big data platform is hard? What are the key aspects involved in providing a "Serverless" experience for data folks. And how Databricks solves infrastructure problems and provides the "Serverless" experience.
2. Housekeeping
• Your connection will be muted
• Submit questions via the Q&A panel
• Questions will be answered at the end of the webinar
• Any outstanding questions will be answered in the Databricks Forum
(https://forums.databricks.com)
• Webinar will be recorded and attachments will be made available via
www.databricks.com
2
3. About Prakash
Prakash Chockalingam
● Product Manager at Databricks
● Works closely with customers
● Deep experience building large scale
distributed systems and machine learning
infrastructure at Netflix and Yahoo
3
4. 4
Agenda
• About Databricks
• Challenges involved in building a big data platform
• How Databricks simplifies devops with lot of ‘automatic’
features in Databricks clusters.
• Demo of the key features that sets up the Databricks
clusters in an autopilot mode.
5. 5
Accelerate innovation by unifying data science,
engineering and business
• Founded by the creators of Apache Spark
• Contributes 75% of the open source code, 10x
more than any other company
• Trained 40k+ Spark users on the Databricks
platform
VISIO
N
WHO WE
ARE
Unified Analytics Platform powered by Apache SparkPRODUC
T
6. 6
Building a big data platform is hard
Availability Reliability Lower cloud cost Simplicity
Throughput Scalability Security
7. 7
Cluster Reliability
Commodity hardware . Good performance/$
• But lot of things can go wrong - bad disks, flaky
instances, network errors, etc.
Heterogenous workloads running user code
• Requires proper isolation and fault tolerance.
Uneven distribution
• Data skew, bursty requests, etc
8. 8
Data Throughput through Cluster
• Read/Write large number of records
efficiently
• Low latency under high throughput
traffic
• Right tradeoff with reliability and
throughput is required.
Instance Disk
Memory
Cloud Storage
HigherThroughput
LowerReliability
Available Size
9. 9
Scalability
Scalability across different dimensions:
• Number of clusters
• Platform must be able to handle 100s of cluster requests.
• # of nodes in a cluster
• Platform must be able to handle 100s of nodes in a single cluster.
• Size of data
• Handle large volumes of data.
10. 10
Cluster Availability
Must be easy to upgrade even with large scale
• Zero downtime upgrades for clusters so that no
production workloads are affected.
• Easy mechanisms to rollback
Instrument monitoring & alerting
• Easily detect & alert on failures
• Track performance & utilization
Fast recoverability from failures
11. 11
Simplicity
Easy to use
• Interface must be intuitive and easy for the
developers to use
Debuggability
• Developers must be able to easily access
metrics and logs to troubleshoot their code.
12. 12
Lowering cloud cost
Optimum resource utilization
• Fine-grained resource sharing
Elasticity
• Autoscaling resources
Leverage cloud features to optimize costs
• Spot vs on-demand
13. 13
Security
• Firewall & Network ACLs to prevent attackers intruding
data flowing through clusters.
• Encryption of data at rest
• Temporary storage
• Permanent storage
• Encryption of data during transit
15. 15
Databricks Clusters
• Resilient to transient cloud failures
• Optimized for high throughput
• High availability
• Scalable to handle thousands of nodes
• Bulletproof security
• Optimized for lowering your cloud costs
17. 17
Cluster in Autopilot mode
Just focus on your data; not the underlying
infrastructure. (#serverless)
18. 18
Cluster in Autopilot mode
• Automatic scaling of compute
• Automatic scaling of instance storage
• Automatic recovery
• Automatic software updates
• Automatic caching
• Automatic start and termination
• Automatic configuration
• Automatic monitoring instrumentation
• Automatic resilience to spot price fluctuations
19. 19
Autoscale compute
• Do not worry about how many machines are required for
your workload.
• Autoscale compute is based on Spark-native task tracking.
• Guarantees maximum utilization.
20. 20
Autoscale local storage
• Spark requires lot of intermediate disk space.
• Coming up with the right disk space is very painful.
• Automatically scale local storage based on Spark’s disk
space requirements for your job.
21. 21
Automatic Recovery
• Automatic recovery of cluster nodes
• If cluster nodes fail, they get automatically replaced with new
ones.
• Automatic recovery of cluster failures
• If the whole cluster becomes unresponsive for some reason, then
the cluster will be automatically recovered.
22. 22
Automatic Software Updates
• Automatic push of latest updates to cluster’s side car
services every 2 weeks.
• Zero downtime of clusters during pushes.
• Automatic rollout of new Databricks runtime versions that
customers can choose when they create clusters.
23. 23
Automatic Caching
• Automatically move parquet data in cloud storage to
instance’s local storage.
• Blazing fast throughput for repeatedly read data.
• Completely transparent to user.
24. 24
Automatic Termination
• Automatically terminate clusters that are idle.
• Idle time is calculated based on fine-grained Spark task
tracking so that it is more accurate.
25. 25
Automatic Start
• Automatically start clusters when you run commands.
• Auto-start and auto-terminate completely eliminate the
need to worry about underlying infrastructure
27. 27
Automatic resilience to spot price
hikes• Leverage spot as much as possible and fallback to on-
demand for reliability
• Combine it with autoscaling and you can tremendously
reduce your cloud cost.
28. 28
Auto Configuration
• Databricks Serverless auto-configures all the
functionalities out of the box for you.
• You can just specify the minimum parameters you care
about.
30. Try Apache Spark in Databricks
Sign up for a free 14-day trial of Databricks
https://databricks.com/try-databricks
Additional Questions?
Contact us at http://go.databricks.com/contact-databricks
3
0