1. 1
The New Storage Applications:
Lots of Data, New Hardware and
Machine Intelligence
Nisha Talagala
Parallel Machines
INFLOW 2016
2. 2
Storage Evolution & Application Evolution Combined
Disk & Tape
Flash
DRAM
Persistent
Memory
Geographically
Distributed
Clustered
Local
Key-Value
File, Object
Block Data Management
Classic Enterprise
Transactions
Business Intelligence
Search etc.
Advanced Analytics
(Machine Learning, Cognitive
Functions)
3. 3
In this talk
• What are the new data apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities for the storage stack?
4. 4
Teaching Assistants
Elderly Companions
Service Robots
Personal Social Robots
Smart Cities
Robot Drones
Smart Homes
Intelligent vehicles
Personal Assistants (bots)
Smart Enterprise
Edited version of slide from Balint
Fleischer’s talk: Flash Memory
Summit 2016, Santa Clara, CA
X
Growing Sources of Data
5. 5
Classic Enterprise Transactions, Business
Intelligence
Advanced Analytics
“Killer” use cases OTLP
ERP
Email
eCommerce
Messaging
Social Networks
Content Delivery
Discovery of solutions, capabilities
Risk Assessment
Improving customer experience
Comprehending sensory data
Key functions RDBMS
BI
Fraud detection
Databases
Social Graphs
SQL and ML Analytics
Streaming
Natural Language Understanding
Object Recognition
Probabilistic Reasoning
Content Analytics
Data Types Structured
Transactional
Structured
Unstructured
Transactional
Streaming
Mixed
Graphs, Matrices
Storage Types Enterprise Scale
Standards driven
SAN/NAS, etc
Cloud Scale
Open source
File/Object
???
Edited version of slide from Balint
Fleischer’s talk: Flash Memory
Summit 2016 Santa Clara, CA
The Application Evolution
6. 6
Libraries Libraries
Machine Learning, Deep
Learning, SQL, Graph, CEP etc.
Data LakeData Repositories
SQL
NoSQL
Data LakeData Streams
A Sample Analytics Stack
Processing Engine
Data from
Repositories or
Live Streams
Optimizers/Schedulers
Language Bindings, APIs
Frequently in
memory
Python, Scala,
Java etc
7. 7
Data LakeData Repositories
SQL
NoSQL
Data LakeData Streams
Machine Learning Software Ecosystem – a Partial
View
Data from
Repositories or
Live Streams
Flink / Apex
Spark Streaming
Storm / Samza / NiFi
Caffe
Theano
Tensor Flow
Hadoop / Spark
Flink
Tensor Flow
Mahout, Samsara, Mllib, FlinkML, Caffe, TensorFlow
Stream
Processing
Engine
Batch
Processing
Engine
Domain
focused back
end engines
Algorithms and Libraries
Beam (Data Flow), StreamSQL, Keras
Layered API Providers
8. 8
In this talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities?
9. 9
How ML/DL Workloads think about Data – Part 1
• Data Sizes
• Incoming datasets can range from MB to TB
• Models are typically small. Largest models tend to be in deep neural networks
and range from 10s MB to single digit GB
• Common Data Types
• Time series and Streams
• Multi-dimensional Arrays, Matrices and Vectors
• DataFrames
• Common distributed patterns
• Data Parallel, periodic synchronization
• Model Parallel
• Network sensitivity varies between algorithms. Straggler
performance issues can be significant
• 2x performance difference between IB and 40Gbit Ethernet for some algorithms
like KMeans and SVM
10. 10
The Growth of Streaming Data
• Continuous data flows and continuous processing
• Enabled & driven by sensor data, real time information feeds
• Enables native time component “event time”
• Allows complex computations that can combine new and old data in
deterministic ways
• Several variants with varied functionality
• True Streams, Micro-Batch (an incremental batch emulation)
• Possible with existing models like SQL, supported natively by models
like Google DataFlow / Apache Beam
• The performance of in-memory streaming enables a convergence
between stream analytics (aggregation) and Complex Event Processing
(CEP)
11. 11
Convergence of RDBMS and Analytics
• In-Memory DBs are moving to continuous queries
• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)
• Stream and batch analytic engines support SQL interfaces
• Ex: SQL support on Spark, Flink
• SQL parsers with pluggable back ends – Apache Calcite
• Good for basic analytics but need extensions to support machine
learning and deep learning
• Joins, sorts, etc. good for feature engineering, data cleansing
• Many core machine & deep learning operations require linear algebra ops
If the idea of a standard database is "durable data, ephemeral queries"
the idea of a streaming database is "durable queries, ephemeral data”
http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
12. 12
The Growing Role of the Edge
• Closest to data ingest, lowest latency.
• Benefits to real time processing
• Highly varied connectivity to data centers
• Varied hardware architectures and
resource constraints
• Differs from geographically distributed
data center architecture
• Asymmetry of hardware
• Unpredictable connectivity
• Unpredictable device uptime ioT Reference Model
13. 13
How ML/DL Workloads think about Data – Part 2
• The older data gets – the more its “role” changes
• Older data for batch- historical analytics and model reboots
• Used for model training (sort of), not for inference
• Guarantees can be “flexible” on older data
• Availability can be reduced (most algorithms can deal with some data loss)
• A few data corruptions don’t really hurt J
• Data is evaluated in aggregate and algorithms are tolerant of outliers
• Holes are a fact of real life data – algorithms deal with it
• Quality of service exists but is different
• Random access is very rare
• Heavily patterned access (most operations are some form of array/matrix)
• Shuffle phase in some analytic engines
14. 14
Correctness, Determinism, Accuracy and Speed
• More complex evaluation metrics than
traditional transactional workloads
• Correctness is hard to measure
• Even two implementations of the “same
algorithm” can generate different results
• Determinism/Repeatability is not always
present for streaming data
• Ex: Micro-batch processing can produce
different results depending on arrival time Vs
event time
• Accuracy to time tradeoff is non-linear
• Exploratory models can generate massive
parallelism for the same data set used
repeatedly (hyper-parameter search)
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Error
Time
SVM V1
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error
Time
SVM V2
15. 15
The Role of Persistence
• For ML functions, most computations today are in-memory
• Data flows from data lake to analytic engine and results flow back
• Persistent checkpoints can generate large write traffic for very long running
computations (streams, large neural network training, etc.)
• Persistent message storage to enforce exactly once semantics and
determinism, latency sensitive write traffic
• For in-memory databases, persistence is part of the core engine
• Log based persistence is common
• Loading & cleaning of data is still a very large fraction of the pipeline time
• Most of this involves manipulating stored data
16. 16
In this talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage/memory optimized for these apps today?
• Opportunities?
17. 17
Abstractions and the Stack
• ML/DL applications use common
abstractions that combine linear algebra,
tables, streams etc
• These are stored as independent entities
inside Key-Value pairs, Objects or Files
• File system used as common namespace
• Information is lost at each level down,
along with opportunities to optimize
layout, tiering, caching etc
Data copies (or transfers denoted by red
lines) occur frequently, sometimes more
than once!
Block
File
Key-Value and Object
Matrices, Tables, Streams, etc
18. 18
Optimizing Storage: Some Examples
• Time series optimized databases
• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)
• Streamlined data types, specialized indexing, tiering optimized for access
patterns
• API pushdown techniques
• Iguazio.io
• Streams and Spark RDDs as native access APIs
• Lineage
• Alluxio (Formerly Tachyon)
• Link data history & compute history, cache intermediate stages in machine
learning pipelines
• Memory expansion
• Many studies on DRAM/Persistent Memory/Flash tiering for analytics
19. 19
Opportunities: Places to Start
• Persistent Memory and Flash offer several opportunities to
improve ML/DL capacity and efficiency
• Fast/Frequent Checkpointing for long running jobs
• Note: will put pressure on write endurance
• Low latency logging for exactly-once semantics
• Memory expansion: DRAM/Persistent Memory/Flash hierarchies
• exploit the highly predictable access patterns of ML algorithms
• Accelerate data load/save stages of ML/DL pipelines
20. 20
Opportunities – More Fundamental Shifts
• Role of storage types in analytics optimizers and schedulers –
superficially similar to DB query optimization
• Exploit the more relaxed set of requirements on persistence
• Even correctness can be relaxed
• Example in compute land for flexibility in synchronization (HogWild!
approach to SGD, plus Asynchronous SGD etc.)
• Leverage Persistent Memory to unify low latency streaming data
requirements and high throughput batch data requirements
• New(er) data types and repeatable access patterns
• Converged systems with analytics and storage management for cross
stack efficiency
21. 21
Takeaways
• The use of ML/DL in enterprise is at its infancy and expanding
furiously
• These apps put ever larger pressure on data management,
latency, and throughput requirements
• These apps also introduce another layer of abstraction and
another layer of workload intelligence
• Further away from block and file
• Opportunities exist to significantly improve storage and memory
for these use cases by understanding and exploiting their
priorities and non-priorities for data