SlideShare una empresa de Scribd logo
1 de 22
Descargar para leer sin conexión
1
The  New  Storage  Applications:
Lots  of  Data,  New  Hardware  and  
Machine  Intelligence
Nisha  Talagala
Parallel  Machines
INFLOW  2016
2
Storage  Evolution  &  Application  Evolution  Combined
Disk  &  Tape
Flash
DRAM
Persistent  
Memory
Geographically
Distributed
Clustered
Local
Key-­Value
File,  Object
Block Data  Management
Classic  Enterprise
Transactions
Business  Intelligence
Search  etc.
Advanced  Analytics
(Machine  Learning,  Cognitive  
Functions)
3
In  this  talk
• What are the new data apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities for the storage stack?
4
Teaching  Assistants
Elderly  Companions
Service  Robots
Personal  Social  Robots
Smart  Cities
Robot  Drones
Smart  Homes
Intelligent  vehicles
Personal  Assistants  (bots)
Smart  Enterprise
Edited  version  of  slide  from  Balint
Fleischer’s  talk:  Flash  Memory  
Summit  2016,  Santa  Clara,  CA
X
Growing  Sources  of  Data
5
Classic Enterprise Transactions,  Business  
Intelligence
Advanced  Analytics
“Killer”  use  cases OTLP
ERP
Email
eCommerce
Messaging
Social Networks
Content  Delivery
Discovery  of    solutions,  capabilities
Risk  Assessment
Improving  customer  experience
Comprehending    sensory  data
Key  functions RDBMS
BI
Fraud  detection
Databases
Social  Graphs
SQL  and  ML  Analytics
Streaming
Natural  Language  Understanding
Object Recognition
Probabilistic  Reasoning
Content  Analytics
Data  Types Structured
Transactional
Structured
Unstructured
Transactional
Streaming
Mixed
Graphs,  Matrices
Storage  Types Enterprise Scale
Standards  driven
SAN/NAS,  etc
Cloud Scale
Open  source
File/Object
???
Edited  version  of  slide  from  Balint
Fleischer’s  talk:  Flash  Memory  
Summit  2016 Santa  Clara,  CA
The  Application  Evolution
6
Libraries Libraries
Machine  Learning,  Deep  
Learning,  SQL,  Graph,  CEP    etc.
Data LakeData  Repositories
SQL
NoSQL
Data LakeData  Streams
A  Sample  Analytics  Stack
Processing  Engine
Data  from  
Repositories  or  
Live  Streams
Optimizers/Schedulers
Language  Bindings,  APIs
Frequently  in  
memory
Python,  Scala,  
Java  etc
7
Data LakeData  Repositories
SQL
NoSQL
Data LakeData  Streams
Machine  Learning  Software  Ecosystem  – a    Partial  
View
Data  from  
Repositories  or  
Live  Streams
Flink /  Apex
Spark  Streaming
Storm  /  Samza /  NiFi
Caffe
Theano
Tensor  Flow
Hadoop  /  Spark
Flink
Tensor  Flow
Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow
Stream  
Processing  
Engine
Batch
Processing  
Engine
Domain  
focused  back  
end  engines
Algorithms  and  Libraries
Beam  (Data  Flow),  StreamSQL,  Keras
Layered  API  Providers
8
In  this  talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage optimized for these apps today?
• Opportunities?
9
How  ML/DL  Workloads  think  about  Data  – Part  1
• Data Sizes
• Incoming datasets can range from MB to TB
• Models are typically small. Largest models tend to be in deep neural networks
and range from 10s MB to single digit GB
• Common Data Types
• Time series and Streams
• Multi-dimensional Arrays, Matrices and Vectors
• DataFrames
• Common distributed patterns
• Data Parallel, periodic synchronization
• Model Parallel
• Network sensitivity varies between algorithms. Straggler
performance issues can be significant
• 2x performance difference between IB and 40Gbit Ethernet for some algorithms
like KMeans and SVM
10
The  Growth  of  Streaming  Data
• Continuous data flows and continuous processing
• Enabled & driven by sensor data, real time information feeds
• Enables native time component “event time”
• Allows complex computations that can combine new and old data in
deterministic ways
• Several variants with varied functionality
• True Streams, Micro-Batch (an incremental batch emulation)
• Possible with existing models like SQL, supported natively by models
like Google DataFlow / Apache Beam
• The performance of in-memory streaming enables a convergence
between stream analytics (aggregation) and Complex Event Processing
(CEP)
11
Convergence  of  RDBMS  and  Analytics
• In-Memory DBs are moving to continuous queries
• Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL)
• Stream and batch analytic engines support SQL interfaces
• Ex: SQL support on Spark, Flink
• SQL parsers with pluggable back ends – Apache Calcite
• Good for basic analytics but need extensions to support machine
learning and deep learning
• Joins, sorts, etc. good for feature engineering, data cleansing
• Many core machine & deep learning operations require linear algebra ops
If the idea of a standard database is "durable data, ephemeral queries"
the idea of a streaming database is "durable queries, ephemeral data”
http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
12
The  Growing  Role  of  the  Edge
• Closest to data ingest, lowest latency.
• Benefits to real time processing
• Highly varied connectivity to data centers
• Varied hardware architectures and
resource constraints
• Differs from geographically distributed
data center architecture
• Asymmetry of hardware
• Unpredictable connectivity
• Unpredictable device uptime ioT Reference  Model
13
How  ML/DL  Workloads  think  about  Data  – Part  2
• The older data gets – the more its “role” changes
• Older data for batch- historical analytics and model reboots
• Used for model training (sort of), not for inference
• Guarantees can be “flexible” on older data
• Availability can be reduced (most algorithms can deal with some data loss)
• A few data corruptions don’t really hurt J
• Data is evaluated in aggregate and algorithms are tolerant of outliers
• Holes are a fact of real life data – algorithms deal with it
• Quality of service exists but is different
• Random access is very rare
• Heavily patterned access (most operations are some form of array/matrix)
• Shuffle phase in some analytic engines
14
Correctness,  Determinism,  Accuracy  and  Speed
• More complex evaluation metrics than
traditional transactional workloads
• Correctness is hard to measure
• Even two implementations of the “same
algorithm” can generate different results
• Determinism/Repeatability is not always
present for streaming data
• Ex: Micro-batch processing can produce
different results depending on arrival time Vs
event time
• Accuracy to time tradeoff is non-linear
• Exploratory models can generate massive
parallelism for the same data set used
repeatedly (hyper-parameter search)
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14
Error
Time
SVM  V1  
0
0.2
0.4
0.6
0.8
1
1.2
0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error
Time
SVM  V2  
15
The  Role  of  Persistence
• For ML functions, most computations today are in-memory
• Data flows from data lake to analytic engine and results flow back
• Persistent checkpoints can generate large write traffic for very long running
computations (streams, large neural network training, etc.)
• Persistent message storage to enforce exactly once semantics and
determinism, latency sensitive write traffic
• For in-memory databases, persistence is part of the core engine
• Log based persistence is common
• Loading & cleaning of data is still a very large fraction of the pipeline time
• Most of this involves manipulating stored data
16
In  this  talk
• What are the new apps? – with a heavy focus on Advanced
Analytics, particularly Machine Learning and Deep Learning
• What are their salient characteristics when it comes to storage
and memory?
• How is storage/memory optimized for these apps today?
• Opportunities?
17
Abstractions  and  the  Stack
• ML/DL applications use common
abstractions that combine linear algebra,
tables, streams etc
• These are stored as independent entities
inside Key-Value pairs, Objects or Files
• File system used as common namespace
• Information is lost at each level down,
along with opportunities to optimize
layout, tiering, caching etc
Data copies (or transfers denoted by red
lines) occur frequently, sometimes more
than once!
Block
File
Key-­Value  and  Object
Matrices,  Tables,  Streams,  etc
18
Optimizing  Storage:  Some  Examples
• Time series optimized databases
• Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015)
• Streamlined data types, specialized indexing, tiering optimized for access
patterns
• API pushdown techniques
• Iguazio.io
• Streams and Spark RDDs as native access APIs
• Lineage
• Alluxio (Formerly Tachyon)
• Link data history & compute history, cache intermediate stages in machine
learning pipelines
• Memory expansion
• Many studies on DRAM/Persistent Memory/Flash tiering for analytics
19
Opportunities:  Places  to  Start  
• Persistent Memory and Flash offer several opportunities to
improve ML/DL capacity and efficiency
• Fast/Frequent Checkpointing for long running jobs
• Note: will put pressure on write endurance
• Low latency logging for exactly-once semantics
• Memory expansion: DRAM/Persistent Memory/Flash hierarchies
• exploit the highly predictable access patterns of ML algorithms
• Accelerate data load/save stages of ML/DL pipelines
20
Opportunities  – More  Fundamental  Shifts
• Role of storage types in analytics optimizers and schedulers –
superficially similar to DB query optimization
• Exploit the more relaxed set of requirements on persistence
• Even correctness can be relaxed
• Example in compute land for flexibility in synchronization (HogWild!
approach to SGD, plus Asynchronous SGD etc.)
• Leverage Persistent Memory to unify low latency streaming data
requirements and high throughput batch data requirements
• New(er) data types and repeatable access patterns
• Converged systems with analytics and storage management for cross
stack efficiency
21
Takeaways
• The use of ML/DL in enterprise is at its infancy and expanding
furiously
• These apps put ever larger pressure on data management,
latency, and throughput requirements
• These apps also introduce another layer of abstraction and
another layer of workload intelligence
• Further away from block and file
• Opportunities exist to significantly improve storage and memory
for these use cases by understanding and exploiting their
priorities and non-priorities for data
22
Thank  You

Más contenido relacionado

La actualidad más candente

Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
Databricks
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Spark Summit
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
Databricks
 

La actualidad más candente (20)

ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
ML Platform Q1 Meetup: End to-end Feature Analysis, Validation and Transforma...
 
Data Streaming For Big Data
Data Streaming For Big DataData Streaming For Big Data
Data Streaming For Big Data
 
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
Deep Learning for Natural Language Processing Using Apache Spark and TensorFl...
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 
Distributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On SparkDistributed Heterogeneous Mixture Learning On Spark
Distributed Heterogeneous Mixture Learning On Spark
 
Scaling up Machine Learning Development
Scaling up Machine Learning DevelopmentScaling up Machine Learning Development
Scaling up Machine Learning Development
 
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
Stream Data Processing at Big Data Landscape by Oleksandr Fedirko
 
Thomas Jensen. Machine Learning
Thomas Jensen. Machine LearningThomas Jensen. Machine Learning
Thomas Jensen. Machine Learning
 
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep... Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
Cloud Computing Was Built for Web Developers—What Does v2 Look Like for Deep...
 
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the EdgeUsing Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
Using Apache Pulsar to Provide Real-Time IoT Analytics on the Edge
 
PEARC 17: Spark On the ARC
PEARC 17: Spark On the ARCPEARC 17: Spark On the ARC
PEARC 17: Spark On the ARC
 
Data warehouse 26 exploiting parallel technologies
Data warehouse  26 exploiting parallel technologiesData warehouse  26 exploiting parallel technologies
Data warehouse 26 exploiting parallel technologies
 
Tuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and ArchitectureTuning ML Models: Scaling, Workflows, and Architecture
Tuning ML Models: Scaling, Workflows, and Architecture
 
DataMass Summit - Machine Learning for Big Data in SQL Server
DataMass Summit - Machine Learning for Big Data  in SQL ServerDataMass Summit - Machine Learning for Big Data  in SQL Server
DataMass Summit - Machine Learning for Big Data in SQL Server
 
Machine Learning Pipelines
Machine Learning PipelinesMachine Learning Pipelines
Machine Learning Pipelines
 
Distributed machine learning
Distributed machine learningDistributed machine learning
Distributed machine learning
 
Cognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from MicrosoftCognitive Toolkit - Deep Learning framework from Microsoft
Cognitive Toolkit - Deep Learning framework from Microsoft
 
Data Intensive Applications with Apache Flink
Data Intensive Applications with Apache FlinkData Intensive Applications with Apache Flink
Data Intensive Applications with Apache Flink
 
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
Petabyte Scale Anomaly Detection Using R & Spark by Sridhar Alla and Kiran Mu...
 
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens... ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
ML at the Edge: Building Your Production Pipeline with Apache Spark and Tens...
 

Destacado

Finland presentation
Finland presentationFinland presentation
Finland presentation
hatice ekiz
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Maxim Birger (马克斯)
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
Sarah Cruz
 

Destacado (20)

Finland presentation
Finland presentationFinland presentation
Finland presentation
 
AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...AXCIOMA, the internals, the component framework for distributed, real-time, a...
AXCIOMA, the internals, the component framework for distributed, real-time, a...
 
English nibelungenlied
English nibelungenliedEnglish nibelungenlied
English nibelungenlied
 
Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore Arthurian, Germanic & Scandinavian Legends and Folklore
Arthurian, Germanic & Scandinavian Legends and Folklore
 
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)Snapdragon platforms overview feat. MSM7x30 chipset (v7)
Snapdragon platforms overview feat. MSM7x30 chipset (v7)
 
Nibelungenlied(World Literature)
Nibelungenlied(World Literature)Nibelungenlied(World Literature)
Nibelungenlied(World Literature)
 
Android things intro
Android things introAndroid things intro
Android things intro
 
oVirt – open your virtual datacenter
oVirt – open your virtual datacenteroVirt – open your virtual datacenter
oVirt – open your virtual datacenter
 
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-centerFossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
Fossasia 16 Integrating oVirt, Foreman and Katello to empower your data-center
 
Nibelungenlied
NibelungenliedNibelungenlied
Nibelungenlied
 
Spark CL
Spark CLSpark CL
Spark CL
 
State of Linux Containers for HPC
State of Linux Containers for HPCState of Linux Containers for HPC
State of Linux Containers for HPC
 
Interview preparation workshop
Interview preparation workshopInterview preparation workshop
Interview preparation workshop
 
Embedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernelEmbedded Linux Kernel - Build your custom kernel
Embedded Linux Kernel - Build your custom kernel
 
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
ALLUXIO (formerly Tachyon): Unify Data at Memory Speed - Effective using Spar...
 
HPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloadsHPC meets Docker - Using Docker Containers to run HPC worloads
HPC meets Docker - Using Docker Containers to run HPC worloads
 
Having fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projectsHaving fun with Raspberry(s) and Apache projects
Having fun with Raspberry(s) and Apache projects
 
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like systemAccelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
Accelerate Reed-Solomon coding for Fault-Tolerance in RAID-like system
 
The basics and design of lua table
The basics and design of lua tableThe basics and design of lua table
The basics and design of lua table
 
Designing HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale SystemsDesigning HPC & Deep Learning Middleware for Exascale Systems
Designing HPC & Deep Learning Middleware for Exascale Systems
 

Similar a Nisha talagala keynote_inflow_2016

Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Precisely
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Paige_Roberts
 

Similar a Nisha talagala keynote_inflow_2016 (20)

Storage Challenges for Production Machine Learning
Storage Challenges for Production Machine LearningStorage Challenges for Production Machine Learning
Storage Challenges for Production Machine Learning
 
Msst 2019 v4
Msst 2019 v4Msst 2019 v4
Msst 2019 v4
 
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data ArchitectureADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
ADV Slides: When and How Data Lakes Fit into a Modern Data Architecture
 
Productionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons LearnedProductionizing Hadoop - New Lessons Learned
Productionizing Hadoop - New Lessons Learned
 
Ideas spracklen-final
Ideas spracklen-finalIdeas spracklen-final
Ideas spracklen-final
 
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 How to use Big Data and Data Lake concept in business using Hadoop and Spark... How to use Big Data and Data Lake concept in business using Hadoop and Spark...
How to use Big Data and Data Lake concept in business using Hadoop and Spark...
 
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?Data Lake Acceleration vs. Data Virtualization - What’s the difference?
Data Lake Acceleration vs. Data Virtualization - What’s the difference?
 
Fms invited talk_2018 v5
Fms invited talk_2018 v5Fms invited talk_2018 v5
Fms invited talk_2018 v5
 
Bitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FSBitkom Cray presentation - on HPC affecting big data analytics in FS
Bitkom Cray presentation - on HPC affecting big data analytics in FS
 
What ya gonna do?
What ya gonna do?What ya gonna do?
What ya gonna do?
 
Meta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinarMeta scale kognitio hadoop webinar
Meta scale kognitio hadoop webinar
 
So You Want to Build a Data Lake?
So You Want to Build a Data Lake?So You Want to Build a Data Lake?
So You Want to Build a Data Lake?
 
Operational-Analytics
Operational-AnalyticsOperational-Analytics
Operational-Analytics
 
SpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud ComputingSpringPeople - Introduction to Cloud Computing
SpringPeople - Introduction to Cloud Computing
 
The Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the MassesThe Challenges of Bringing Machine Learning to the Masses
The Challenges of Bringing Machine Learning to the Masses
 
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
Engineering Machine Learning Data Pipelines Series: Streaming New Data as It ...
 
20160331 sa introduction to big data pipelining berlin meetup 0.3
20160331 sa introduction to big data pipelining berlin meetup   0.320160331 sa introduction to big data pipelining berlin meetup   0.3
20160331 sa introduction to big data pipelining berlin meetup 0.3
 
Big data berlin
Big data berlinBig data berlin
Big data berlin
 
Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark Solving Office 365 Big Challenges using Cassandra + Spark
Solving Office 365 Big Challenges using Cassandra + Spark
 
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production FasterPython + MPP Database = Large Scale AI/ML Projects in Production Faster
Python + MPP Database = Large Scale AI/ML Projects in Production Faster
 

Último

Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
nirzagarg
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
gajnagarg
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
vexqp
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
nirzagarg
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
ranjankumarbehera14
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
HyderabadDolls
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Klinik kandungan
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
nirzagarg
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
gajnagarg
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
ZurliaSoop
 

Último (20)

Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
Gomti Nagar & best call girls in Lucknow | 9548273370 Independent Escorts & D...
 
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptxRESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
RESEARCH-FINAL-DEFENSE-PPT-TEMPLATE.pptx
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Statistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbersStatistics notes ,it includes mean to index numbers
Statistics notes ,it includes mean to index numbers
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Discover Why Less is More in B2B Research
Discover Why Less is More in B2B ResearchDiscover Why Less is More in B2B Research
Discover Why Less is More in B2B Research
 
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
Nirala Nagar / Cheap Call Girls In Lucknow Phone No 9548273370 Elite Escort S...
 
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24  Building Real-Time Pipelines With FLaNKDATA SUMMIT 24  Building Real-Time Pipelines With FLaNK
DATA SUMMIT 24 Building Real-Time Pipelines With FLaNK
 
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With OrangePredicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
Predicting HDB Resale Prices - Conducting Linear Regression Analysis With Orange
 
Aspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - AlmoraAspirational Block Program Block Syaldey District - Almora
Aspirational Block Program Block Syaldey District - Almora
 
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
Top profile Call Girls In Vadodara [ 7014168258 ] Call Me For Genuine Models ...
 
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
怎样办理圣地亚哥州立大学毕业证(SDSU毕业证书)成绩单学校原版复制
 
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In Begusarai [ 7014168258 ] Call Me For Genuine Models...
 
Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1Lecture_2_Deep_Learning_Overview-newone1
Lecture_2_Deep_Learning_Overview-newone1
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
Jual obat aborsi Bandung ( 085657271886 ) Cytote pil telat bulan penggugur ka...
 
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Tumkur [ 7014168258 ] Call Me For Genuine Models We...
 
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
Top profile Call Girls In bhavnagar [ 7014168258 ] Call Me For Genuine Models...
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 

Nisha talagala keynote_inflow_2016

  • 1. 1 The  New  Storage  Applications: Lots  of  Data,  New  Hardware  and   Machine  Intelligence Nisha  Talagala Parallel  Machines INFLOW  2016
  • 2. 2 Storage  Evolution  &  Application  Evolution  Combined Disk  &  Tape Flash DRAM Persistent   Memory Geographically Distributed Clustered Local Key-­Value File,  Object Block Data  Management Classic  Enterprise Transactions Business  Intelligence Search  etc. Advanced  Analytics (Machine  Learning,  Cognitive   Functions)
  • 3. 3 In  this  talk • What are the new data apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities for the storage stack?
  • 4. 4 Teaching  Assistants Elderly  Companions Service  Robots Personal  Social  Robots Smart  Cities Robot  Drones Smart  Homes Intelligent  vehicles Personal  Assistants  (bots) Smart  Enterprise Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016,  Santa  Clara,  CA X Growing  Sources  of  Data
  • 5. 5 Classic Enterprise Transactions,  Business   Intelligence Advanced  Analytics “Killer”  use  cases OTLP ERP Email eCommerce Messaging Social Networks Content  Delivery Discovery  of    solutions,  capabilities Risk  Assessment Improving  customer  experience Comprehending    sensory  data Key  functions RDBMS BI Fraud  detection Databases Social  Graphs SQL  and  ML  Analytics Streaming Natural  Language  Understanding Object Recognition Probabilistic  Reasoning Content  Analytics Data  Types Structured Transactional Structured Unstructured Transactional Streaming Mixed Graphs,  Matrices Storage  Types Enterprise Scale Standards  driven SAN/NAS,  etc Cloud Scale Open  source File/Object ??? Edited  version  of  slide  from  Balint Fleischer’s  talk:  Flash  Memory   Summit  2016 Santa  Clara,  CA The  Application  Evolution
  • 6. 6 Libraries Libraries Machine  Learning,  Deep   Learning,  SQL,  Graph,  CEP    etc. Data LakeData  Repositories SQL NoSQL Data LakeData  Streams A  Sample  Analytics  Stack Processing  Engine Data  from   Repositories  or   Live  Streams Optimizers/Schedulers Language  Bindings,  APIs Frequently  in   memory Python,  Scala,   Java  etc
  • 7. 7 Data LakeData  Repositories SQL NoSQL Data LakeData  Streams Machine  Learning  Software  Ecosystem  – a    Partial   View Data  from   Repositories  or   Live  Streams Flink /  Apex Spark  Streaming Storm  /  Samza /  NiFi Caffe Theano Tensor  Flow Hadoop  /  Spark Flink Tensor  Flow Mahout,  Samsara,  Mllib,  FlinkML,  Caffe,  TensorFlow Stream   Processing   Engine Batch Processing   Engine Domain   focused  back   end  engines Algorithms  and  Libraries Beam  (Data  Flow),  StreamSQL,  Keras Layered  API  Providers
  • 8. 8 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage optimized for these apps today? • Opportunities?
  • 9. 9 How  ML/DL  Workloads  think  about  Data  – Part  1 • Data Sizes • Incoming datasets can range from MB to TB • Models are typically small. Largest models tend to be in deep neural networks and range from 10s MB to single digit GB • Common Data Types • Time series and Streams • Multi-dimensional Arrays, Matrices and Vectors • DataFrames • Common distributed patterns • Data Parallel, periodic synchronization • Model Parallel • Network sensitivity varies between algorithms. Straggler performance issues can be significant • 2x performance difference between IB and 40Gbit Ethernet for some algorithms like KMeans and SVM
  • 10. 10 The  Growth  of  Streaming  Data • Continuous data flows and continuous processing • Enabled & driven by sensor data, real time information feeds • Enables native time component “event time” • Allows complex computations that can combine new and old data in deterministic ways • Several variants with varied functionality • True Streams, Micro-Batch (an incremental batch emulation) • Possible with existing models like SQL, supported natively by models like Google DataFlow / Apache Beam • The performance of in-memory streaming enables a convergence between stream analytics (aggregation) and Complex Event Processing (CEP)
  • 11. 11 Convergence  of  RDBMS  and  Analytics • In-Memory DBs are moving to continuous queries • Ex: StreamSQL interfaces, Pipeline DB (based on PostgreSQL) • Stream and batch analytic engines support SQL interfaces • Ex: SQL support on Spark, Flink • SQL parsers with pluggable back ends – Apache Calcite • Good for basic analytics but need extensions to support machine learning and deep learning • Joins, sorts, etc. good for feature engineering, data cleansing • Many core machine & deep learning operations require linear algebra ops If the idea of a standard database is "durable data, ephemeral queries" the idea of a streaming database is "durable queries, ephemeral data” http://www.databasesoup.com/2015/07/pipelinedb-streaming-postgres.html
  • 12. 12 The  Growing  Role  of  the  Edge • Closest to data ingest, lowest latency. • Benefits to real time processing • Highly varied connectivity to data centers • Varied hardware architectures and resource constraints • Differs from geographically distributed data center architecture • Asymmetry of hardware • Unpredictable connectivity • Unpredictable device uptime ioT Reference  Model
  • 13. 13 How  ML/DL  Workloads  think  about  Data  – Part  2 • The older data gets – the more its “role” changes • Older data for batch- historical analytics and model reboots • Used for model training (sort of), not for inference • Guarantees can be “flexible” on older data • Availability can be reduced (most algorithms can deal with some data loss) • A few data corruptions don’t really hurt J • Data is evaluated in aggregate and algorithms are tolerant of outliers • Holes are a fact of real life data – algorithms deal with it • Quality of service exists but is different • Random access is very rare • Heavily patterned access (most operations are some form of array/matrix) • Shuffle phase in some analytic engines
  • 14. 14 Correctness,  Determinism,  Accuracy  and  Speed • More complex evaluation metrics than traditional transactional workloads • Correctness is hard to measure • Even two implementations of the “same algorithm” can generate different results • Determinism/Repeatability is not always present for streaming data • Ex: Micro-batch processing can produce different results depending on arrival time Vs event time • Accuracy to time tradeoff is non-linear • Exploratory models can generate massive parallelism for the same data set used repeatedly (hyper-parameter search) 0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14 Error Time SVM  V1   0 0.2 0.4 0.6 0.8 1 1.2 0 0.02 0.04 0.06 0.08 0.1 0.12 0.14Error Time SVM  V2  
  • 15. 15 The  Role  of  Persistence • For ML functions, most computations today are in-memory • Data flows from data lake to analytic engine and results flow back • Persistent checkpoints can generate large write traffic for very long running computations (streams, large neural network training, etc.) • Persistent message storage to enforce exactly once semantics and determinism, latency sensitive write traffic • For in-memory databases, persistence is part of the core engine • Log based persistence is common • Loading & cleaning of data is still a very large fraction of the pipeline time • Most of this involves manipulating stored data
  • 16. 16 In  this  talk • What are the new apps? – with a heavy focus on Advanced Analytics, particularly Machine Learning and Deep Learning • What are their salient characteristics when it comes to storage and memory? • How is storage/memory optimized for these apps today? • Opportunities?
  • 17. 17 Abstractions  and  the  Stack • ML/DL applications use common abstractions that combine linear algebra, tables, streams etc • These are stored as independent entities inside Key-Value pairs, Objects or Files • File system used as common namespace • Information is lost at each level down, along with opportunities to optimize layout, tiering, caching etc Data copies (or transfers denoted by red lines) occur frequently, sometimes more than once! Block File Key-­Value  and  Object Matrices,  Tables,  Streams,  etc
  • 18. 18 Optimizing  Storage:  Some  Examples • Time series optimized databases • Examples BTrDB (FAST 2016) and Gorrilla DB (Facebook/VLDB 2015) • Streamlined data types, specialized indexing, tiering optimized for access patterns • API pushdown techniques • Iguazio.io • Streams and Spark RDDs as native access APIs • Lineage • Alluxio (Formerly Tachyon) • Link data history & compute history, cache intermediate stages in machine learning pipelines • Memory expansion • Many studies on DRAM/Persistent Memory/Flash tiering for analytics
  • 19. 19 Opportunities:  Places  to  Start   • Persistent Memory and Flash offer several opportunities to improve ML/DL capacity and efficiency • Fast/Frequent Checkpointing for long running jobs • Note: will put pressure on write endurance • Low latency logging for exactly-once semantics • Memory expansion: DRAM/Persistent Memory/Flash hierarchies • exploit the highly predictable access patterns of ML algorithms • Accelerate data load/save stages of ML/DL pipelines
  • 20. 20 Opportunities  – More  Fundamental  Shifts • Role of storage types in analytics optimizers and schedulers – superficially similar to DB query optimization • Exploit the more relaxed set of requirements on persistence • Even correctness can be relaxed • Example in compute land for flexibility in synchronization (HogWild! approach to SGD, plus Asynchronous SGD etc.) • Leverage Persistent Memory to unify low latency streaming data requirements and high throughput batch data requirements • New(er) data types and repeatable access patterns • Converged systems with analytics and storage management for cross stack efficiency
  • 21. 21 Takeaways • The use of ML/DL in enterprise is at its infancy and expanding furiously • These apps put ever larger pressure on data management, latency, and throughput requirements • These apps also introduce another layer of abstraction and another layer of workload intelligence • Further away from block and file • Opportunities exist to significantly improve storage and memory for these use cases by understanding and exploiting their priorities and non-priorities for data