2. Brief introduction
• CTO & co-founder of Agile Lab
• Data & Tech addicted
• Contributor of Spark Notebook
• Spark early adopter
• Certified Cassandra Architect
• DeepLearning enthusiast
3. Who is Agile Lab ?
GO BIG (data) or GO HOME
http://www.meetup.com/it-IT/Torino-Scala-Programming-Big-Data-Meetup/
4. What we do
Applications
High scalability
Decision Support
Systems
data engineering, data mining and data
«meaning»
Big Data Strategies
Training
Reactive, NoSQL, Big Data, Machine
learning
7. What is Deep Learning
• Deep learning is just another name for artificial neural networks
• An algorithm is deep if the input is passed through several non-li
nearity before being output
• Deep learning is discovering the features that best represent the
problem, rather than just a way to combine them
14. Target Environment
Prod Env Dev Env
Training
Data
Cleaning
ETLScheduling
ML Pipeline
- Track model performance over time
- Care about SLA
- Continous tweaks
16. Easy Wins
Training pipeline should run
on Spark or Hadoop
Trained Model should be
represented in Java objects
17. Vision: keep in mind Scaling
High Level dynamic languages
are incredibly productive for
prototyping and data exploration
Scaling on larger data sets
quickly runs into performance
limitations
Keep in mind scaling
requirements from beginning
18. Vision: simplify the pipeline
Copy & Sample data from Dev Env to Data
Scientist Env
Prototype in Python or R
Train model
Predict on validation Data
Translate Model to match Prod Env
Java, MapReduce, Spark
Deploy training pipeline and model
19. Easy Wins
Datascientists should work
directly on distributed
environment
Datascientist and big data
engineers should co-operate
on the same platform
21. Tensor Flow
Strenghts:
- Powered By Google
- Nice UI
Weaknesses:
- Powered By Google
- No support for “inline” matrix operations
Slow
Opportunities:
- Awesome community
Threats:
- No Scala or Java integration
- No commercial support
22. Theano
Strenghts:
- Grand Daddy of deep learning
- RNN and CNN
- Computational graph abstraction
- Python
Weaknesses:
- No support for Hadoop or Spark
- No plug & play nets
Opportunities:
- Great community
Threats:
- No Scala or Java integration
- No commercial support
23. Torch
Strenghts:
- GPU support
- Lots of pretrained models and packages
- Easy to use
Weaknesses:
- Lua language
Opportunities:
- Backed by DeepMind and Facebook
Threats:
- No Scala or Java integration
- No commercial support
24. Caffè
Strenghts:
- C++ & Python
- Good Performance
- GPU Support
Weaknesses:
- Focused on image processing
Opportunities:
- Backed by Yahoo for Spark integration
- Gpu Clustering
Threats:
- No commercial support
25. DeepLearning4j
Strenghts:
- GPU support
- Java and Scala
- Full DNN set
- Support Hadoop, Spark & Akka
Weaknesses:
- Not for dummies
Opportunities:
- Commercial support - SkyMind
Threats:
- Not so sexy for DataScientist because of
Java/Scala
26. H2O
• Easy to use Web UI
• Multi language API
• Run directly on HDFS or S3
• Model is Java PoJo
• Big Data Ready
• Really Fast
• Compressed data
• Regularization
• Grid Search
• GPU is still on roadmap
• CNN and RNN too
28. H20 – Sparkling Water
• Python, R and Scala API
• Best Kagglers use H20
• Tons of tools for profiling and tu
ning
• Spark leverage
• Best in class algorithms – battle
tested
• Regolarization
• Grid search
31. Spark as middleware
Using Spark as middleware, you can leverage :
• Deeplearning4J
• H2O
• TensorFlow ( Arimo Extension)
• Caffe ( Yahoo Extension )
• ML MultilayerPerceptrons and future implementations
NO tech provider Lock-in
32. Our Stack for Enterprise
• Ready for Enterprise and Hadoop World
• Deployable into Java Env
• Notebook ( Flow )
• H2O for out of the box algorithms
• DeepLearning 4J for advanced DNN and
n-dimension array manipulation
• Good usability for both DataScientists and
Big Data Engineers
• Enterprise Support along the whole stack