Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

Machine Learning (ML) and
TACC Supercomputers

A little about me
• Data Scientist at Texas Advanced Computing Center
(TACC)
• My Contact: atrivedi@tacc.utexas.edu
• TACC - Independent research center at UT Austin
• TACC - One of the largest HIPAA compliant
supercomputer center
• ~250 faculty, researchers, students and staff
• We work on providing support to large scale computing
problems
1

Some Basic Observations
 There are fundamental differences in data access
patterns between Data Intensive Computing and High
Performance Computing (HPC)
 Today, most of the ML Researchers want/need to
work with Big Data, Vectorization, Code Optimization
etc.
2

Data Intensive Computing
 Specialized in dealing effectively with vast quantities of
data in distributed environments
Generates high demand for computational resources,
e.g. storing capacity, processing power etc.
3

 Big data plays the key role in the popularity
and growth of Data intensive computing
 Increased the volume of data
 Improves accuracy of existing algorithms
 Helps create better predictive models
 Increased the complexity
Data Intensive Computing & Big Data
4

What’s the challenge with the big data
analysis?
5

 Big Data Analysis requires even more computational resources
 Storage is triple the standard data size
 Algorithms use large data points and is memory intensive
 The Big Data Analysis takes much longer time
 Typical hard drive read-speed is about 150MB/sec
 But for reading 1TB ~ 2 hours
 Analysis could require processing-time proportional to the size of the
data
 Data Analysis at the rate of 1GB /second would require 11 days to
ﬁnish for 1TB data
6

High Performance Computing (HPC)
Hardware with more computational power per compute
node
Computation can be done with multiple nodes
Provides highly efficient numeric processing in
distributed environments
HPC has seen a recent growth in shared memory
architectures
7

Sample TACC Computing Cluster
8

Combine HPC & Data intensive
computing
The intersection of these two domains is mainly driven
by the use of machine learning (ML)
ML methodologies help extract knowledge from big data
These hybrid environments –
 take advantage of data locality
 keep the data exchanges over the network at a
manageable level
 offer high performance through distributed libraries
9

 Stampede – Traditional cluster HPC system
 Stockyard and Corral – 25 Petabytes of combined disk
storage for all data needs
 Ranch – 160 Petabytes of tape archive storage
 Maverick/Rustler/Rodeo – “Niche” systems with GPU
clusters, great for data anatytics and visualization
 Wrangler - A New Generation of Data-intensive
Supercomputer
TACC Ecosystem
10

TACC Ecosystem Goals
 Goal to address the data problem in multiple dimensions
 Supports data in large and small scales
 Supports data reliability
 Supports data security
 Supports multiple data types: structured and unstructured
 Supports sequential access
 Fast for large files
 Goal to support a wide range of applications and interfaces
 Hadoop (and Mahout) & Spark (and MLlib)
 Traditional R, GIS, DBs, and other HPC style performing
workflows
 Goal to support the full data lifecycle
 Metadata and collection management support
11

 Need to analyze large datasets quickly
 Need a more on-demand interactive analysis environment
 Need to work with databases at high transaction rates
 Have a Hadoop or Spark workflow with need for large HDFS
datastore
 Have a dataset that many users will compute with or
analyze
 In need of a system with data management capabilities
 Have a job that is currently IO bound
Why use TACC Supercomputers?
12

Available ML tools/libraries in TACC
Supercomputers
Scikit-learn
Caffe
Theano
CUDA/cuDNN
Hadoop
PyHadoop
RHadoop
Mahout
Spark
PySpark
SparkR
MLlib
16

Two Sample ML workflows in TACC
Supercomputers
GPU Powered Deep Learning on MRI images with NVIDIA
DIGITS in Maverick Supercomputer
Pubmed Recommender System in Wrangler
Supercomputer
17

Deep Learning on Images
 Deep Neural Networks are computationally quite
demanding
 The input data is much larger if we use even a small
image resolution
 256 x 256 RGB-pixel implies 196,608 input neurons
(256 x 256 x 3)
 Many of the involved floating point matrix operations
can be addressed by GPUs
18

Deep Learning on MRI using
TACC Supercomputers
 Maverick has large GPU Clusters
 There are three major GPU utilizing Deep Learning frameworks
available – Theano, Torch and caffe
 We use NVIDIA DIGITS (based on caffe), which is a web server providing
a convenient web interface for training and testing Deep Neural Networks
 For classification of MRI/images we use a convolutional DNN to figure out
the features
 We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our
MRI/images
In the course of 30 epochs, our classification accuracy ranges from
74.21% to 82.09%
19

Pubmed Recommender System in
Wrangler
20

What is a Recommendation System?
 Recommender System helps match users with item
 Implicit or explicit user feedback or item suggestion
 Our Recommendation system:
 We try to build a model which recommends Pubmed
documents to users, based on the user search profile
21

Types of Recommender System
Types Pros Cons
Knowledge‐based
(i.e, search)
Deterministic
recommendations,
assured quality,
no cold‐ start
Knowledge engineering effort to
bootstrap,
basically static
Content‐based No community required,
comparison between items
possible
Content descriptions necessary,
cold start for new users
Collaborative No knowledge‐
engineering effort,
serendipity of results
Requires some form of rating
feedback,
cold start for new users and new
items
22

Using Vector Space Model (VSM) for
Pubmed
 Given:
 A set of Pubmed documents
 N features (unique terms) describing the documents in the set
 VSM builds an N-dimensional Vector Space
 Each item/document is represented as a point in the Vector Space
 Information Retrieval based on search
 Query: A point in the Vector Space
 We apply TFIDF to the tokenized documents to weight the documents
and convert the documents to vectors
 We compute cosine similarity between the tokenized documents and
the query term
 We select top 3 documents matching our query
 We weight the query term in the sparse matrix and rank documents
2323

MPI or Hadoop or Spark?
Which is really more suitable for this
ML problem in a HPC system ?
24

Message Passing in HPC
Message Passing Interface (MPI) was one of the key factors
which supported the initial growth of cluster computing
MPI helped shape what the HPC world has become today
MPI supported a substantial majority of all supercomputing
work
 Scientists and engineers have relied upon MPI for the past
decades
 MPI works great for data intensive computing in a GPU
cluster
25

Why MPI is not the best tool for ML
A researcher/developer working with MPI needs to
manually decompose the common data structures
across processors
 Every update of the data structure needs to be recast into a
flurry of messages, syncs, and data exchange
Programming at the transport layer is an awkward fit for
numerical application developers
This led to the advent of other techniques
26

 Hadoop is an open source implementation of MapReduce
programming model in JAVA
 It has interface to other programming languages such as
R, python etc.
 Hadoop includes -
 HDFS: A distributed file system based on google file
system (GFS)
 YARN: A resource manager to assign resources to
the computational tasks
 MapReduce: A library to enable efficient distributed
data processing easily
 Mahout: Scalable machine learning and data mining library
 Hadoop streaming: It is a generic API which allows writing
Mappers and Reducers in any language.
 Hadoop is a good fit for large single-pass data processing,
but has its own limitations
Choosing Hadoop over MPI
27

Limitations of Hadoop in HPC
Hadoop comes with mandatory Map Reduce logging of output to
the disk after every Map/Reduce stage
 In HPC, logging output to disk could be sped up with caching or
SSDs
In general, this fact rendered Hadoop unusable for many ML
approaches which required iteration, or interactive use
The real issue with Hadoop was its HDFS file system.
 The HDFS file system was intimately tied to Hadoop cluster
scheduling
The large-scale ML community sought in-memory approaches to
avoid this problem
28

Spark
 For large-scale technical computing, one very promising
in-memory approach is Spark
 Spark lacks Map/Reduce-style requirements
 Spark can run standalone, without a scheduler like YARN
 It has interfaces to other programming languages such
as R, python etc.
 Spark supports HDFS through YARN
 MLlib: Scalable machine learning and data mining library
 Spark streaming: Enables stream processing of live data
streams
29

Our Recommendation Model
 We apply collaborative filtering on the weighted/ranked documents
 We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for
recommending Pubmed documents
 MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)
 We use collaborative filtering in Scikit-learn & Hadoop as baselines
 We use the python-recsys library along with Python Scikit-learn
 svd.recommend(int product_id)
 We use the mahout’s Alternating Least Square for Hadoop
 Comparative study of our model shows improved performance in Spark
3030

Performance Evaluation of Pubmed
Recommendation Model
We evaluate our recommendation model using Python Scikit-learn,
Apache Mahout and PySpark MLlib in Wrangler
Recommendation model use Root Mean Square Error (RMSE) and
Mean Absolute Error (MAE) for evaluation
Lower the errors, more accurate the model
Lower the time taken to train/test the model, better the
performance
Algo: Type Public Dataset
Python ML
library
Eval Test Model Training Time Model Test Time
Recommendation
Weighted Pubmed
Documents Python Scikit
RMSE=17.96%
MAE=16.53% 42 secs 19 secs
Recommendation
Weighted Pubmed
Documents Hadoop Mahout
RMSE=16.02%
Recommendation
Weighted Pubmed
Documents PySpark MLlib
RMSE=15.88%
31

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15

Similar a Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15 (20)

Más de MLconf

Más de MLconf (20)

Último

Último (20)

Anusua Trivedi, Data Scientist at Texas Advanced Computing Center (TACC), UT Austin at MLconf ATL - 9/18/15