Building a Recommender System for Publications using Vector Space Model and Python:In recent years, it has become very common that we have access to large number of publications on similar or related topics. Recommendation systems for publications are needed to locate appropriate published articles from a large number of publications on the same topic or on similar topics. In this talk, I will describe a recommender system framework for PubMed articles. PubMed is a free search engine that primarily accesses the MEDLINE database of references and abstracts on life-sciences and biomedical topics. The proposed recommender system produces two types of recommendations – i) content-based recommendation and (ii) recommendations based on similarities with other users’ search profiles. The first type of recommendation, viz., content-based recommendation, can efficiently search for material that is similar in context or topic to the input publication. The second mechanism generates recommendations using the search history of users whose search profiles match the current user. The content-based recommendation system uses a Vector Space model in ranking PubMed articles based on the similarity of content items. To implement the second recommendation mechanism, we use python libraries and frameworks. For the second method, we find the profile similarity of users, and recommend additional publications based on the history of the most similar user. In the talk I will present the background and motivation for these recommendation systems, and discuss the implementations of this PubMed recommendation system with example.
This talk will cover, via live demo & code walk-through, the key lessons we’ve learned while building such real-world software systems over the past few years. We’ll incrementally build a hybrid machine learned model for fraud detection, combining features from natural language processing, topic modeling, time series analysis, link analysis, heuristic rules & anomaly detection. We’ll be looking for fraud signals in public email datasets, using Python & popular open-source libraries for data science and Apache Spark as the compute engine for scalable parallel processing.
2. A little about me
• Data Scientist at Texas Advanced Computing Center
(TACC)
• My Contact: atrivedi@tacc.utexas.edu
• TACC - Independent research center at UT Austin
• TACC - One of the largest HIPAA compliant
supercomputer center
• ~250 faculty, researchers, students and staff
• We work on providing support to large scale computing
problems
1
3. Some Basic Observations
There are fundamental differences in data access
patterns between Data Intensive Computing and High
Performance Computing (HPC)
Today, most of the ML Researchers want/need to
work with Big Data, Vectorization, Code Optimization
etc.
2
4. Data Intensive Computing
Specialized in dealing effectively with vast quantities of
data in distributed environments
Generates high demand for computational resources,
e.g. storing capacity, processing power etc.
3
5. Big data plays the key role in the popularity
and growth of Data intensive computing
Increased the volume of data
Improves accuracy of existing algorithms
Helps create better predictive models
Increased the complexity
Data Intensive Computing & Big Data
4
7. Big Data Analysis requires even more computational resources
Storage is triple the standard data size
Algorithms use large data points and is memory intensive
The Big Data Analysis takes much longer time
Typical hard drive read-speed is about 150MB/sec
But for reading 1TB ~ 2 hours
Analysis could require processing-time proportional to the size of the
data
Data Analysis at the rate of 1GB /second would require 11 days to
finish for 1TB data
6
8. High Performance Computing (HPC)
Hardware with more computational power per compute
node
Computation can be done with multiple nodes
Provides highly efficient numeric processing in
distributed environments
HPC has seen a recent growth in shared memory
architectures
7
10. Combine HPC & Data intensive
computing
The intersection of these two domains is mainly driven
by the use of machine learning (ML)
ML methodologies help extract knowledge from big data
These hybrid environments –
take advantage of data locality
keep the data exchanges over the network at a
manageable level
offer high performance through distributed libraries
9
11. Stampede – Traditional cluster HPC system
Stockyard and Corral – 25 Petabytes of combined disk
storage for all data needs
Ranch – 160 Petabytes of tape archive storage
Maverick/Rustler/Rodeo – “Niche” systems with GPU
clusters, great for data anatytics and visualization
Wrangler - A New Generation of Data-intensive
Supercomputer
TACC Ecosystem
10
12. TACC Ecosystem Goals
Goal to address the data problem in multiple dimensions
Supports data in large and small scales
Supports data reliability
Supports data security
Supports multiple data types: structured and unstructured
Supports sequential access
Fast for large files
Goal to support a wide range of applications and interfaces
Hadoop (and Mahout) & Spark (and MLlib)
Traditional R, GIS, DBs, and other HPC style performing
workflows
Goal to support the full data lifecycle
Metadata and collection management support
11
13. Need to analyze large datasets quickly
Need a more on-demand interactive analysis environment
Need to work with databases at high transaction rates
Have a Hadoop or Spark workflow with need for large HDFS
datastore
Have a dataset that many users will compute with or
analyze
In need of a system with data management capabilities
Have a job that is currently IO bound
Why use TACC Supercomputers?
12
17. Available ML tools/libraries in TACC
Supercomputers
Scikit-learn
Caffe
Theano
CUDA/cuDNN
Hadoop
PyHadoop
RHadoop
Mahout
Spark
PySpark
SparkR
MLlib
16
18. Two Sample ML workflows in TACC
Supercomputers
GPU Powered Deep Learning on MRI images with NVIDIA
DIGITS in Maverick Supercomputer
Pubmed Recommender System in Wrangler
Supercomputer
17
19. Deep Learning on Images
Deep Neural Networks are computationally quite
demanding
The input data is much larger if we use even a small
image resolution
256 x 256 RGB-pixel implies 196,608 input neurons
(256 x 256 x 3)
Many of the involved floating point matrix operations
can be addressed by GPUs
18
20. Deep Learning on MRI using
TACC Supercomputers
Maverick has large GPU Clusters
There are three major GPU utilizing Deep Learning frameworks
available – Theano, Torch and caffe
We use NVIDIA DIGITS (based on caffe), which is a web server providing
a convenient web interface for training and testing Deep Neural Networks
For classification of MRI/images we use a convolutional DNN to figure out
the features
We use CUDA 7,cuDNN, caffe and DIGITS on Maverick to classify our
MRI/images
In the course of 30 epochs, our classification accuracy ranges from
74.21% to 82.09%
19
22. What is a Recommendation System?
Recommender System helps match users with item
Implicit or explicit user feedback or item suggestion
Our Recommendation system:
We try to build a model which recommends Pubmed
documents to users, based on the user search profile
21
23. Types of Recommender System
Types Pros Cons
Knowledge‐based
(i.e, search)
Deterministic
recommendations,
assured quality,
no cold‐ start
Knowledge engineering effort to
bootstrap,
basically static
Content‐based No community required,
comparison between items
possible
Content descriptions necessary,
cold start for new users
Collaborative No knowledge‐
engineering effort,
serendipity of results
Requires some form of rating
feedback,
cold start for new users and new
items
22
24. Using Vector Space Model (VSM) for
Pubmed
Given:
A set of Pubmed documents
N features (unique terms) describing the documents in the set
VSM builds an N-dimensional Vector Space
Each item/document is represented as a point in the Vector Space
Information Retrieval based on search
Query: A point in the Vector Space
We apply TFIDF to the tokenized documents to weight the documents
and convert the documents to vectors
We compute cosine similarity between the tokenized documents and
the query term
We select top 3 documents matching our query
We weight the query term in the sparse matrix and rank documents
2323
25. MPI or Hadoop or Spark?
Which is really more suitable for this
ML problem in a HPC system ?
24
26. Message Passing in HPC
Message Passing Interface (MPI) was one of the key factors
which supported the initial growth of cluster computing
MPI helped shape what the HPC world has become today
MPI supported a substantial majority of all supercomputing
work
Scientists and engineers have relied upon MPI for the past
decades
MPI works great for data intensive computing in a GPU
cluster
25
27. Why MPI is not the best tool for ML
A researcher/developer working with MPI needs to
manually decompose the common data structures
across processors
Every update of the data structure needs to be recast into a
flurry of messages, syncs, and data exchange
Programming at the transport layer is an awkward fit for
numerical application developers
This led to the advent of other techniques
26
28. Hadoop is an open source implementation of MapReduce
programming model in JAVA
It has interface to other programming languages such as
R, python etc.
Hadoop includes -
HDFS: A distributed file system based on google file
system (GFS)
YARN: A resource manager to assign resources to
the computational tasks
MapReduce: A library to enable efficient distributed
data processing easily
Mahout: Scalable machine learning and data mining library
Hadoop streaming: It is a generic API which allows writing
Mappers and Reducers in any language.
Hadoop is a good fit for large single-pass data processing,
but has its own limitations
Choosing Hadoop over MPI
27
29. Limitations of Hadoop in HPC
Hadoop comes with mandatory Map Reduce logging of output to
the disk after every Map/Reduce stage
In HPC, logging output to disk could be sped up with caching or
SSDs
In general, this fact rendered Hadoop unusable for many ML
approaches which required iteration, or interactive use
The real issue with Hadoop was its HDFS file system.
The HDFS file system was intimately tied to Hadoop cluster
scheduling
The large-scale ML community sought in-memory approaches to
avoid this problem
28
30. Spark
For large-scale technical computing, one very promising
in-memory approach is Spark
Spark lacks Map/Reduce-style requirements
Spark can run standalone, without a scheduler like YARN
It has interfaces to other programming languages such
as R, python etc.
Spark supports HDFS through YARN
MLlib: Scalable machine learning and data mining library
Spark streaming: Enables stream processing of live data
streams
29
31. Our Recommendation Model
We apply collaborative filtering on the weighted/ranked documents
We use Alternating Least Square (pyspark.mllib.recommendation.ALS) for
recommending Pubmed documents
MatrixFactorizationModel.recommendProducts(int user_id, int num_of_iterations)
We use collaborative filtering in Scikit-learn & Hadoop as baselines
We use the python-recsys library along with Python Scikit-learn
svd.recommend(int product_id)
We use the mahout’s Alternating Least Square for Hadoop
Comparative study of our model shows improved performance in Spark
3030
32. Performance Evaluation of Pubmed
Recommendation Model
We evaluate our recommendation model using Python Scikit-learn,
Apache Mahout and PySpark MLlib in Wrangler
Recommendation model use Root Mean Square Error (RMSE) and
Mean Absolute Error (MAE) for evaluation
Lower the errors, more accurate the model
Lower the time taken to train/test the model, better the
performance
Algo: Type Public Dataset
Python ML
library
Eval Test Model Training Time Model Test Time
Recommendation
Weighted Pubmed
Documents Python Scikit
RMSE=17.96%
MAE=16.53% 42 secs 19 secs
Recommendation
Weighted Pubmed
Documents Hadoop Mahout
RMSE=16.02%
MAE=14.98% 38 secs 14 secs
Recommendation
Weighted Pubmed
Documents PySpark MLlib
RMSE=15.88%
MAE=14.23% 34 secs 11 secs
31