SlideShare a Scribd company logo
1 of 59
Download to read offline
Discovering Users’ Topics of Interest
in Recommender Systems
Gabriel Moreira - @gspmoreira
Machine Learning SP
Lead Data Scientist DSc. student
Agenda
● Recommender Systems concepts
● Topic Modeling concepts
● Case: Smart Canvas - Corporative Collaboration
● Wrap up
Recommender Systems
An introduction
Life is too short!
Social recommendations
Recommendations by interaction
What should I
watch today?
What do you
like, my son?
"A lot of times, people don’t know what they want
until you show it to them."
Steve Jobs
"We are leaving the Information Age and entering the
Recommendation Age.".
Cris Anderson, "The long tail"
38% of sales
2/3 views
Recommendations are responsible for...
38% of top news
visualization
What else may I recommend?
What can a Recommender Systems do?
2 - Prediction
Given an item, what is its relevance for
each user?
1 - Recommendation
Given a user, produce an ordered list matching the
user needs
How it works
Content-Based Filtering
Advantages
● Does not depend upon other users
● May recommend new and unpopular items
● Recommendations can be easily explained
Drawbacks
● Overspecialization
● May not recommend to new users
May be difficult to extract attributes from audio,
movies or images
Content-Based Filtering
User-Based Collaborative Filtering
Item-Based Collaborative Filtering
Collaborative Filtering
Advantages
● Works to any item kind (ignore attributes)
Drawbacks
● Usually recommends more popular items
● Cold-start
○ Cannot recommend items not already
rated/consumed
○ Needs a minimum amount of users to match
similar users
Hybrid Recommender Systems
Composite
Iterates by a chain of algorithm, aggregating
recommendations.
Weighted
Each algorithm has as a weight and the final
recommendations are defined by weighted averages.
Some approaches
UBCF Example (Java / Mahout)
// Loads user-item ratings
DataModel model = new FileDataModel(new File("input.csv"));
// Defines a similarity metric to compare users (Person's correlation coefficient)
UserSimilarity similarity = new PearsonCorrelationSimilarity(model);
// Threshold the minimum similarity to consider two users similar
UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity,
model);
// Create a User-Based Collaborative Filtering recommender
UserBasedRecommender recommender = new
GenericUserBasedRecommender(model, neighborhood, similarity);
// Return the top 3 recommendations for userId=2
List recommendations = recommender.recommend(2, 3);
User,Item,Rating1,
15,4.0
1,16,5.0
1,17,1.0
1,18,5.0
2,10,1.0
2,11,2.0
2,15,5.0
2,16,4.5
2,17,1.0
2,18,5.0
3,11,2.5
input.csv
User-Based Collaborative Filtering example (Mahout)
1
2
3
4
5
6
https://mahout.apache.org/users/recommender/userbased-5-minutes.html
Frameworks - Recommender Systems
Python
Python / ScalaJava
.NET
Java
Books
Topic Modeling
An introduction
Topic Modeling
A simple way to analyze topics of large text collections (corpus).
What are this documents about?
What are the main topics?
Topic
A Cluster of words that generally occur together and are related.
Example 2D topics visualization with pyLDAviz (topics are the circles in the left, main terms of selected topic are shown in the right)
Topics evolution
Visualization of topics evolution during across time
Topic Modeling applications
● Detect trends in the market and customer challenges of a industry from media
and social networks.
● Marketing SEO: Optimization of search keywords to improve page ranking in
searchers
● Identify users preferences to recommend relevant content, products, jobs, …
● Documents are composed by many topics
● Topics are composed by many words (tokens)
Topic Modeling
Documents Topics Words (Tokens)
Observable Latent Observable
Unsupervised learning technique, which does not require too much effort on
pre-processing (usually just Text Vectorization). In general, the only parameter is
the number of topics (k).
Topics in a document
Topics example from NYT
Topic analysis (LDA) from 1.8 millions of New York Times articles
How it works
Text vectorization
Represent each document as a feature vector in the vector space, where each
position represents a word (token) and the contained value is its relevance in the
document.
● BoW (Bag of words)
● TF-IDF (Term Frequency - Inverse Document Frequency)
Document Term Matrix - Bag of Words
Text vectorization
TF-IDF example with scikit-learn
from sklearn.feature_extraction.text import TfidfVectorizer
vectorizer = TfidfVectorizer(max_df=0.5, max_features=1000,
min_df=2, stop_words='english')
tfidf_corpus = vectorizer.fit_transform(text_corpus)
face person guide lock cat dog sleep micro pool gym
0 1 2 3 4 5 6 7 8 9
D1 0.05 0.25
D2 0.02 0.32 0.45
...
...
tokens
documents
TF-IDF sparse matrix example
Text vectorization
“Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to
do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can
be used to validate your Architecture. definition | architecture validation | academia de arquitetura”
Example
Tokens (unigrams and bigrams) Weight
jmeter 0.466
architecture 0.380
validate 0.243
validation 0.242
definition 0.239
write 0.225
academia arquitetura 0.218
academy 0.216
ruby 0.213
tests 0.209
Relevant keywords (TF-IDF)
Visualization of the average TF-IDF vectors of posts at Google+ for Work (CI&T)
Text vectorization
See the more details about this social and text analytics at http://bit.ly/python4ds_nb
Main techniques
● Latent Dirichlet Allocation (LDA) -> Probabilistic
● Latent Semantic Indexing / Analysis (LSI / LSA) -> Matrix Factorization
● Non-Negative Matrix Factorization (NMF) -> Matrix Factorization
Topic Modeling example (Python / gensim)
from gensim import corpora, models, similarities
documents = ["Human machine interface for lab abc computer applications",
"A survey of user opinion of computer system response time",
"The EPS user interface management system", ...]
stoplist = set('for a of the and to in'.split())
texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents]
dictionary = corpora.Dictionary(texts)
corpus = [dictionary.doc2bow(text) for text in texts]
tfidf = models.TfidfModel(corpus)
corpus_tfidf = tfidf[corpus]
lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2)
corpus_lsi = lsi[corpus_tfidf]
lsi.print_topics(2)
topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + ...
topic #1(1.476): 0.460*"system" + 0.373*"user" + 0.332*"eps" + 0.328*"interface" + 0.320*"response" + ...
Example of topic modeling using LSI technique
Example of the 2 topics discovered in the corpus
1
2
3
4
5
6
7
8
9
10
11
from pyspark.mllib.clustering import LDA, LDAModel
from pyspark.mllib.linalg import Vectors
# Load and parse the data
data = sc.textFile("data/mllib/sample_lda_data.txt")
parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')]))
# Index documents with unique IDs
corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache()
# Cluster the documents into three topics using LDA
ldaModel = LDA.train(corpus, k=3)
Topic Modeling example (PySpark MLLib LDA)
Sometimes simpler is better!
Scikit-learn topic models worked better for us than Spark MLlib LDA distributed
implementation because we have thousands of users with hundreds of posts,
and running scikit-learn on workers for each person was much quicker than
running distributed Spark LDA for each person.
1
2
3
4
5
6
Cosine similarity
This similarity metric between two vectors is the cosine of the angle between them
from sklearn.metrics.pairwise import cosine_similarity
cosine_similarity(tfidf_matrix[0:1], tfidf_matrix)
Example with scikit-learn
Example of relevant keywords for a person and people with similar interests
People similarity
Frameworks - Topic Modeling
Python Java
Stanford Topic Modeling Toolbox (Excel plugin)
http://nlp.stanford.edu/software/tmt/tmt-0.4/
Python / Scala
Smart Canvas©
Corporate Collaboration
http://www.smartcanvas.com/
Powered by Recommender Systems and Topic Modeling techniques
Case
Content recommendations
Discover channel - Content recommendations with personalized explanations, based on user’s topics of
interest (discovered from their contributed content and reads)
Person topics of interest
User profile - Topics of interest of users are from the content that they contribute and are presented as
tags in their profile.
Searching people interested in topics / experts...
User discovered tags are searchable, allowing to find experts or people with specific interests.
Similar people
People recommendation - Recommends people with similar interests, explaining which topics are shared.
How it works
Collaboration Graph Model
Smart Canvas ©
Graph Model: Dashed lines and bold
labels are the relationships inferred by usage of RecSys
and Topic Modeling techniques
Architecture Overview
Jobs Pipeline
Content
Vectorization
People
Topic Modeling
Boards
Topic Modeling
Contents
Recommendation
Boards
Recommendation
Similar ContentGoogle
Cloud
Storage
Google
Cloud
Storage
pickles
pickles
pickles
loads
loads
loads
loads
loads
Stores topics and
recommendations in the graph
1
2
3
4
5
6
loads
content
loads
touchpoints
1. Groups all touchpoints (user interactions) by person
2. For each person
a. Selects contents user has interacted
b. Clusters contents TF-IDF vectors to model (e.g. LDA, NMF, Kmeans) person’ topics of interest
(varying the k from a range (1-5) and selecting the model whose clusters best describe
cohesive topics, penalizing large k)
c. Weights clusters relevance (to the user) by summing touchpoints strength (view, like,
bookmark, …) of each cluster, with a time decay on interaction age (older interactions are less
relevant to current person interest)
d. Returns highly relevant topics vectors for the person,labeled by its three top keywords
3. For each people topic
a. Calculate the cosine similarity (optionally applying Pivoted Unique Pivoted Normalization)
among the topic vector and content TF-IDF vectors
b. Recommends more similar contents to person user topics, which user has not yet
interacted (using Bloom Filter)
People Topic Modeling and Content
Recommendations algorithm
Bloom Filter
● A space-efficient Probabilistic Data Structure used to test whether an element is a
member of a set.
● False positive matches are possible, but false negatives are not.
● An empty Bloom filter is a bit array of m bits, all set to 0.
● Each inserted key is mapped to bits (which are all set to 1) by k different hash functions.
The set {x,y,z} was inserted in Bloom Filter. w is not in the
set, because it hashes to one bit equal to 0
Bloom Filter - Example Using cross_bloomfilter - Pure Python and Java compatible implementation
GitHub: http://bit.ly/cross_bf
Bloom Filter - Complexity Analysis
Ordered List
Space Complexity: O(n)
where n is the number of items in the set
Check for key presence
Time Complexity (binary search): O(log(n))
assuming an ordered list
Example: For a list of 100,000 string keys, with an average length of 18 characters
Key list size: 1,788,900 bytes
Bloom filter size: 137,919 bytes (false positive rate = 0.5%)
Bloom filter size: 18,043 bytes (false positive rate = 5%)
(and you can gzip it!)
Bloom Filter
Space Complexity: O(t * ln(p))
where t is the maximum number of items to be
inserted and p the accepted false positive rate
Check for key presence
Time Complexity: O(k)
where k is a constant representing the number of
hash functions to be applied to a string key
● Distributed Graph Database
● Elastic and linear scalability for a growing data and user base
● Support for ACID and eventual consistency
● Backended by Cassandra or HBase
● Support for global graph data analytics, reporting, and ETL through integration
with Spark, Hadoop, Giraph
● Support for geo and full text search (ElasticSearch, Lucene, Solr)
● Native integration with TinkerPop (Gremlin query language, Gremlin Server, ...)
Apache TinkerPop
Gremlin traversals
g.V().has('CONTENT','id', 'Post:123')
.outE('IS SIMILAR TO')
.has('strength', gte(0.1)).as('e').inV().as('v')
.select('e','v')
.map{['id': it.get().v.values('id').next(),
'strength': it.get().e.values('strength').next()]}
.limit(10)
Querying similar contents
[{‘id’: ‘Post:456’, ‘strength’: 0.672},
{‘id’: ‘Post:789’, ‘strength’: 0.453},
{‘id’: ‘Post:333’, ‘strength’: 0.235},
...]
Example output:
Content Content
IS SIMILAR TO
1
2
3
4
5
6
7
Gremlin traversals
g.V().has('PERSON','id','Person:gabrielpm@ciandt.com')
.out('IS ABOUT').has('number', 1)
.inE('IS RECOMMENDED TO').filter{ it.get().value('strength') >= 0.1}
.order().by('strength',decr).as('tc').outV().as('c')
.select('tc','c')
.map{ ['contentId': it.get().c.value('id'),
'recommendationStrength': it.get().tc.value('strength') *
getTimeDecayFactor(it.get().c.value('updatedOn')) ] }
Querying recommended contents for a person
[{‘contentId’: ‘Post:456’, ‘strength’: 0.672},
{‘contentId’: ‘Post:789’, ‘strength’: 0.453},
{‘contentId’: ‘Post:333’, ‘strength’: 0.235},
...]
Example output:
Person Topic
IS ABOUT
Content
IS RECOMMENDED TO
1
2
3
4
5
6
7
8
Recommender
Systems Topic Modeling
Improvement of
user experience
for greater
engagement
Users’ interests
segmentation
Datasets
summarization
Personalized
recommendations
Wrap up
Thanks! Gabriel Moreira
Lead Data Scientist
We are hiring!
bit.ly/ds4cit

More Related Content

What's hot

Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014StampedeCon
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningJames Ward
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine LearningRahul Jain
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!OSCON Byrum
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedJonathan Mugan
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine LearningJeff Tanner
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning pptshubhamshirke12
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsimtiaz khan
 
Azure Machine Learning Intro
Azure Machine Learning IntroAzure Machine Learning Intro
Azure Machine Learning IntroDamir Dobric
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..butest
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social networkChanon Hongsirikulkit
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learningKnoldus Inc.
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network HamdaAnees
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?SwiftKeyComms
 
Programmer information needs after memory failure
Programmer information needs after memory failureProgrammer information needs after memory failure
Programmer information needs after memory failureBhagyashree Deokar
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Nikola Milosevic
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachinePulse
 

What's hot (20)

Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014Making Machine Learning Work in Practice - StampedeCon 2014
Making Machine Learning Work in Practice - StampedeCon 2014
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
ML Basics
ML BasicsML Basics
ML Basics
 
Introduction to Machine Learning
Introduction to Machine LearningIntroduction to Machine Learning
Introduction to Machine Learning
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Machine Learning
Machine LearningMachine Learning
Machine Learning
 
Moving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow ExtendedMoving Your Machine Learning Models to Production with TensorFlow Extended
Moving Your Machine Learning Models to Production with TensorFlow Extended
 
Primer to Machine Learning
Primer to Machine LearningPrimer to Machine Learning
Primer to Machine Learning
 
Introduction to Machine learning ppt
Introduction to Machine learning pptIntroduction to Machine learning ppt
Introduction to Machine learning ppt
 
Cognitive automation
Cognitive automationCognitive automation
Cognitive automation
 
Imtiaz khan data_science_analytics
Imtiaz khan data_science_analyticsImtiaz khan data_science_analytics
Imtiaz khan data_science_analytics
 
Azure Machine Learning Intro
Azure Machine Learning IntroAzure Machine Learning Intro
Azure Machine Learning Intro
 
DagdelenSiriwardaneY..
DagdelenSiriwardaneY..DagdelenSiriwardaneY..
DagdelenSiriwardaneY..
 
Stock prediction using social network
Stock prediction using social networkStock prediction using social network
Stock prediction using social network
 
Introduction to Machine learning
Introduction to Machine learningIntroduction to Machine learning
Introduction to Machine learning
 
Machine Learning Project - Neural Network
Machine Learning Project - Neural Network Machine Learning Project - Neural Network
Machine Learning Project - Neural Network
 
What is Machine Learning?
What is Machine Learning?What is Machine Learning?
What is Machine Learning?
 
Programmer information needs after memory failure
Programmer information needs after memory failureProgrammer information needs after memory failure
Programmer information needs after memory failure
 
Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)Machine learning (ML) and natural language processing (NLP)
Machine learning (ML) and natural language processing (NLP)
 
Machine Learning and Real-World Applications
Machine Learning and Real-World ApplicationsMachine Learning and Real-World Applications
Machine Learning and Real-World Applications
 

Similar to Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine Learning SP 4th edition

ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsCarla Marini
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLGeorge Simov
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Gabriel Moreira
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentationrenjan131
 
Embracing OOUX for Better Projects and Happier Teams
Embracing OOUX for Better Projects and Happier TeamsEmbracing OOUX for Better Projects and Happier Teams
Embracing OOUX for Better Projects and Happier TeamsCaroline Sober-James
 
Major_Project_Presentaion_B14.pptx
Major_Project_Presentaion_B14.pptxMajor_Project_Presentaion_B14.pptx
Major_Project_Presentaion_B14.pptxLokeshKumarReddy8
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics DomainDrjabez
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahoutaneeshabakharia
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Karen Thompson
 
Bt8901 objective oriented systems1
Bt8901 objective oriented systems1Bt8901 objective oriented systems1
Bt8901 objective oriented systems1Techglyphs
 
Generation of Automatic Code using Design Patterns
Generation of Automatic Code using Design PatternsGeneration of Automatic Code using Design Patterns
Generation of Automatic Code using Design PatternsIRJET Journal
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
Understanding and Conceptualizing interaction - Mary Margarat
Understanding and Conceptualizing interaction  - Mary MargaratUnderstanding and Conceptualizing interaction  - Mary Margarat
Understanding and Conceptualizing interaction - Mary MargaratMary Margarat
 
Structured design: Modular style for modern content
Structured design: Modular style for modern contentStructured design: Modular style for modern content
Structured design: Modular style for modern contentChristopher Hess
 
Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptxprathammishra28
 

Similar to Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine Learning SP 4th edition (20)

ML crash course
ML crash courseML crash course
ML crash course
 
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insightsLaboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
Laboratorio Master BI&BDA (Modulo Web Data Analytics) : Reddit fashion insights
 
ClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureMLClassifyingIssuesFromSRTextAzureML
ClassifyingIssuesFromSRTextAzureML
 
Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação Sistemas de Recomendação sem Enrolação
Sistemas de Recomendação sem Enrolação
 
Recsys 2016
Recsys 2016Recsys 2016
Recsys 2016
 
Map Reduce amrp presentation
Map Reduce amrp presentationMap Reduce amrp presentation
Map Reduce amrp presentation
 
Embracing OOUX for Better Projects and Happier Teams
Embracing OOUX for Better Projects and Happier TeamsEmbracing OOUX for Better Projects and Happier Teams
Embracing OOUX for Better Projects and Happier Teams
 
Major_Project_Presentaion_B14.pptx
Major_Project_Presentaion_B14.pptxMajor_Project_Presentaion_B14.pptx
Major_Project_Presentaion_B14.pptx
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
Orchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache MahoutOrchestrating the Intelligent Web with Apache Mahout
Orchestrating the Intelligent Web with Apache Mahout
 
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
Cis 555 Week 4 Assignment 2 Automated Teller Machine (Atm)...
 
Bt8901 objective oriented systems1
Bt8901 objective oriented systems1Bt8901 objective oriented systems1
Bt8901 objective oriented systems1
 
Generation of Automatic Code using Design Patterns
Generation of Automatic Code using Design PatternsGeneration of Automatic Code using Design Patterns
Generation of Automatic Code using Design Patterns
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
Understanding and Conceptualizing interaction - Mary Margarat
Understanding and Conceptualizing interaction  - Mary MargaratUnderstanding and Conceptualizing interaction  - Mary Margarat
Understanding and Conceptualizing interaction - Mary Margarat
 
Introduction
IntroductionIntroduction
Introduction
 
Patterns Overview
Patterns OverviewPatterns Overview
Patterns Overview
 
Structured design: Modular style for modern content
Structured design: Modular style for modern contentStructured design: Modular style for modern content
Structured design: Modular style for modern content
 
Recommendation system (1).pptx
Recommendation system (1).pptxRecommendation system (1).pptx
Recommendation system (1).pptx
 

More from Gabriel Moreira

[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...Gabriel Moreira
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...Gabriel Moreira
 
Deep Learning for Recommender Systems @ TDC SP 2019
Deep Learning for Recommender Systems @ TDC SP 2019Deep Learning for Recommender Systems @ TDC SP 2019
Deep Learning for Recommender Systems @ TDC SP 2019Gabriel Moreira
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...Gabriel Moreira
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data ScienceGabriel Moreira
 
Deep Recommender Systems - PAPIs.io LATAM 2018
Deep Recommender Systems - PAPIs.io LATAM 2018Deep Recommender Systems - PAPIs.io LATAM 2018
Deep Recommender Systems - PAPIs.io LATAM 2018Gabriel Moreira
 
CI&T Tech Summit 2017 - Machine Learning para Sistemas de Recomendação
CI&T Tech Summit 2017 - Machine Learning para Sistemas de RecomendaçãoCI&T Tech Summit 2017 - Machine Learning para Sistemas de Recomendação
CI&T Tech Summit 2017 - Machine Learning para Sistemas de RecomendaçãoGabriel Moreira
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Gabriel Moreira
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsGabriel Moreira
 
Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)Gabriel Moreira
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Gabriel Moreira
 
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...Gabriel Moreira
 
Developing GeoGames for Education with Kinect and Android for ArcGIS Runtime
Developing GeoGames for Education with Kinect and Android for ArcGIS RuntimeDeveloping GeoGames for Education with Kinect and Android for ArcGIS Runtime
Developing GeoGames for Education with Kinect and Android for ArcGIS RuntimeGabriel Moreira
 
Dojo Imagem de Android - 19/06/2012
Dojo Imagem de Android - 19/06/2012Dojo Imagem de Android - 19/06/2012
Dojo Imagem de Android - 19/06/2012Gabriel Moreira
 
Agile Testing e outros amendoins
Agile Testing e outros amendoinsAgile Testing e outros amendoins
Agile Testing e outros amendoinsGabriel Moreira
 
ArcGIS Runtime For Android
ArcGIS Runtime For AndroidArcGIS Runtime For Android
ArcGIS Runtime For AndroidGabriel Moreira
 
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...Gabriel Moreira
 
Continuous Inspection - An effective approch towards Software Quality Product...
Continuous Inspection - An effective approch towards Software Quality Product...Continuous Inspection - An effective approch towards Software Quality Product...
Continuous Inspection - An effective approch towards Software Quality Product...Gabriel Moreira
 
An Investigation Of EXtreme Programming Practices
An Investigation Of EXtreme Programming PracticesAn Investigation Of EXtreme Programming Practices
An Investigation Of EXtreme Programming PracticesGabriel Moreira
 

More from Gabriel Moreira (20)

[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
[Phd Thesis Defense] CHAMELEON: A Deep Learning Meta-Architecture for News Re...
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
 
Deep Learning for Recommender Systems @ TDC SP 2019
Deep Learning for Recommender Systems @ TDC SP 2019Deep Learning for Recommender Systems @ TDC SP 2019
Deep Learning for Recommender Systems @ TDC SP 2019
 
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
PAPIs LATAM 2019 - Training and deploying ML models with Kubeflow and TensorF...
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Deep Recommender Systems - PAPIs.io LATAM 2018
Deep Recommender Systems - PAPIs.io LATAM 2018Deep Recommender Systems - PAPIs.io LATAM 2018
Deep Recommender Systems - PAPIs.io LATAM 2018
 
CI&T Tech Summit 2017 - Machine Learning para Sistemas de Recomendação
CI&T Tech Summit 2017 - Machine Learning para Sistemas de RecomendaçãoCI&T Tech Summit 2017 - Machine Learning para Sistemas de Recomendação
CI&T Tech Summit 2017 - Machine Learning para Sistemas de Recomendação
 
Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017Feature Engineering - Getting most out of data for predictive models - TDC 2017
Feature Engineering - Getting most out of data for predictive models - TDC 2017
 
Feature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive modelsFeature Engineering - Getting most out of data for predictive models
Feature Engineering - Getting most out of data for predictive models
 
Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)Python for Data Science - Python Brasil 11 (2015)
Python for Data Science - Python Brasil 11 (2015)
 
Python for Data Science - TDC 2015
Python for Data Science - TDC 2015Python for Data Science - TDC 2015
Python for Data Science - TDC 2015
 
Python for Data Science
Python for Data SciencePython for Data Science
Python for Data Science
 
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...
Using Neural Networks and 3D sensors data to model LIBRAS gestures recognitio...
 
Developing GeoGames for Education with Kinect and Android for ArcGIS Runtime
Developing GeoGames for Education with Kinect and Android for ArcGIS RuntimeDeveloping GeoGames for Education with Kinect and Android for ArcGIS Runtime
Developing GeoGames for Education with Kinect and Android for ArcGIS Runtime
 
Dojo Imagem de Android - 19/06/2012
Dojo Imagem de Android - 19/06/2012Dojo Imagem de Android - 19/06/2012
Dojo Imagem de Android - 19/06/2012
 
Agile Testing e outros amendoins
Agile Testing e outros amendoinsAgile Testing e outros amendoins
Agile Testing e outros amendoins
 
ArcGIS Runtime For Android
ArcGIS Runtime For AndroidArcGIS Runtime For Android
ArcGIS Runtime For Android
 
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...
EARLY-FIX: Um Framework para Predição de Manutenção Corretiva de Software uti...
 
Continuous Inspection - An effective approch towards Software Quality Product...
Continuous Inspection - An effective approch towards Software Quality Product...Continuous Inspection - An effective approch towards Software Quality Product...
Continuous Inspection - An effective approch towards Software Quality Product...
 
An Investigation Of EXtreme Programming Practices
An Investigation Of EXtreme Programming PracticesAn Investigation Of EXtreme Programming Practices
An Investigation Of EXtreme Programming Practices
 

Recently uploaded

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxLoriGlavin3
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxLoriGlavin3
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024Lorenzo Miniero
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfMounikaPolabathina
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersRaghuram Pandurangan
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfLoriGlavin3
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 

Recently uploaded (20)

Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptxUse of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
Use of FIDO in the Payments and Identity Landscape: FIDO Paris Seminar.pptx
 
The State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptxThe State of Passkeys with FIDO Alliance.pptx
The State of Passkeys with FIDO Alliance.pptx
 
SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024SIP trunking in Janus @ Kamailio World 2024
SIP trunking in Janus @ Kamailio World 2024
 
What is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdfWhat is DBT - The Ultimate Data Build Tool.pdf
What is DBT - The Ultimate Data Build Tool.pdf
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Generative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information DevelopersGenerative AI for Technical Writer or Information Developers
Generative AI for Technical Writer or Information Developers
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxMerck Moving Beyond Passwords: FIDO Paris Seminar.pptx
Merck Moving Beyond Passwords: FIDO Paris Seminar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Moving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdfMoving Beyond Passwords: FIDO Paris Seminar.pdf
Moving Beyond Passwords: FIDO Paris Seminar.pdf
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 

Discovering User's Topics of Interest in Recommender Systems @ Meetup Machine Learning SP 4th edition

  • 1. Discovering Users’ Topics of Interest in Recommender Systems Gabriel Moreira - @gspmoreira Machine Learning SP Lead Data Scientist DSc. student
  • 2. Agenda ● Recommender Systems concepts ● Topic Modeling concepts ● Case: Smart Canvas - Corporative Collaboration ● Wrap up
  • 4. Life is too short!
  • 6. Recommendations by interaction What should I watch today? What do you like, my son?
  • 7. "A lot of times, people don’t know what they want until you show it to them." Steve Jobs
  • 8. "We are leaving the Information Age and entering the Recommendation Age.". Cris Anderson, "The long tail"
  • 9. 38% of sales 2/3 views Recommendations are responsible for... 38% of top news visualization
  • 10. What else may I recommend?
  • 11. What can a Recommender Systems do? 2 - Prediction Given an item, what is its relevance for each user? 1 - Recommendation Given a user, produce an ordered list matching the user needs
  • 14. Advantages ● Does not depend upon other users ● May recommend new and unpopular items ● Recommendations can be easily explained Drawbacks ● Overspecialization ● May not recommend to new users May be difficult to extract attributes from audio, movies or images Content-Based Filtering
  • 17. Collaborative Filtering Advantages ● Works to any item kind (ignore attributes) Drawbacks ● Usually recommends more popular items ● Cold-start ○ Cannot recommend items not already rated/consumed ○ Needs a minimum amount of users to match similar users
  • 18. Hybrid Recommender Systems Composite Iterates by a chain of algorithm, aggregating recommendations. Weighted Each algorithm has as a weight and the final recommendations are defined by weighted averages. Some approaches
  • 19. UBCF Example (Java / Mahout) // Loads user-item ratings DataModel model = new FileDataModel(new File("input.csv")); // Defines a similarity metric to compare users (Person's correlation coefficient) UserSimilarity similarity = new PearsonCorrelationSimilarity(model); // Threshold the minimum similarity to consider two users similar UserNeighborhood neighborhood = new ThresholdUserNeighborhood(0.1, similarity, model); // Create a User-Based Collaborative Filtering recommender UserBasedRecommender recommender = new GenericUserBasedRecommender(model, neighborhood, similarity); // Return the top 3 recommendations for userId=2 List recommendations = recommender.recommend(2, 3); User,Item,Rating1, 15,4.0 1,16,5.0 1,17,1.0 1,18,5.0 2,10,1.0 2,11,2.0 2,15,5.0 2,16,4.5 2,17,1.0 2,18,5.0 3,11,2.5 input.csv User-Based Collaborative Filtering example (Mahout) 1 2 3 4 5 6 https://mahout.apache.org/users/recommender/userbased-5-minutes.html
  • 20. Frameworks - Recommender Systems Python Python / ScalaJava .NET Java
  • 21. Books
  • 23. Topic Modeling A simple way to analyze topics of large text collections (corpus). What are this documents about? What are the main topics?
  • 24. Topic A Cluster of words that generally occur together and are related. Example 2D topics visualization with pyLDAviz (topics are the circles in the left, main terms of selected topic are shown in the right)
  • 25. Topics evolution Visualization of topics evolution during across time
  • 26. Topic Modeling applications ● Detect trends in the market and customer challenges of a industry from media and social networks. ● Marketing SEO: Optimization of search keywords to improve page ranking in searchers ● Identify users preferences to recommend relevant content, products, jobs, …
  • 27. ● Documents are composed by many topics ● Topics are composed by many words (tokens) Topic Modeling Documents Topics Words (Tokens) Observable Latent Observable Unsupervised learning technique, which does not require too much effort on pre-processing (usually just Text Vectorization). In general, the only parameter is the number of topics (k).
  • 28. Topics in a document
  • 29. Topics example from NYT Topic analysis (LDA) from 1.8 millions of New York Times articles
  • 31. Text vectorization Represent each document as a feature vector in the vector space, where each position represents a word (token) and the contained value is its relevance in the document. ● BoW (Bag of words) ● TF-IDF (Term Frequency - Inverse Document Frequency) Document Term Matrix - Bag of Words
  • 33. TF-IDF example with scikit-learn from sklearn.feature_extraction.text import TfidfVectorizer vectorizer = TfidfVectorizer(max_df=0.5, max_features=1000, min_df=2, stop_words='english') tfidf_corpus = vectorizer.fit_transform(text_corpus) face person guide lock cat dog sleep micro pool gym 0 1 2 3 4 5 6 7 8 9 D1 0.05 0.25 D2 0.02 0.32 0.45 ... ... tokens documents TF-IDF sparse matrix example
  • 34. Text vectorization “Did you ever wonder how great it would be if you could write your jmeter tests in ruby ? This projects aims to do so. If you use it on your project just let me now. On the Architecture Academy you can read how jmeter can be used to validate your Architecture. definition | architecture validation | academia de arquitetura” Example Tokens (unigrams and bigrams) Weight jmeter 0.466 architecture 0.380 validate 0.243 validation 0.242 definition 0.239 write 0.225 academia arquitetura 0.218 academy 0.216 ruby 0.213 tests 0.209 Relevant keywords (TF-IDF)
  • 35. Visualization of the average TF-IDF vectors of posts at Google+ for Work (CI&T) Text vectorization See the more details about this social and text analytics at http://bit.ly/python4ds_nb
  • 36. Main techniques ● Latent Dirichlet Allocation (LDA) -> Probabilistic ● Latent Semantic Indexing / Analysis (LSI / LSA) -> Matrix Factorization ● Non-Negative Matrix Factorization (NMF) -> Matrix Factorization
  • 37. Topic Modeling example (Python / gensim) from gensim import corpora, models, similarities documents = ["Human machine interface for lab abc computer applications", "A survey of user opinion of computer system response time", "The EPS user interface management system", ...] stoplist = set('for a of the and to in'.split()) texts = [[word for word in document.lower().split() if word not in stoplist] for document in documents] dictionary = corpora.Dictionary(texts) corpus = [dictionary.doc2bow(text) for text in texts] tfidf = models.TfidfModel(corpus) corpus_tfidf = tfidf[corpus] lsi = models.LsiModel(corpus_tfidf, id2word=dictionary, num_topics=2) corpus_lsi = lsi[corpus_tfidf] lsi.print_topics(2) topic #0(1.594): 0.703*"trees" + 0.538*"graph" + 0.402*"minors" + 0.187*"survey" + ... topic #1(1.476): 0.460*"system" + 0.373*"user" + 0.332*"eps" + 0.328*"interface" + 0.320*"response" + ... Example of topic modeling using LSI technique Example of the 2 topics discovered in the corpus 1 2 3 4 5 6 7 8 9 10 11
  • 38. from pyspark.mllib.clustering import LDA, LDAModel from pyspark.mllib.linalg import Vectors # Load and parse the data data = sc.textFile("data/mllib/sample_lda_data.txt") parsedData = data.map(lambda line: Vectors.dense([float(x) for x in line.strip().split(' ')])) # Index documents with unique IDs corpus = parsedData.zipWithIndex().map(lambda x: [x[1], x[0]]).cache() # Cluster the documents into three topics using LDA ldaModel = LDA.train(corpus, k=3) Topic Modeling example (PySpark MLLib LDA) Sometimes simpler is better! Scikit-learn topic models worked better for us than Spark MLlib LDA distributed implementation because we have thousands of users with hundreds of posts, and running scikit-learn on workers for each person was much quicker than running distributed Spark LDA for each person. 1 2 3 4 5 6
  • 39. Cosine similarity This similarity metric between two vectors is the cosine of the angle between them from sklearn.metrics.pairwise import cosine_similarity cosine_similarity(tfidf_matrix[0:1], tfidf_matrix) Example with scikit-learn
  • 40. Example of relevant keywords for a person and people with similar interests People similarity
  • 41. Frameworks - Topic Modeling Python Java Stanford Topic Modeling Toolbox (Excel plugin) http://nlp.stanford.edu/software/tmt/tmt-0.4/ Python / Scala
  • 42. Smart Canvas© Corporate Collaboration http://www.smartcanvas.com/ Powered by Recommender Systems and Topic Modeling techniques Case
  • 43. Content recommendations Discover channel - Content recommendations with personalized explanations, based on user’s topics of interest (discovered from their contributed content and reads)
  • 44. Person topics of interest User profile - Topics of interest of users are from the content that they contribute and are presented as tags in their profile.
  • 45. Searching people interested in topics / experts... User discovered tags are searchable, allowing to find experts or people with specific interests.
  • 46. Similar people People recommendation - Recommends people with similar interests, explaining which topics are shared.
  • 48. Collaboration Graph Model Smart Canvas © Graph Model: Dashed lines and bold labels are the relationships inferred by usage of RecSys and Topic Modeling techniques
  • 50. Jobs Pipeline Content Vectorization People Topic Modeling Boards Topic Modeling Contents Recommendation Boards Recommendation Similar ContentGoogle Cloud Storage Google Cloud Storage pickles pickles pickles loads loads loads loads loads Stores topics and recommendations in the graph 1 2 3 4 5 6 loads content loads touchpoints
  • 51. 1. Groups all touchpoints (user interactions) by person 2. For each person a. Selects contents user has interacted b. Clusters contents TF-IDF vectors to model (e.g. LDA, NMF, Kmeans) person’ topics of interest (varying the k from a range (1-5) and selecting the model whose clusters best describe cohesive topics, penalizing large k) c. Weights clusters relevance (to the user) by summing touchpoints strength (view, like, bookmark, …) of each cluster, with a time decay on interaction age (older interactions are less relevant to current person interest) d. Returns highly relevant topics vectors for the person,labeled by its three top keywords 3. For each people topic a. Calculate the cosine similarity (optionally applying Pivoted Unique Pivoted Normalization) among the topic vector and content TF-IDF vectors b. Recommends more similar contents to person user topics, which user has not yet interacted (using Bloom Filter) People Topic Modeling and Content Recommendations algorithm
  • 52. Bloom Filter ● A space-efficient Probabilistic Data Structure used to test whether an element is a member of a set. ● False positive matches are possible, but false negatives are not. ● An empty Bloom filter is a bit array of m bits, all set to 0. ● Each inserted key is mapped to bits (which are all set to 1) by k different hash functions. The set {x,y,z} was inserted in Bloom Filter. w is not in the set, because it hashes to one bit equal to 0
  • 53. Bloom Filter - Example Using cross_bloomfilter - Pure Python and Java compatible implementation GitHub: http://bit.ly/cross_bf
  • 54. Bloom Filter - Complexity Analysis Ordered List Space Complexity: O(n) where n is the number of items in the set Check for key presence Time Complexity (binary search): O(log(n)) assuming an ordered list Example: For a list of 100,000 string keys, with an average length of 18 characters Key list size: 1,788,900 bytes Bloom filter size: 137,919 bytes (false positive rate = 0.5%) Bloom filter size: 18,043 bytes (false positive rate = 5%) (and you can gzip it!) Bloom Filter Space Complexity: O(t * ln(p)) where t is the maximum number of items to be inserted and p the accepted false positive rate Check for key presence Time Complexity: O(k) where k is a constant representing the number of hash functions to be applied to a string key
  • 55. ● Distributed Graph Database ● Elastic and linear scalability for a growing data and user base ● Support for ACID and eventual consistency ● Backended by Cassandra or HBase ● Support for global graph data analytics, reporting, and ETL through integration with Spark, Hadoop, Giraph ● Support for geo and full text search (ElasticSearch, Lucene, Solr) ● Native integration with TinkerPop (Gremlin query language, Gremlin Server, ...) Apache TinkerPop
  • 56. Gremlin traversals g.V().has('CONTENT','id', 'Post:123') .outE('IS SIMILAR TO') .has('strength', gte(0.1)).as('e').inV().as('v') .select('e','v') .map{['id': it.get().v.values('id').next(), 'strength': it.get().e.values('strength').next()]} .limit(10) Querying similar contents [{‘id’: ‘Post:456’, ‘strength’: 0.672}, {‘id’: ‘Post:789’, ‘strength’: 0.453}, {‘id’: ‘Post:333’, ‘strength’: 0.235}, ...] Example output: Content Content IS SIMILAR TO 1 2 3 4 5 6 7
  • 57. Gremlin traversals g.V().has('PERSON','id','Person:gabrielpm@ciandt.com') .out('IS ABOUT').has('number', 1) .inE('IS RECOMMENDED TO').filter{ it.get().value('strength') >= 0.1} .order().by('strength',decr).as('tc').outV().as('c') .select('tc','c') .map{ ['contentId': it.get().c.value('id'), 'recommendationStrength': it.get().tc.value('strength') * getTimeDecayFactor(it.get().c.value('updatedOn')) ] } Querying recommended contents for a person [{‘contentId’: ‘Post:456’, ‘strength’: 0.672}, {‘contentId’: ‘Post:789’, ‘strength’: 0.453}, {‘contentId’: ‘Post:333’, ‘strength’: 0.235}, ...] Example output: Person Topic IS ABOUT Content IS RECOMMENDED TO 1 2 3 4 5 6 7 8
  • 58. Recommender Systems Topic Modeling Improvement of user experience for greater engagement Users’ interests segmentation Datasets summarization Personalized recommendations Wrap up
  • 59. Thanks! Gabriel Moreira Lead Data Scientist We are hiring! bit.ly/ds4cit