o you get the feeling of ‘the cart before the horse’ on hearing buzz-words like social data mining or sentiment analysis and so on? Fundamental text mining methods are the real ‘workhorses’ behind these buzz-words. This prsentation aims to give understanding of the fundamentals in plain english.
6. The main ideaPacking our bags : Checks
● Starting R
● Loading required packages
● Check sessionInfo( )
7. The main ideaPacking our bags : Datatypes
Atomic
Vector
Lists
"Let's try our hands"
8. The main ideaPacking our bags : Functions
● Expressions which are evaluated
● Can be passed around
● Definitions can be nested
Details not covered : Argument matching, Call by value,
Environments and lexical scoping, Promises etc..
10. The main ideaPrep camp : Sentiment Analysis
● Bag of words model
● Simple aggregated score
' terrible service & disorganised '
' OK - some good some bad '
' Great location, fabulous staff '
11. The main idea
● Part of speech ambiguity
● Further exploration ?
● Equal weightage model
● Double negations ?
Prep camp : Improvements
13. The main ideawandering traveller : Unsupervised Learning
Can define
distance
Entity as
point in
space
How to derive this model for text ?
Feature 1
Feature 2
14. The main ideawandering traveller : Vector Space Model
Word,
Phrase,
Theme
Comments,
Blogs,
Tweets
Word,
Phrase,
Theme
15. The main ideawandering traveller : TfIdf and other details
" But how to measure the importance of
a word for a doc ? "
● Binary : Is the 'word' in the 'doc' ?
● Tf : # times the word in the 'doc' ?
● TfIdf : Penalize the obvious!
16. The main ideawandering traveller : Hierarchical Clustering
● Define distance measure
● Keep Merging based on similarity
Washing
Machine
Washer
Dryer
Camera
17. The main ideawandering traveller : Improvements
● Stemming, lemmatization
● Latent semantic analysis
"Cameras" Vs "Camera"
"Phone" "Touch Screen"
19. The main ideaSeeker : Supervised Learning
● Labels given with features
● Find rule, classify unobserved case
Feature 1
Feature 2
20. The main ideaSeeker : Naive Bayes Classifier
● Independence of features
● Train the model on training set
● Test accuracy on a holdout sample
Predicted 0 Predicted 1
Actual 0 F (0, 0) F(0, 1)
Actual 1 F (1, 0) F(1, 1)
22. The main ideaLearnings
● How to cleanup and preprocess data
in text form ?
● How to model the data ?
● How to cluster the data ?
● How to classify the data ?
23. The main ideaSource of text and applications
Emails Spam detection
Product descriptions /
reviews
Sentiment analysis,
recommendation
Blogs / informational
content
Content
recommendations
Web pages / news
articles
Topic identification,
trending topics
Tweets / comments /
social content
Sentiment analysis,
named entity recognition