SlideShare una empresa de Scribd logo
1 de 26
Descargar para leer sin conexión
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Hands-on Introduction to Text Mining with Julia
- A Mathematical approach
Abhijith Chandraprabhu
April 19, 2014
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Overview
Introduction
Preprocessing
VSM Model
Query and Performance Modeling
LSI Model
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
About todays session
We will be dealing with an age old technique(1999).
Old wine in a New bottle!
The emphasis is on understanding the math behind it.
Vectors, Matrices
Dimension, Space
Projection, Matrix factorization
SVD
Hands on session.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
“For almost a decade the computational linguistics community has
viewed large text collections as a resource to be tapped in order to
produce better text analysis algorithms. In this paper, I have
attempted to suggest a new emphasis: the use of large online text
collections to discover new facts and trends about the world itself.
I suggest that to make progress we do not need fully artificial
intelligent text analysis; rather, a mixture of
computationally-driven and user-guided analysis may open the door
to exciting new results.”
Untangling Text Data Mining, Marti A. Hearst
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Text Mining
Type of Information retrieval, where we try to extract relevant
information from huge collection of textual data(documents).
Documents: web pages, biomedical literature, movie reviews,
research articles etc.
Non-Semantic
Based Vector-Space model.
Applications : Web Search Engines, Biomedical Information
Retrieval etc.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Document1
It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take
up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal
seamlessly and generically with these different immediate types as well as pointer types like String.
Document2
To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis,
I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages
that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation
language for data analysis.
Document3
Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wall
between your high-level code and native assembly. Not only can you write code with the performance of C in Julia,
you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as its
generated assembly code — all within the REPL. Check it out.
Document4
Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of,
because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it
tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to
appreciate them.
Document5
Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2
or lattice. Work is underway on graphics models for Julia but, again, it is early days still.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Term-Document Matrix
Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Query 1 Query 2
Array 1 0 0 0 0 1 0
continous block 1 0 0 0 0 0 0
Julia 1 3 3 1 1 1 1
types 1 0 0 0 0 1 0
string 1 0 0 0 0 0 0
amazing 0 1 0 0 0 0 0
data analysis 0 2 0 0 0 0 1
statistical computing 0 1 0 0 0 0 1
high-level 0 0 1 0 0 0 0
performance 0 0 1 0 0 0 0
LLVM 0 0 1 0 0 0 0
homoiconicity 0 0 0 1 0 0 0
R 0 0 0 2 0 0 0
syntax 0 0 0 4 0 0 0
macros 0 0 0 1 0 0 0
graphics 0 0 0 0 3 0 0
advantages 0 0 0 0 1 0 0
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Terminologies explained:
Corpus : Structured set of text, obtained by preprocessing raw
textual data.
Lexicon : Distinctive collection of all the words/terms in the
corpus.
Document, Query : Bag of words/terms, represented as vector.
Keyword, term : Elements of Lexicon.
Term Frequency : Frequency of occurence of a term in a
Document.
TDM : A matrix, with term frequencies as entries across all
Documents.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Major Steps in Text Mining
1. Preprocess the documents to form a corpus.
2. Identify the lexicon from the corpus.
3. Form the Term-Document matrix (Numericized Text).
4. Apply VSM, LSI, K-Means to measure the proximity between
all the documents and the query.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing using TextAnalysis.jl package
Raw input Text is usually stream of characters. Need to
convert this to stream of terms(basic processing units).
The first step would be to create Documents which is
collection of terms.
Four types of documents can be created in Julia.
A FileDocument which represents the files on disk
str=“Julia is a high-level, high-performance dynamic
programming language for technical computing”
sd = StringDocument(str)
td = TokenDocument(str)
nd = NGramDocument(str)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing
Most of the times we have huge volumes of unstructured data, with
content not carrying any useful information w.r.t Text Mining.
HTML tags
Numbers
Stop words
Prepositions, Articles, Pronouns,
In Julia, we can remove these uneccessary using the functions,
remove_articles!(), remove_indefinite_articles!()
remove_definite_articles!(), remove_pronouns!()
remove_prepositions!(), remove_stop_words!()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Stemming
I have a different opinion
I differ with your opinion
I opine differently
In Information Retrieval, morphological variants of
words/terms carrying the same semantic information adds to
redundancy.
Stemming linguistically normalizes the lexicon of a corpus.
stem!(Corpus)
stem!(Document) #sd, td or nd
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Vector-Space Model
T1
T2
T3
D1
D2
D3
0
Documents are vectors in
the Term-Document Space
The elements of the vector
are the weights1, wij ,
corresponding to Document
i and term j
The weights are the
frequencies of the terms in
the documents.
Proximity of documents
calculated by the cosine of
the angle between them.
a
Refer Weighting Schemes
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Term-Document Space
Let Dj , be a collection of i documents.
T = t1, t2, t3, ..., ti , is the Lexicon set.
wji , is the frequency of occurence of the term j in document i.
Document, dj = [w1j , w2j , w3j , ..., wij ].
TDM = [d1 d2 d3 ... dj ]ij
dj ∈ Ri×j
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Weighting Schemes
Associating each occurence of a term with a weight that
represents its relevance with respect to the meaning of the
document it appears in. []
Longer documents do not always carry more informative
content (or more relevant content wrt a query).
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Binary Scheme : wij = 1, if ti occurs in dj , 0 else not.
Term Frequency (TF) Scheme : wij = fij , i.e. the number of
times ti occurs in document dj .
Term Frequency - Inverse Document Frequency (TF-IDF)
Scheme : Term Frequency tfij =
fij
max[f1j ,f2j ,f3j ...,fij ]
Inverse-Document frequency, idfi = log N
dfi
,
N: number of documents and dfi : Number of documents in
which the term ti occurs.
wij = tfij × idfi .
m=DocumentTermMatrix(crps)
tf_idf(m)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Query Matching
Finding the relevant documents for a query, q.
Cosine Distance Measure is used.
cos(θ) =
qT dj
q 2 dj 2
, where θ is the angle between the query q
and document dj .
The documents for which cos(θ) > tol, are considered
relevant, where tol, is the predefined tolerance.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Performance Modeling
The tol, decides the number of documents returned.
With low tol, more documents are returned.
But chances of the documents being irrelevant increases.
Ideally we need higher number of documents returned and
majority of the returned documents to be relevant.
Precision, P = Dr
Dt
, where Dr is the number of relevant
documents retrieved, and Dt is the total number of
documents retrieved.
Recall, R = Dr
Nr
, where Nr is the total number of relevant
documents in the database.
VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int
This function is used to obtain Dr , Dt & Nr for any specific query
using Vector Space model.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
40No.ofDocuments Query10
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
20
40
60
80
No.ofDocuments
Query13
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
50
100
150
200
No.ofDocuments
Query6
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
No.ofDocuments
Query23
Dr
Dt
Nr
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Latent Semantic Indexing
There exists underlying latent semantic structure in the data.
We can identify this structure through SVD.
Project the data onto two lower dimensional spaces.
These are the term space and the document space.
Dimension reduction is achieved through truncated SVD.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Singular Value Decomposition
Theorem (SVD)
For any matrix A ∈ Rm×n, with m > n, there exists two
orthogonal matrices U = (u1, . . . , um) ∈ Rm×m &
V = (v1, . . . , vn) ∈ Rn×n such that
A = U
Σ
0
V T
,
where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), with
σ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular values
of A. Columns of U & V are called the right and left singular
vectors of A respectively.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Low Rank Matrix Approximation using SVD
A =
n
i=1
σi ui vT
i ≈
k
i=1
σi ui vT
i =: Ak k < r
The above approximation is based on the Eckart-Young
theorem.
It helps in removal of noise, solving ill-conditioned problems,
and mainly in dimension reduction of data.
Using the below function examine the effect of rank reduction.
Why is SVD good enough to decompose the TDM?
svdRedRank(A,k)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
A
m
n
≈ m
U
k ≤ n
k ≤ n
S
k
k ≤ n
V
n
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
VSMModel(A::Array{Float64,2},nq::Int)
SVDModel(A::Array{Float64,2},nq::Int,rank::Int)
The above functions returns the Recall and Precision using
the Vector space model and the LSI model.
The first qNum vectors are the queries and the rest are the
document vectors, of matrix A.
For the LSI(SVDModel), we need to pass in the rank also as a
parameter.
Use the functions shown below to view the Precision Vs Recall for
VSM and LSI.
plotNew_RecPrec()
plotAdd_RecPrec()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
10 20 30 40 50 60 70 80 90
Recall
30
40
50
60
70
Precision
Precision Vs Recall
VSM
LSI
KM
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Thank you.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach

Más contenido relacionado

La actualidad más candente

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information RetrievalBhaskar Mitra
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analyticsFarheen Nilofer
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modellingcsandit
 
Recommender System with Distributed Representation
Recommender System with Distributed RepresentationRecommender System with Distributed Representation
Recommender System with Distributed RepresentationRakuten Group, Inc.
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalBhaskar Mitra
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document RankingBhaskar Mitra
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web searchhyunsung lee
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Bhaskar Mitra
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalBhaskar Mitra
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackBhaskar Mitra
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for SearchBhaskar Mitra
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...IOSR Journals
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-rankingIOSR Journals
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverSebastian Ruder
 

La actualidad más candente (18)

Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Topic modeling using big data analytics
Topic modeling using big data analyticsTopic modeling using big data analytics
Topic modeling using big data analytics
 
Word Embedding In IR
Word Embedding In IRWord Embedding In IR
Word Embedding In IR
 
A Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic ModellingA Text Mining Research Based on LDA Topic Modelling
A Text Mining Research Based on LDA Topic Modelling
 
Recommender System with Distributed Representation
Recommender System with Distributed RepresentationRecommender System with Distributed Representation
Recommender System with Distributed Representation
 
A Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information RetrievalA Simple Introduction to Neural Information Retrieval
A Simple Introduction to Neural Information Retrieval
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
G04124041046
G04124041046G04124041046
G04124041046
 
Neural Models for Document Ranking
Neural Models for Document RankingNeural Models for Document Ranking
Neural Models for Document Ranking
 
Learning deep structured semantic models for web search
Learning deep structured semantic models for web searchLearning deep structured semantic models for web search
Learning deep structured semantic models for web search
 
Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...Exploring Session Context using Distributed Representations of Queries and Re...
Exploring Session Context using Distributed Representations of Queries and Re...
 
Adversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrievalAdversarial and reinforcement learning-based approaches to information retrieval
Adversarial and reinforcement learning-based approaches to information retrieval
 
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning TrackConformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
Conformer-Kernel with Query Term Independence @ TREC 2020 Deep Learning Track
 
Deep Learning for Search
Deep Learning for SearchDeep Learning for Search
Deep Learning for Search
 
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
Computational Intelligence Methods for Clustering of Sense Tagged Nepali Docu...
 
Improving Web Image Search Re-ranking
Improving Web Image Search Re-rankingImproving Web Image Search Re-ranking
Improving Web Image Search Re-ranking
 
Modeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John GloverModeling documents with Generative Adversarial Networks - John Glover
Modeling documents with Generative Adversarial Networks - John Glover
 

Destacado (7)

Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
Text Classification in Python – using Pandas, scikit-learn, IPython Notebook ...
 
Julia meetup bangalore
Julia meetup bangaloreJulia meetup bangalore
Julia meetup bangalore
 
IR
IRIR
IR
 
Information Retrieval
Information RetrievalInformation Retrieval
Information Retrieval
 
Information retrieval system!
Information retrieval system!Information retrieval system!
Information retrieval system!
 
Information retrieval s
Information retrieval sInformation retrieval s
Information retrieval s
 
Introduction to Information Retrieval & Models
Introduction to Information Retrieval & ModelsIntroduction to Information Retrieval & Models
Introduction to Information Retrieval & Models
 

Similar a Julia text mining_inmobi

ML crash course
ML crash courseML crash course
ML crash coursemikaelhuss
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...ijtsrd
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionLiad Magen
 
B sc it syit sem 3 sem 4 syllabus as per mumbai university
B sc it syit sem 3 sem 4 syllabus as per mumbai universityB sc it syit sem 3 sem 4 syllabus as per mumbai university
B sc it syit sem 3 sem 4 syllabus as per mumbai universitytanujaparihar
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...IJERA Editor
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesDrjabez
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET Journal
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016 Mahesh Dananjaya
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsGabriel Moreira
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibEl Habib NFAOUI
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements LabelingData Works MD
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmeSAT Publishing House
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineYi Zeng
 

Similar a Julia text mining_inmobi (20)

ML crash course
ML crash courseML crash course
ML crash course
 
395 404
395 404395 404
395 404
 
E43022023
E43022023E43022023
E43022023
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
Suggestion Generation for Specific Erroneous Part in a Sentence using Deep Le...
 
Bridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full versionBridging the gap between AI and UI - DSI Vienna - full version
Bridging the gap between AI and UI - DSI Vienna - full version
 
A0210110
A0210110A0210110
A0210110
 
B sc it syit sem 3 sem 4 syllabus as per mumbai university
B sc it syit sem 3 sem 4 syllabus as per mumbai universityB sc it syit sem 3 sem 4 syllabus as per mumbai university
B sc it syit sem 3 sem 4 syllabus as per mumbai university
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...Construction of Keyword Extraction using Statistical Approaches and Document ...
Construction of Keyword Extraction using Statistical Approaches and Document ...
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning Techniques
 
Ir 02
Ir   02Ir   02
Ir 02
 
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...IRJET -  	  Conversion of Unsupervised Data to Supervised Data using Topic Mo...
IRJET - Conversion of Unsupervised Data to Supervised Data using Topic Mo...
 
Proposal for google summe of code 2016
Proposal for google summe of code 2016 Proposal for google summe of code 2016
Proposal for google summe of code 2016
 
Discovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender SystemsDiscovering User's Topics of Interest in Recommender Systems
Discovering User's Topics of Interest in Recommender Systems
 
Information_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_HabibInformation_Retrieval_Models_Nfaoui_El_Habib
Information_Retrieval_Models_Nfaoui_El_Habib
 
Automated Software Requirements Labeling
Automated Software Requirements LabelingAutomated Software Requirements Labeling
Automated Software Requirements Labeling
 
Classification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithmClassification of text data using feature clustering algorithm
Classification of text data using feature clustering algorithm
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
DBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support EngineDBLP-SSE: A DBLP Search Support Engine
DBLP-SSE: A DBLP Search Support Engine
 

Último

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 

Último (20)

Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 

Julia text mining_inmobi

  • 1. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Hands-on Introduction to Text Mining with Julia - A Mathematical approach Abhijith Chandraprabhu April 19, 2014 Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 2. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Overview Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 3. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model About todays session We will be dealing with an age old technique(1999). Old wine in a New bottle! The emphasis is on understanding the math behind it. Vectors, Matrices Dimension, Space Projection, Matrix factorization SVD Hands on session. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 4. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model “For almost a decade the computational linguistics community has viewed large text collections as a resource to be tapped in order to produce better text analysis algorithms. In this paper, I have attempted to suggest a new emphasis: the use of large online text collections to discover new facts and trends about the world itself. I suggest that to make progress we do not need fully artificial intelligent text analysis; rather, a mixture of computationally-driven and user-guided analysis may open the door to exciting new results.” Untangling Text Data Mining, Marti A. Hearst Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 5. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Text Mining Type of Information retrieval, where we try to extract relevant information from huge collection of textual data(documents). Documents: web pages, biomedical literature, movie reviews, research articles etc. Non-Semantic Based Vector-Space model. Applications : Web Search Engines, Biomedical Information Retrieval etc. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 6. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Document1 It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal seamlessly and generically with these different immediate types as well as pointer types like String. Document2 To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis, I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation language for data analysis. Document3 Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wall between your high-level code and native assembly. Not only can you write code with the performance of C in Julia, you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as its generated assembly code — all within the REPL. Check it out. Document4 Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of, because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to appreciate them. Document5 Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2 or lattice. Work is underway on graphics models for Julia but, again, it is early days still. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 7. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Term-Document Matrix Doc 1 Doc 2 Doc 3 Doc 4 Doc 5 Query 1 Query 2 Array 1 0 0 0 0 1 0 continous block 1 0 0 0 0 0 0 Julia 1 3 3 1 1 1 1 types 1 0 0 0 0 1 0 string 1 0 0 0 0 0 0 amazing 0 1 0 0 0 0 0 data analysis 0 2 0 0 0 0 1 statistical computing 0 1 0 0 0 0 1 high-level 0 0 1 0 0 0 0 performance 0 0 1 0 0 0 0 LLVM 0 0 1 0 0 0 0 homoiconicity 0 0 0 1 0 0 0 R 0 0 0 2 0 0 0 syntax 0 0 0 4 0 0 0 macros 0 0 0 1 0 0 0 graphics 0 0 0 0 3 0 0 advantages 0 0 0 0 1 0 0 Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 8. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Terminologies explained: Corpus : Structured set of text, obtained by preprocessing raw textual data. Lexicon : Distinctive collection of all the words/terms in the corpus. Document, Query : Bag of words/terms, represented as vector. Keyword, term : Elements of Lexicon. Term Frequency : Frequency of occurence of a term in a Document. TDM : A matrix, with term frequencies as entries across all Documents. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 9. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Major Steps in Text Mining 1. Preprocess the documents to form a corpus. 2. Identify the lexicon from the corpus. 3. Form the Term-Document matrix (Numericized Text). 4. Apply VSM, LSI, K-Means to measure the proximity between all the documents and the query. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 10. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Preprocessing using TextAnalysis.jl package Raw input Text is usually stream of characters. Need to convert this to stream of terms(basic processing units). The first step would be to create Documents which is collection of terms. Four types of documents can be created in Julia. A FileDocument which represents the files on disk str=“Julia is a high-level, high-performance dynamic programming language for technical computing” sd = StringDocument(str) td = TokenDocument(str) nd = NGramDocument(str) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 11. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Preprocessing Most of the times we have huge volumes of unstructured data, with content not carrying any useful information w.r.t Text Mining. HTML tags Numbers Stop words Prepositions, Articles, Pronouns, In Julia, we can remove these uneccessary using the functions, remove_articles!(), remove_indefinite_articles!() remove_definite_articles!(), remove_pronouns!() remove_prepositions!(), remove_stop_words!() Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 12. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Stemming I have a different opinion I differ with your opinion I opine differently In Information Retrieval, morphological variants of words/terms carrying the same semantic information adds to redundancy. Stemming linguistically normalizes the lexicon of a corpus. stem!(Corpus) stem!(Document) #sd, td or nd Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 13. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Vector-Space Model T1 T2 T3 D1 D2 D3 0 Documents are vectors in the Term-Document Space The elements of the vector are the weights1, wij , corresponding to Document i and term j The weights are the frequencies of the terms in the documents. Proximity of documents calculated by the cosine of the angle between them. a Refer Weighting Schemes Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 14. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Term-Document Space Let Dj , be a collection of i documents. T = t1, t2, t3, ..., ti , is the Lexicon set. wji , is the frequency of occurence of the term j in document i. Document, dj = [w1j , w2j , w3j , ..., wij ]. TDM = [d1 d2 d3 ... dj ]ij dj ∈ Ri×j Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 15. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Weighting Schemes Associating each occurence of a term with a weight that represents its relevance with respect to the meaning of the document it appears in. [] Longer documents do not always carry more informative content (or more relevant content wrt a query). Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 16. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Binary Scheme : wij = 1, if ti occurs in dj , 0 else not. Term Frequency (TF) Scheme : wij = fij , i.e. the number of times ti occurs in document dj . Term Frequency - Inverse Document Frequency (TF-IDF) Scheme : Term Frequency tfij = fij max[f1j ,f2j ,f3j ...,fij ] Inverse-Document frequency, idfi = log N dfi , N: number of documents and dfi : Number of documents in which the term ti occurs. wij = tfij × idfi . m=DocumentTermMatrix(crps) tf_idf(m) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 17. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Query Matching Finding the relevant documents for a query, q. Cosine Distance Measure is used. cos(θ) = qT dj q 2 dj 2 , where θ is the angle between the query q and document dj . The documents for which cos(θ) > tol, are considered relevant, where tol, is the predefined tolerance. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 18. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Performance Modeling The tol, decides the number of documents returned. With low tol, more documents are returned. But chances of the documents being irrelevant increases. Ideally we need higher number of documents returned and majority of the returned documents to be relevant. Precision, P = Dr Dt , where Dr is the number of relevant documents retrieved, and Dt is the total number of documents retrieved. Recall, R = Dr Nr , where Nr is the total number of relevant documents in the database. VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int This function is used to obtain Dr , Dt & Nr for any specific query using Vector Space model. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 19. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model 0.2 0.4 0.6 0.8 Tolerance 0 5 10 15 20 25 30 35 40No.ofDocuments Query10 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 20 40 60 80 No.ofDocuments Query13 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 50 100 150 200 No.ofDocuments Query6 Dr Dt Nr 0.2 0.4 0.6 0.8 Tolerance 0 5 10 15 20 25 30 35 No.ofDocuments Query23 Dr Dt Nr Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 20. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Latent Semantic Indexing There exists underlying latent semantic structure in the data. We can identify this structure through SVD. Project the data onto two lower dimensional spaces. These are the term space and the document space. Dimension reduction is achieved through truncated SVD. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 21. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Singular Value Decomposition Theorem (SVD) For any matrix A ∈ Rm×n, with m > n, there exists two orthogonal matrices U = (u1, . . . , um) ∈ Rm×m & V = (v1, . . . , vn) ∈ Rn×n such that A = U Σ 0 V T , where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), with σ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular values of A. Columns of U & V are called the right and left singular vectors of A respectively. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 22. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Low Rank Matrix Approximation using SVD A = n i=1 σi ui vT i ≈ k i=1 σi ui vT i =: Ak k < r The above approximation is based on the Eckart-Young theorem. It helps in removal of noise, solving ill-conditioned problems, and mainly in dimension reduction of data. Using the below function examine the effect of rank reduction. Why is SVD good enough to decompose the TDM? svdRedRank(A,k) Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 23. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model A m n ≈ m U k ≤ n k ≤ n S k k ≤ n V n Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 24. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model VSMModel(A::Array{Float64,2},nq::Int) SVDModel(A::Array{Float64,2},nq::Int,rank::Int) The above functions returns the Recall and Precision using the Vector space model and the LSI model. The first qNum vectors are the queries and the rest are the document vectors, of matrix A. For the LSI(SVDModel), we need to pass in the rank also as a parameter. Use the functions shown below to view the Precision Vs Recall for VSM and LSI. plotNew_RecPrec() plotAdd_RecPrec() Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 25. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model 10 20 30 40 50 60 70 80 90 Recall 30 40 50 60 70 Precision Precision Vs Recall VSM LSI KM Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach
  • 26. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model Thank you. Abhijith Chandraprabhu Julia Hands-on Introduction to Text Mining with Julia - A Mathematical approach