The document is a presentation on text mining with Julia. It introduces vector space models (VSM), latent semantic indexing (LSI) using singular value decomposition (SVD), and their implementation in Julia. The major steps are preprocessing text data, creating a term-document matrix, applying VSM and LSI to measure document similarity, and evaluating performance using precision and recall as the tolerance varies. LSI projects documents into lower dimensions using SVD to remove noise and better capture the latent semantic structure of the data.
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Julia text mining_inmobi
1. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Hands-on Introduction to Text Mining with Julia
- A Mathematical approach
Abhijith Chandraprabhu
April 19, 2014
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
2. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Overview
Introduction
Preprocessing
VSM Model
Query and Performance Modeling
LSI Model
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
3. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
About todays session
We will be dealing with an age old technique(1999).
Old wine in a New bottle!
The emphasis is on understanding the math behind it.
Vectors, Matrices
Dimension, Space
Projection, Matrix factorization
SVD
Hands on session.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
4. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
“For almost a decade the computational linguistics community has
viewed large text collections as a resource to be tapped in order to
produce better text analysis algorithms. In this paper, I have
attempted to suggest a new emphasis: the use of large online text
collections to discover new facts and trends about the world itself.
I suggest that to make progress we do not need fully artificial
intelligent text analysis; rather, a mixture of
computationally-driven and user-guided analysis may open the door
to exciting new results.”
Untangling Text Data Mining, Marti A. Hearst
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
5. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Text Mining
Type of Information retrieval, where we try to extract relevant
information from huge collection of textual data(documents).
Documents: web pages, biomedical literature, movie reviews,
research articles etc.
Non-Semantic
Based Vector-Space model.
Applications : Web Search Engines, Biomedical Information
Retrieval etc.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
6. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Document1
It’s also important to point out that even though these arrays are generic, they’re not boxed: an Int8 array will take
up much less memory than an Int64 array, and both will be laid out as continuous blocks of memory; Julia can deal
seamlessly and generically with these different immediate types as well as pointer types like String.
Document2
To celebrate some of the amazing work that’s already been done to make Julia usable for day-to-day data analysis,
I’d like to give a brief overview of the state of statistical programming in Julia. There are now several packages
that, taken as a whole, suggest that Julia may really live up to its potential and become the next generation
language for data analysis.
Document3
Only later did I realize what makes Julia different from all the others. Julia breaks down the second wall — the wall
between your high-level code and native assembly. Not only can you write code with the performance of C in Julia,
you can take a peek behind the curtain of any function into its LLVM Intermediate Representation as well as its
generated assembly code — all within the REPL. Check it out.
Document4
Homoiconicity — the code can be operated on by other parts of the code. Again, R kind of has this too! Kind of,
because I’m unaware of a good explanation for how to use it productively, and R’s syntax and scoping rules make it
tricky to pull off. But I’m still excited to see it in Julia, because I’ve heard good things about macros and I’d like to
appreciate them.
Document5
Graphics. One of the big advantages of R over similar languages is the sophisticated graphics available in ggplot2
or lattice. Work is underway on graphics models for Julia but, again, it is early days still.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
8. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Terminologies explained:
Corpus : Structured set of text, obtained by preprocessing raw
textual data.
Lexicon : Distinctive collection of all the words/terms in the
corpus.
Document, Query : Bag of words/terms, represented as vector.
Keyword, term : Elements of Lexicon.
Term Frequency : Frequency of occurence of a term in a
Document.
TDM : A matrix, with term frequencies as entries across all
Documents.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
9. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Major Steps in Text Mining
1. Preprocess the documents to form a corpus.
2. Identify the lexicon from the corpus.
3. Form the Term-Document matrix (Numericized Text).
4. Apply VSM, LSI, K-Means to measure the proximity between
all the documents and the query.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
10. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing using TextAnalysis.jl package
Raw input Text is usually stream of characters. Need to
convert this to stream of terms(basic processing units).
The first step would be to create Documents which is
collection of terms.
Four types of documents can be created in Julia.
A FileDocument which represents the files on disk
str=“Julia is a high-level, high-performance dynamic
programming language for technical computing”
sd = StringDocument(str)
td = TokenDocument(str)
nd = NGramDocument(str)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
11. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Preprocessing
Most of the times we have huge volumes of unstructured data, with
content not carrying any useful information w.r.t Text Mining.
HTML tags
Numbers
Stop words
Prepositions, Articles, Pronouns,
In Julia, we can remove these uneccessary using the functions,
remove_articles!(), remove_indefinite_articles!()
remove_definite_articles!(), remove_pronouns!()
remove_prepositions!(), remove_stop_words!()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
12. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Stemming
I have a different opinion
I differ with your opinion
I opine differently
In Information Retrieval, morphological variants of
words/terms carrying the same semantic information adds to
redundancy.
Stemming linguistically normalizes the lexicon of a corpus.
stem!(Corpus)
stem!(Document) #sd, td or nd
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
13. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Vector-Space Model
T1
T2
T3
D1
D2
D3
0
Documents are vectors in
the Term-Document Space
The elements of the vector
are the weights1, wij ,
corresponding to Document
i and term j
The weights are the
frequencies of the terms in
the documents.
Proximity of documents
calculated by the cosine of
the angle between them.
a
Refer Weighting Schemes
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
14. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Term-Document Space
Let Dj , be a collection of i documents.
T = t1, t2, t3, ..., ti , is the Lexicon set.
wji , is the frequency of occurence of the term j in document i.
Document, dj = [w1j , w2j , w3j , ..., wij ].
TDM = [d1 d2 d3 ... dj ]ij
dj ∈ Ri×j
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
15. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Weighting Schemes
Associating each occurence of a term with a weight that
represents its relevance with respect to the meaning of the
document it appears in. []
Longer documents do not always carry more informative
content (or more relevant content wrt a query).
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
16. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Binary Scheme : wij = 1, if ti occurs in dj , 0 else not.
Term Frequency (TF) Scheme : wij = fij , i.e. the number of
times ti occurs in document dj .
Term Frequency - Inverse Document Frequency (TF-IDF)
Scheme : Term Frequency tfij =
fij
max[f1j ,f2j ,f3j ...,fij ]
Inverse-Document frequency, idfi = log N
dfi
,
N: number of documents and dfi : Number of documents in
which the term ti occurs.
wij = tfij × idfi .
m=DocumentTermMatrix(crps)
tf_idf(m)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
17. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Query Matching
Finding the relevant documents for a query, q.
Cosine Distance Measure is used.
cos(θ) =
qT dj
q 2 dj 2
, where θ is the angle between the query q
and document dj .
The documents for which cos(θ) > tol, are considered
relevant, where tol, is the predefined tolerance.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
18. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Performance Modeling
The tol, decides the number of documents returned.
With low tol, more documents are returned.
But chances of the documents being irrelevant increases.
Ideally we need higher number of documents returned and
majority of the returned documents to be relevant.
Precision, P = Dr
Dt
, where Dr is the number of relevant
documents retrieved, and Dt is the total number of
documents retrieved.
Recall, R = Dr
Nr
, where Nr is the total number of relevant
documents in the database.
VSMModel(QueryNum::Int,A::Array{Float64,2},NumQueries::Int
This function is used to obtain Dr , Dt & Nr for any specific query
using Vector Space model.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
19. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
40No.ofDocuments Query10
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
20
40
60
80
No.ofDocuments
Query13
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
50
100
150
200
No.ofDocuments
Query6
Dr
Dt
Nr
0.2 0.4 0.6 0.8
Tolerance
0
5
10
15
20
25
30
35
No.ofDocuments
Query23
Dr
Dt
Nr
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
20. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Latent Semantic Indexing
There exists underlying latent semantic structure in the data.
We can identify this structure through SVD.
Project the data onto two lower dimensional spaces.
These are the term space and the document space.
Dimension reduction is achieved through truncated SVD.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
21. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Singular Value Decomposition
Theorem (SVD)
For any matrix A ∈ Rm×n, with m > n, there exists two
orthogonal matrices U = (u1, . . . , um) ∈ Rm×m &
V = (v1, . . . , vn) ∈ Rn×n such that
A = U
Σ
0
V T
,
where Σ ∈ Rn×n is diagonal matrix, i.e., Σ = (σ1, ...., σn), with
σ1 ≥ σ2 ≥ .... ≥ σn ≥ 0. σ1, . . . , σn are called the singular values
of A. Columns of U & V are called the right and left singular
vectors of A respectively.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
22. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Low Rank Matrix Approximation using SVD
A =
n
i=1
σi ui vT
i ≈
k
i=1
σi ui vT
i =: Ak k < r
The above approximation is based on the Eckart-Young
theorem.
It helps in removal of noise, solving ill-conditioned problems,
and mainly in dimension reduction of data.
Using the below function examine the effect of rank reduction.
Why is SVD good enough to decompose the TDM?
svdRedRank(A,k)
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
23. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
A
m
n
≈ m
U
k ≤ n
k ≤ n
S
k
k ≤ n
V
n
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
24. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
VSMModel(A::Array{Float64,2},nq::Int)
SVDModel(A::Array{Float64,2},nq::Int,rank::Int)
The above functions returns the Recall and Precision using
the Vector space model and the LSI model.
The first qNum vectors are the queries and the rest are the
document vectors, of matrix A.
For the LSI(SVDModel), we need to pass in the rank also as a
parameter.
Use the functions shown below to view the Precision Vs Recall for
VSM and LSI.
plotNew_RecPrec()
plotAdd_RecPrec()
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
25. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
10 20 30 40 50 60 70 80 90
Recall
30
40
50
60
70
Precision
Precision Vs Recall
VSM
LSI
KM
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach
26. Introduction Preprocessing VSM Model Query and Performance Modeling LSI Model
Thank you.
Abhijith Chandraprabhu Julia
Hands-on Introduction to Text Mining with Julia - A Mathematical approach