Text mining seeks to extract useful information from unstructured text documents. It involves preprocessing the text, identifying features, and applying techniques from data mining, machine learning and natural language processing to discover patterns. The core operations of text mining include analyzing distributions of concepts, identifying frequent concept sets and associations between concepts. Text mining systems aim to analyze document collections over time to identify trends, ephemeral relationships and anomalous patterns.
2. Definition : Text Mining
Text mining refers generally to the process of extracting interesting
information and knowledge from unstructured text.
Text Mining can be defined as a knowledge-intensive process in which
a user interacts with a document collection over time by using a suite of
analysis tools.
And
Text Mining seeks to extract useful information from data sources
(document collections) through the identification and exploration of
interesting patterns.
3. Text Mining
Text mining (TM) seeks to extract useful information from a collection of
documents.
It is similar to data mining (DM), but the data sources are unstructured or
semi-structured documents.
The TM methods involve :
- Basic pre-processing / TM operations, such as identification /
extraction of representative features (this can be done in several phases)
- Advanced text mining operations, involving identification of
complex patterns (e.g. relationships between previously identified concepts)
TM exploits techniques / methodologies from
data mining, machine learning, information retrieval, corpus-based
computational linguistics
4. Similarity and difference between
Data and Text Mining
Both types of systems rely on:
Preprocessing routines
Pattern-discovery algorithms
Presentation-layer elements such as visualization tools
Pre-processing Operations:
In Data Mining assume data
Stored in a structured format,
so preprocessing focus on scrubbing and normalizing data,
to create extensive numbers of table joins
In Text Mining preprocessing operations center on
Identification & Extraction of representative features for NL documents,
to transform unstructured data stored in document collections into a more
explicitly structured intermediate format
5. Finding Frequent Patterns
Finding “Nuggets”
Novel Non-Novel
Non textual data General Data Mining
Exploratory Data
Analysis
Database Queries
Information
RetrievalTextual data Computational Linguistics
7. Text Mining Process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification-
Supervised learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation
8. Text Mining Tasks
TM
Text Analysis
Tools
Feature extraction
Categorization
Summarization
Clustering
Name Extractions
Term Extraction
Abbreviation Extraction
Relationship Extraction
Hierarchical Clustering
Binary relational Clustering
Web Searching
Tools
Text search engine
Net Question Solution
Web Crawler
9. Handling Text Data
Modeling semi-structured data
Information Retrieval (IR) from unstructured documents
• Locates relevant documents and Ranks documents
• Keyword based (Boolean matching)
• Similarity based
Text mining
• Classify documents
• Cluster documents
• Find patterns or trends across documents
10. Challenges in Text Mining
Data collection is “free text”, is not well-organized (Semi-structured or
unstructured)
No uniform access over all sources, each source has separate storage
and algebra, examples: email, databases, applications, web
A quintuple heterogeneity: semantic, linguistic, structure, format, size
of unit information
Learning techniques for processing text typically need annotated
training
XML as the common model, it allows:
Manipulation data with standards
Mining becomes more data mining
RDF emerging as a complementary model
The more structure you can explore the better you can do mining
11. Types of Text Data Mining
Keyword-based association analysis
Automatic document classification
Similarity detection
Cluster documents by a common author
Cluster documents containing information from a common source
Link analysis: unusual correlation between entities
Sequence analysis: predicting a recurring event
Anomaly detection: find information that violates usual patterns
Hypertext analysis
Patterns in anchors/links
Anchor text correlations with linked objects
12. Documents and Document Collections
Document collection is a grouping of text-based documents.
It can be either static or dynamic (growing over time).
Document is a unit of discrete textual data within a collection,
representing usually some real world document, such as, a business
report, memorandum, email, research paper, news story etc.
A document can be a member of different document collections (e.g.
legal affairs and computing equipment, if it falls under both).
13. Document Structure
Text documents can be :
unstructured
i.e. free-style text (but from a linguistic perspective they are
really structured objects)
weakly structured
Adhering to some pre-specified format, like most scientific
papers, business reports, legal memoranda, news stories etc.
semi-structured
Exploiting heavy document templating or style sheets.
14. Weakly structured and Semi structured
Documents
Documents
Have relatively little in the way of strong
typographical, layout, or markup indicators to denote structure are
referred to as free-format or weakly structured docs (such as most
scientific research papers, business reports, and news stories)
With extensive and consistent format elements in which field-type
metadata can be more easily inferred are described as semi-
structured docs (such as some e-mail, HTML web pages, PDF
files)
15. Document Representation and Features
Irregular and implicitly structured representation is transformed into an
explicitly structured representation.
We can distinguish:
- Feature based representation,
- Relational representation.
In feature based representation that documents are represented by a set
of features.
16. Document Features
Although many potential features can be employed to represent docs, the
following four types are most commonly used
Characters
Words
Terms
Concepts
High Feature Dimensionality ( HFD)
Problems relating to HFD are typically of much greater magnitude in
TM systems than in classic DM systems.
Feature Sparcity
Only a small percentage of all possible features for a document
collection as a whole appear as in any single docs.
17. Representational Model of a Document
An essential task for most text mining systems is The identification of
a simplified subset of document features that can be used to represent a
particular document as a whole.
We refer to such a set of features as the representational model of a
document.
Commonly Used Document Features:
Characters,
Words,
Terms, and
Concepts
18. Character level Representation
Without Positional Information
Are often of very limited utility in TM applications
With Positional Information
Are somewhat more useful and common (e.g. bigrams or trigrams)
Disadvantage:
Character-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized
Word-level Representation
Without Positional Information
Are often of very limited utility in TM applications
With Positional Information
Are somewhat more useful and common(e.g. bigrams or trigrams)
Disadvantage:
Word-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized
19. Term-level Representation
Normalized Terms comes out of Term-Extraction Methodology
Sequence of one or more tokenized and lemmatized word forms
associated with part-of-speech tags.
Concept-level Representation
Concepts are features generated for a document by means of manual,
statistical, rule-based, or hybrid categorization methodology.
20. General Architecture of Text Mining Systems
Abstract Level
A text mining system takes in input raw docs and generates various types of output
such as:
Patterns
Maps of connections
Trends
Input Output
Documents
Patterns
Connections
Trends
21. General Architecture of Text Mining Systems
Functional Level
TM systems follow the general model provided by some classic DM
applications and are thus divisible into 4 main areas
• Preprocessing Tasks
• Core mining operations
• Presentation layer components and browsing functionality
• Refinement techniques
25. Core Text Mining Operations
Core mining operations in text mining systems are algorithms of the
creation of queries for discovering patterns in document collections.
Core Text Mining Operations
• Distributions
• Frequent and Near Frequent Sets
• Associations
• Isolating Interesting Patterns
• Analyzing Document Collections over Time
Using Background Knowledge for Text Mining
Text Mining Query Languages
26. Core Text Mining Operations
Core text mining operations consist of various mechanisms for
discovering patterns of concept with in a document collection.
The three types of patterns in text mining
Distributions (and proportions)
Frequent and near frequent sets
Associations
Symbols
D : a collection of documents
K : a set of concepts
k : a concept
27. Distributions
Definition 1. Concept Selection
Selecting some sub collection of documents that is labeled by one or
more given concepts
D/K
Subset of documents in D labeled with all of the concepts in K
Definition 2. Concept Proportion
The proportion of a set of documents labeled with a particular
concept
f(D , K) = |D/K| / |D|
The fraction of documents in D labeled with all of the concepts in
K
28. Distributions
Definition 3. Conditional Concept Proportion
The proportion of a set of documents labeled with a concept that are
also labeled with another concept
f(D , K1|K2) = f(D/K2 , K1)
The proportion of all those documents in D labeled with K2 that
are also labeled with K1
Definition 4. Concept Proportion Distribution
The proportion of documents in some collection that are labeled with
each of a number of selected concepts
FK(D , x)
The proportion of documents in D labeled with x for any x in K
29. Distributions
Definition 5. Conditional Concept Proportion Distribution
The proportion of those documents in D labeled with all the concepts
in K’ that are also labeled with concept x(with x in K)
FK(D,x | K’) = FK(D/K | K’, x)
Definition 6. Average Concept Proportion
Given a collection of documents D, a concept k, and an internal node
in the hierarchy n, an average concept proportion is the average
value of f(D,k | k’), where k’ ranges over all immediate children of n.
a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}
30. Distributions
Definition 7. Average Concept Distribution
Given a collection of documents D and two internal nodes in the
hierarchy n and n’, average concept distribution is the distribution
that, for any x that is a child of n, averages x’s proportions over all
children of n’
An(D,x | n’ ) = Avg {k’ is a child of n’} {Fn(D,x | k’)}
31. Frequent and Near Frequent Sets
Frequent Concept Sets
A set of concepts represented in the document collection with co-
occurrences at or above a minimal support level (given as a threshold
parameter s; i.e., all the concepts of the frequent concept set appear
together in at least s documents)
Support
The number (or percent) of documents containing the given rule –
that is, the co-occurrence frequency
Confidence
The percentage of the time that the rule is true
32. Frequent and Near Frequent Sets
Algorithm 1 : The Apriori Algorithm (Agrawal and Srikant 1994)
Discovery methods for frequent concept sets in text mining.
33. Frequent and Near Frequent Sets
Algorithm for Frequent Set Generation
Frequent sets are generated in relation to some support level.
Support (i.e., the frequency of co- occurrence) by convention often
expressed as the variable σ, frequent sets are sometimes also referred to as σ-
covers, or σ-cover sets.
34. Frequent and Near Frequent Sets
Near Frequent Concept Sets
An undirected relation between two frequent sets of concepts
This relation can be quantified by measuring the degree of
overlapping, for example, on the basis of the number of documents
that include all the concepts of the two concept sets.
35. Associations
Associations
Directed relations between concepts or sets of concepts
Associations Rule
An expression of the from A => B, where A and B are sets of
features
An association rule A => B indicates that transactions that involve
A tend also to involve B.
A is the left-hand side (LHS)
B is the right-hand side (RHS)
Confidence of Association Rule A => B (A, B : frequent concept sets)
The percentage of documents that include all the concept in B within
the subset of those documents that include all the concepts in A
Support of Association Rule A => B (A, B : frequent concept sets)
The percentage of documents that include all the concepts in A and B
36. Associations
Discovering Association Rules
The Problem of finding all the association rules with a confidence
and support greater than the user-identified values minconf (the
minimum confidence level) and minsup (the minimum support level)
thresholds
Two step of discovering associations
Find all frequent concept sets X (i.e., all combinations of concepts
with a support greater than minsup).
Test whether X-B => B holds with the required confidence
X = {w,x,y,z}, B = {y,z} , X-B = {w,x}
X-B => B
{w,x} => {y,z}
Confidence of association rule {w,x} => {y,z}
confidence = support({w,x,y,z}) / support({w,x}
37. Associations
Maximal Associations (M-association)
Relations between concepts in which associations are identified in terms
of their relevance to one concept and their lack of relevance to another
Concept X most often appear in association with Concept Y,
Simple Algorithm for Generating Associations
(Rajman and Besancon 1998)
38. Associations
Definition 8. Alone with Respect to Maximal Associations
For a transaction t, a category g, and a concept-set X gi, one would
say that X is alone in t if t ∩ gi = X.
X is alone in t if X is largest subset of gi that is in t
X is maximal in t …
t M-supports X …
For a document collection D, the M-support of X in D
number of transactions t D that M-support X.
39. Associations
The M-support for the maximal association
If D(X,g(Y)) is the subset of the document collection D consisting
of all the transactions that M-support X and contain at least one
element of g(Y), then the M-confidence of the rule
40. Isolating Interesting Patterns
Interestingness with Respect to Distributions and Proportions
Measures for quantifying the distance between an investigated
distribution and another distribution
=> Sum-of-squares to measure the distance between two models
D(P’ || P) = ∑ (p’(x) – p(x))2
41. Isolating Interesting Patterns
Definition 9. Concept Distribution Distance
Given two concept distributions P’K(x) and Pk(x), the distance D(P’K
|| PK) between them
D(P’K || PK) = ∑(P’K(x) – PK(x))2
Definition 10. Concept Proportion Distance
The value of the difference between two distributions at a particular
point
d(P’K || PK) = P’K(x) – PK(x)
42. Analyzing Document Collections
over Time
Incremental Algorithms
Algorithms processing truly dynamic document collections that add,
modify, or delete documents over time
Trend Analysis
The term generally used to describe the analysis of concept
distribution behavior across multiple document subsets over time
A two-phase process
First phase
Phrases are created as frequent sequences of words using the
sequential patterns mining algorithms first mooted for mining
structured databases
Second phase
A user can query the system to obtain all phrases whose trend
matches a specified pattern.
43. Analyzing Document Collections
over Time
Ephemeral Associations
A direct or inverse relation between the probability distributions of
given topics (concepts) over a fixed time span
Direct Ephemeral Associations
One very frequently occurring or “peak” topic during a period
seems to influence either the emergence or disappearance of other
topics
Inverse Ephemeral Associations
Momentary negative influence between one topic and another
Deviation Detection
The identification of anomalous instances that do not fit a defined
“standard case” in large amounts of data.
44. Analyzing Document Collections
over Time
Context Phrases and Context Relationships
Definition 11. Context Phrase
A subset of documents in a document collection that is either
labeled with all, or at least one, of the concepts in a specified set
of concepts.
If D is a collection of documents and C is a set of concepts,
D/A(C) is the subset of documents in D labeled with all the
concepts in C, and D/O(C) is the subset of documents in D labeled
with at least one of the concepts in C. Both D/A(C) and D/O(C)
are referred to as context phrases.
45. Analyzing Document Collections
over Time
Context Phrases and Context Relationships
Definition 12. Context Relationships
The relationship within a set of concepts found in the document
collection in relation to a separately specified concept ( the
context or the context concept)
If D is a collection of documents, c1 and c2 are individual concepts,
and P is a context phase, R(D, c1, c2 | P) is the number of
documents in D/P which include both c1 and c2, Formally, R(D, c1,
c2 | P) = |(D/A({c1,c2}))|P|.
46. Analyzing Document Collections
over Time
The Context Graph
Definition 13. Context Graph
A graphic representation of the relationship between a set of
concepts as reflected in a corpus respect to a given context.
A context graph consists of a set of vertices (=nodes) and edges.
The vertices of the graph represent concepts
Weighted “edges” denote the affinity between the concepts.
If D is a collection of documents, C is a set of concepts, and P is a
context phrase, the concept graph of D, C, P is a weighted graph
G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D,
c1, c2 | P) > 0). For each edge, {c1,c2} E, one defines the weight
of the edge, w{c1,c2} = R(D, c1, c2 | P).
47. Analyzing Document Collections
over Time
Example of Context Graph in the context of P
Concept1(C1)
Concept3(C3)
Concept2(C2)
R(D, c1, c2 | P) = 10
R(D, c1, c3 | P) = 15
48. Analyzing Document Collections
over Time
Definition 14. Temporal Selection (“Time Interval”)
If D is a collection of documents and I is a time range, date range, or both, DI is the
subset of documents in D whose time stamp, date stamp, or both, is within I. The
resulting selection is sometimes referred to as the time interval.
Definition 15. Temporal Context Relationship
If D is a collection of documents, c1 and c2 are individual concepts, P is a context
phrase, and I is the time interval, then RI(D, c1, c2 | P) is the number of documents in DI
in which c1 and c2 co-occur in the context of P – that is, RI(D, c1, c2 | P) is the number
of DI/P that include both c1 and c2.
Definition 16. Temporal Context Graph
If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is
the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,EI)
with set nodes in C and a set of edges EI, where EI = ({c1,c2} | R(D,c1,c2|P) > 0). For
each edge, {c1, c2} E, one defines the weight of the edge by wI{c1,c2} = RI(D,c1,c2|P).
49. Analyzing Document Collections
over Time
The Trend Graph
A representation that builds on the temporal context graph as
informed by the general approaches found in trend analysis
New Edges
Edges that did not exist in the previous graph
Increased Edges
Edges that have a relatively higher weight in relation to the
previous interval
Decreased Edges
Edges that have a relatively decreased weight than the previous
interval.
Stable Edges
Edges that have about the same weight as the corresponding edge
in the previous interval
50. Analyzing Document Collections
over Time
The Borders Incremental Text Mining Algorithm
The Borders algorithm can be used to update search pattern results
incrementally.
Definition 17. Border Set
X is a border set if it is not a frequent set, but any proper subset
Y X is frequent set
51. Analyzing Document Collections
over Time
The Borders Incremental Text Mining Algorithm
Concept Set A = {A1, …, Am}
Relations over A:
Rold : old relation
Rinc : increment
Rnew : new combined relation
s(X/R) : support of concept set X in the relation R
s* : minimum support threshold (min_sup)
Property 1: if X is a new frequent set in Rnew, then there is a subset Y X
such that Y is a promoted border
Property 2: if X is a new k-sized frequent set in Rnew, then for each subsetY X
of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set,
or (c) an old frequent set with additional support in Rinc.
52. Analyzing Document Collections
over Time
The Borders Incremental Text Mining Algorithm
Stage 1: Finding Promoted Borders and Generating Candidates.
Stage 2: Processing Candidates
54. Text Mining Preprocessing Techniques
Effective text mining operations are predicated on sophisticated data
preprocessing methodologies.
Text mining is o dependent on the various preprocessing techniques that
infer or extract structured representations from raw unstructured data
sources, or do both.
Different preprocessing techniques are used to create structured document
representations from raw textual data. (structure documents – and, by
extension, document collections.)
Two ways of categorizing the totality of preparatory document structuring
techniques - According to
Their task and
The algorithms and formal frameworks that they use.
55. Pre-processing Techniques
Task Oriented Approaches
General purpose NLP tasks
Tokenization and zoning
Part-of-speech Tagging and Stemming
Shallow and deep syntactic parsing
Problem Dependent task
Text Categorization
Information Extraction
57. General Purpose NLP Tasks
Tokenization
Tokenization is the process of breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called tokens.
The list of tokens becomes input for further processing such as parsing
or text mining.
Part-of-speech Tagging
POS tagging is the annotation of words with the appropriate POS tags
based on the context in which they appear.
POS tags divide words into categories based on the role they play in the
sentence in which they appear.
POS tags provide information about the semantic content of a word.
POS taggers at some stage of their processing perform morphological
analysis of words. An additional output of a POS tagger is a sequence of
stems (“lemmas”) of the input words.
58. Syntactical parsing
Syntactical parsing components perform a full syntactical analysis of sentences
according to a certain grammar theory. The basic division is between the
constituency and dependency grammars.
Constituency grammars describe the syntactical structure of sentences in
terms of recursively built phrases – sequences of syntactically grouped
elements.
Dependency grammars, do not recognize the constituents as separate
linguistic units but focus instead on the direct relations between words.
Shallow parsing
Shallow parsing compromises speed and robustness of processing by
sacrificing depth of analysis.
Instead of providing a complete analysis (a parse) of a whole sentence, shallow
parsers produce only parts that are easy and unambiguous.
For the purposes of information extraction, shallow parsing is usually sufficient
and preferable to full analysis because of its far greater speed and robustness.
59. Problem Dependent Task
Text Categorization
Text categorization (Text Classification) tasks tag each document with a
small number of concepts or keywords.
The set of all possible concepts or keywords is usually manually
prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.
Information Extraction
Information retrieval returns documents that match a given query but still
requires the user to read through these documents to locate the relevant
information.
IE, aims at pinpointing the relevant information and presenting it in a
structured format – typically in a tabular format.
60. Types of Problems
Text-mining operates in very high dimensions, in many situations,
processing is effective and efficient because of the sparseness
characteristic of most documents and most practical applications.
Types of problems that can be solved with text mining approach to data
representation and learning methods are
Document Classification
Information Retrieval
Clustering and Organizing Documents
Information Extraction
Prediction and Evaluation
61. Document Classification
Documents are organized into folders, one folder for each topic. A new
document is presented, and the objective is to place this document in the
appropriate folders.
Document classification or document categorization is a problem in library
science, information science and computer science. The task is to assign a
document to one or more classes or categories.
Document classification tasks can be divided into three kinds
supervised document classification is performed by an external
mechanism, usually human feedback, which provides the necessary
information for the correct classification of documents
semi-supervised document classification, a mixture between supervised
and unsupervised classification: some documents or parts of documents
are labeled by external assistance
unsupervised document classification is entirely executed without
reference to external information
63. Classification Techniques
Decision Trees
K-nearest neighbors
Training examples are points in a vector space
Compute distance between new instance and all training instances
and the k-closest vote for the class
Naïve Bayes Classifier
Classify using probabilities and assuming independence among
terms
P(xi |C) is estimated as the relative frequency of examples having
value xi as feature in class C
P(C/ Xi Xj Xk) = P(C) P(Xi/C) P(Xj/C) P(Xk/C)
Neural networks, support vector machines,…
64. Information Retrieval
Query
E.g. Spam / Text
Documents
source
Ranked
Documents
IR
System
Document
Document
Document
Given
A source of textual
documents
A user query (text
based)
Find
A set (ranked) of
documents that are
relevant to the
query
65. Information Retrieval
Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need (query)
from within large collections (usually stored on computers).
Basic assumptions
Collection: Fixed set of documents
Goal: Retrieve documents with information that is relevant to user’s
information need and helps him complete a task
Retrieving Matched document:
66. Information Retrieval
Basic Information Retrieval (IR) process
Browsing or Navigation system
User skims document collection by jumping from one document to
the other via hypertext or hypermedia links until relevant document
found
Classical IR system: Question Answering System
Query: Question in Natural Language
Answer: Directly extracted from text of document collection
Text Based Information Retrieval
Information Item (document)
Text format (written/spoken) or has textual description
Information Need (query)
Usually in text format
68. Clustering and Organizing Documents:
Clustering
Given
A source of textual
documents
Similarity measure
(e.g., how many
words are common in
these documents)
Find
Several clusters of
documents that are
relevant to each other
Similarity
measure
Documents
source
Clustering
System
Doc
DocDoc
Doc
Doc
Doc
69. Clustering and Organizing Documents
The clustering process is equivalent to assigning the labels needed for
text categorization.
There are many ways to cluster documents, it is not quite as powerful a
process as assigning answers (i.e., known correct labels) to documents.
Organizing documents into groups:
70. Information Extraction
Definition
The automatic extraction of structured information from
unstructured documents.
Information Extraction is the process of scanning text for relevant
information to some interest
Extract:
Entities, Relations, Events
Overall Goals:
Making information more accessible to people
Making information more machine-processable
71. Information Extraction
Why IE?
Need for efficient processing of texts in specialized domains
Focus on relevant parts, ignore the rest
Typical applications:
Gleaning Business
Government
Military Intelligence
WWW searches (more specific than keywords)
Scientific literature searches
72. Information Extraction
Information extraction is a subfield of text mining that attempts to move
text mining onto an equal footing with the structured world of data
mining.
The objective is to take an unstructured document and automatically fill
in the values of a spreadsheet.
Extracting Information from the document:
73. Prediction and Evaluation
Prediction is the measurement of error. For topic assignment, we can
determine whether a program’s answer is right or wrong.
The classical measures of accuracy will be applicable, but not all errors
will be evaluated equally.
That’s why measures of accuracy such as “recall” and “precision” are
especially important to document analysis.
74. Performance Measure
The set of retrieved documents can be formed by collecting the top ranking
documents according to a similarity measure
The quality of a collection can be compared by the two following measures
Precision: percentage of retrieved documents that are in fact relevant to the
query (i.e., “correct” responses)
Recall: percentage of documents that are relevant to the query and were, in
fact, retrieved
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision
recall
|{Relevant}{Retrieved} |
|{Relevant} |
75. From Textual Information to Numerical
Vectors: Introduction
To Mine Text we need to process it in a form that Data Mining
procedures use.
From earlier, this involves generating features in a spread sheet
format
Classical data mining looks at highly structured data
Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
Predictive text mining is simpler and more restrictive than open
ended data mining.
76. From Textual Information to Numerical
Vectors: Introduction
Text mining is unstructured because very far from the spreadsheet
model that we need to process data for prediction.
Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
We have to determine nature of column in spread sheet.
Features are easy to obtain , some are difficult.
Features (word in a text-easy) ; grammatical function of a word in a
sentence.
Obtain the kinds of features generated from Text
77. Collecting Documents
Text Mining is collect of data
Web page retrieval application for an intranet implicitly specifies the
relevant documents to be the web pages on the intranet
If documents are identified, then they can be obtained
main issue – cleanse the samples and ensure high quality
Web application compromising a number of autonomous Websites, one
may deploy software tool such as WebCrawler to collect the documents
78. Collecting Documents
Other application, have a logging process attached to an input data
steam for a length of time (e.g., email audit, log in the incoming and
outgoing messages at mail server for a period of time)
For R&D work of Text Mining , we need generic data – Corpus
Accompanying Software is Reuter which is called Reuter’s corpus
(RV1)
Early days (1960’s and 1970’s) 1 million works was considered
Size of collection of size of collection Brown corpus consist of 500
samples for 2000 words of American English test
79. Collecting Documents
European corpus was modeled on Brown corpus - British English
1970’s 0r 80’s more resource were available - Government
sponsored.
Some widely used corpora – Penn Tree Bank (collection
manually parsed sentences from Journal)
Resource is World Wide Web.
Web crawlers can build collections of pages from a particular sit such
as yahoo.
Give n size of web, collections require cleaning before use.
80. Document Standardization
When Documents are collected, you can have them in different
formats
Some documents may be collected in word format or simple text
with ASCII format. To process these documents we have to convert
them to standard formats
Standard Format –XML
XML is Extensible Markup Language
81. Document Standardization-XML
Standard way to insert tags onto text to identify it’s parts.
Each Document is marked off from corpus through XML
XML will have tags
<Date>
<Subject>
<Topic>
<Text>
<Body>
<Header>
82. XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Diya</to>
<from>Surya</from>
<heading>Reminder</heading>
<body>Happy Birth Day</body>
</note>
83. XML
The main reason to identify the parts is to allow selection of those
parts that is used to generate features
Selected document is concatenated into strings- separated by tags
Document Standardization
Advantage of data standardization is mining tools can be applied
without having to consider the pedigree of document.
84. Tokenization
Document Collected in XML Format- examine the data
Break the characters into words – TOKENS
Each token is an instance of a type; the number of tokens is higher than
the number of types
2 tokens “the” occurs twice in a sentence. Refer to occurrence
of a type
Character space , tab are not tokens but white spaces
Comma, Colon are tokens (between characters) e.g., USA,INDIA
Between numbers are delimiter (121,135)
Apostrophe - number of uses (Delimiter or part of token)
e.g., D’Angelo
When it is followed by a terminator – internal quote (Tess’.)
85. Tokenization –Pseudo code
Dash is a terminator a token proceeded or followed by another dash
(522-3333)
Without identifying token it is difficult to imagine extracting higher
level information from document
86. Lemmatization
Once a character stream has been segmented after sequence of tokens
Convert each tokens to standard forms – Stemming or Lemmatization.
(Application dependent)
Reduce the number of distinct types in corpus and increase
frequency of occurrence of individual types
English Speaker’s agree nouns Book and Books are 2 forms of
same word- advantage to eliminate kind of variation
Normalization regularize grammatical variants –Inflectional Stemming
87. Stemming to a Root
Grammatical variants (singular/plural present/past)
It is always advantageous to eliminate this kind of variation before further
processing
When normalization is confined to regular grammatical variants such as
singular/plural and present/past, the process is called Inflectional stemming
The intent of these stemmers is to reach a root of no inflectional or
derivational prefixes or suffixes- end result aggressive stemming Example
Reduce number of types is text
89. Vector Generation for prediction
Consider the problem of categorizing documents
Characteristic feature are tokens or words they contain.
Without deep analysis we can choose to describe each document by
features that represent the most frequent tokens.
There is collective features called dictionary.
Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document collection.
Each Row-> document; column ->feature
90. Vector Generation for prediction
Cells in a spreadsheet is a measurement of feature for a document.
Basic model of data, we will simply check the presence or absence of words.
Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents are
readily available - confidence on variation and combinations of words that
occurs.
If prediction is our goal then we need one more column for correct answer.
In preparing data for learning, information is available from document labels.
(Labels are binaries and answers which is also called as class)
Instead of generating global dictionary for class we consider words in class
that we are trying to predict.
If this class is far smaller than the negative class which is typical - local
dictionary is far smaller than global dictionary
91. Another reduction in dictionary size is to compile a lost of stop-words and
remove them from dictionary.
Stop-words are almost never have any predictive capability, such articles a
& the pronouns as it and they.
Frequency information on the word counts can be quite useful in reducing
the dictionary size and improve predictive performance
Frequent words are stop-words and can be deleted.
Alternative approach to local dictionary generation is to generate a global
dictionary from all documents in the collection . Special feature selection
routines will attempt to select a subset of words that have greatest potential
of prediction- independent (selection methods)
If we have 100 topics to categorize, then 100 problems to solve. Our choices
are 100 small dictionary or 1 global dictionary .
92. Vectors implied by spreadsheet model will be regenerated to correspond to
small dictionary
Instead of placing the word in the dictionary, follow a path printed
dictionary and avoid storing every variation of word.
(no singular/plural/past/present)
Verbs stored in stemming manner.
Add a layer of complexity in text - gain in performance and size is reduced
Universal procedure is trim words to their root form -> difference in
meaning (exit /exiting) - context of programming (different meanings)
Small Dictionary - can capture the best words easily.
Use of tokens and stemming are examples of helpful procedures in smaller
dictionaries. Improve ability of managing of learning and accuracy
Document can be converted to spread sheet .
93. Each column is feature. Row is a document.
Model of data for predictive text mining in terms of spread sheet that
populated by ones or zeros.
Cells represent the presence of dictionary words in a document collection.
Higher accuracy-> additional transformations
They are
Word Pairs and collocations
Frequency
Tf-idf
Word Pairs and Collocations: They serve to increase size of dictionary
improve performance of prediction
Instead of 0’s & 1’s in cells; the frequency of word can be used.(word “the”
occurs 10 times count of “the” is used)
Count give better results than binary in cells.
This leads to compact solutions same solution of binary data model. Yet
additional frequency yield simpler solution.
94. Frequencies are helpful in prediction but add complexity to solutions.
Compromise that works – 3 value system.1/0/2
Word did not occur -0
Word occurred one -1
Word occurred 2 or more times -2
Capture much added value of frequency without adding much complexity to
model.
Another variant is zeroing the values below the threshold where tokens min
frequency before being considered any use.
Reduce the complexity of spread sheet – used in Data Mining algorithms
Other methods to reduce complexity are chi square, mutual Information,
odds Ratio ..etc
Next step beyond counting frequency is modify the count by perceived
importance of that word .
95. Tf-idf: Compute the weightings or scores of words
Values of positive numbers that we capture the absence or presence of the
words.
Eq(a) we see that weight assigned to word j-term of frequency modified by
a scale factor for importance of word. Scale factor is inverse document
frequency (Eq(b))
Simply checks for number of documents containing the word df(j) and
reverse scaling.
Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
Idf(j) = log(N/ df(j)) ------ Eq(b)
When a word appears in a document, the scale is lowered and perhaps zero.
if word is unique , appears in few documents - scale factor zooms upward
and appears important
Alternative of this tf-idf formulation exist, but motivation is same. Result is
positive score that replaces the simple frequency or binary (T/F) entry in our
cell in spreadsheet.
96. Another variant is weight the tokens from different parts of the document.
Which Data Transformation Method are BEST????
No Universal answer.
Best predictive accuracy is dependent on mating all these
methods.
Best variation is one method may not be the one for other. Test
ALL
Describe data as populating a spread sheet-cells are 0
Small subset of dictionary words.
Text Classification a text corpus 1000’s words. Each individual document,
unique tokens.
Spread sheet for that document is 0. Rather than store all 0’s its better to
represent the spread sheet as a set of sparse vectors (row is list of pairs , one
element of pair is column and other is corresponding nonzero value). By not
storing the non zero It will increase memory
97. Multi Word Features
Features are associated with single words ( tokens delimited with white
space)
Simple scenario is extended to include pair of words e.g., bon and viant .
Instead of separating we could feature the word as bonviant.
Why stop at pairs? Why not consider a multiword features??
Unlike word pairs , the words need not be consecutive.
E.g., Don Smith as feature – we can ignore is middle name Leroy that may
reappear in some reference to the person.
In this case we have to accommodate many reference to the noun that
involve a number of adjectives with desired adjective not the adjacent to the
noun. E.g., we want to accept a phrase broken and dirty vase as an instance
broken vase
98. X number if words occurring within a maximum window size y(y>=x naturally)
How features are extracted from text- specialized methods???
If we use frequency methods, combinations of words that are relatively frequent.
Straight forward implementation is simple combination of x words in window y
Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood ratio
is used!!!
An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
Multiword features are not too found in document collection, but they are highly
predictive
Twordi iwordfreq
TfreqTfreqTsize
TAM
)(
)())((log)(
)( 10
99.
100. Labels for Right Answers:
For prediction an extra column is added to the spreadsheet
Last column contains the labels, looks no different from others.
It’s a 0 or 1 indicating a right answer with either True/false
In the sparse vector format are appended to each vector separately as
either a one (positive class) or a Zero (negative class)
Feature Selection by Attribute Ranking:
In addition to frequency based approaches, feature selection can be done in
number of ways.
Select a set of feature for each category to form a local dictionary for the
category
Independent ranking feature attributes according to their predictive abilities
for category under consideration.
Predictive ability of an attribute can be measured by certain quantity how its is
correlated
Lets assume n number of documents; presence or absence of attribute j
in x; y to denote label of document in last column
ix
101. A commonly used ranking score is information gain criterion which is
The quantity L(j) is number of bits required to encode the label and the
attribute j minus the number of bits required to encode the attribute.
Quantities are needed to compute L(j). Can be easily estimated using the
estimators
1
0
2
1
0
1
0
2
))|(/1(log)|Pr()Pr()(
))(/1(log)Pr(
)()(
c
ii
v
i
c
Label
label
vxcyprvxcyvxjL
cyprcyL
jLLjIG
2)(
1),(
)|(
2
1)(
)(
vxfreq
clabelvxfreq
vxcypr
n
vxfreq
vxpr
j
j
j
j
i
102. Sentence Boundary Determination
If the XML markup for corpus doesn't mark sentence boundaries,
necessary to mark the sentence
Necessary to determine when a period is part of a token and when it is
not
For more sophisticated way linguistic parsing, the algorithms often
require complete sentence as input.
Extraction algorithms operate text a sentence at a time
Algorithms are optimal, sentences are identified clearly
Sentence boundary determination is problem of deciding which
instances of period followed by white space are sentence delimiters and
which are not, since we assume characters ? ! –classification problem
Algorithm – accuracy and adjustments will give better performance