Text Data Mining Techniques and Processes Explained

Definition : Text Mining
 Text mining refers generally to the process of extracting interesting
information and knowledge from unstructured text.
 Text Mining can be defined as a knowledge-intensive process in which
a user interacts with a document collection over time by using a suite of
analysis tools.
And
 Text Mining seeks to extract useful information from data sources
(document collections) through the identification and exploration of
interesting patterns.

Text Mining
 Text mining (TM) seeks to extract useful information from a collection of
documents.
 It is similar to data mining (DM), but the data sources are unstructured or
semi-structured documents.
 The TM methods involve :
- Basic pre-processing / TM operations, such as identification /
extraction of representative features (this can be done in several phases)
- Advanced text mining operations, involving identification of
complex patterns (e.g. relationships between previously identified concepts)
 TM exploits techniques / methodologies from
data mining, machine learning, information retrieval, corpus-based
computational linguistics

Similarity and difference between
Data and Text Mining
 Both types of systems rely on:
 Preprocessing routines
 Pattern-discovery algorithms
 Presentation-layer elements such as visualization tools
 Pre-processing Operations:
 In Data Mining assume data
 Stored in a structured format,
so preprocessing focus on scrubbing and normalizing data,
to create extensive numbers of table joins
 In Text Mining preprocessing operations center on
 Identification & Extraction of representative features for NL documents,
to transform unstructured data stored in document collections into a more
explicitly structured intermediate format

Finding Frequent Patterns
Finding “Nuggets”
Novel Non-Novel
Non textual data General Data Mining
Exploratory Data
Analysis
Database Queries
Information
RetrievalTextual data Computational Linguistics

Text Mining Pipeline
Unstructured Text
(implicit knowledge)
Structured content
(explicit knowledge)

Text Mining Process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification-
Supervised learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation

Text Mining Tasks
TM
Text Analysis
Tools
Feature extraction
Categorization
Summarization
Clustering
Name Extractions
Term Extraction
Abbreviation Extraction
Relationship Extraction
Hierarchical Clustering
Binary relational Clustering
Web Searching
Tools
Text search engine
Net Question Solution
Web Crawler

Handling Text Data
 Modeling semi-structured data
 Information Retrieval (IR) from unstructured documents
• Locates relevant documents and Ranks documents
• Keyword based (Boolean matching)
• Similarity based
 Text mining
• Classify documents
• Cluster documents
• Find patterns or trends across documents

Challenges in Text Mining
 Data collection is “free text”, is not well-organized (Semi-structured or
unstructured)
 No uniform access over all sources, each source has separate storage
and algebra, examples: email, databases, applications, web
 A quintuple heterogeneity: semantic, linguistic, structure, format, size
of unit information
 Learning techniques for processing text typically need annotated
training
 XML as the common model, it allows:
 Manipulation data with standards
 Mining becomes more data mining
 RDF emerging as a complementary model
 The more structure you can explore the better you can do mining

Types of Text Data Mining
 Keyword-based association analysis
 Automatic document classification
 Similarity detection
 Cluster documents by a common author
 Cluster documents containing information from a common source
 Link analysis: unusual correlation between entities
 Sequence analysis: predicting a recurring event
 Anomaly detection: find information that violates usual patterns
 Hypertext analysis
 Patterns in anchors/links
 Anchor text correlations with linked objects

Documents and Document Collections
 Document collection is a grouping of text-based documents.
 It can be either static or dynamic (growing over time).
 Document is a unit of discrete textual data within a collection,
representing usually some real world document, such as, a business
report, memorandum, email, research paper, news story etc.
 A document can be a member of different document collections (e.g.
legal affairs and computing equipment, if it falls under both).

Document Structure
Text documents can be :
 unstructured
i.e. free-style text (but from a linguistic perspective they are
really structured objects)
 weakly structured
Adhering to some pre-specified format, like most scientific
papers, business reports, legal memoranda, news stories etc.
 semi-structured
Exploiting heavy document templating or style sheets.

Weakly structured and Semi structured
Documents
Documents
 Have relatively little in the way of strong
typographical, layout, or markup indicators to denote structure are
referred to as free-format or weakly structured docs (such as most
scientific research papers, business reports, and news stories)
 With extensive and consistent format elements in which field-type
metadata can be more easily inferred are described as semi-
structured docs (such as some e-mail, HTML web pages, PDF
files)

Document Representation and Features
 Irregular and implicitly structured representation is transformed into an
explicitly structured representation.
 We can distinguish:
- Feature based representation,
- Relational representation.
 In feature based representation that documents are represented by a set
of features.

Document Features
Although many potential features can be employed to represent docs, the
following four types are most commonly used
 Characters
 Words
 Terms
 Concepts
High Feature Dimensionality ( HFD)
 Problems relating to HFD are typically of much greater magnitude in
TM systems than in classic DM systems.
Feature Sparcity
 Only a small percentage of all possible features for a document
collection as a whole appear as in any single docs.

Representational Model of a Document
 An essential task for most text mining systems is The identification of
a simplified subset of document features that can be used to represent a
particular document as a whole.
 We refer to such a set of features as the representational model of a
document.
 Commonly Used Document Features:
 Characters,
 Words,
 Terms, and
 Concepts

Character level Representation
 Without Positional Information
Are often of very limited utility in TM applications
 With Positional Information
Are somewhat more useful and common (e.g. bigrams or trigrams)
 Disadvantage:
Character-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized
Word-level Representation
 Without Positional Information
Are often of very limited utility in TM applications
 With Positional Information
Are somewhat more useful and common(e.g. bigrams or trigrams)
 Disadvantage:
Word-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized

Term-level Representation
 Normalized Terms comes out of Term-Extraction Methodology
 Sequence of one or more tokenized and lemmatized word forms
associated with part-of-speech tags.
Concept-level Representation
 Concepts are features generated for a document by means of manual,
statistical, rule-based, or hybrid categorization methodology.

General Architecture of Text Mining Systems
Abstract Level
A text mining system takes in input raw docs and generates various types of output
such as:
 Patterns
 Maps of connections
 Trends
Input Output
Documents
Patterns
Connections
Trends

General Architecture of Text Mining Systems
Functional Level
TM systems follow the general model provided by some classic DM
applications and are thus divisible into 4 main areas
• Preprocessing Tasks
• Core mining operations
• Presentation layer components and browsing functionality
• Refinement techniques

System Architecture for
Generic Text Mining System

System Architecture for
Domain-oriented Text Mining System

System Architecture for an Advanced Text Mining
System with background knowledge base

Core Text Mining Operations
 Core mining operations in text mining systems are algorithms of the
creation of queries for discovering patterns in document collections.
 Core Text Mining Operations
• Distributions
• Frequent and Near Frequent Sets
• Associations
• Isolating Interesting Patterns
• Analyzing Document Collections over Time
 Using Background Knowledge for Text Mining
 Text Mining Query Languages

Core Text Mining Operations
 Core text mining operations consist of various mechanisms for
discovering patterns of concept with in a document collection.
 The three types of patterns in text mining
 Distributions (and proportions)
 Frequent and near frequent sets
 Associations
 Symbols
 D : a collection of documents
 K : a set of concepts
 k : a concept

Distributions
 Definition 1. Concept Selection
 Selecting some sub collection of documents that is labeled by one or
more given concepts
 D/K
 Subset of documents in D labeled with all of the concepts in K
 Definition 2. Concept Proportion
 The proportion of a set of documents labeled with a particular
concept
 f(D , K) = |D/K| / |D|
 The fraction of documents in D labeled with all of the concepts in
K

Distributions
 Definition 3. Conditional Concept Proportion
 The proportion of a set of documents labeled with a concept that are
also labeled with another concept
 f(D , K1|K2) = f(D/K2 , K1)
 The proportion of all those documents in D labeled with K2 that
are also labeled with K1
 Definition 4. Concept Proportion Distribution
 The proportion of documents in some collection that are labeled with
each of a number of selected concepts
 FK(D , x)
 The proportion of documents in D labeled with x for any x in K

Distributions
 Definition 5. Conditional Concept Proportion Distribution
 The proportion of those documents in D labeled with all the concepts
in K’ that are also labeled with concept x(with x in K)
 FK(D,x | K’) = FK(D/K | K’, x)
 Definition 6. Average Concept Proportion
 Given a collection of documents D, a concept k, and an internal node
in the hierarchy n, an average concept proportion is the average
value of f(D,k | k’), where k’ ranges over all immediate children of n.
 a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}

Distributions
 Definition 7. Average Concept Distribution
 Given a collection of documents D and two internal nodes in the
hierarchy n and n’, average concept distribution is the distribution
that, for any x that is a child of n, averages x’s proportions over all
children of n’
 An(D,x | n’ ) = Avg {k’ is a child of n’} {Fn(D,x | k’)}

Frequent and Near Frequent Sets
 Frequent Concept Sets
 A set of concepts represented in the document collection with co-
occurrences at or above a minimal support level (given as a threshold
parameter s; i.e., all the concepts of the frequent concept set appear
together in at least s documents)
Support
 The number (or percent) of documents containing the given rule –
that is, the co-occurrence frequency
Confidence
 The percentage of the time that the rule is true

Algorithm 1 : The Apriori Algorithm (Agrawal and Srikant 1994)
 Discovery methods for frequent concept sets in text mining.

Algorithm for Frequent Set Generation
Frequent sets are generated in relation to some support level.
Support (i.e., the frequency of co- occurrence) by convention often
expressed as the variable σ, frequent sets are sometimes also referred to as σ-
covers, or σ-cover sets.

 Near Frequent Concept Sets
 An undirected relation between two frequent sets of concepts
 This relation can be quantified by measuring the degree of
overlapping, for example, on the basis of the number of documents
that include all the concepts of the two concept sets.

Associations
Associations
 Directed relations between concepts or sets of concepts
Associations Rule
 An expression of the from A => B, where A and B are sets of
features
 An association rule A => B indicates that transactions that involve
A tend also to involve B.
 A is the left-hand side (LHS)
 B is the right-hand side (RHS)
Confidence of Association Rule A => B (A, B : frequent concept sets)
 The percentage of documents that include all the concept in B within
the subset of those documents that include all the concepts in A
Support of Association Rule A => B (A, B : frequent concept sets)
 The percentage of documents that include all the concepts in A and B

Associations
Discovering Association Rules
 The Problem of finding all the association rules with a confidence
and support greater than the user-identified values minconf (the
minimum confidence level) and minsup (the minimum support level)
thresholds
Two step of discovering associations
 Find all frequent concept sets X (i.e., all combinations of concepts
with a support greater than minsup).
 Test whether X-B => B holds with the required confidence
 X = {w,x,y,z}, B = {y,z} , X-B = {w,x}
 X-B => B
{w,x} => {y,z}
 Confidence of association rule {w,x} => {y,z}
confidence = support({w,x,y,z}) / support({w,x}

Associations
Maximal Associations (M-association)
 Relations between concepts in which associations are identified in terms
of their relevance to one concept and their lack of relevance to another
 Concept X most often appear in association with Concept Y,
Simple Algorithm for Generating Associations
(Rajman and Besancon 1998)

Associations
 Definition 8. Alone with Respect to Maximal Associations
 For a transaction t, a category g, and a concept-set X gi, one would
say that X is alone in t if t ∩ gi = X.
 X is alone in t if X is largest subset of gi that is in t
 X is maximal in t …
 t M-supports X …
 For a document collection D, the M-support of X in D
 number of transactions t D that M-support X.

Associations
 The M-support for the maximal association
 If D(X,g(Y)) is the subset of the document collection D consisting
of all the transactions that M-support X and contain at least one
element of g(Y), then the M-confidence of the rule

Isolating Interesting Patterns
Interestingness with Respect to Distributions and Proportions
 Measures for quantifying the distance between an investigated
distribution and another distribution
=> Sum-of-squares to measure the distance between two models
D(P’ || P) = ∑ (p’(x) – p(x))2

Isolating Interesting Patterns
 Definition 9. Concept Distribution Distance
 Given two concept distributions P’K(x) and Pk(x), the distance D(P’K
|| PK) between them
D(P’K || PK) = ∑(P’K(x) – PK(x))2
 Definition 10. Concept Proportion Distance
 The value of the difference between two distributions at a particular
point
d(P’K || PK) = P’K(x) – PK(x)

Analyzing Document Collections
over Time
Incremental Algorithms
 Algorithms processing truly dynamic document collections that add,
modify, or delete documents over time
Trend Analysis
 The term generally used to describe the analysis of concept
distribution behavior across multiple document subsets over time
 A two-phase process
First phase
 Phrases are created as frequent sequences of words using the
sequential patterns mining algorithms first mooted for mining
structured databases
Second phase
 A user can query the system to obtain all phrases whose trend
matches a specified pattern.

over Time
Ephemeral Associations
 A direct or inverse relation between the probability distributions of
given topics (concepts) over a fixed time span
Direct Ephemeral Associations
 One very frequently occurring or “peak” topic during a period
seems to influence either the emergence or disappearance of other
topics
Inverse Ephemeral Associations
 Momentary negative influence between one topic and another
Deviation Detection
 The identification of anomalous instances that do not fit a defined
“standard case” in large amounts of data.

over Time
Context Phrases and Context Relationships
 Definition 11. Context Phrase
 A subset of documents in a document collection that is either
labeled with all, or at least one, of the concepts in a specified set
of concepts.
 If D is a collection of documents and C is a set of concepts,
D/A(C) is the subset of documents in D labeled with all the
concepts in C, and D/O(C) is the subset of documents in D labeled
with at least one of the concepts in C. Both D/A(C) and D/O(C)
are referred to as context phrases.

over Time
Context Phrases and Context Relationships
 Definition 12. Context Relationships
 The relationship within a set of concepts found in the document
collection in relation to a separately specified concept ( the
context or the context concept)
 If D is a collection of documents, c1 and c2 are individual concepts,
and P is a context phase, R(D, c1, c2 | P) is the number of
documents in D/P which include both c1 and c2, Formally, R(D, c1,
c2 | P) = |(D/A({c1,c2}))|P|.

over Time
The Context Graph
 Definition 13. Context Graph
 A graphic representation of the relationship between a set of
concepts as reflected in a corpus respect to a given context.
 A context graph consists of a set of vertices (=nodes) and edges.
 The vertices of the graph represent concepts
 Weighted “edges” denote the affinity between the concepts.
 If D is a collection of documents, C is a set of concepts, and P is a
context phrase, the concept graph of D, C, P is a weighted graph
G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D,
c1, c2 | P) > 0). For each edge, {c1,c2} E, one defines the weight
of the edge, w{c1,c2} = R(D, c1, c2 | P).

over Time
 Example of Context Graph in the context of P
Concept1(C1)
Concept3(C3)
Concept2(C2)
R(D, c1, c2 | P) = 10
R(D, c1, c3 | P) = 15

over Time
 Definition 14. Temporal Selection (“Time Interval”)
 If D is a collection of documents and I is a time range, date range, or both, DI is the
subset of documents in D whose time stamp, date stamp, or both, is within I. The
resulting selection is sometimes referred to as the time interval.
 Definition 15. Temporal Context Relationship
 If D is a collection of documents, c1 and c2 are individual concepts, P is a context
phrase, and I is the time interval, then RI(D, c1, c2 | P) is the number of documents in DI
in which c1 and c2 co-occur in the context of P – that is, RI(D, c1, c2 | P) is the number
of DI/P that include both c1 and c2.
 Definition 16. Temporal Context Graph
 If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is
the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,EI)
with set nodes in C and a set of edges EI, where EI = ({c1,c2} | R(D,c1,c2|P) > 0). For
each edge, {c1, c2} E, one defines the weight of the edge by wI{c1,c2} = RI(D,c1,c2|P).

over Time
The Trend Graph
A representation that builds on the temporal context graph as
informed by the general approaches found in trend analysis
New Edges
 Edges that did not exist in the previous graph
Increased Edges
 Edges that have a relatively higher weight in relation to the
previous interval
Decreased Edges
 Edges that have a relatively decreased weight than the previous
interval.
Stable Edges
 Edges that have about the same weight as the corresponding edge
in the previous interval

over Time
 The Borders Incremental Text Mining Algorithm
 The Borders algorithm can be used to update search pattern results
incrementally.
 Definition 17. Border Set
 X is a border set if it is not a frequent set, but any proper subset
Y X is frequent set

over Time
The Borders Incremental Text Mining Algorithm
 Concept Set A = {A1, …, Am}
 Relations over A:
 Rold : old relation
 Rinc : increment
 Rnew : new combined relation
 s(X/R) : support of concept set X in the relation R
 s* : minimum support threshold (min_sup)
 Property 1: if X is a new frequent set in Rnew, then there is a subset Y X
such that Y is a promoted border
 Property 2: if X is a new k-sized frequent set in Rnew, then for each subsetY X
of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set,
or (c) an old frequent set with additional support in Rinc.

over Time
The Borders Incremental Text Mining Algorithm
 Stage 1: Finding Promoted Borders and Generating Candidates.
 Stage 2: Processing Candidates

over Time
 The Borders Incremental Text Mining Algorithm

Text Mining Preprocessing Techniques
 Effective text mining operations are predicated on sophisticated data
preprocessing methodologies.
 Text mining is o dependent on the various preprocessing techniques that
infer or extract structured representations from raw unstructured data
sources, or do both.
 Different preprocessing techniques are used to create structured document
representations from raw textual data. (structure documents – and, by
extension, document collections.)
 Two ways of categorizing the totality of preparatory document structuring
techniques - According to
 Their task and
 The algorithms and formal frameworks that they use.

Pre-processing Techniques
 Task Oriented Approaches
 General purpose NLP tasks
 Tokenization and zoning
 Part-of-speech Tagging and Stemming
 Shallow and deep syntactic parsing
 Problem Dependent task
 Text Categorization
 Information Extraction

Taxonomy of Text Pre-Processing Tasks

General Purpose NLP Tasks
 Tokenization
 Tokenization is the process of breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called tokens.
 The list of tokens becomes input for further processing such as parsing
or text mining.
 Part-of-speech Tagging
 POS tagging is the annotation of words with the appropriate POS tags
based on the context in which they appear.
 POS tags divide words into categories based on the role they play in the
sentence in which they appear.
 POS tags provide information about the semantic content of a word.
 POS taggers at some stage of their processing perform morphological
analysis of words. An additional output of a POS tagger is a sequence of
stems (“lemmas”) of the input words.

 Syntactical parsing
 Syntactical parsing components perform a full syntactical analysis of sentences
according to a certain grammar theory. The basic division is between the
constituency and dependency grammars.
 Constituency grammars describe the syntactical structure of sentences in
terms of recursively built phrases – sequences of syntactically grouped
elements.
 Dependency grammars, do not recognize the constituents as separate
linguistic units but focus instead on the direct relations between words.
 Shallow parsing
 Shallow parsing compromises speed and robustness of processing by
sacrificing depth of analysis.
 Instead of providing a complete analysis (a parse) of a whole sentence, shallow
parsers produce only parts that are easy and unambiguous.
 For the purposes of information extraction, shallow parsing is usually sufficient
and preferable to full analysis because of its far greater speed and robustness.

Problem Dependent Task
 Text Categorization
 Text categorization (Text Classification) tasks tag each document with a
small number of concepts or keywords.
 The set of all possible concepts or keywords is usually manually
prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.
 Information Extraction
 Information retrieval returns documents that match a given query but still
requires the user to read through these documents to locate the relevant
information.
 IE, aims at pinpointing the relevant information and presenting it in a
structured format – typically in a tabular format.

Types of Problems
 Text-mining operates in very high dimensions, in many situations,
processing is effective and efficient because of the sparseness
characteristic of most documents and most practical applications.
 Types of problems that can be solved with text mining approach to data
representation and learning methods are
Document Classification
Information Retrieval
Clustering and Organizing Documents
Information Extraction
Prediction and Evaluation

Document Classification
 Documents are organized into folders, one folder for each topic. A new
document is presented, and the objective is to place this document in the
appropriate folders.
 Document classification or document categorization is a problem in library
science, information science and computer science. The task is to assign a
document to one or more classes or categories.
 Document classification tasks can be divided into three kinds
 supervised document classification is performed by an external
mechanism, usually human feedback, which provides the necessary
information for the correct classification of documents
 semi-supervised document classification, a mixture between supervised
and unsupervised classification: some documents or parts of documents
are labeled by external assistance
 unsupervised document classification is entirely executed without
reference to external information

Classification Techniques
 Decision Trees
 K-nearest neighbors
 Training examples are points in a vector space
 Compute distance between new instance and all training instances
and the k-closest vote for the class
 Naïve Bayes Classifier
 Classify using probabilities and assuming independence among
terms
 P(xi |C) is estimated as the relative frequency of examples having
value xi as feature in class C
 P(C/ Xi Xj Xk) = P(C) P(Xi/C) P(Xj/C) P(Xk/C)
 Neural networks, support vector machines,…

Information Retrieval
 Query
 E.g. Spam / Text
Documents
source
Ranked
Documents
IR
System
Document
Document
Document
Given
 A source of textual
documents
 A user query (text
based)
Find
 A set (ranked) of
documents that are
relevant to the
query

 Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need (query)
from within large collections (usually stored on computers).
 Basic assumptions
Collection: Fixed set of documents
Goal: Retrieve documents with information that is relevant to user’s
information need and helps him complete a task
 Retrieving Matched document:

Basic Information Retrieval (IR) process
Browsing or Navigation system
 User skims document collection by jumping from one document to
the other via hypertext or hypermedia links until relevant document
found
Classical IR system: Question Answering System
 Query: Question in Natural Language
 Answer: Directly extracted from text of document collection
Text Based Information Retrieval
 Information Item (document)
 Text format (written/spoken) or has textual description
 Information Need (query)
 Usually in text format

Clustering and Organizing Documents:
Clustering
Given
 A source of textual
documents
 Similarity measure
(e.g., how many
words are common in
these documents)
Find
 Several clusters of
documents that are
relevant to each other
Similarity
measure
Documents
source
Clustering
System
Doc
DocDoc
Doc
Doc
Doc

Clustering and Organizing Documents
The clustering process is equivalent to assigning the labels needed for
text categorization.
There are many ways to cluster documents, it is not quite as powerful a
process as assigning answers (i.e., known correct labels) to documents.
Organizing documents into groups:

Information Extraction
Definition
 The automatic extraction of structured information from
unstructured documents.
 Information Extraction is the process of scanning text for relevant
information to some interest
 Extract:
Entities, Relations, Events
Overall Goals:
 Making information more accessible to people
 Making information more machine-processable

Why IE?
 Need for efficient processing of texts in specialized domains
 Focus on relevant parts, ignore the rest
 Typical applications:
 Gleaning Business
 Government
 Military Intelligence
 WWW searches (more specific than keywords)
 Scientific literature searches

Information extraction is a subfield of text mining that attempts to move
text mining onto an equal footing with the structured world of data
mining.
The objective is to take an unstructured document and automatically fill
in the values of a spreadsheet.
Extracting Information from the document:

Prediction and Evaluation
 Prediction is the measurement of error. For topic assignment, we can
determine whether a program’s answer is right or wrong.
 The classical measures of accuracy will be applicable, but not all errors
will be evaluated equally.
 That’s why measures of accuracy such as “recall” and “precision” are
especially important to document analysis.

Performance Measure
 The set of retrieved documents can be formed by collecting the top ranking
documents according to a similarity measure
 The quality of a collection can be compared by the two following measures
 Precision: percentage of retrieved documents that are in fact relevant to the
query (i.e., “correct” responses)
 Recall: percentage of documents that are relevant to the query and were, in
fact, retrieved
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision


recall 
|{Relevant}{Retrieved} |
|{Relevant} |

From Textual Information to Numerical
Vectors: Introduction
 To Mine Text we need to process it in a form that Data Mining
procedures use.
 From earlier, this involves generating features in a spread sheet
format
 Classical data mining looks at highly structured data
 Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
 Predictive text mining is simpler and more restrictive than open
ended data mining.

From Textual Information to Numerical
Vectors: Introduction
 Text mining is unstructured because very far from the spreadsheet
model that we need to process data for prediction.
 Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
 We have to determine nature of column in spread sheet.
 Features are easy to obtain , some are difficult.
 Features (word in a text-easy) ; grammatical function of a word in a
sentence.
 Obtain the kinds of features generated from Text

Collecting Documents
 Text Mining is collect of data
 Web page retrieval application for an intranet implicitly specifies the
relevant documents to be the web pages on the intranet
 If documents are identified, then they can be obtained
 main issue – cleanse the samples and ensure high quality
 Web application compromising a number of autonomous Websites, one
may deploy software tool such as WebCrawler to collect the documents

 Other application, have a logging process attached to an input data
steam for a length of time (e.g., email audit, log in the incoming and
outgoing messages at mail server for a period of time)
 For R&D work of Text Mining , we need generic data – Corpus
 Accompanying Software is Reuter which is called Reuter’s corpus
(RV1)
 Early days (1960’s and 1970’s) 1 million works was considered
 Size of collection of size of collection Brown corpus consist of 500
samples for 2000 words of American English test

 European corpus was modeled on Brown corpus - British English
 1970’s 0r 80’s more resource were available - Government
sponsored.
 Some widely used corpora – Penn Tree Bank (collection
manually parsed sentences from Journal)
 Resource is World Wide Web.
 Web crawlers can build collections of pages from a particular sit such
as yahoo.
 Give n size of web, collections require cleaning before use.

Document Standardization
 When Documents are collected, you can have them in different
formats
 Some documents may be collected in word format or simple text
with ASCII format. To process these documents we have to convert
them to standard formats
 Standard Format –XML
 XML is Extensible Markup Language

Document Standardization-XML
 Standard way to insert tags onto text to identify it’s parts.
 Each Document is marked off from corpus through XML
 XML will have tags
 <Date>
 <Subject>
 <Topic>
 <Text>
 <Body>
 <Header>

XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Diya</to>
<from>Surya</from>
<heading>Reminder</heading>
<body>Happy Birth Day</body>
</note>

XML
 The main reason to identify the parts is to allow selection of those
parts that is used to generate features
 Selected document is concatenated into strings- separated by tags
Document Standardization
Advantage of data standardization is mining tools can be applied
without having to consider the pedigree of document.

Tokenization
 Document Collected in XML Format- examine the data
 Break the characters into words – TOKENS
 Each token is an instance of a type; the number of tokens is higher than
the number of types
 2 tokens “the” occurs twice in a sentence. Refer to occurrence
of a type
 Character space , tab are not tokens but white spaces
 Comma, Colon are tokens (between characters) e.g., USA,INDIA
 Between numbers are delimiter (121,135)
 Apostrophe - number of uses (Delimiter or part of token)
e.g., D’Angelo
 When it is followed by a terminator – internal quote (Tess’.)

Tokenization –Pseudo code
 Dash is a terminator a token proceeded or followed by another dash
(522-3333)
 Without identifying token it is difficult to imagine extracting higher
level information from document

Lemmatization
 Once a character stream has been segmented after sequence of tokens
 Convert each tokens to standard forms – Stemming or Lemmatization.
(Application dependent)
 Reduce the number of distinct types in corpus and increase
frequency of occurrence of individual types
 English Speaker’s agree nouns Book and Books are 2 forms of
same word- advantage to eliminate kind of variation
 Normalization regularize grammatical variants –Inflectional Stemming

Stemming to a Root
 Grammatical variants (singular/plural present/past)
 It is always advantageous to eliminate this kind of variation before further
processing
 When normalization is confined to regular grammatical variants such as
singular/plural and present/past, the process is called Inflectional stemming
 The intent of these stemmers is to reach a root of no inflectional or
derivational prefixes or suffixes- end result aggressive stemming Example
 Reduce number of types is text

Vector Generation for prediction
 Consider the problem of categorizing documents
 Characteristic feature are tokens or words they contain.
 Without deep analysis we can choose to describe each document by
features that represent the most frequent tokens.
 There is collective features called dictionary.
 Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document collection.
 Each Row-> document; column ->feature

Vector Generation for prediction
 Cells in a spreadsheet is a measurement of feature for a document.
 Basic model of data, we will simply check the presence or absence of words.
 Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents are
readily available - confidence on variation and combinations of words that
occurs.
 If prediction is our goal then we need one more column for correct answer.
 In preparing data for learning, information is available from document labels.
(Labels are binaries and answers which is also called as class)
 Instead of generating global dictionary for class we consider words in class
that we are trying to predict.
 If this class is far smaller than the negative class which is typical - local
dictionary is far smaller than global dictionary

 Another reduction in dictionary size is to compile a lost of stop-words and
remove them from dictionary.
 Stop-words are almost never have any predictive capability, such articles a
& the pronouns as it and they.
 Frequency information on the word counts can be quite useful in reducing
the dictionary size and improve predictive performance
 Frequent words are stop-words and can be deleted.
 Alternative approach to local dictionary generation is to generate a global
dictionary from all documents in the collection . Special feature selection
routines will attempt to select a subset of words that have greatest potential
of prediction- independent (selection methods)
 If we have 100 topics to categorize, then 100 problems to solve. Our choices
are 100 small dictionary or 1 global dictionary .

 Vectors implied by spreadsheet model will be regenerated to correspond to
small dictionary
 Instead of placing the word in the dictionary, follow a path printed
dictionary and avoid storing every variation of word.
(no singular/plural/past/present)
 Verbs stored in stemming manner.
 Add a layer of complexity in text - gain in performance and size is reduced
 Universal procedure is trim words to their root form -> difference in
meaning (exit /exiting) - context of programming (different meanings)
 Small Dictionary - can capture the best words easily.
 Use of tokens and stemming are examples of helpful procedures in smaller
dictionaries. Improve ability of managing of learning and accuracy
 Document can be converted to spread sheet .

 Each column is feature. Row is a document.
 Model of data for predictive text mining in terms of spread sheet that
populated by ones or zeros.
 Cells represent the presence of dictionary words in a document collection.
Higher accuracy-> additional transformations
 They are
 Word Pairs and collocations
 Frequency
 Tf-idf
 Word Pairs and Collocations: They serve to increase size of dictionary
improve performance of prediction
 Instead of 0’s & 1’s in cells; the frequency of word can be used.(word “the”
occurs 10 times count of “the” is used)
 Count give better results than binary in cells.
 This leads to compact solutions same solution of binary data model. Yet
additional frequency yield simpler solution.

 Frequencies are helpful in prediction but add complexity to solutions.
 Compromise that works – 3 value system.1/0/2
 Word did not occur -0
 Word occurred one -1
 Word occurred 2 or more times -2
 Capture much added value of frequency without adding much complexity to
model.
 Another variant is zeroing the values below the threshold where tokens min
frequency before being considered any use.
 Reduce the complexity of spread sheet – used in Data Mining algorithms
 Other methods to reduce complexity are chi square, mutual Information,
odds Ratio ..etc
 Next step beyond counting frequency is modify the count by perceived
importance of that word .

 Tf-idf: Compute the weightings or scores of words
 Values of positive numbers that we capture the absence or presence of the
words.
 Eq(a) we see that weight assigned to word j-term of frequency modified by
a scale factor for importance of word. Scale factor is inverse document
frequency (Eq(b))
 Simply checks for number of documents containing the word df(j) and
reverse scaling.
 Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
 Idf(j) = log(N/ df(j)) ------ Eq(b)
 When a word appears in a document, the scale is lowered and perhaps zero.
if word is unique , appears in few documents - scale factor zooms upward
and appears important
 Alternative of this tf-idf formulation exist, but motivation is same. Result is
positive score that replaces the simple frequency or binary (T/F) entry in our
cell in spreadsheet.

 Another variant is weight the tokens from different parts of the document.
 Which Data Transformation Method are BEST????
 No Universal answer.
 Best predictive accuracy is dependent on mating all these
methods.
 Best variation is one method may not be the one for other. Test
ALL
 Describe data as populating a spread sheet-cells are 0
 Small subset of dictionary words.
 Text Classification a text corpus 1000’s words. Each individual document,
unique tokens.
 Spread sheet for that document is 0. Rather than store all 0’s its better to
represent the spread sheet as a set of sparse vectors (row is list of pairs , one
element of pair is column and other is corresponding nonzero value). By not
storing the non zero It will increase memory

Multi Word Features
 Features are associated with single words ( tokens delimited with white
space)
 Simple scenario is extended to include pair of words e.g., bon and viant .
Instead of separating we could feature the word as bonviant.
 Why stop at pairs? Why not consider a multiword features??
 Unlike word pairs , the words need not be consecutive.
 E.g., Don Smith as feature – we can ignore is middle name Leroy that may
reappear in some reference to the person.
 In this case we have to accommodate many reference to the noun that
involve a number of adjectives with desired adjective not the adjacent to the
noun. E.g., we want to accept a phrase broken and dirty vase as an instance
broken vase

 X number if words occurring within a maximum window size y(y>=x naturally)
 How features are extracted from text- specialized methods???
 If we use frequency methods, combinations of words that are relatively frequent.
 Straight forward implementation is simple combination of x words in window y
 Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood ratio
is used!!!
 An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
 Multiword features are not too found in document collection, but they are highly
predictive
 

Twordi iwordfreq
TfreqTfreqTsize
TAM
)(
)())((log)(
)( 10

Labels for Right Answers:
 For prediction an extra column is added to the spreadsheet
 Last column contains the labels, looks no different from others.
 It’s a 0 or 1 indicating a right answer with either True/false
 In the sparse vector format are appended to each vector separately as
either a one (positive class) or a Zero (negative class)
 Feature Selection by Attribute Ranking:
 In addition to frequency based approaches, feature selection can be done in
number of ways.
 Select a set of feature for each category to form a local dictionary for the
category
 Independent ranking feature attributes according to their predictive abilities
for category under consideration.
 Predictive ability of an attribute can be measured by certain quantity how its is
correlated
 Lets assume n number of documents; presence or absence of attribute j
in x; y to denote label of document in last column
ix

 A commonly used ranking score is information gain criterion which is
 The quantity L(j) is number of bits required to encode the label and the
attribute j minus the number of bits required to encode the attribute.
 Quantities are needed to compute L(j). Can be easily estimated using the
estimators







1
0
2
1
0
1
0
2
))|(/1(log)|Pr()Pr()(
))(/1(log)Pr(
)()(
c
ii
v
i
c
Label
label
vxcyprvxcyvxjL
cyprcyL
jLLjIG
2)(
1),(
)|(
2
1)(
)(






vxfreq
clabelvxfreq
vxcypr
n
vxfreq
vxpr
j
j
j
j
i

Sentence Boundary Determination
 If the XML markup for corpus doesn't mark sentence boundaries,
necessary to mark the sentence
 Necessary to determine when a period is part of a token and when it is
not
 For more sophisticated way linguistic parsing, the algorithms often
require complete sentence as input.
 Extraction algorithms operate text a sentence at a time
 Algorithms are optimal, sentences are identified clearly
 Sentence boundary determination is problem of deciding which
instances of period followed by white space are sentence delimiters and
which are not, since we assume characters ? ! –classification problem
 Algorithm – accuracy and adjustments will give better performance

Text Data Mining Techniques and Processes Explained

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Text Data Mining Techniques and Processes Explained

Similar a Text Data Mining Techniques and Processes Explained (20)

Text Data Mining Techniques and Processes Explained