SlideShare una empresa de Scribd logo
1 de 104
Part-I
TEXT DATA MINING
Definition : Text Mining
 Text mining refers generally to the process of extracting interesting
information and knowledge from unstructured text.
 Text Mining can be defined as a knowledge-intensive process in which
a user interacts with a document collection over time by using a suite of
analysis tools.
And
 Text Mining seeks to extract useful information from data sources
(document collections) through the identification and exploration of
interesting patterns.
Text Mining
 Text mining (TM) seeks to extract useful information from a collection of
documents.
 It is similar to data mining (DM), but the data sources are unstructured or
semi-structured documents.
 The TM methods involve :
- Basic pre-processing / TM operations, such as identification /
extraction of representative features (this can be done in several phases)
- Advanced text mining operations, involving identification of
complex patterns (e.g. relationships between previously identified concepts)
 TM exploits techniques / methodologies from
data mining, machine learning, information retrieval, corpus-based
computational linguistics
Similarity and difference between
Data and Text Mining
 Both types of systems rely on:
 Preprocessing routines
 Pattern-discovery algorithms
 Presentation-layer elements such as visualization tools
 Pre-processing Operations:
 In Data Mining assume data
 Stored in a structured format,
so preprocessing focus on scrubbing and normalizing data,
to create extensive numbers of table joins
 In Text Mining preprocessing operations center on
 Identification & Extraction of representative features for NL documents,
to transform unstructured data stored in document collections into a more
explicitly structured intermediate format
Finding Frequent Patterns
Finding “Nuggets”
Novel Non-Novel
Non textual data General Data Mining
Exploratory Data
Analysis
Database Queries
Information
RetrievalTextual data Computational Linguistics
Text Mining Pipeline
Unstructured Text
(implicit knowledge)
Structured content
(explicit knowledge)
Text Mining Process
Text preprocessing
Syntactic/Semantic text
analysis
Features Generation
Bag of words
Features Selection
Simple counting
Statistics
Text/Data Mining
Classification-
Supervised learning
Clustering- Unsupervised
learning
Analyzing results
Mapping/Visualization
Result interpretation
Text Mining Tasks
TM
Text Analysis
Tools
Feature extraction
Categorization
Summarization
Clustering
Name Extractions
Term Extraction
Abbreviation Extraction
Relationship Extraction
Hierarchical Clustering
Binary relational Clustering
Web Searching
Tools
Text search engine
Net Question Solution
Web Crawler
Handling Text Data
 Modeling semi-structured data
 Information Retrieval (IR) from unstructured documents
• Locates relevant documents and Ranks documents
• Keyword based (Boolean matching)
• Similarity based
 Text mining
• Classify documents
• Cluster documents
• Find patterns or trends across documents
Challenges in Text Mining
 Data collection is “free text”, is not well-organized (Semi-structured or
unstructured)
 No uniform access over all sources, each source has separate storage
and algebra, examples: email, databases, applications, web
 A quintuple heterogeneity: semantic, linguistic, structure, format, size
of unit information
 Learning techniques for processing text typically need annotated
training
 XML as the common model, it allows:
 Manipulation data with standards
 Mining becomes more data mining
 RDF emerging as a complementary model
 The more structure you can explore the better you can do mining
Types of Text Data Mining
 Keyword-based association analysis
 Automatic document classification
 Similarity detection
 Cluster documents by a common author
 Cluster documents containing information from a common source
 Link analysis: unusual correlation between entities
 Sequence analysis: predicting a recurring event
 Anomaly detection: find information that violates usual patterns
 Hypertext analysis
 Patterns in anchors/links
 Anchor text correlations with linked objects
Documents and Document Collections
 Document collection is a grouping of text-based documents.
 It can be either static or dynamic (growing over time).
 Document is a unit of discrete textual data within a collection,
representing usually some real world document, such as, a business
report, memorandum, email, research paper, news story etc.
 A document can be a member of different document collections (e.g.
legal affairs and computing equipment, if it falls under both).
Document Structure
Text documents can be :
 unstructured
i.e. free-style text (but from a linguistic perspective they are
really structured objects)
 weakly structured
Adhering to some pre-specified format, like most scientific
papers, business reports, legal memoranda, news stories etc.
 semi-structured
Exploiting heavy document templating or style sheets.
Weakly structured and Semi structured
Documents
Documents
 Have relatively little in the way of strong
typographical, layout, or markup indicators to denote structure are
referred to as free-format or weakly structured docs (such as most
scientific research papers, business reports, and news stories)
 With extensive and consistent format elements in which field-type
metadata can be more easily inferred are described as semi-
structured docs (such as some e-mail, HTML web pages, PDF
files)
Document Representation and Features
 Irregular and implicitly structured representation is transformed into an
explicitly structured representation.
 We can distinguish:
- Feature based representation,
- Relational representation.
 In feature based representation that documents are represented by a set
of features.
Document Features
Although many potential features can be employed to represent docs, the
following four types are most commonly used
 Characters
 Words
 Terms
 Concepts
High Feature Dimensionality ( HFD)
 Problems relating to HFD are typically of much greater magnitude in
TM systems than in classic DM systems.
Feature Sparcity
 Only a small percentage of all possible features for a document
collection as a whole appear as in any single docs.
Representational Model of a Document
 An essential task for most text mining systems is The identification of
a simplified subset of document features that can be used to represent a
particular document as a whole.
 We refer to such a set of features as the representational model of a
document.
 Commonly Used Document Features:
 Characters,
 Words,
 Terms, and
 Concepts
Character level Representation
 Without Positional Information
Are often of very limited utility in TM applications
 With Positional Information
Are somewhat more useful and common (e.g. bigrams or trigrams)
 Disadvantage:
Character-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized
Word-level Representation
 Without Positional Information
Are often of very limited utility in TM applications
 With Positional Information
Are somewhat more useful and common(e.g. bigrams or trigrams)
 Disadvantage:
Word-base Rep. can often be unwieldy for some types of text processing
techniques because the feature space for a docs is fairly un-optimized
Term-level Representation
 Normalized Terms comes out of Term-Extraction Methodology
 Sequence of one or more tokenized and lemmatized word forms
associated with part-of-speech tags.
Concept-level Representation
 Concepts are features generated for a document by means of manual,
statistical, rule-based, or hybrid categorization methodology.
General Architecture of Text Mining Systems
Abstract Level
A text mining system takes in input raw docs and generates various types of output
such as:
 Patterns
 Maps of connections
 Trends
Input Output
Documents
Patterns
Connections
Trends
General Architecture of Text Mining Systems
Functional Level
TM systems follow the general model provided by some classic DM
applications and are thus divisible into 4 main areas
• Preprocessing Tasks
• Core mining operations
• Presentation layer components and browsing functionality
• Refinement techniques
System Architecture for
Generic Text Mining System
System Architecture for
Domain-oriented Text Mining System
System Architecture for an Advanced Text Mining
System with background knowledge base
Core Text Mining Operations
 Core mining operations in text mining systems are algorithms of the
creation of queries for discovering patterns in document collections.
 Core Text Mining Operations
• Distributions
• Frequent and Near Frequent Sets
• Associations
• Isolating Interesting Patterns
• Analyzing Document Collections over Time
 Using Background Knowledge for Text Mining
 Text Mining Query Languages
Core Text Mining Operations
 Core text mining operations consist of various mechanisms for
discovering patterns of concept with in a document collection.
 The three types of patterns in text mining
 Distributions (and proportions)
 Frequent and near frequent sets
 Associations
 Symbols
 D : a collection of documents
 K : a set of concepts
 k : a concept
Distributions
 Definition 1. Concept Selection
 Selecting some sub collection of documents that is labeled by one or
more given concepts
 D/K
 Subset of documents in D labeled with all of the concepts in K
 Definition 2. Concept Proportion
 The proportion of a set of documents labeled with a particular
concept
 f(D , K) = |D/K| / |D|
 The fraction of documents in D labeled with all of the concepts in
K
Distributions
 Definition 3. Conditional Concept Proportion
 The proportion of a set of documents labeled with a concept that are
also labeled with another concept
 f(D , K1|K2) = f(D/K2 , K1)
 The proportion of all those documents in D labeled with K2 that
are also labeled with K1
 Definition 4. Concept Proportion Distribution
 The proportion of documents in some collection that are labeled with
each of a number of selected concepts
 FK(D , x)
 The proportion of documents in D labeled with x for any x in K
Distributions
 Definition 5. Conditional Concept Proportion Distribution
 The proportion of those documents in D labeled with all the concepts
in K’ that are also labeled with concept x(with x in K)
 FK(D,x | K’) = FK(D/K | K’, x)
 Definition 6. Average Concept Proportion
 Given a collection of documents D, a concept k, and an internal node
in the hierarchy n, an average concept proportion is the average
value of f(D,k | k’), where k’ ranges over all immediate children of n.
 a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}
Distributions
 Definition 7. Average Concept Distribution
 Given a collection of documents D and two internal nodes in the
hierarchy n and n’, average concept distribution is the distribution
that, for any x that is a child of n, averages x’s proportions over all
children of n’
 An(D,x | n’ ) = Avg {k’ is a child of n’} {Fn(D,x | k’)}
Frequent and Near Frequent Sets
 Frequent Concept Sets
 A set of concepts represented in the document collection with co-
occurrences at or above a minimal support level (given as a threshold
parameter s; i.e., all the concepts of the frequent concept set appear
together in at least s documents)
Support
 The number (or percent) of documents containing the given rule –
that is, the co-occurrence frequency
Confidence
 The percentage of the time that the rule is true
Frequent and Near Frequent Sets
Algorithm 1 : The Apriori Algorithm (Agrawal and Srikant 1994)
 Discovery methods for frequent concept sets in text mining.
Frequent and Near Frequent Sets
Algorithm for Frequent Set Generation
Frequent sets are generated in relation to some support level.
Support (i.e., the frequency of co- occurrence) by convention often
expressed as the variable σ, frequent sets are sometimes also referred to as σ-
covers, or σ-cover sets.
Frequent and Near Frequent Sets
 Near Frequent Concept Sets
 An undirected relation between two frequent sets of concepts
 This relation can be quantified by measuring the degree of
overlapping, for example, on the basis of the number of documents
that include all the concepts of the two concept sets.
Associations
Associations
 Directed relations between concepts or sets of concepts
Associations Rule
 An expression of the from A => B, where A and B are sets of
features
 An association rule A => B indicates that transactions that involve
A tend also to involve B.
 A is the left-hand side (LHS)
 B is the right-hand side (RHS)
Confidence of Association Rule A => B (A, B : frequent concept sets)
 The percentage of documents that include all the concept in B within
the subset of those documents that include all the concepts in A
Support of Association Rule A => B (A, B : frequent concept sets)
 The percentage of documents that include all the concepts in A and B
Associations
Discovering Association Rules
 The Problem of finding all the association rules with a confidence
and support greater than the user-identified values minconf (the
minimum confidence level) and minsup (the minimum support level)
thresholds
Two step of discovering associations
 Find all frequent concept sets X (i.e., all combinations of concepts
with a support greater than minsup).
 Test whether X-B => B holds with the required confidence
 X = {w,x,y,z}, B = {y,z} , X-B = {w,x}
 X-B => B
{w,x} => {y,z}
 Confidence of association rule {w,x} => {y,z}
confidence = support({w,x,y,z}) / support({w,x}
Associations
Maximal Associations (M-association)
 Relations between concepts in which associations are identified in terms
of their relevance to one concept and their lack of relevance to another
 Concept X most often appear in association with Concept Y,
Simple Algorithm for Generating Associations
(Rajman and Besancon 1998)
Associations
 Definition 8. Alone with Respect to Maximal Associations
 For a transaction t, a category g, and a concept-set X gi, one would
say that X is alone in t if t ∩ gi = X.
 X is alone in t if X is largest subset of gi that is in t
 X is maximal in t …
 t M-supports X …
 For a document collection D, the M-support of X in D
 number of transactions t D that M-support X.
Associations
 The M-support for the maximal association
 If D(X,g(Y)) is the subset of the document collection D consisting
of all the transactions that M-support X and contain at least one
element of g(Y), then the M-confidence of the rule
Isolating Interesting Patterns
Interestingness with Respect to Distributions and Proportions
 Measures for quantifying the distance between an investigated
distribution and another distribution
=> Sum-of-squares to measure the distance between two models
D(P’ || P) = ∑ (p’(x) – p(x))2
Isolating Interesting Patterns
 Definition 9. Concept Distribution Distance
 Given two concept distributions P’K(x) and Pk(x), the distance D(P’K
|| PK) between them
D(P’K || PK) = ∑(P’K(x) – PK(x))2
 Definition 10. Concept Proportion Distance
 The value of the difference between two distributions at a particular
point
d(P’K || PK) = P’K(x) – PK(x)
Analyzing Document Collections
over Time
Incremental Algorithms
 Algorithms processing truly dynamic document collections that add,
modify, or delete documents over time
Trend Analysis
 The term generally used to describe the analysis of concept
distribution behavior across multiple document subsets over time
 A two-phase process
First phase
 Phrases are created as frequent sequences of words using the
sequential patterns mining algorithms first mooted for mining
structured databases
Second phase
 A user can query the system to obtain all phrases whose trend
matches a specified pattern.
Analyzing Document Collections
over Time
Ephemeral Associations
 A direct or inverse relation between the probability distributions of
given topics (concepts) over a fixed time span
Direct Ephemeral Associations
 One very frequently occurring or “peak” topic during a period
seems to influence either the emergence or disappearance of other
topics
Inverse Ephemeral Associations
 Momentary negative influence between one topic and another
Deviation Detection
 The identification of anomalous instances that do not fit a defined
“standard case” in large amounts of data.
Analyzing Document Collections
over Time
Context Phrases and Context Relationships
 Definition 11. Context Phrase
 A subset of documents in a document collection that is either
labeled with all, or at least one, of the concepts in a specified set
of concepts.
 If D is a collection of documents and C is a set of concepts,
D/A(C) is the subset of documents in D labeled with all the
concepts in C, and D/O(C) is the subset of documents in D labeled
with at least one of the concepts in C. Both D/A(C) and D/O(C)
are referred to as context phrases.
Analyzing Document Collections
over Time
Context Phrases and Context Relationships
 Definition 12. Context Relationships
 The relationship within a set of concepts found in the document
collection in relation to a separately specified concept ( the
context or the context concept)
 If D is a collection of documents, c1 and c2 are individual concepts,
and P is a context phase, R(D, c1, c2 | P) is the number of
documents in D/P which include both c1 and c2, Formally, R(D, c1,
c2 | P) = |(D/A({c1,c2}))|P|.
Analyzing Document Collections
over Time
The Context Graph
 Definition 13. Context Graph
 A graphic representation of the relationship between a set of
concepts as reflected in a corpus respect to a given context.
 A context graph consists of a set of vertices (=nodes) and edges.
 The vertices of the graph represent concepts
 Weighted “edges” denote the affinity between the concepts.
 If D is a collection of documents, C is a set of concepts, and P is a
context phrase, the concept graph of D, C, P is a weighted graph
G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D,
c1, c2 | P) > 0). For each edge, {c1,c2} E, one defines the weight
of the edge, w{c1,c2} = R(D, c1, c2 | P).
Analyzing Document Collections
over Time
 Example of Context Graph in the context of P
Concept1(C1)
Concept3(C3)
Concept2(C2)
R(D, c1, c2 | P) = 10
R(D, c1, c3 | P) = 15
Analyzing Document Collections
over Time
 Definition 14. Temporal Selection (“Time Interval”)
 If D is a collection of documents and I is a time range, date range, or both, DI is the
subset of documents in D whose time stamp, date stamp, or both, is within I. The
resulting selection is sometimes referred to as the time interval.
 Definition 15. Temporal Context Relationship
 If D is a collection of documents, c1 and c2 are individual concepts, P is a context
phrase, and I is the time interval, then RI(D, c1, c2 | P) is the number of documents in DI
in which c1 and c2 co-occur in the context of P – that is, RI(D, c1, c2 | P) is the number
of DI/P that include both c1 and c2.
 Definition 16. Temporal Context Graph
 If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is
the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,EI)
with set nodes in C and a set of edges EI, where EI = ({c1,c2} | R(D,c1,c2|P) > 0). For
each edge, {c1, c2} E, one defines the weight of the edge by wI{c1,c2} = RI(D,c1,c2|P).
Analyzing Document Collections
over Time
The Trend Graph
A representation that builds on the temporal context graph as
informed by the general approaches found in trend analysis
New Edges
 Edges that did not exist in the previous graph
Increased Edges
 Edges that have a relatively higher weight in relation to the
previous interval
Decreased Edges
 Edges that have a relatively decreased weight than the previous
interval.
Stable Edges
 Edges that have about the same weight as the corresponding edge
in the previous interval
Analyzing Document Collections
over Time
 The Borders Incremental Text Mining Algorithm
 The Borders algorithm can be used to update search pattern results
incrementally.
 Definition 17. Border Set
 X is a border set if it is not a frequent set, but any proper subset
Y X is frequent set
Analyzing Document Collections
over Time
The Borders Incremental Text Mining Algorithm
 Concept Set A = {A1, …, Am}
 Relations over A:
 Rold : old relation
 Rinc : increment
 Rnew : new combined relation
 s(X/R) : support of concept set X in the relation R
 s* : minimum support threshold (min_sup)
 Property 1: if X is a new frequent set in Rnew, then there is a subset Y X
such that Y is a promoted border
 Property 2: if X is a new k-sized frequent set in Rnew, then for each subsetY X
of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set,
or (c) an old frequent set with additional support in Rinc.
Analyzing Document Collections
over Time
The Borders Incremental Text Mining Algorithm
 Stage 1: Finding Promoted Borders and Generating Candidates.
 Stage 2: Processing Candidates
Analyzing Document Collections
over Time
 The Borders Incremental Text Mining Algorithm
Text Mining Preprocessing Techniques
 Effective text mining operations are predicated on sophisticated data
preprocessing methodologies.
 Text mining is o dependent on the various preprocessing techniques that
infer or extract structured representations from raw unstructured data
sources, or do both.
 Different preprocessing techniques are used to create structured document
representations from raw textual data. (structure documents – and, by
extension, document collections.)
 Two ways of categorizing the totality of preparatory document structuring
techniques - According to
 Their task and
 The algorithms and formal frameworks that they use.
Pre-processing Techniques
 Task Oriented Approaches
 General purpose NLP tasks
 Tokenization and zoning
 Part-of-speech Tagging and Stemming
 Shallow and deep syntactic parsing
 Problem Dependent task
 Text Categorization
 Information Extraction
Taxonomy of Text Pre-Processing Tasks
General Purpose NLP Tasks
 Tokenization
 Tokenization is the process of breaking a stream of text up into words,
phrases, symbols, or other meaningful elements called tokens.
 The list of tokens becomes input for further processing such as parsing
or text mining.
 Part-of-speech Tagging
 POS tagging is the annotation of words with the appropriate POS tags
based on the context in which they appear.
 POS tags divide words into categories based on the role they play in the
sentence in which they appear.
 POS tags provide information about the semantic content of a word.
 POS taggers at some stage of their processing perform morphological
analysis of words. An additional output of a POS tagger is a sequence of
stems (“lemmas”) of the input words.
 Syntactical parsing
 Syntactical parsing components perform a full syntactical analysis of sentences
according to a certain grammar theory. The basic division is between the
constituency and dependency grammars.
 Constituency grammars describe the syntactical structure of sentences in
terms of recursively built phrases – sequences of syntactically grouped
elements.
 Dependency grammars, do not recognize the constituents as separate
linguistic units but focus instead on the direct relations between words.
 Shallow parsing
 Shallow parsing compromises speed and robustness of processing by
sacrificing depth of analysis.
 Instead of providing a complete analysis (a parse) of a whole sentence, shallow
parsers produce only parts that are easy and unambiguous.
 For the purposes of information extraction, shallow parsing is usually sufficient
and preferable to full analysis because of its far greater speed and robustness.
Problem Dependent Task
 Text Categorization
 Text categorization (Text Classification) tasks tag each document with a
small number of concepts or keywords.
 The set of all possible concepts or keywords is usually manually
prepared, closed, and comparatively small. The hierarchy relation
between the keywords is also prepared manually.
 Information Extraction
 Information retrieval returns documents that match a given query but still
requires the user to read through these documents to locate the relevant
information.
 IE, aims at pinpointing the relevant information and presenting it in a
structured format – typically in a tabular format.
Types of Problems
 Text-mining operates in very high dimensions, in many situations,
processing is effective and efficient because of the sparseness
characteristic of most documents and most practical applications.
 Types of problems that can be solved with text mining approach to data
representation and learning methods are
Document Classification
Information Retrieval
Clustering and Organizing Documents
Information Extraction
Prediction and Evaluation
Document Classification
 Documents are organized into folders, one folder for each topic. A new
document is presented, and the objective is to place this document in the
appropriate folders.
 Document classification or document categorization is a problem in library
science, information science and computer science. The task is to assign a
document to one or more classes or categories.
 Document classification tasks can be divided into three kinds
 supervised document classification is performed by an external
mechanism, usually human feedback, which provides the necessary
information for the correct classification of documents
 semi-supervised document classification, a mixture between supervised
and unsupervised classification: some documents or parts of documents
are labeled by external assistance
 unsupervised document classification is entirely executed without
reference to external information
Classification Schema
Classification Techniques
 Decision Trees
 K-nearest neighbors
 Training examples are points in a vector space
 Compute distance between new instance and all training instances
and the k-closest vote for the class
 Naïve Bayes Classifier
 Classify using probabilities and assuming independence among
terms
 P(xi |C) is estimated as the relative frequency of examples having
value xi as feature in class C
 P(C/ Xi Xj Xk) = P(C) P(Xi/C) P(Xj/C) P(Xk/C)
 Neural networks, support vector machines,…
Information Retrieval
 Query
 E.g. Spam / Text
Documents
source
Ranked
Documents
IR
System
Document
Document
Document
Given
 A source of textual
documents
 A user query (text
based)
Find
 A set (ranked) of
documents that are
relevant to the
query
Information Retrieval
 Information Retrieval (IR) is finding material (usually documents) of an
unstructured nature (usually text) that satisfies an information need (query)
from within large collections (usually stored on computers).
 Basic assumptions
Collection: Fixed set of documents
Goal: Retrieve documents with information that is relevant to user’s
information need and helps him complete a task
 Retrieving Matched document:
Information Retrieval
Basic Information Retrieval (IR) process
Browsing or Navigation system
 User skims document collection by jumping from one document to
the other via hypertext or hypermedia links until relevant document
found
Classical IR system: Question Answering System
 Query: Question in Natural Language
 Answer: Directly extracted from text of document collection
Text Based Information Retrieval
 Information Item (document)
 Text format (written/spoken) or has textual description
 Information Need (query)
 Usually in text format
Classical IR System Process
Clustering and Organizing Documents:
Clustering
Given
 A source of textual
documents
 Similarity measure
(e.g., how many
words are common in
these documents)
Find
 Several clusters of
documents that are
relevant to each other
Similarity
measure
Documents
source
Clustering
System
Doc
DocDoc
Doc
Doc
Doc
Clustering and Organizing Documents
The clustering process is equivalent to assigning the labels needed for
text categorization.
There are many ways to cluster documents, it is not quite as powerful a
process as assigning answers (i.e., known correct labels) to documents.
Organizing documents into groups:
Information Extraction
Definition
 The automatic extraction of structured information from
unstructured documents.
 Information Extraction is the process of scanning text for relevant
information to some interest
 Extract:
Entities, Relations, Events
Overall Goals:
 Making information more accessible to people
 Making information more machine-processable
Information Extraction
Why IE?
 Need for efficient processing of texts in specialized domains
 Focus on relevant parts, ignore the rest
 Typical applications:
 Gleaning Business
 Government
 Military Intelligence
 WWW searches (more specific than keywords)
 Scientific literature searches
Information Extraction
Information extraction is a subfield of text mining that attempts to move
text mining onto an equal footing with the structured world of data
mining.
The objective is to take an unstructured document and automatically fill
in the values of a spreadsheet.
Extracting Information from the document:
Prediction and Evaluation
 Prediction is the measurement of error. For topic assignment, we can
determine whether a program’s answer is right or wrong.
 The classical measures of accuracy will be applicable, but not all errors
will be evaluated equally.
 That’s why measures of accuracy such as “recall” and “precision” are
especially important to document analysis.
Performance Measure
 The set of retrieved documents can be formed by collecting the top ranking
documents according to a similarity measure
 The quality of a collection can be compared by the two following measures
 Precision: percentage of retrieved documents that are in fact relevant to the
query (i.e., “correct” responses)
 Recall: percentage of documents that are relevant to the query and were, in
fact, retrieved
|}{|
|}{}{|
Retrieved
RetrievedRelevant
precision


recall 
|{Relevant}{Retrieved} |
|{Relevant} |
From Textual Information to Numerical
Vectors: Introduction
 To Mine Text we need to process it in a form that Data Mining
procedures use.
 From earlier, this involves generating features in a spread sheet
format
 Classical data mining looks at highly structured data
 Spreadsheet Model is embodiment of representation that is
supportive of predictive modeling.
 Predictive text mining is simpler and more restrictive than open
ended data mining.
From Textual Information to Numerical
Vectors: Introduction
 Text mining is unstructured because very far from the spreadsheet
model that we need to process data for prediction.
 Transformation of data to spreadsheet model is methodical and
carefully organized procedure to fill in cells in a spread sheet.
 We have to determine nature of column in spread sheet.
 Features are easy to obtain , some are difficult.
 Features (word in a text-easy) ; grammatical function of a word in a
sentence.
 Obtain the kinds of features generated from Text
Collecting Documents
 Text Mining is collect of data
 Web page retrieval application for an intranet implicitly specifies the
relevant documents to be the web pages on the intranet
 If documents are identified, then they can be obtained
 main issue – cleanse the samples and ensure high quality
 Web application compromising a number of autonomous Websites, one
may deploy software tool such as WebCrawler to collect the documents
Collecting Documents
 Other application, have a logging process attached to an input data
steam for a length of time (e.g., email audit, log in the incoming and
outgoing messages at mail server for a period of time)
 For R&D work of Text Mining , we need generic data – Corpus
 Accompanying Software is Reuter which is called Reuter’s corpus
(RV1)
 Early days (1960’s and 1970’s) 1 million works was considered
 Size of collection of size of collection Brown corpus consist of 500
samples for 2000 words of American English test
Collecting Documents
 European corpus was modeled on Brown corpus - British English
 1970’s 0r 80’s more resource were available - Government
sponsored.
 Some widely used corpora – Penn Tree Bank (collection
manually parsed sentences from Journal)
 Resource is World Wide Web.
 Web crawlers can build collections of pages from a particular sit such
as yahoo.
 Give n size of web, collections require cleaning before use.
Document Standardization
 When Documents are collected, you can have them in different
formats
 Some documents may be collected in word format or simple text
with ASCII format. To process these documents we have to convert
them to standard formats
 Standard Format –XML
 XML is Extensible Markup Language
Document Standardization-XML
 Standard way to insert tags onto text to identify it’s parts.
 Each Document is marked off from corpus through XML
 XML will have tags
 <Date>
 <Subject>
 <Topic>
 <Text>
 <Body>
 <Header>
XML – An Example
<?xml version="1.0" encoding="ISO-8859-1"?>
<note>
<to>Diya</to>
<from>Surya</from>
<heading>Reminder</heading>
<body>Happy Birth Day</body>
</note>
XML
 The main reason to identify the parts is to allow selection of those
parts that is used to generate features
 Selected document is concatenated into strings- separated by tags
Document Standardization
Advantage of data standardization is mining tools can be applied
without having to consider the pedigree of document.
Tokenization
 Document Collected in XML Format- examine the data
 Break the characters into words – TOKENS
 Each token is an instance of a type; the number of tokens is higher than
the number of types
 2 tokens “the” occurs twice in a sentence. Refer to occurrence
of a type
 Character space , tab are not tokens but white spaces
 Comma, Colon are tokens (between characters) e.g., USA,INDIA
 Between numbers are delimiter (121,135)
 Apostrophe - number of uses (Delimiter or part of token)
e.g., D’Angelo
 When it is followed by a terminator – internal quote (Tess’.)
Tokenization –Pseudo code
 Dash is a terminator a token proceeded or followed by another dash
(522-3333)
 Without identifying token it is difficult to imagine extracting higher
level information from document
Lemmatization
 Once a character stream has been segmented after sequence of tokens
 Convert each tokens to standard forms – Stemming or Lemmatization.
(Application dependent)
 Reduce the number of distinct types in corpus and increase
frequency of occurrence of individual types
 English Speaker’s agree nouns Book and Books are 2 forms of
same word- advantage to eliminate kind of variation
 Normalization regularize grammatical variants –Inflectional Stemming
Stemming to a Root
 Grammatical variants (singular/plural present/past)
 It is always advantageous to eliminate this kind of variation before further
processing
 When normalization is confined to regular grammatical variants such as
singular/plural and present/past, the process is called Inflectional stemming
 The intent of these stemmers is to reach a root of no inflectional or
derivational prefixes or suffixes- end result aggressive stemming Example
 Reduce number of types is text
Stemming Pseudo code
Vector Generation for prediction
 Consider the problem of categorizing documents
 Characteristic feature are tokens or words they contain.
 Without deep analysis we can choose to describe each document by
features that represent the most frequent tokens.
 There is collective features called dictionary.
 Tokens or words in the dictionary forms the basis for creating a
spreadsheet of numeric data corresponding to document collection.
 Each Row-> document; column ->feature
Vector Generation for prediction
 Cells in a spreadsheet is a measurement of feature for a document.
 Basic model of data, we will simply check the presence or absence of words.
 Checking for words is simple because we do not check each word in
dictionary. We will build a hash table. Large samples of digital documents are
readily available - confidence on variation and combinations of words that
occurs.
 If prediction is our goal then we need one more column for correct answer.
 In preparing data for learning, information is available from document labels.
(Labels are binaries and answers which is also called as class)
 Instead of generating global dictionary for class we consider words in class
that we are trying to predict.
 If this class is far smaller than the negative class which is typical - local
dictionary is far smaller than global dictionary
 Another reduction in dictionary size is to compile a lost of stop-words and
remove them from dictionary.
 Stop-words are almost never have any predictive capability, such articles a
& the pronouns as it and they.
 Frequency information on the word counts can be quite useful in reducing
the dictionary size and improve predictive performance
 Frequent words are stop-words and can be deleted.
 Alternative approach to local dictionary generation is to generate a global
dictionary from all documents in the collection . Special feature selection
routines will attempt to select a subset of words that have greatest potential
of prediction- independent (selection methods)
 If we have 100 topics to categorize, then 100 problems to solve. Our choices
are 100 small dictionary or 1 global dictionary .
 Vectors implied by spreadsheet model will be regenerated to correspond to
small dictionary
 Instead of placing the word in the dictionary, follow a path printed
dictionary and avoid storing every variation of word.
(no singular/plural/past/present)
 Verbs stored in stemming manner.
 Add a layer of complexity in text - gain in performance and size is reduced
 Universal procedure is trim words to their root form -> difference in
meaning (exit /exiting) - context of programming (different meanings)
 Small Dictionary - can capture the best words easily.
 Use of tokens and stemming are examples of helpful procedures in smaller
dictionaries. Improve ability of managing of learning and accuracy
 Document can be converted to spread sheet .
 Each column is feature. Row is a document.
 Model of data for predictive text mining in terms of spread sheet that
populated by ones or zeros.
 Cells represent the presence of dictionary words in a document collection.
Higher accuracy-> additional transformations
 They are
 Word Pairs and collocations
 Frequency
 Tf-idf
 Word Pairs and Collocations: They serve to increase size of dictionary
improve performance of prediction
 Instead of 0’s & 1’s in cells; the frequency of word can be used.(word “the”
occurs 10 times count of “the” is used)
 Count give better results than binary in cells.
 This leads to compact solutions same solution of binary data model. Yet
additional frequency yield simpler solution.
 Frequencies are helpful in prediction but add complexity to solutions.
 Compromise that works – 3 value system.1/0/2
 Word did not occur -0
 Word occurred one -1
 Word occurred 2 or more times -2
 Capture much added value of frequency without adding much complexity to
model.
 Another variant is zeroing the values below the threshold where tokens min
frequency before being considered any use.
 Reduce the complexity of spread sheet – used in Data Mining algorithms
 Other methods to reduce complexity are chi square, mutual Information,
odds Ratio ..etc
 Next step beyond counting frequency is modify the count by perceived
importance of that word .
 Tf-idf: Compute the weightings or scores of words
 Values of positive numbers that we capture the absence or presence of the
words.
 Eq(a) we see that weight assigned to word j-term of frequency modified by
a scale factor for importance of word. Scale factor is inverse document
frequency (Eq(b))
 Simply checks for number of documents containing the word df(j) and
reverse scaling.
 Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)
 Idf(j) = log(N/ df(j)) ------ Eq(b)
 When a word appears in a document, the scale is lowered and perhaps zero.
if word is unique , appears in few documents - scale factor zooms upward
and appears important
 Alternative of this tf-idf formulation exist, but motivation is same. Result is
positive score that replaces the simple frequency or binary (T/F) entry in our
cell in spreadsheet.
 Another variant is weight the tokens from different parts of the document.
 Which Data Transformation Method are BEST????
 No Universal answer.
 Best predictive accuracy is dependent on mating all these
methods.
 Best variation is one method may not be the one for other. Test
ALL
 Describe data as populating a spread sheet-cells are 0
 Small subset of dictionary words.
 Text Classification a text corpus 1000’s words. Each individual document,
unique tokens.
 Spread sheet for that document is 0. Rather than store all 0’s its better to
represent the spread sheet as a set of sparse vectors (row is list of pairs , one
element of pair is column and other is corresponding nonzero value). By not
storing the non zero It will increase memory
Multi Word Features
 Features are associated with single words ( tokens delimited with white
space)
 Simple scenario is extended to include pair of words e.g., bon and viant .
Instead of separating we could feature the word as bonviant.
 Why stop at pairs? Why not consider a multiword features??
 Unlike word pairs , the words need not be consecutive.
 E.g., Don Smith as feature – we can ignore is middle name Leroy that may
reappear in some reference to the person.
 In this case we have to accommodate many reference to the noun that
involve a number of adjectives with desired adjective not the adjacent to the
noun. E.g., we want to accept a phrase broken and dirty vase as an instance
broken vase
 X number if words occurring within a maximum window size y(y>=x naturally)
 How features are extracted from text- specialized methods???
 If we use frequency methods, combinations of words that are relatively frequent.
 Straight forward implementation is simple combination of x words in window y
 Measuring the value of multiword feature is done correlation between words in
potential multiword features measures on mutual information or likelihood ratio
is used!!!
 An algorithm for generating multiword features. A straight forward
implementation consume lot of memory
 Multiword features are not too found in document collection, but they are highly
predictive
 

Twordi iwordfreq
TfreqTfreqTsize
TAM
)(
)())((log)(
)( 10
Labels for Right Answers:
 For prediction an extra column is added to the spreadsheet
 Last column contains the labels, looks no different from others.
 It’s a 0 or 1 indicating a right answer with either True/false
 In the sparse vector format are appended to each vector separately as
either a one (positive class) or a Zero (negative class)
 Feature Selection by Attribute Ranking:
 In addition to frequency based approaches, feature selection can be done in
number of ways.
 Select a set of feature for each category to form a local dictionary for the
category
 Independent ranking feature attributes according to their predictive abilities
for category under consideration.
 Predictive ability of an attribute can be measured by certain quantity how its is
correlated
 Lets assume n number of documents; presence or absence of attribute j
in x; y to denote label of document in last column
ix
 A commonly used ranking score is information gain criterion which is
 The quantity L(j) is number of bits required to encode the label and the
attribute j minus the number of bits required to encode the attribute.
 Quantities are needed to compute L(j). Can be easily estimated using the
estimators







1
0
2
1
0
1
0
2
))|(/1(log)|Pr()Pr()(
))(/1(log)Pr(
)()(
c
ii
v
i
c
Label
label
vxcyprvxcyvxjL
cyprcyL
jLLjIG
2)(
1),(
)|(
2
1)(
)(






vxfreq
clabelvxfreq
vxcypr
n
vxfreq
vxpr
j
j
j
j
i
Sentence Boundary Determination
 If the XML markup for corpus doesn't mark sentence boundaries,
necessary to mark the sentence
 Necessary to determine when a period is part of a token and when it is
not
 For more sophisticated way linguistic parsing, the algorithms often
require complete sentence as input.
 Extraction algorithms operate text a sentence at a time
 Algorithms are optimal, sentences are identified clearly
 Sentence boundary determination is problem of deciding which
instances of period followed by white space are sentence delimiters and
which are not, since we assume characters ? ! –classification problem
 Algorithm – accuracy and adjustments will give better performance
Thank you  !!!!

Más contenido relacionado

La actualidad más candente

Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysisDataminingTools Inc
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data MiningValerii Klymchuk
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrievalKU Leuven
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxShivaVemula2
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social mediaJeremiah Fadugba
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalSudarsun Santhiappan
 
Data mining query language
Data mining query languageData mining query language
Data mining query languageGowriLatha1
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data miningKrish_ver2
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithmhina firdaus
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data miningDataminingTools Inc
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Seerat Malik
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignmentKarthi Keyan
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data miningDataminingTools Inc
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.pptbutest
 

La actualidad más candente (20)

Text mining
Text miningText mining
Text mining
 
Data Mining: clustering and analysis
Data Mining: clustering and analysisData Mining: clustering and analysis
Data Mining: clustering and analysis
 
04 Classification in Data Mining
04 Classification in Data Mining04 Classification in Data Mining
04 Classification in Data Mining
 
Tdm information retrieval
Tdm information retrievalTdm information retrieval
Tdm information retrieval
 
Data mining primitives
Data mining primitivesData mining primitives
Data mining primitives
 
IRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptxIRS-Cataloging and Indexing-2.1.pptx
IRS-Cataloging and Indexing-2.1.pptx
 
Text analytics in social media
Text analytics in social mediaText analytics in social media
Text analytics in social media
 
Latent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information RetrievalLatent Semantic Indexing For Information Retrieval
Latent Semantic Indexing For Information Retrieval
 
Data mining query language
Data mining query languageData mining query language
Data mining query language
 
Information Extraction
Information ExtractionInformation Extraction
Information Extraction
 
4.2 spatial data mining
4.2 spatial data mining4.2 spatial data mining
4.2 spatial data mining
 
Association rule mining and Apriori algorithm
Association rule mining and Apriori algorithmAssociation rule mining and Apriori algorithm
Association rule mining and Apriori algorithm
 
Decision tree
Decision treeDecision tree
Decision tree
 
Data Mining: Applying data mining
Data Mining: Applying data miningData Mining: Applying data mining
Data Mining: Applying data mining
 
Data Mining: What is Data Mining?
Data Mining: What is Data Mining?Data Mining: What is Data Mining?
Data Mining: What is Data Mining?
 
Register allocation and assignment
Register allocation and assignmentRegister allocation and assignment
Register allocation and assignment
 
Data Mining: Application and trends in data mining
Data Mining: Application and trends in data miningData Mining: Application and trends in data mining
Data Mining: Application and trends in data mining
 
MachineLearning.ppt
MachineLearning.pptMachineLearning.ppt
MachineLearning.ppt
 
Pattern recognition
Pattern recognitionPattern recognition
Pattern recognition
 
Data Mining: Association Rules Basics
Data Mining: Association Rules BasicsData Mining: Association Rules Basics
Data Mining: Association Rules Basics
 

Similar a Text Data Mining Techniques and Processes Explained

Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text ClassificationAM Publications
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationIOSR Journals
 
1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0asaira gilani
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paralalyssamarieparal
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining Rupak Roy
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1Dave King
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewINFOGAIN PUBLICATION
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersScott Bou
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesHammad Afzal
 
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Heimo Hänninen
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance Anaya Zafar
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningIJSRD
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papersEditorJST
 

Similar a Text Data Mining Techniques and Processes Explained (20)

Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
Survey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document ClassificationSurvey of Machine Learning Techniques in Textual Document Classification
Survey of Machine Learning Techniques in Textual Document Classification
 
1 _text_mining_v0a
1  _text_mining_v0a1  _text_mining_v0a
1 _text_mining_v0a
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
Ppt programming by alyssa marie paral
Ppt programming by alyssa marie paralPpt programming by alyssa marie paral
Ppt programming by alyssa marie paral
 
Introduction to Text Mining
Introduction to Text Mining Introduction to Text Mining
Introduction to Text Mining
 
Text mining and analytics v6 - p1
Text mining and analytics   v6 - p1Text mining and analytics   v6 - p1
Text mining and analytics v6 - p1
 
Text Mining at Feature Level: A Review
Text Mining at Feature Level: A ReviewText Mining at Feature Level: A Review
Text Mining at Feature Level: A Review
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Text mining
Text miningText mining
Text mining
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
A Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research PapersA Lightweight Approach To Semantic Annotation Of Research Papers
A Lightweight Approach To Semantic Annotation Of Research Papers
 
Text Mining.pptx
Text Mining.pptxText Mining.pptx
Text Mining.pptx
 
Literature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resourcesLiterature Based Framework for Semantic Descriptions of e-Science resources
Literature Based Framework for Semantic Descriptions of e-Science resources
 
Heterogeneous data annotation
Heterogeneous data annotationHeterogeneous data annotation
Heterogeneous data annotation
 
Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?Semantic technology in nutshell 2013. Semantic! are you a linguist?
Semantic technology in nutshell 2013. Semantic! are you a linguist?
 
data structures and its importance
 data structures and its importance  data structures and its importance
data structures and its importance
 
A Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text miningA Novel Approach for Keyword extraction in learning objects using text mining
A Novel Approach for Keyword extraction in learning objects using text mining
 
Bibliographic metadata (including citation)
Bibliographic metadata (including citation)Bibliographic metadata (including citation)
Bibliographic metadata (including citation)
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 

Text Data Mining Techniques and Processes Explained

  • 2. Definition : Text Mining  Text mining refers generally to the process of extracting interesting information and knowledge from unstructured text.  Text Mining can be defined as a knowledge-intensive process in which a user interacts with a document collection over time by using a suite of analysis tools. And  Text Mining seeks to extract useful information from data sources (document collections) through the identification and exploration of interesting patterns.
  • 3. Text Mining  Text mining (TM) seeks to extract useful information from a collection of documents.  It is similar to data mining (DM), but the data sources are unstructured or semi-structured documents.  The TM methods involve : - Basic pre-processing / TM operations, such as identification / extraction of representative features (this can be done in several phases) - Advanced text mining operations, involving identification of complex patterns (e.g. relationships between previously identified concepts)  TM exploits techniques / methodologies from data mining, machine learning, information retrieval, corpus-based computational linguistics
  • 4. Similarity and difference between Data and Text Mining  Both types of systems rely on:  Preprocessing routines  Pattern-discovery algorithms  Presentation-layer elements such as visualization tools  Pre-processing Operations:  In Data Mining assume data  Stored in a structured format, so preprocessing focus on scrubbing and normalizing data, to create extensive numbers of table joins  In Text Mining preprocessing operations center on  Identification & Extraction of representative features for NL documents, to transform unstructured data stored in document collections into a more explicitly structured intermediate format
  • 5. Finding Frequent Patterns Finding “Nuggets” Novel Non-Novel Non textual data General Data Mining Exploratory Data Analysis Database Queries Information RetrievalTextual data Computational Linguistics
  • 6. Text Mining Pipeline Unstructured Text (implicit knowledge) Structured content (explicit knowledge)
  • 7. Text Mining Process Text preprocessing Syntactic/Semantic text analysis Features Generation Bag of words Features Selection Simple counting Statistics Text/Data Mining Classification- Supervised learning Clustering- Unsupervised learning Analyzing results Mapping/Visualization Result interpretation
  • 8. Text Mining Tasks TM Text Analysis Tools Feature extraction Categorization Summarization Clustering Name Extractions Term Extraction Abbreviation Extraction Relationship Extraction Hierarchical Clustering Binary relational Clustering Web Searching Tools Text search engine Net Question Solution Web Crawler
  • 9. Handling Text Data  Modeling semi-structured data  Information Retrieval (IR) from unstructured documents • Locates relevant documents and Ranks documents • Keyword based (Boolean matching) • Similarity based  Text mining • Classify documents • Cluster documents • Find patterns or trends across documents
  • 10. Challenges in Text Mining  Data collection is “free text”, is not well-organized (Semi-structured or unstructured)  No uniform access over all sources, each source has separate storage and algebra, examples: email, databases, applications, web  A quintuple heterogeneity: semantic, linguistic, structure, format, size of unit information  Learning techniques for processing text typically need annotated training  XML as the common model, it allows:  Manipulation data with standards  Mining becomes more data mining  RDF emerging as a complementary model  The more structure you can explore the better you can do mining
  • 11. Types of Text Data Mining  Keyword-based association analysis  Automatic document classification  Similarity detection  Cluster documents by a common author  Cluster documents containing information from a common source  Link analysis: unusual correlation between entities  Sequence analysis: predicting a recurring event  Anomaly detection: find information that violates usual patterns  Hypertext analysis  Patterns in anchors/links  Anchor text correlations with linked objects
  • 12. Documents and Document Collections  Document collection is a grouping of text-based documents.  It can be either static or dynamic (growing over time).  Document is a unit of discrete textual data within a collection, representing usually some real world document, such as, a business report, memorandum, email, research paper, news story etc.  A document can be a member of different document collections (e.g. legal affairs and computing equipment, if it falls under both).
  • 13. Document Structure Text documents can be :  unstructured i.e. free-style text (but from a linguistic perspective they are really structured objects)  weakly structured Adhering to some pre-specified format, like most scientific papers, business reports, legal memoranda, news stories etc.  semi-structured Exploiting heavy document templating or style sheets.
  • 14. Weakly structured and Semi structured Documents Documents  Have relatively little in the way of strong typographical, layout, or markup indicators to denote structure are referred to as free-format or weakly structured docs (such as most scientific research papers, business reports, and news stories)  With extensive and consistent format elements in which field-type metadata can be more easily inferred are described as semi- structured docs (such as some e-mail, HTML web pages, PDF files)
  • 15. Document Representation and Features  Irregular and implicitly structured representation is transformed into an explicitly structured representation.  We can distinguish: - Feature based representation, - Relational representation.  In feature based representation that documents are represented by a set of features.
  • 16. Document Features Although many potential features can be employed to represent docs, the following four types are most commonly used  Characters  Words  Terms  Concepts High Feature Dimensionality ( HFD)  Problems relating to HFD are typically of much greater magnitude in TM systems than in classic DM systems. Feature Sparcity  Only a small percentage of all possible features for a document collection as a whole appear as in any single docs.
  • 17. Representational Model of a Document  An essential task for most text mining systems is The identification of a simplified subset of document features that can be used to represent a particular document as a whole.  We refer to such a set of features as the representational model of a document.  Commonly Used Document Features:  Characters,  Words,  Terms, and  Concepts
  • 18. Character level Representation  Without Positional Information Are often of very limited utility in TM applications  With Positional Information Are somewhat more useful and common (e.g. bigrams or trigrams)  Disadvantage: Character-base Rep. can often be unwieldy for some types of text processing techniques because the feature space for a docs is fairly un-optimized Word-level Representation  Without Positional Information Are often of very limited utility in TM applications  With Positional Information Are somewhat more useful and common(e.g. bigrams or trigrams)  Disadvantage: Word-base Rep. can often be unwieldy for some types of text processing techniques because the feature space for a docs is fairly un-optimized
  • 19. Term-level Representation  Normalized Terms comes out of Term-Extraction Methodology  Sequence of one or more tokenized and lemmatized word forms associated with part-of-speech tags. Concept-level Representation  Concepts are features generated for a document by means of manual, statistical, rule-based, or hybrid categorization methodology.
  • 20. General Architecture of Text Mining Systems Abstract Level A text mining system takes in input raw docs and generates various types of output such as:  Patterns  Maps of connections  Trends Input Output Documents Patterns Connections Trends
  • 21. General Architecture of Text Mining Systems Functional Level TM systems follow the general model provided by some classic DM applications and are thus divisible into 4 main areas • Preprocessing Tasks • Core mining operations • Presentation layer components and browsing functionality • Refinement techniques
  • 22. System Architecture for Generic Text Mining System
  • 24. System Architecture for an Advanced Text Mining System with background knowledge base
  • 25. Core Text Mining Operations  Core mining operations in text mining systems are algorithms of the creation of queries for discovering patterns in document collections.  Core Text Mining Operations • Distributions • Frequent and Near Frequent Sets • Associations • Isolating Interesting Patterns • Analyzing Document Collections over Time  Using Background Knowledge for Text Mining  Text Mining Query Languages
  • 26. Core Text Mining Operations  Core text mining operations consist of various mechanisms for discovering patterns of concept with in a document collection.  The three types of patterns in text mining  Distributions (and proportions)  Frequent and near frequent sets  Associations  Symbols  D : a collection of documents  K : a set of concepts  k : a concept
  • 27. Distributions  Definition 1. Concept Selection  Selecting some sub collection of documents that is labeled by one or more given concepts  D/K  Subset of documents in D labeled with all of the concepts in K  Definition 2. Concept Proportion  The proportion of a set of documents labeled with a particular concept  f(D , K) = |D/K| / |D|  The fraction of documents in D labeled with all of the concepts in K
  • 28. Distributions  Definition 3. Conditional Concept Proportion  The proportion of a set of documents labeled with a concept that are also labeled with another concept  f(D , K1|K2) = f(D/K2 , K1)  The proportion of all those documents in D labeled with K2 that are also labeled with K1  Definition 4. Concept Proportion Distribution  The proportion of documents in some collection that are labeled with each of a number of selected concepts  FK(D , x)  The proportion of documents in D labeled with x for any x in K
  • 29. Distributions  Definition 5. Conditional Concept Proportion Distribution  The proportion of those documents in D labeled with all the concepts in K’ that are also labeled with concept x(with x in K)  FK(D,x | K’) = FK(D/K | K’, x)  Definition 6. Average Concept Proportion  Given a collection of documents D, a concept k, and an internal node in the hierarchy n, an average concept proportion is the average value of f(D,k | k’), where k’ ranges over all immediate children of n.  a(D,k | n) = Avg {k’ is a child of n} {f(D,k | k’)}
  • 30. Distributions  Definition 7. Average Concept Distribution  Given a collection of documents D and two internal nodes in the hierarchy n and n’, average concept distribution is the distribution that, for any x that is a child of n, averages x’s proportions over all children of n’  An(D,x | n’ ) = Avg {k’ is a child of n’} {Fn(D,x | k’)}
  • 31. Frequent and Near Frequent Sets  Frequent Concept Sets  A set of concepts represented in the document collection with co- occurrences at or above a minimal support level (given as a threshold parameter s; i.e., all the concepts of the frequent concept set appear together in at least s documents) Support  The number (or percent) of documents containing the given rule – that is, the co-occurrence frequency Confidence  The percentage of the time that the rule is true
  • 32. Frequent and Near Frequent Sets Algorithm 1 : The Apriori Algorithm (Agrawal and Srikant 1994)  Discovery methods for frequent concept sets in text mining.
  • 33. Frequent and Near Frequent Sets Algorithm for Frequent Set Generation Frequent sets are generated in relation to some support level. Support (i.e., the frequency of co- occurrence) by convention often expressed as the variable σ, frequent sets are sometimes also referred to as σ- covers, or σ-cover sets.
  • 34. Frequent and Near Frequent Sets  Near Frequent Concept Sets  An undirected relation between two frequent sets of concepts  This relation can be quantified by measuring the degree of overlapping, for example, on the basis of the number of documents that include all the concepts of the two concept sets.
  • 35. Associations Associations  Directed relations between concepts or sets of concepts Associations Rule  An expression of the from A => B, where A and B are sets of features  An association rule A => B indicates that transactions that involve A tend also to involve B.  A is the left-hand side (LHS)  B is the right-hand side (RHS) Confidence of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concept in B within the subset of those documents that include all the concepts in A Support of Association Rule A => B (A, B : frequent concept sets)  The percentage of documents that include all the concepts in A and B
  • 36. Associations Discovering Association Rules  The Problem of finding all the association rules with a confidence and support greater than the user-identified values minconf (the minimum confidence level) and minsup (the minimum support level) thresholds Two step of discovering associations  Find all frequent concept sets X (i.e., all combinations of concepts with a support greater than minsup).  Test whether X-B => B holds with the required confidence  X = {w,x,y,z}, B = {y,z} , X-B = {w,x}  X-B => B {w,x} => {y,z}  Confidence of association rule {w,x} => {y,z} confidence = support({w,x,y,z}) / support({w,x}
  • 37. Associations Maximal Associations (M-association)  Relations between concepts in which associations are identified in terms of their relevance to one concept and their lack of relevance to another  Concept X most often appear in association with Concept Y, Simple Algorithm for Generating Associations (Rajman and Besancon 1998)
  • 38. Associations  Definition 8. Alone with Respect to Maximal Associations  For a transaction t, a category g, and a concept-set X gi, one would say that X is alone in t if t ∩ gi = X.  X is alone in t if X is largest subset of gi that is in t  X is maximal in t …  t M-supports X …  For a document collection D, the M-support of X in D  number of transactions t D that M-support X.
  • 39. Associations  The M-support for the maximal association  If D(X,g(Y)) is the subset of the document collection D consisting of all the transactions that M-support X and contain at least one element of g(Y), then the M-confidence of the rule
  • 40. Isolating Interesting Patterns Interestingness with Respect to Distributions and Proportions  Measures for quantifying the distance between an investigated distribution and another distribution => Sum-of-squares to measure the distance between two models D(P’ || P) = ∑ (p’(x) – p(x))2
  • 41. Isolating Interesting Patterns  Definition 9. Concept Distribution Distance  Given two concept distributions P’K(x) and Pk(x), the distance D(P’K || PK) between them D(P’K || PK) = ∑(P’K(x) – PK(x))2  Definition 10. Concept Proportion Distance  The value of the difference between two distributions at a particular point d(P’K || PK) = P’K(x) – PK(x)
  • 42. Analyzing Document Collections over Time Incremental Algorithms  Algorithms processing truly dynamic document collections that add, modify, or delete documents over time Trend Analysis  The term generally used to describe the analysis of concept distribution behavior across multiple document subsets over time  A two-phase process First phase  Phrases are created as frequent sequences of words using the sequential patterns mining algorithms first mooted for mining structured databases Second phase  A user can query the system to obtain all phrases whose trend matches a specified pattern.
  • 43. Analyzing Document Collections over Time Ephemeral Associations  A direct or inverse relation between the probability distributions of given topics (concepts) over a fixed time span Direct Ephemeral Associations  One very frequently occurring or “peak” topic during a period seems to influence either the emergence or disappearance of other topics Inverse Ephemeral Associations  Momentary negative influence between one topic and another Deviation Detection  The identification of anomalous instances that do not fit a defined “standard case” in large amounts of data.
  • 44. Analyzing Document Collections over Time Context Phrases and Context Relationships  Definition 11. Context Phrase  A subset of documents in a document collection that is either labeled with all, or at least one, of the concepts in a specified set of concepts.  If D is a collection of documents and C is a set of concepts, D/A(C) is the subset of documents in D labeled with all the concepts in C, and D/O(C) is the subset of documents in D labeled with at least one of the concepts in C. Both D/A(C) and D/O(C) are referred to as context phrases.
  • 45. Analyzing Document Collections over Time Context Phrases and Context Relationships  Definition 12. Context Relationships  The relationship within a set of concepts found in the document collection in relation to a separately specified concept ( the context or the context concept)  If D is a collection of documents, c1 and c2 are individual concepts, and P is a context phase, R(D, c1, c2 | P) is the number of documents in D/P which include both c1 and c2, Formally, R(D, c1, c2 | P) = |(D/A({c1,c2}))|P|.
  • 46. Analyzing Document Collections over Time The Context Graph  Definition 13. Context Graph  A graphic representation of the relationship between a set of concepts as reflected in a corpus respect to a given context.  A context graph consists of a set of vertices (=nodes) and edges.  The vertices of the graph represent concepts  Weighted “edges” denote the affinity between the concepts.  If D is a collection of documents, C is a set of concepts, and P is a context phrase, the concept graph of D, C, P is a weighted graph G = (C,E), with nodes in C and a set of edges E = ({c1,c2} | R(D, c1, c2 | P) > 0). For each edge, {c1,c2} E, one defines the weight of the edge, w{c1,c2} = R(D, c1, c2 | P).
  • 47. Analyzing Document Collections over Time  Example of Context Graph in the context of P Concept1(C1) Concept3(C3) Concept2(C2) R(D, c1, c2 | P) = 10 R(D, c1, c3 | P) = 15
  • 48. Analyzing Document Collections over Time  Definition 14. Temporal Selection (“Time Interval”)  If D is a collection of documents and I is a time range, date range, or both, DI is the subset of documents in D whose time stamp, date stamp, or both, is within I. The resulting selection is sometimes referred to as the time interval.  Definition 15. Temporal Context Relationship  If D is a collection of documents, c1 and c2 are individual concepts, P is a context phrase, and I is the time interval, then RI(D, c1, c2 | P) is the number of documents in DI in which c1 and c2 co-occur in the context of P – that is, RI(D, c1, c2 | P) is the number of DI/P that include both c1 and c2.  Definition 16. Temporal Context Graph  If D is a collection of documents, C is a set of concepts, P is a context phrase, and I is the time range, the temporal concept graph of D, C, P, I is a weighted graph G = (C,EI) with set nodes in C and a set of edges EI, where EI = ({c1,c2} | R(D,c1,c2|P) > 0). For each edge, {c1, c2} E, one defines the weight of the edge by wI{c1,c2} = RI(D,c1,c2|P).
  • 49. Analyzing Document Collections over Time The Trend Graph A representation that builds on the temporal context graph as informed by the general approaches found in trend analysis New Edges  Edges that did not exist in the previous graph Increased Edges  Edges that have a relatively higher weight in relation to the previous interval Decreased Edges  Edges that have a relatively decreased weight than the previous interval. Stable Edges  Edges that have about the same weight as the corresponding edge in the previous interval
  • 50. Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm  The Borders algorithm can be used to update search pattern results incrementally.  Definition 17. Border Set  X is a border set if it is not a frequent set, but any proper subset Y X is frequent set
  • 51. Analyzing Document Collections over Time The Borders Incremental Text Mining Algorithm  Concept Set A = {A1, …, Am}  Relations over A:  Rold : old relation  Rinc : increment  Rnew : new combined relation  s(X/R) : support of concept set X in the relation R  s* : minimum support threshold (min_sup)  Property 1: if X is a new frequent set in Rnew, then there is a subset Y X such that Y is a promoted border  Property 2: if X is a new k-sized frequent set in Rnew, then for each subsetY X of size k-1, Y is one of the following: (a) a promoted border, (b) a frequent set, or (c) an old frequent set with additional support in Rinc.
  • 52. Analyzing Document Collections over Time The Borders Incremental Text Mining Algorithm  Stage 1: Finding Promoted Borders and Generating Candidates.  Stage 2: Processing Candidates
  • 53. Analyzing Document Collections over Time  The Borders Incremental Text Mining Algorithm
  • 54. Text Mining Preprocessing Techniques  Effective text mining operations are predicated on sophisticated data preprocessing methodologies.  Text mining is o dependent on the various preprocessing techniques that infer or extract structured representations from raw unstructured data sources, or do both.  Different preprocessing techniques are used to create structured document representations from raw textual data. (structure documents – and, by extension, document collections.)  Two ways of categorizing the totality of preparatory document structuring techniques - According to  Their task and  The algorithms and formal frameworks that they use.
  • 55. Pre-processing Techniques  Task Oriented Approaches  General purpose NLP tasks  Tokenization and zoning  Part-of-speech Tagging and Stemming  Shallow and deep syntactic parsing  Problem Dependent task  Text Categorization  Information Extraction
  • 56. Taxonomy of Text Pre-Processing Tasks
  • 57. General Purpose NLP Tasks  Tokenization  Tokenization is the process of breaking a stream of text up into words, phrases, symbols, or other meaningful elements called tokens.  The list of tokens becomes input for further processing such as parsing or text mining.  Part-of-speech Tagging  POS tagging is the annotation of words with the appropriate POS tags based on the context in which they appear.  POS tags divide words into categories based on the role they play in the sentence in which they appear.  POS tags provide information about the semantic content of a word.  POS taggers at some stage of their processing perform morphological analysis of words. An additional output of a POS tagger is a sequence of stems (“lemmas”) of the input words.
  • 58.  Syntactical parsing  Syntactical parsing components perform a full syntactical analysis of sentences according to a certain grammar theory. The basic division is between the constituency and dependency grammars.  Constituency grammars describe the syntactical structure of sentences in terms of recursively built phrases – sequences of syntactically grouped elements.  Dependency grammars, do not recognize the constituents as separate linguistic units but focus instead on the direct relations between words.  Shallow parsing  Shallow parsing compromises speed and robustness of processing by sacrificing depth of analysis.  Instead of providing a complete analysis (a parse) of a whole sentence, shallow parsers produce only parts that are easy and unambiguous.  For the purposes of information extraction, shallow parsing is usually sufficient and preferable to full analysis because of its far greater speed and robustness.
  • 59. Problem Dependent Task  Text Categorization  Text categorization (Text Classification) tasks tag each document with a small number of concepts or keywords.  The set of all possible concepts or keywords is usually manually prepared, closed, and comparatively small. The hierarchy relation between the keywords is also prepared manually.  Information Extraction  Information retrieval returns documents that match a given query but still requires the user to read through these documents to locate the relevant information.  IE, aims at pinpointing the relevant information and presenting it in a structured format – typically in a tabular format.
  • 60. Types of Problems  Text-mining operates in very high dimensions, in many situations, processing is effective and efficient because of the sparseness characteristic of most documents and most practical applications.  Types of problems that can be solved with text mining approach to data representation and learning methods are Document Classification Information Retrieval Clustering and Organizing Documents Information Extraction Prediction and Evaluation
  • 61. Document Classification  Documents are organized into folders, one folder for each topic. A new document is presented, and the objective is to place this document in the appropriate folders.  Document classification or document categorization is a problem in library science, information science and computer science. The task is to assign a document to one or more classes or categories.  Document classification tasks can be divided into three kinds  supervised document classification is performed by an external mechanism, usually human feedback, which provides the necessary information for the correct classification of documents  semi-supervised document classification, a mixture between supervised and unsupervised classification: some documents or parts of documents are labeled by external assistance  unsupervised document classification is entirely executed without reference to external information
  • 63. Classification Techniques  Decision Trees  K-nearest neighbors  Training examples are points in a vector space  Compute distance between new instance and all training instances and the k-closest vote for the class  Naïve Bayes Classifier  Classify using probabilities and assuming independence among terms  P(xi |C) is estimated as the relative frequency of examples having value xi as feature in class C  P(C/ Xi Xj Xk) = P(C) P(Xi/C) P(Xj/C) P(Xk/C)  Neural networks, support vector machines,…
  • 64. Information Retrieval  Query  E.g. Spam / Text Documents source Ranked Documents IR System Document Document Document Given  A source of textual documents  A user query (text based) Find  A set (ranked) of documents that are relevant to the query
  • 65. Information Retrieval  Information Retrieval (IR) is finding material (usually documents) of an unstructured nature (usually text) that satisfies an information need (query) from within large collections (usually stored on computers).  Basic assumptions Collection: Fixed set of documents Goal: Retrieve documents with information that is relevant to user’s information need and helps him complete a task  Retrieving Matched document:
  • 66. Information Retrieval Basic Information Retrieval (IR) process Browsing or Navigation system  User skims document collection by jumping from one document to the other via hypertext or hypermedia links until relevant document found Classical IR system: Question Answering System  Query: Question in Natural Language  Answer: Directly extracted from text of document collection Text Based Information Retrieval  Information Item (document)  Text format (written/spoken) or has textual description  Information Need (query)  Usually in text format
  • 68. Clustering and Organizing Documents: Clustering Given  A source of textual documents  Similarity measure (e.g., how many words are common in these documents) Find  Several clusters of documents that are relevant to each other Similarity measure Documents source Clustering System Doc DocDoc Doc Doc Doc
  • 69. Clustering and Organizing Documents The clustering process is equivalent to assigning the labels needed for text categorization. There are many ways to cluster documents, it is not quite as powerful a process as assigning answers (i.e., known correct labels) to documents. Organizing documents into groups:
  • 70. Information Extraction Definition  The automatic extraction of structured information from unstructured documents.  Information Extraction is the process of scanning text for relevant information to some interest  Extract: Entities, Relations, Events Overall Goals:  Making information more accessible to people  Making information more machine-processable
  • 71. Information Extraction Why IE?  Need for efficient processing of texts in specialized domains  Focus on relevant parts, ignore the rest  Typical applications:  Gleaning Business  Government  Military Intelligence  WWW searches (more specific than keywords)  Scientific literature searches
  • 72. Information Extraction Information extraction is a subfield of text mining that attempts to move text mining onto an equal footing with the structured world of data mining. The objective is to take an unstructured document and automatically fill in the values of a spreadsheet. Extracting Information from the document:
  • 73. Prediction and Evaluation  Prediction is the measurement of error. For topic assignment, we can determine whether a program’s answer is right or wrong.  The classical measures of accuracy will be applicable, but not all errors will be evaluated equally.  That’s why measures of accuracy such as “recall” and “precision” are especially important to document analysis.
  • 74. Performance Measure  The set of retrieved documents can be formed by collecting the top ranking documents according to a similarity measure  The quality of a collection can be compared by the two following measures  Precision: percentage of retrieved documents that are in fact relevant to the query (i.e., “correct” responses)  Recall: percentage of documents that are relevant to the query and were, in fact, retrieved |}{| |}{}{| Retrieved RetrievedRelevant precision   recall  |{Relevant}{Retrieved} | |{Relevant} |
  • 75. From Textual Information to Numerical Vectors: Introduction  To Mine Text we need to process it in a form that Data Mining procedures use.  From earlier, this involves generating features in a spread sheet format  Classical data mining looks at highly structured data  Spreadsheet Model is embodiment of representation that is supportive of predictive modeling.  Predictive text mining is simpler and more restrictive than open ended data mining.
  • 76. From Textual Information to Numerical Vectors: Introduction  Text mining is unstructured because very far from the spreadsheet model that we need to process data for prediction.  Transformation of data to spreadsheet model is methodical and carefully organized procedure to fill in cells in a spread sheet.  We have to determine nature of column in spread sheet.  Features are easy to obtain , some are difficult.  Features (word in a text-easy) ; grammatical function of a word in a sentence.  Obtain the kinds of features generated from Text
  • 77. Collecting Documents  Text Mining is collect of data  Web page retrieval application for an intranet implicitly specifies the relevant documents to be the web pages on the intranet  If documents are identified, then they can be obtained  main issue – cleanse the samples and ensure high quality  Web application compromising a number of autonomous Websites, one may deploy software tool such as WebCrawler to collect the documents
  • 78. Collecting Documents  Other application, have a logging process attached to an input data steam for a length of time (e.g., email audit, log in the incoming and outgoing messages at mail server for a period of time)  For R&D work of Text Mining , we need generic data – Corpus  Accompanying Software is Reuter which is called Reuter’s corpus (RV1)  Early days (1960’s and 1970’s) 1 million works was considered  Size of collection of size of collection Brown corpus consist of 500 samples for 2000 words of American English test
  • 79. Collecting Documents  European corpus was modeled on Brown corpus - British English  1970’s 0r 80’s more resource were available - Government sponsored.  Some widely used corpora – Penn Tree Bank (collection manually parsed sentences from Journal)  Resource is World Wide Web.  Web crawlers can build collections of pages from a particular sit such as yahoo.  Give n size of web, collections require cleaning before use.
  • 80. Document Standardization  When Documents are collected, you can have them in different formats  Some documents may be collected in word format or simple text with ASCII format. To process these documents we have to convert them to standard formats  Standard Format –XML  XML is Extensible Markup Language
  • 81. Document Standardization-XML  Standard way to insert tags onto text to identify it’s parts.  Each Document is marked off from corpus through XML  XML will have tags  <Date>  <Subject>  <Topic>  <Text>  <Body>  <Header>
  • 82. XML – An Example <?xml version="1.0" encoding="ISO-8859-1"?> <note> <to>Diya</to> <from>Surya</from> <heading>Reminder</heading> <body>Happy Birth Day</body> </note>
  • 83. XML  The main reason to identify the parts is to allow selection of those parts that is used to generate features  Selected document is concatenated into strings- separated by tags Document Standardization Advantage of data standardization is mining tools can be applied without having to consider the pedigree of document.
  • 84. Tokenization  Document Collected in XML Format- examine the data  Break the characters into words – TOKENS  Each token is an instance of a type; the number of tokens is higher than the number of types  2 tokens “the” occurs twice in a sentence. Refer to occurrence of a type  Character space , tab are not tokens but white spaces  Comma, Colon are tokens (between characters) e.g., USA,INDIA  Between numbers are delimiter (121,135)  Apostrophe - number of uses (Delimiter or part of token) e.g., D’Angelo  When it is followed by a terminator – internal quote (Tess’.)
  • 85. Tokenization –Pseudo code  Dash is a terminator a token proceeded or followed by another dash (522-3333)  Without identifying token it is difficult to imagine extracting higher level information from document
  • 86. Lemmatization  Once a character stream has been segmented after sequence of tokens  Convert each tokens to standard forms – Stemming or Lemmatization. (Application dependent)  Reduce the number of distinct types in corpus and increase frequency of occurrence of individual types  English Speaker’s agree nouns Book and Books are 2 forms of same word- advantage to eliminate kind of variation  Normalization regularize grammatical variants –Inflectional Stemming
  • 87. Stemming to a Root  Grammatical variants (singular/plural present/past)  It is always advantageous to eliminate this kind of variation before further processing  When normalization is confined to regular grammatical variants such as singular/plural and present/past, the process is called Inflectional stemming  The intent of these stemmers is to reach a root of no inflectional or derivational prefixes or suffixes- end result aggressive stemming Example  Reduce number of types is text
  • 89. Vector Generation for prediction  Consider the problem of categorizing documents  Characteristic feature are tokens or words they contain.  Without deep analysis we can choose to describe each document by features that represent the most frequent tokens.  There is collective features called dictionary.  Tokens or words in the dictionary forms the basis for creating a spreadsheet of numeric data corresponding to document collection.  Each Row-> document; column ->feature
  • 90. Vector Generation for prediction  Cells in a spreadsheet is a measurement of feature for a document.  Basic model of data, we will simply check the presence or absence of words.  Checking for words is simple because we do not check each word in dictionary. We will build a hash table. Large samples of digital documents are readily available - confidence on variation and combinations of words that occurs.  If prediction is our goal then we need one more column for correct answer.  In preparing data for learning, information is available from document labels. (Labels are binaries and answers which is also called as class)  Instead of generating global dictionary for class we consider words in class that we are trying to predict.  If this class is far smaller than the negative class which is typical - local dictionary is far smaller than global dictionary
  • 91.  Another reduction in dictionary size is to compile a lost of stop-words and remove them from dictionary.  Stop-words are almost never have any predictive capability, such articles a & the pronouns as it and they.  Frequency information on the word counts can be quite useful in reducing the dictionary size and improve predictive performance  Frequent words are stop-words and can be deleted.  Alternative approach to local dictionary generation is to generate a global dictionary from all documents in the collection . Special feature selection routines will attempt to select a subset of words that have greatest potential of prediction- independent (selection methods)  If we have 100 topics to categorize, then 100 problems to solve. Our choices are 100 small dictionary or 1 global dictionary .
  • 92.  Vectors implied by spreadsheet model will be regenerated to correspond to small dictionary  Instead of placing the word in the dictionary, follow a path printed dictionary and avoid storing every variation of word. (no singular/plural/past/present)  Verbs stored in stemming manner.  Add a layer of complexity in text - gain in performance and size is reduced  Universal procedure is trim words to their root form -> difference in meaning (exit /exiting) - context of programming (different meanings)  Small Dictionary - can capture the best words easily.  Use of tokens and stemming are examples of helpful procedures in smaller dictionaries. Improve ability of managing of learning and accuracy  Document can be converted to spread sheet .
  • 93.  Each column is feature. Row is a document.  Model of data for predictive text mining in terms of spread sheet that populated by ones or zeros.  Cells represent the presence of dictionary words in a document collection. Higher accuracy-> additional transformations  They are  Word Pairs and collocations  Frequency  Tf-idf  Word Pairs and Collocations: They serve to increase size of dictionary improve performance of prediction  Instead of 0’s & 1’s in cells; the frequency of word can be used.(word “the” occurs 10 times count of “the” is used)  Count give better results than binary in cells.  This leads to compact solutions same solution of binary data model. Yet additional frequency yield simpler solution.
  • 94.  Frequencies are helpful in prediction but add complexity to solutions.  Compromise that works – 3 value system.1/0/2  Word did not occur -0  Word occurred one -1  Word occurred 2 or more times -2  Capture much added value of frequency without adding much complexity to model.  Another variant is zeroing the values below the threshold where tokens min frequency before being considered any use.  Reduce the complexity of spread sheet – used in Data Mining algorithms  Other methods to reduce complexity are chi square, mutual Information, odds Ratio ..etc  Next step beyond counting frequency is modify the count by perceived importance of that word .
  • 95.  Tf-idf: Compute the weightings or scores of words  Values of positive numbers that we capture the absence or presence of the words.  Eq(a) we see that weight assigned to word j-term of frequency modified by a scale factor for importance of word. Scale factor is inverse document frequency (Eq(b))  Simply checks for number of documents containing the word df(j) and reverse scaling.  Tf-idf(j) = tf(j) * idf(j) -------? Eq(a)  Idf(j) = log(N/ df(j)) ------ Eq(b)  When a word appears in a document, the scale is lowered and perhaps zero. if word is unique , appears in few documents - scale factor zooms upward and appears important  Alternative of this tf-idf formulation exist, but motivation is same. Result is positive score that replaces the simple frequency or binary (T/F) entry in our cell in spreadsheet.
  • 96.  Another variant is weight the tokens from different parts of the document.  Which Data Transformation Method are BEST????  No Universal answer.  Best predictive accuracy is dependent on mating all these methods.  Best variation is one method may not be the one for other. Test ALL  Describe data as populating a spread sheet-cells are 0  Small subset of dictionary words.  Text Classification a text corpus 1000’s words. Each individual document, unique tokens.  Spread sheet for that document is 0. Rather than store all 0’s its better to represent the spread sheet as a set of sparse vectors (row is list of pairs , one element of pair is column and other is corresponding nonzero value). By not storing the non zero It will increase memory
  • 97. Multi Word Features  Features are associated with single words ( tokens delimited with white space)  Simple scenario is extended to include pair of words e.g., bon and viant . Instead of separating we could feature the word as bonviant.  Why stop at pairs? Why not consider a multiword features??  Unlike word pairs , the words need not be consecutive.  E.g., Don Smith as feature – we can ignore is middle name Leroy that may reappear in some reference to the person.  In this case we have to accommodate many reference to the noun that involve a number of adjectives with desired adjective not the adjacent to the noun. E.g., we want to accept a phrase broken and dirty vase as an instance broken vase
  • 98.  X number if words occurring within a maximum window size y(y>=x naturally)  How features are extracted from text- specialized methods???  If we use frequency methods, combinations of words that are relatively frequent.  Straight forward implementation is simple combination of x words in window y  Measuring the value of multiword feature is done correlation between words in potential multiword features measures on mutual information or likelihood ratio is used!!!  An algorithm for generating multiword features. A straight forward implementation consume lot of memory  Multiword features are not too found in document collection, but they are highly predictive    Twordi iwordfreq TfreqTfreqTsize TAM )( )())((log)( )( 10
  • 99.
  • 100. Labels for Right Answers:  For prediction an extra column is added to the spreadsheet  Last column contains the labels, looks no different from others.  It’s a 0 or 1 indicating a right answer with either True/false  In the sparse vector format are appended to each vector separately as either a one (positive class) or a Zero (negative class)  Feature Selection by Attribute Ranking:  In addition to frequency based approaches, feature selection can be done in number of ways.  Select a set of feature for each category to form a local dictionary for the category  Independent ranking feature attributes according to their predictive abilities for category under consideration.  Predictive ability of an attribute can be measured by certain quantity how its is correlated  Lets assume n number of documents; presence or absence of attribute j in x; y to denote label of document in last column ix
  • 101.  A commonly used ranking score is information gain criterion which is  The quantity L(j) is number of bits required to encode the label and the attribute j minus the number of bits required to encode the attribute.  Quantities are needed to compute L(j). Can be easily estimated using the estimators        1 0 2 1 0 1 0 2 ))|(/1(log)|Pr()Pr()( ))(/1(log)Pr( )()( c ii v i c Label label vxcyprvxcyvxjL cyprcyL jLLjIG 2)( 1),( )|( 2 1)( )(       vxfreq clabelvxfreq vxcypr n vxfreq vxpr j j j j i
  • 102. Sentence Boundary Determination  If the XML markup for corpus doesn't mark sentence boundaries, necessary to mark the sentence  Necessary to determine when a period is part of a token and when it is not  For more sophisticated way linguistic parsing, the algorithms often require complete sentence as input.  Extraction algorithms operate text a sentence at a time  Algorithms are optimal, sentences are identified clearly  Sentence boundary determination is problem of deciding which instances of period followed by white space are sentence delimiters and which are not, since we assume characters ? ! –classification problem  Algorithm – accuracy and adjustments will give better performance
  • 103.
  • 104. Thank you  !!!!