SlideShare una empresa de Scribd logo
1 de 4
Descargar para leer sin conexión
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



           A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques
                                    1
                                        R. Sagayam, 2 S.Srinivasan, 3 S. Roshni
                                              1, 2, 3
                                                  Department Of Computer Science
                                                 Govt. Arts College (Autonomous)
                                                              Salem-7
                                                        2
                                                          Periyar University
                                                           Salem-636011


Abstract
Text mining is the analysis of data contained in natural              And relevance of the documents, or find patterns and trends
language text. Text mining works by transposing words and             across multiple documents. Thus, text mining has become
phrases in unstructured data into numerical values which              an increasingly popular and essential theme in data mining.
can then be linked with structured data in a database and
analyzed with traditional data mining techniques. Text                 2. Information Retrieval
information retrieval and data mining has thus become                  Information retrieval is a field that has been developing in
increasingly important. In this paper a survey of Text                parallel with database systems for many years. Unlike the
mining have been presented.                                           field of database systems, which has focused on query and
                                                                      transaction processing of structured data, information
Keyword: Information Retrieval, Information Extraction                retrieval is concerned with the organization and retrieval of
and Indexing Techniques                                               information from a large number of text-based documents.
                                                                      Since information retrieval and database systems each
                                                                      handle different kinds of data, some database system
1. Introduction
                                                                      problems are usually not present in information retrieval
Text mining is a variation on a field called data mining, that
                                                                      systems, such as concurrency control, recovery, transaction
tries to find interesting patterns from large databases. Text
                                                                      management, and update. Also, some common information
databases are rapidly growing due to the increasing amount
                                                                      retrieval problems are usually not encountered in traditional
of information available in electronic form, such as
                                                                      database systems, such as unstructured documents,
electronic publications, various kinds of electronic
                                                                      approximate search based on keywords, and the notion of
documents, e-mail, and the World Wide Web . Nowadays
                                                                      relevance. Due to the abundance of text information,
most of the information in government, industry, business,
                                                                      information retrieval has found many applications. There
and other institutions are stored electronically, in the form
                                                                      exist many information retrieval systems, such as on-line
of text databases. Data stored in most text databases are
                                                                      library catalog systems, on-line document management
semi structured data in that they are neither completely
                                                                      systems, and the more recently developed Web search
unstructured nor completely structured. For example, a
                                                                      engines. A typical information retrieval problem is to locate
document may contain a few structured fields, such as title,
                                                                      relevant documents in a document collection based on a
authors, publication date, category, and so on, but also
                                                                      user’s query, which is often some keywords describing an
contain some largely unstructured text components, such as
                                                                      information need, although it could also be an example
abstract and contents. There have been a great deal of
                                                                      relevant document. In such a search problem, a user takes
studies on the modeling and implementation of semi
                                                                      the initiative to “pull” the relevant information out from the
structured data in recent database research. Moreover,
                                                                      collection; this is most appropriate when a user has some ad
information retrieval techniques, such as text indexing
                                                                      hoc information need, such as finding information to buy a
methods, have been developed to handle unstructured
                                                                      used car. When a user has a long-term information need , a
documents. Traditional information retrieval techniques
                                                                      retrieval system may also take the initiative to “push” any
become inadequate for the increasingly vast amounts of text
                                                                      newly arrived information item to a user if the item is
data. Typically, only a small fraction of the many available
                                                                      judged as being relevant to the user’s information need.
documents will be relevant to a given individual user.
                                                                      Such an information access process is called information
Without knowing what could be in the documents, it
                                                                      filtering, and the corresponding systems are often called
                                                                      filtering systems or recommender systems. From a technical
Is difficult to formulate effective queries for analyzing and
                                                                      viewpoint, however, search and filtering share many
extracting useful information from the data. Users need
                                                                      common techniques. Below we briefly discuss the major
tools to compare different documents, rank the importance
                                                                      techniques in information retrieval with a focus on search
                                                                      techniques.


Issn 2250-3005(online)                                           September| 2012                                  Page 1443
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



2.1. MEASURES FOR TEXT RETRIEVAL                                         measure. term table consists of a set of term records, each
The set of documents relevant to a query be denoted as                   containing two fields: term id and posting list, where
{Relevant}, and the set of documents retrieved be denoted as             posting list specifies a list of document identifiers in which
{Retrieved}. The set of documents that are both relevant                 the term appears. With such organization, it is easy to
and retrieved is denoted as {Relevant }∩{Retrieved}, as                  answer queries like “Find all of the documents associated
shown in the Venn diagram of Figure 1. There are two basic               with a given set of terms,” or “Find all of the terms
measures for assessing the quality of text retrieval.                    associated with a given set of documents.” For example, to
Precision: This is the percentage of retrieved documents that            find all of the documents associated with a set of terms, we
are in fact relevant to the query . It is formally defined as            can first find a list of document identifiers in term table for
     Precision=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Retrieved}‫׀‬                    each term, and then intersect them to obtain the set of
Recall: This is the percentage of documents that are relevant            relevant documents. Inverted indices are widely used in
to the query and were, in fact, retrieved. It is formally                industry. They are easy to implement. The posting lists
defined as                                                               could be rather long, making the storage requirement quite
      Recall=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Relevant} ‫׀‬                      large. They are easy to implement, but are not satisfactory at
An information retrieval system often needs to trade off                 handling synonymy (where two very different words can
recall for precision or vice versa. One commonly used trade-             have the same meaning) and polysemy (where an individual
off is the F-score, which is defined as the harmonic mean of             word may have many meanings). A signature file is a file
recall and precision                                                     that stores a signature record for each document in the
F-score =recall×precision/(recall+ precision)/2                          database. Each signature has a fixed size of b bits
The harmonic mean discourages a system that sacrifices one               representing terms. A simple encoding scheme goes as
measure for another too drastically                                      follows. Each bit of a document signature is initialized to 0.
                                                                         A bit is set to 1 if the term it represents appears in the
                                                                         document. A signature S1 matches another signature S2 if
                                                                         each bit that is set in signature S2 is also set in S1. Since
                                                                         there are usually more terms than available bits, multiple
                                                                         terms may be mapped into the same bit. Such multiple-to
                                                                         one mappings make the search expensive because a
                                                                         document that matches the signature of a query does not
                                                                         necessarily contain the set of keywords of the query. The
                                                                         document has to be retrieved, parsed, stemmed, and
                                                                         checked. Improvements can be made by first performing
                                                                         frequency analysis, stemming, and by filtering stop words,
 Figure 1. Relationship between the set of relevant                      and then using a hashing technique and superimposed
documents and the set of retrieved documents.                            coding technique to encode the list of terms into bit
Precision, recall, and F-score are the basic measures of a               representation. Nevertheless, the problem of multiple-to-one
retrieved set of documents. These three measures are not                 mappings still exists, which is the major disadvantage of
directly useful for comparing two ranked lists of documents              this approach. Readers can refer to [2] for more detailed
because they are not sensitive to the internal ranking of the            discussion of indexing techniques, including how to
documents in a retrieved set. In order to measure the quality            compress an index.
of a ranked list of documents, it is common to compute an
average of precisions at all the ranks where a new relevant
                                                                         4. Query Processing Techniques
document is returned. It is also common to plot a graph of
                                                                         Once an inverted index is created for a document collection,
precisions at many different levels of recall; a higher curve
                                                                         a retrieval system can answer a keyword query quickly by
represents a better-quality information retrieval system. For
                                                                         looking up which documents contain the query keywords.
more details about these measures, readers may consult an
                                                                         Specifically, we will maintain a score accumulator for each
information retrieval textbook, such as [3].
                                                                         document and update these accumulators as we go through
                                                                         each query term. For each query term, we will fetch all of
3. Text Indexing Techniques                                              the documents that match the term and increase their scores.
There are several popular text retrieval indexing techniques,            More sophisticated query processing techniques are
including inverted indices and signature files. An inverted              discussed in [2].When examples of relevant documents are
index is an index structure that maintains two hash indexed              available, the system can learn from such examples to
or B+-tree indexed tables: document table and term table,                improve retrieval performance. This is called relevance
where document table consists of a set of document records,              feedback and has proven to be effective in improving
each containing two fields: doc id and posting list, where               retrieval performance. When we do not have such relevant
posting list is a list of terms (or pointers to terms) that occur        examples, a system can assume the top few retrieved
in the document, sorted according to some relevance                      documents in some initial retrieval results to be relevant and
Issn 2250-3005(online)                                              September| 2012                                   Page 1444
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



extract more related keywords to expand a query. Such                  version of the Porter stemming algorithm with a few
feedback is called pseudo-feedback or blind feedback and is            changes towards the end in which we have omitted some
essentially a process of mining useful keywords from the               cases.
top retrieved documents. Pseudo-feedback also often leads              e.g. apply – applied – applies
to improved retrieval performance. One major limitation of             print – printing – prints – printed
many existing retrieval methods is that they are based on              In both the cases, all words of the first example will be
exact keyword matching. However, due to the complexity of              treated as ‘apply’ and all words of the second example will
natural languages, keyword based retrieval can encounter               be treated as ‘print’.
two major difficulties. The first is the synonymy problem:
two words with identical or similar meanings may have very             5.2 Domain dictionary
different surface forms. For example, a user’s query may               In order to develop tools of this sort, it is essential to
use the word “automobile,” but a relevant document may                 provide them with a knowledge base. A collective set of all
use “vehicle” instead of “automobile.” The second is the               the ‘feature terms’ is the Domain dictionary (our source was
polysemy problem: the same keyword, such as mining, or                 www.webopedia.com). The structure of the Domain
Java, may mean different things in different contexts.                 dictionary which we implemented consisted of three levels
                                                                       in the hierarchy. Namely, Parent Category, Sub-category
5. Information Extraction                                              and word. Parent categories define the main category under
The general purpose of Knowledge Discovery is to “extract              which any sub-category or word falls. A parent category
implicit, previously unknown, and potentially useful                   will be unique on its level in the hierarchy. Sub-categories
information from data”. Information Extraction IE mainly               will belong to a certain parent category and each sub-
deals with identifying words or feature terms from within a            category will consist of all the words associated with it. As
textual file. Feature terms can be defined as those which are          an example, consider the following
directly related to the domain.
                                                                       Table 1.Structure of the Domain Dictionary




                                                                       Table 1 is an example that shows how we identify words
                                                                       which belong to the Parent Category ‘Hardware’ and Sub-
Figure 2. A layered model of the Text Mining Application
                                                                       category ‘Input Devices’.
These are the terms which can be recognized by the tool. In
order to perform this function optimally, we had to look into
                                                                       5.3 Exclusion List
few more aspects which are as follows:
                                                                       A lot of words in a text file can be treated as unwanted
                                                                       noise. To eliminate these, we devised a separate file which
5.1 Stemming
                                                                       includes all such words. These include words such as the, a,
Stemming refers to identifying the root of a certain word.
                                                                       an, if, off, on etc.
There are basically two types of stemming techniques, one
is inflectional and other is derivational. Derivational
stemming can create a new word from an existing word,                  6. Research Directions
sometimes by simply changing grammatical category (for                 With abundant literature published in research into frequent
example, changing a noun to a verb) [Wikipedia]. The type              pattern mining, one may wonder whether we have solved
of stemming we were able to implement is called                        most of the critical problems related to frequent pattern
Inflectional Stemming. A commonly used algorithms is the               mining so that the solutions provided are good enough for
‘Porter’s Algorithm’ for stemming. When the normalization              most of the data mining tasks. However, based on our view,
is confined to regularizing grammatical variants such as               there are still several critical research problems that need to
singular/plural or past/present, it is referred to inflectional        be solved before frequent pattern mining can become a
stemming [10].To minimize the effects of inflection and                cornerstone approach in data mining applications. First, the
morphological variations of Words (stemming), our                      most focused and extensively studied topic in frequent
approach has pre-processed each word using a provided                  pattern mining is perhaps scalable mining methods. The set

Issn 2250-3005(online)                                            September| 2012                                   Page 1445
International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5



of frequent patterns derived by most of the current pattern            textual sources (e.g.: the World Wide Web). Discovering
mining methods is too huge for effective usage. There are              such hidden know ledge is an essential requirement for
proposals on reduction of such a huge set, including closed            many corporations, due to its wide spectrum of applications.
patterns, maximal patterns, approximate patterns, condensed            In this short survey, the notion of text mining have been
pattern bases, representative patterns, clustered patterns, and        introduced and several techniques available have been
discriminative frequent patterns, as introduced in the                 presented. Due to its novelty, there are many potential
previous sections. However, it is still not clear what kind of         research areas in the field of Text Mining, which includes
patterns will give us satisfactory pattern sets in both                finding better intermediate forms for representing the
compactness and representative quality for a particular                outputs of information extraction, an XML document may
application, and whether we can mine such patterns directly            be a good choice. Mining texts in different languages is a
and efficiently. Much research is still needed to substantially        major problem, since text mining tools should be able to
reduce the size of derived pattern sets and enhance the                work with many languages and multilingual documents.
quality of retained patterns. Frequent pattern mining: current         Integrating a domain knowledge base with a text mining
status and future directions. Second, although we have                 engine would boost its efficiency, especially in the
efficient methods for mining precise and complete set of               information retrieval and information extraction phases.
frequent patterns, approximate frequent patterns could be
the best choice in many applications. For example, in the              References
analysis of DNA or protein sequences, one would like to                [1] M.       A.      Hearst.     What      is    text    mining?
find long sequence patterns that approximately match the                    http://www.sims.berkeley.edu/˜hearst/text-mining.html, Oct.
sequences in biological entities, similar to BLAST. Much                    2003.
research is still needed to make such mining more effective            [2] Ian H. Witten, Alistair Moffat, and Timothy C. Bell.
than the currently available tools in bioinformatics. Third, to             Managing Gigabytes: Compressing and Indexing Documents
                                                                            and Images. Morgan Kaufmann Publishers, San Francisco,
make frequent pattern mining an essential task in data
                                                                            CA, 1999.
mining, much research is needed to further develop pattern-            [3] R.Baeza-Yates and B.Ribeiro-Neto, Modern Information
based mining methods. For example, classification is an                     Retrieval addition-Wesley,Boston,1999
essential task in data mining. Fourth, we need mechanisms              [4] Jochen Dorre, Peter Gersti, Roland Seiffert (1999), Text
for deep understanding and interpretation of patterns, e.g.,                Mining: Finding Nuggets in Mountains of Textual Data,
semantic annotation for frequent patterns, and contextual                   ACM KDD 1999 in San Diego, CA, USA.
analysis of frequent patterns. The main research work on               [5] Ah-Hwee Tan, (1999), Text Mining: The state of art and the
pattern analysis has been focused on pattern composition                    challenges, In proceedings, PAKDD'99 Workshop on
(e.g., the set of items in item-set patterns) and frequency.                Knowledge discovery from Advanced Databases (KDAD'99),
                                                                            Beijing, pp. 71-76, April 1999.
The semantic of a frequent pattern includes deeper
                                                                       [6] Danial Tkach, (1998), Text Mining Technology Turning
information: what is the meaning of the pattern; what are the               Information Into Knowledge A white paper from IBM .
synonym patterns; and what are the typical transactions that           [7]. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A.
this pattern resides? In many cases, frequent patterns are                  Inkeri Verkamo, (1997), Applying Data Mining Techniques
mined from certain data sets which also contain structural                  in Text Analysis, Report C-1997-23, Department of Computer
information. Finally, applications often raise new research                 Science, University of Helsinki, 1997
issues and bring deep insight on the strength and weakness             [8]. Mark Dixon,(1997), An Overview of Document Mining
of an existing solution. This is also true for frequent pattern             Technology,
mining. On one side, it is important to go to the core part of              http://www.geocities.com/ResearchTriangle/Thinktank/1997/
                                                                            mark/writings/dixm 97_dm.ps
pattern mining algorithms, and analyze the theoretical                 [9] Juan José García Adeva and Rafael Calvo, “Mining Text with
properties of different solutions. On the other side, although              Pimiento”, University of Sydney
we only cover a small subset of applications in this article,          [10] Text Mining Application Programming by Manu Konchadi.
frequent pattern mining has claimed a broad spectrum of                     Published by Charles River Media. ISBN: 1584504609
applications and demonstrated its strength at solving some             [11] Automated Concept Extraction From Plain Text. Boris,
problems. Much work is needed to explore new applications                   GARAGe Michigan State University, East Lansing MI 48824.
of frequent pattern mining. For example, bioinformatics has            [12] Dr. Antoine Spinakis; Asanoula Chatzimakri, “Comparison
raised a lot of challenging problems, and we believe                        Study of Text Mining Tools”.
frequent pattern mining may contribute a good deal to it               [13] Raymond J. Mooney and Razvan Bunescu, “Mining
                                                                            Knowledge from Text Using Information Extraction”,
with further research efforts.                                              University of Texas at Austin
                                                                       [14] Tova, Milo, “Active views for Electronic commerce”. Paper
7. Conclusion                                                               number: EUROPE64.
Most of knowledge hidden in electronic media of an
organization is encapsulated in documents. Acquiring this
knowledge implies effective querying of the documents as
well as the combination of information pieces from different
Issn 2250-3005(online)                                            September| 2012                                    Page 1446

Más contenido relacionado

La actualidad más candente

Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)Sumit Kumar Gupta
 
Correlation Analysis of Forensic Metadata for Digital Evidence
Correlation Analysis of Forensic Metadata for Digital EvidenceCorrelation Analysis of Forensic Metadata for Digital Evidence
Correlation Analysis of Forensic Metadata for Digital EvidenceIJCSIS Research Publications
 
Information retrieval
Information retrievalInformation retrieval
Information retrievalhplap
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?Maryann Martone
 
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...cseij
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data viaNeuroscience Information Framework
 
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...IJASCSE
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibEl Habib NFAOUI
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemMaryann Martone
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systemsMohammed Muqeet
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYcscpconf
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and LibrariansJohann van Wyk
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document ClassificationIDES Editor
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...University of Bari (Italy)
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebIOSR Journals
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligencedannyijwest
 
Urm concept for sharing information inside of communities
Urm concept for sharing information inside of communitiesUrm concept for sharing information inside of communities
Urm concept for sharing information inside of communitiesKarel Charvat
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spacesMounia Lalmas-Roelleke
 

La actualidad más candente (20)

Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)Information Storage and Retrieval system (ISRS)
Information Storage and Retrieval system (ISRS)
 
Correlation Analysis of Forensic Metadata for Digital Evidence
Correlation Analysis of Forensic Metadata for Digital EvidenceCorrelation Analysis of Forensic Metadata for Digital Evidence
Correlation Analysis of Forensic Metadata for Digital Evidence
 
Information retrieval
Information retrievalInformation retrieval
Information retrieval
 
Databases and Ontologies: Where do we go from here?
Databases and Ontologies:  Where do we go from here?Databases and Ontologies:  Where do we go from here?
Databases and Ontologies: Where do we go from here?
 
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
AN ELABORATION OF TEXT CATEGORIZATION AND AUTOMATIC TEXT CLASSIFICATION THROU...
 
Big data from small data: A deep survey of the neuroscience landscape data via
Big data from small data:  A deep survey of the neuroscience landscape data viaBig data from small data:  A deep survey of the neuroscience landscape data via
Big data from small data: A deep survey of the neuroscience landscape data via
 
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
Enhanced Performance of Search Engine with Multitype Feature Co-Selection of ...
 
Web_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_HabibWeb_Mining_Overview_Nfaoui_El_Habib
Web_Mining_Overview_Nfaoui_El_Habib
 
Data-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystemData-knowledge transition zones within the biomedical research ecosystem
Data-knowledge transition zones within the biomedical research ecosystem
 
Text Indexing and Retrieval
Text Indexing and RetrievalText Indexing and Retrieval
Text Indexing and Retrieval
 
Indexing techniques for advanced database systems
Indexing techniques for advanced database systemsIndexing techniques for advanced database systems
Indexing techniques for advanced database systems
 
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGYINTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
INTELLIGENT INFORMATION RETRIEVAL WITHIN DIGITAL LIBRARY USING DOMAIN ONTOLOGY
 
Research Data Management and Librarians
Research Data Management and LibrariansResearch Data Management and Librarians
Research Data Management and Librarians
 
DICTIONARY
DICTIONARYDICTIONARY
DICTIONARY
 
Context Driven Technique for Document Classification
Context Driven Technique for Document ClassificationContext Driven Technique for Document Classification
Context Driven Technique for Document Classification
 
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
A Domain Based Approach to Information Retrieval in Digital Libraries - Rotel...
 
Context Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic WebContext Based Web Indexing For Semantic Web
Context Based Web Indexing For Semantic Web
 
Metadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled IntelligenceMetadata: Towards Machine-Enabled Intelligence
Metadata: Towards Machine-Enabled Intelligence
 
Urm concept for sharing information inside of communities
Urm concept for sharing information inside of communitiesUrm concept for sharing information inside of communities
Urm concept for sharing information inside of communities
 
Aggregation for searching complex information spaces
Aggregation for searching complex information spacesAggregation for searching complex information spaces
Aggregation for searching complex information spaces
 

Similar a IJCER (www.ijceronline.com) International Journal of computational Engineering research

An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataMelinda Watson
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrievalcaptainmactavish1996
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrievalunyil96
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-drivenJoshua Chudy
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search enginesunyil96
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...ijceronline
 
"Analysis of Different Text Classification Algorithms: An Assessment "
"Analysis of Different Text Classification Algorithms: An Assessment ""Analysis of Different Text Classification Algorithms: An Assessment "
"Analysis of Different Text Classification Algorithms: An Assessment "ijtsrd
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachAIRCC Publishing Corporation
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHijcsit
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringKelly Lipiec
 
Data mining a tool for knowledge management
Data mining a tool for knowledge managementData mining a tool for knowledge management
Data mining a tool for knowledge managementKishor Satpathy
 
Introduction to Information Retrieval.pptx
Introduction to Information Retrieval.pptxIntroduction to Information Retrieval.pptx
Introduction to Information Retrieval.pptxPhilimonTsige1
 
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...ijtsrd
 

Similar a IJCER (www.ijceronline.com) International Journal of computational Engineering research (20)

Text mining
Text miningText mining
Text mining
 
An Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured DataAn Improved Annotation Based Summary Generation For Unstructured Data
An Improved Annotation Based Summary Generation For Unstructured Data
 
Chapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and RetrievalChapter 1: Introduction to Information Storage and Retrieval
Chapter 1: Introduction to Information Storage and Retrieval
 
CS8080 IRT UNIT I NOTES.pdf
CS8080 IRT UNIT I  NOTES.pdfCS8080 IRT UNIT I  NOTES.pdf
CS8080 IRT UNIT I NOTES.pdf
 
CS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdfCS8080_IRT__UNIT_I_NOTES.pdf
CS8080_IRT__UNIT_I_NOTES.pdf
 
Text databases and information retrieval
Text databases and information retrievalText databases and information retrieval
Text databases and information retrieval
 
accelerating-data-driven
accelerating-data-drivenaccelerating-data-driven
accelerating-data-driven
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
Inverted files for text search engines
Inverted files for text search enginesInverted files for text search engines
Inverted files for text search engines
 
B0410206010
B0410206010B0410206010
B0410206010
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
 
"Analysis of Different Text Classification Algorithms: An Assessment "
"Analysis of Different Text Classification Algorithms: An Assessment ""Analysis of Different Text Classification Algorithms: An Assessment "
"Analysis of Different Text Classification Algorithms: An Assessment "
 
Information Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis ApproachInformation Retrieval based on Cluster Analysis Approach
Information Retrieval based on Cluster Analysis Approach
 
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACHINFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
INFORMATION RETRIEVAL BASED ON CLUSTER ANALYSIS APPROACH
 
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using ClusteringAn Improved Mining Of Biomedical Data From Web Documents Using Clustering
An Improved Mining Of Biomedical Data From Web Documents Using Clustering
 
Data mining a tool for knowledge management
Data mining a tool for knowledge managementData mining a tool for knowledge management
Data mining a tool for knowledge management
 
Web Mining
Web MiningWeb Mining
Web Mining
 
Text mining
Text miningText mining
Text mining
 
Introduction to Information Retrieval.pptx
Introduction to Information Retrieval.pptxIntroduction to Information Retrieval.pptx
Introduction to Information Retrieval.pptx
 
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
Identification of User Aware Rare Sequential Pattern in Document Stream An Ov...
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsMaria Levchenko
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century educationjfdjdjcjdnsjd
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Handwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed textsHandwritten Text Recognition for manuscripts and early printed texts
Handwritten Text Recognition for manuscripts and early printed texts
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

IJCER (www.ijceronline.com) International Journal of computational Engineering research

  • 1. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 A Survey of Text Mining: Retrieval, Extraction and Indexing Techniques 1 R. Sagayam, 2 S.Srinivasan, 3 S. Roshni 1, 2, 3 Department Of Computer Science Govt. Arts College (Autonomous) Salem-7 2 Periyar University Salem-636011 Abstract Text mining is the analysis of data contained in natural And relevance of the documents, or find patterns and trends language text. Text mining works by transposing words and across multiple documents. Thus, text mining has become phrases in unstructured data into numerical values which an increasingly popular and essential theme in data mining. can then be linked with structured data in a database and analyzed with traditional data mining techniques. Text 2. Information Retrieval information retrieval and data mining has thus become Information retrieval is a field that has been developing in increasingly important. In this paper a survey of Text parallel with database systems for many years. Unlike the mining have been presented. field of database systems, which has focused on query and transaction processing of structured data, information Keyword: Information Retrieval, Information Extraction retrieval is concerned with the organization and retrieval of and Indexing Techniques information from a large number of text-based documents. Since information retrieval and database systems each handle different kinds of data, some database system 1. Introduction problems are usually not present in information retrieval Text mining is a variation on a field called data mining, that systems, such as concurrency control, recovery, transaction tries to find interesting patterns from large databases. Text management, and update. Also, some common information databases are rapidly growing due to the increasing amount retrieval problems are usually not encountered in traditional of information available in electronic form, such as database systems, such as unstructured documents, electronic publications, various kinds of electronic approximate search based on keywords, and the notion of documents, e-mail, and the World Wide Web . Nowadays relevance. Due to the abundance of text information, most of the information in government, industry, business, information retrieval has found many applications. There and other institutions are stored electronically, in the form exist many information retrieval systems, such as on-line of text databases. Data stored in most text databases are library catalog systems, on-line document management semi structured data in that they are neither completely systems, and the more recently developed Web search unstructured nor completely structured. For example, a engines. A typical information retrieval problem is to locate document may contain a few structured fields, such as title, relevant documents in a document collection based on a authors, publication date, category, and so on, but also user’s query, which is often some keywords describing an contain some largely unstructured text components, such as information need, although it could also be an example abstract and contents. There have been a great deal of relevant document. In such a search problem, a user takes studies on the modeling and implementation of semi the initiative to “pull” the relevant information out from the structured data in recent database research. Moreover, collection; this is most appropriate when a user has some ad information retrieval techniques, such as text indexing hoc information need, such as finding information to buy a methods, have been developed to handle unstructured used car. When a user has a long-term information need , a documents. Traditional information retrieval techniques retrieval system may also take the initiative to “push” any become inadequate for the increasingly vast amounts of text newly arrived information item to a user if the item is data. Typically, only a small fraction of the many available judged as being relevant to the user’s information need. documents will be relevant to a given individual user. Such an information access process is called information Without knowing what could be in the documents, it filtering, and the corresponding systems are often called filtering systems or recommender systems. From a technical Is difficult to formulate effective queries for analyzing and viewpoint, however, search and filtering share many extracting useful information from the data. Users need common techniques. Below we briefly discuss the major tools to compare different documents, rank the importance techniques in information retrieval with a focus on search techniques. Issn 2250-3005(online) September| 2012 Page 1443
  • 2. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 2.1. MEASURES FOR TEXT RETRIEVAL measure. term table consists of a set of term records, each The set of documents relevant to a query be denoted as containing two fields: term id and posting list, where {Relevant}, and the set of documents retrieved be denoted as posting list specifies a list of document identifiers in which {Retrieved}. The set of documents that are both relevant the term appears. With such organization, it is easy to and retrieved is denoted as {Relevant }∩{Retrieved}, as answer queries like “Find all of the documents associated shown in the Venn diagram of Figure 1. There are two basic with a given set of terms,” or “Find all of the terms measures for assessing the quality of text retrieval. associated with a given set of documents.” For example, to Precision: This is the percentage of retrieved documents that find all of the documents associated with a set of terms, we are in fact relevant to the query . It is formally defined as can first find a list of document identifiers in term table for Precision=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Retrieved}‫׀‬ each term, and then intersect them to obtain the set of Recall: This is the percentage of documents that are relevant relevant documents. Inverted indices are widely used in to the query and were, in fact, retrieved. It is formally industry. They are easy to implement. The posting lists defined as could be rather long, making the storage requirement quite Recall=‫{׀‬Relevant}∩{Retrieved}‫{׀/׀‬Relevant} ‫׀‬ large. They are easy to implement, but are not satisfactory at An information retrieval system often needs to trade off handling synonymy (where two very different words can recall for precision or vice versa. One commonly used trade- have the same meaning) and polysemy (where an individual off is the F-score, which is defined as the harmonic mean of word may have many meanings). A signature file is a file recall and precision that stores a signature record for each document in the F-score =recall×precision/(recall+ precision)/2 database. Each signature has a fixed size of b bits The harmonic mean discourages a system that sacrifices one representing terms. A simple encoding scheme goes as measure for another too drastically follows. Each bit of a document signature is initialized to 0. A bit is set to 1 if the term it represents appears in the document. A signature S1 matches another signature S2 if each bit that is set in signature S2 is also set in S1. Since there are usually more terms than available bits, multiple terms may be mapped into the same bit. Such multiple-to one mappings make the search expensive because a document that matches the signature of a query does not necessarily contain the set of keywords of the query. The document has to be retrieved, parsed, stemmed, and checked. Improvements can be made by first performing frequency analysis, stemming, and by filtering stop words, Figure 1. Relationship between the set of relevant and then using a hashing technique and superimposed documents and the set of retrieved documents. coding technique to encode the list of terms into bit Precision, recall, and F-score are the basic measures of a representation. Nevertheless, the problem of multiple-to-one retrieved set of documents. These three measures are not mappings still exists, which is the major disadvantage of directly useful for comparing two ranked lists of documents this approach. Readers can refer to [2] for more detailed because they are not sensitive to the internal ranking of the discussion of indexing techniques, including how to documents in a retrieved set. In order to measure the quality compress an index. of a ranked list of documents, it is common to compute an average of precisions at all the ranks where a new relevant 4. Query Processing Techniques document is returned. It is also common to plot a graph of Once an inverted index is created for a document collection, precisions at many different levels of recall; a higher curve a retrieval system can answer a keyword query quickly by represents a better-quality information retrieval system. For looking up which documents contain the query keywords. more details about these measures, readers may consult an Specifically, we will maintain a score accumulator for each information retrieval textbook, such as [3]. document and update these accumulators as we go through each query term. For each query term, we will fetch all of 3. Text Indexing Techniques the documents that match the term and increase their scores. There are several popular text retrieval indexing techniques, More sophisticated query processing techniques are including inverted indices and signature files. An inverted discussed in [2].When examples of relevant documents are index is an index structure that maintains two hash indexed available, the system can learn from such examples to or B+-tree indexed tables: document table and term table, improve retrieval performance. This is called relevance where document table consists of a set of document records, feedback and has proven to be effective in improving each containing two fields: doc id and posting list, where retrieval performance. When we do not have such relevant posting list is a list of terms (or pointers to terms) that occur examples, a system can assume the top few retrieved in the document, sorted according to some relevance documents in some initial retrieval results to be relevant and Issn 2250-3005(online) September| 2012 Page 1444
  • 3. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 extract more related keywords to expand a query. Such version of the Porter stemming algorithm with a few feedback is called pseudo-feedback or blind feedback and is changes towards the end in which we have omitted some essentially a process of mining useful keywords from the cases. top retrieved documents. Pseudo-feedback also often leads e.g. apply – applied – applies to improved retrieval performance. One major limitation of print – printing – prints – printed many existing retrieval methods is that they are based on In both the cases, all words of the first example will be exact keyword matching. However, due to the complexity of treated as ‘apply’ and all words of the second example will natural languages, keyword based retrieval can encounter be treated as ‘print’. two major difficulties. The first is the synonymy problem: two words with identical or similar meanings may have very 5.2 Domain dictionary different surface forms. For example, a user’s query may In order to develop tools of this sort, it is essential to use the word “automobile,” but a relevant document may provide them with a knowledge base. A collective set of all use “vehicle” instead of “automobile.” The second is the the ‘feature terms’ is the Domain dictionary (our source was polysemy problem: the same keyword, such as mining, or www.webopedia.com). The structure of the Domain Java, may mean different things in different contexts. dictionary which we implemented consisted of three levels in the hierarchy. Namely, Parent Category, Sub-category 5. Information Extraction and word. Parent categories define the main category under The general purpose of Knowledge Discovery is to “extract which any sub-category or word falls. A parent category implicit, previously unknown, and potentially useful will be unique on its level in the hierarchy. Sub-categories information from data”. Information Extraction IE mainly will belong to a certain parent category and each sub- deals with identifying words or feature terms from within a category will consist of all the words associated with it. As textual file. Feature terms can be defined as those which are an example, consider the following directly related to the domain. Table 1.Structure of the Domain Dictionary Table 1 is an example that shows how we identify words which belong to the Parent Category ‘Hardware’ and Sub- Figure 2. A layered model of the Text Mining Application category ‘Input Devices’. These are the terms which can be recognized by the tool. In order to perform this function optimally, we had to look into 5.3 Exclusion List few more aspects which are as follows: A lot of words in a text file can be treated as unwanted noise. To eliminate these, we devised a separate file which 5.1 Stemming includes all such words. These include words such as the, a, Stemming refers to identifying the root of a certain word. an, if, off, on etc. There are basically two types of stemming techniques, one is inflectional and other is derivational. Derivational stemming can create a new word from an existing word, 6. Research Directions sometimes by simply changing grammatical category (for With abundant literature published in research into frequent example, changing a noun to a verb) [Wikipedia]. The type pattern mining, one may wonder whether we have solved of stemming we were able to implement is called most of the critical problems related to frequent pattern Inflectional Stemming. A commonly used algorithms is the mining so that the solutions provided are good enough for ‘Porter’s Algorithm’ for stemming. When the normalization most of the data mining tasks. However, based on our view, is confined to regularizing grammatical variants such as there are still several critical research problems that need to singular/plural or past/present, it is referred to inflectional be solved before frequent pattern mining can become a stemming [10].To minimize the effects of inflection and cornerstone approach in data mining applications. First, the morphological variations of Words (stemming), our most focused and extensively studied topic in frequent approach has pre-processed each word using a provided pattern mining is perhaps scalable mining methods. The set Issn 2250-3005(online) September| 2012 Page 1445
  • 4. International Journal Of Computational Engineering Research (ijceronline.com) Vol. 2 Issue. 5 of frequent patterns derived by most of the current pattern textual sources (e.g.: the World Wide Web). Discovering mining methods is too huge for effective usage. There are such hidden know ledge is an essential requirement for proposals on reduction of such a huge set, including closed many corporations, due to its wide spectrum of applications. patterns, maximal patterns, approximate patterns, condensed In this short survey, the notion of text mining have been pattern bases, representative patterns, clustered patterns, and introduced and several techniques available have been discriminative frequent patterns, as introduced in the presented. Due to its novelty, there are many potential previous sections. However, it is still not clear what kind of research areas in the field of Text Mining, which includes patterns will give us satisfactory pattern sets in both finding better intermediate forms for representing the compactness and representative quality for a particular outputs of information extraction, an XML document may application, and whether we can mine such patterns directly be a good choice. Mining texts in different languages is a and efficiently. Much research is still needed to substantially major problem, since text mining tools should be able to reduce the size of derived pattern sets and enhance the work with many languages and multilingual documents. quality of retained patterns. Frequent pattern mining: current Integrating a domain knowledge base with a text mining status and future directions. Second, although we have engine would boost its efficiency, especially in the efficient methods for mining precise and complete set of information retrieval and information extraction phases. frequent patterns, approximate frequent patterns could be the best choice in many applications. For example, in the References analysis of DNA or protein sequences, one would like to [1] M. A. Hearst. What is text mining? find long sequence patterns that approximately match the http://www.sims.berkeley.edu/˜hearst/text-mining.html, Oct. sequences in biological entities, similar to BLAST. Much 2003. research is still needed to make such mining more effective [2] Ian H. Witten, Alistair Moffat, and Timothy C. Bell. than the currently available tools in bioinformatics. Third, to Managing Gigabytes: Compressing and Indexing Documents and Images. Morgan Kaufmann Publishers, San Francisco, make frequent pattern mining an essential task in data CA, 1999. mining, much research is needed to further develop pattern- [3] R.Baeza-Yates and B.Ribeiro-Neto, Modern Information based mining methods. For example, classification is an Retrieval addition-Wesley,Boston,1999 essential task in data mining. Fourth, we need mechanisms [4] Jochen Dorre, Peter Gersti, Roland Seiffert (1999), Text for deep understanding and interpretation of patterns, e.g., Mining: Finding Nuggets in Mountains of Textual Data, semantic annotation for frequent patterns, and contextual ACM KDD 1999 in San Diego, CA, USA. analysis of frequent patterns. The main research work on [5] Ah-Hwee Tan, (1999), Text Mining: The state of art and the pattern analysis has been focused on pattern composition challenges, In proceedings, PAKDD'99 Workshop on (e.g., the set of items in item-set patterns) and frequency. Knowledge discovery from Advanced Databases (KDAD'99), Beijing, pp. 71-76, April 1999. The semantic of a frequent pattern includes deeper [6] Danial Tkach, (1998), Text Mining Technology Turning information: what is the meaning of the pattern; what are the Information Into Knowledge A white paper from IBM . synonym patterns; and what are the typical transactions that [7]. Helena Ahonen, Oskari Heinonen, Mika Klemettinen, A. this pattern resides? In many cases, frequent patterns are Inkeri Verkamo, (1997), Applying Data Mining Techniques mined from certain data sets which also contain structural in Text Analysis, Report C-1997-23, Department of Computer information. Finally, applications often raise new research Science, University of Helsinki, 1997 issues and bring deep insight on the strength and weakness [8]. Mark Dixon,(1997), An Overview of Document Mining of an existing solution. This is also true for frequent pattern Technology, mining. On one side, it is important to go to the core part of http://www.geocities.com/ResearchTriangle/Thinktank/1997/ mark/writings/dixm 97_dm.ps pattern mining algorithms, and analyze the theoretical [9] Juan José García Adeva and Rafael Calvo, “Mining Text with properties of different solutions. On the other side, although Pimiento”, University of Sydney we only cover a small subset of applications in this article, [10] Text Mining Application Programming by Manu Konchadi. frequent pattern mining has claimed a broad spectrum of Published by Charles River Media. ISBN: 1584504609 applications and demonstrated its strength at solving some [11] Automated Concept Extraction From Plain Text. Boris, problems. Much work is needed to explore new applications GARAGe Michigan State University, East Lansing MI 48824. of frequent pattern mining. For example, bioinformatics has [12] Dr. Antoine Spinakis; Asanoula Chatzimakri, “Comparison raised a lot of challenging problems, and we believe Study of Text Mining Tools”. frequent pattern mining may contribute a good deal to it [13] Raymond J. Mooney and Razvan Bunescu, “Mining Knowledge from Text Using Information Extraction”, with further research efforts. University of Texas at Austin [14] Tova, Milo, “Active views for Electronic commerce”. Paper 7. Conclusion number: EUROPE64. Most of knowledge hidden in electronic media of an organization is encapsulated in documents. Acquiring this knowledge implies effective querying of the documents as well as the combination of information pieces from different Issn 2250-3005(online) September| 2012 Page 1446