SlideShare una empresa de Scribd logo
1 de 6
Descargar para leer sin conexión
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 24
USING DATA MINING METHODS KNOWLEDGE DISCOVERY FOR
TEXT MINING
D.M.Kulkarni1
, S.K.Shirgave2
1, 2
IT Department Dkte’s TEI Ichalkaranji (Maharashtra), India
Abstract
Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and
update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining
methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have
often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many
experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which
includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered
patterns for finding relevant and interesting information.
Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering.
----------------------------------------------------------------------***--------------------------------------------------------------------
1. INTRODUCTION
Knowledge discovery is a process of nontrivial extraction of
information from large databases, information that is unknown
and useful for user. Data mining is the first and essential step
in the process of knowledge discovery. Various data mining
methods are available such as association rule mining,
sequential pattern mining, closed pattern mining and frequent
item set mining to perform different knowledge discovery
tasks. Effective use of discovered patterns is a research issue.
Proposed system is implemented using different data mining
methods for knowledge discovery.
Text mining is a method of retrieving useful information from
a large amount of digital text data. It is therefore crucial that a
good text mining model should retrieve the information
according to the user requirement. Traditional Information
Retrieval (IR) has same objective of automatically retrieving
as many relevant documents as possible, whilst filtering out
irrelevant documents at the same time. However, IR-based
systems do not provide users with what they really need.
Many text mining methods have been developed for retrieving
useful information for users. Most text mining methods use
keyword based approaches, whereas others choose the phrase
method to construct a text representation for a set of
documents. The phrase-based approaches perform better than
the keyword-based as it is considered that more information is
carried by a phrase than by a single term. New studies have
been focusing on finding better text representatives from a
textual data collection. One solution is to use data mining
methods, such as sequential pattern mining for Text mining.
Such data mining-based methods use concepts of closed
sequential patterns and non-closed patterns to decrease the
feature set size by removing noisy patterns. New method,
Pattern Discovery Model for the purpose of effectively using
discovered patterns is proposed. Proposed system is evaluated
the measures of patterns using pattern deploying process as
well as finds patterns from the negative training examples
using pattern Evolving process.
2. LITERATURE SURVEY
The main process of text-related machine learning tasks is
document indexing, which maps a document into a feature
space representing the semantics of the document. Many types
of text representations have been proposed in the past. A well
known method for text mining is the bag of words that uses
keywords (terms) as elements in the vector of the feature.
Weighting scheme tf*idf (TFIDF) is used for text
representation [1]. In addition to TFIDF, entropy weighting
scheme is used, which improves performance by an average of
30 percent. The problem of bag of word approach is selection
of a limited number of features amongst a huge set of words or
terms in order to increase the system’s efficiency and avoid
over fitting. In order to reduce the number of features, many
dimensionality reduction approaches are available, such as
Information Gain, Mutual Information, Chi-Square, Odds
ratio. Some research works have used phrases rather than
individual words.Using single words in keyword-based
representation pose the semantic ambiguity problem. To solve
this problem, the use of multiple words (i.e. phrases) as
features therefore is proposed [2, 3]. In general, phrases carry
more specific content than single words. For instance,
“engine” and “search engine”. Another reason for using
phrase-based representation is that the simple keyword-based
representation of content is usually inadequate because single
words are rarely specific enough for accurate discrimination
[4]. To identify groups of words that create meaningful
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 25
phrases is a better method, especially for phrases indicating
important concepts in the text. The traditional term clustering
methods are used to provide significantly improved text
representation
3. PROPOSED SYSTEM
Proposed system highlights on a software upgrade-based
approach to increase efficiency of pattern discovery using
different data mining Algorithms with pattern deploying and
pattern Evolving method. System use data set from RCV1
(Reuters Corpus Volume 1) which contains training set and
test set. Documents in both the set are either positive or
negative.”Positive “means document is relevant to the topic
otherwise “negative”. Documents are in XML format. System
uses sequential closed frequent patterns as well as non
sequential closed pattern for finding concept from data set.
Modules in the proposed system are as follows
• Data transform
• Pattern discovery
• Pattern deploy
• Pattern Evolving
• Evaluation
3.1 Data Transform
Data transform is preprocessing of document. It consists of
removal of irrelevant data from documents.
Fig 1: Data Transform
Data transform module consists of following steps as shown in
figure 1.
• Remove Stop Words
In this step non informative words removed from document,
• Stemming
Stemming process to reduce derived word to its root form
using Porter algorithm
• Feature Selection
This step assigns value to each term using a weighting scheme
and removes low frequency terms.
3.2 Pattern Discovery
This module discovers patterns from preprocessed documents.
Sequential closed frequent patterns as well as non sequential
closed patterns are extracted using algorithms Sequential
closed pattern mining and non-sequential closed pattern
mining.
3.3 Pattern Deploy
Processing of discovered patterns is carried in this module.
These discovered patterns are organized in specific format
using pattern deploying method (PDM) and pattern deploying
with support (PDS) Algorithms. PDM organizes discovered
patterns in <term, frequency> form by combining all
discovered pattern vectors. PDS gives same output as PDM
with support of each term.
3.4 Pattern Evolving
This module removed the non meaningful patterns using
deploy pattern Evolving (DPE) and Individual Pattern
Evolving (IPE) Algorithms. This module finds patterns from
negative document. This module identifies and removes
ambiguous patterns i.e. patterns which are present in positive
as well as negative documents.
3.5 Evaluation of Pattern Generated after Evolving
Method
This module is regarding evaluation. This compares output of
system without deploy and Evolve method with system using
deploy and Evolve method. For checking performance of
proposed system this module calculates precision, recall and
f1-measures.
4. EXPERIMENTAL DATASET
Several standard benchmark datasets such as Reuter’s
corpora, OHSUMED[5] and 20 Newsgroups [6] collection are
available for experimental purposes. The most frequently used
one is the Reuters dataset. Several versions of Reuter’s
corpora have been released. Reuters-21578 dataset is
considered for experiment because it contains a reasonable
number of documents with relevance judgment both in the
training and test examples. Table 1 shows summary of Reuters
data collections
Table 1: Summary of Reuter’s data collections
Version #docs #trainings #tests #topics Release
year
Reuters-
22173
22173 14,704 6,746 135 1993
Retuers-
21578
21578 9,603 3,299 90 1996
RCV1 806,791 5,127 37,556 100 2000
IJRET: International Journal of Research in Engineering and Technology
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @
Retuers-21578 includes 21,578 documents and 90 topics and
released in 1996. Documents from data set are formatted using
a structured XML scheme.
5. IMPLEMENTATION
System starts from one of the RCV1 topics and retrieves the
related information with regard to the training set. Each
document is preprocessed with word stemming and
words removal and transformed into a set of transactions
based on its nature of document structure. System selects one
of the pattern discovery algorithms to extract patterns.
Discovered patterns are deployed using one of the deploying
methods, and then pattern evolving process is used to refine
patterns. A concept representing the context of the topic is
eventually generated. Each document in the test set is assessed
by the Test module and the relevant documents to topic are
shown as an output. The result of data transform is a set of
transactions and each transaction consists of a vector of
stemmed terms. The next step is to find frequent patterns using
pattern discovery algorithms. Data mining approaches
including association rule mining, frequent sequential pattern
mining, closed pattern mining, and item set mining are
adopted and applied to the text mining tasks. By splitting each
document into several transactions (i.e., par
mining methods are used to find frequent patterns from the
textual documents. Two pattern discovery methods which
have been implemented in the experiments are briefed as
follows:
- SCPM: Finding sequential closed patterns using the
algorithm SPMining. (Figure.2)
-NSPM: Finding non-sequential patterns using the algorithm.
Fig 2: Algorithm for Sequential closed Pattern mining
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319
__________________________________________________________________________________________
, Available @ http://www.ijret.org
21578 includes 21,578 documents and 90 topics and
from data set are formatted using
starts from one of the RCV1 topics and retrieves the
to the training set. Each
document is preprocessed with word stemming and stops
removal and transformed into a set of transactions
based on its nature of document structure. System selects one
of the pattern discovery algorithms to extract patterns.
Discovered patterns are deployed using one of the deploying
pattern evolving process is used to refine
patterns. A concept representing the context of the topic is
eventually generated. Each document in the test set is assessed
t module and the relevant documents to topic are
The result of data transform is a set of
transactions and each transaction consists of a vector of
stemmed terms. The next step is to find frequent patterns using
thms. Data mining approaches
including association rule mining, frequent sequential pattern
mining, closed pattern mining, and item set mining are
adopted and applied to the text mining tasks. By splitting each
document into several transactions (i.e., paragraphs), these
mining methods are used to find frequent patterns from the
textual documents. Two pattern discovery methods which
have been implemented in the experiments are briefed as
Finding sequential closed patterns using the
sequential patterns using the algorithm.
Sequential closed Pattern mining
PDM uses sequential or non sequential closed pattern and
gives document vectors as output.PDS used sequential or
sequential closed patterns and gives document vectors with
support as output.
Fig 3: Algorithm for Pattern deploy Method
The PDM is used with the attempt to address the problem
caused by the inappropriate evaluation of patterns, discovered
using data mining methods. Data mining methods, such as
SPM and NSPM, utilize discovered patterns directly without
any modification and thus encounter the problem of lacking
frequency on specific patterns. Processing of discovered
patterns is carried in this module. These discovered patterns
are organized in a specific format. There are two choices for
pattern deploying. One is using patter
(PDM Figure 3) and other pattern deploying with support
algorithms. PDM organizes discovered patterns in <term,
support> form by combining all discovered pattern vectors.
PDS gives same output as PDM with support of each term.
After patterns deploy, the concept of topic is built by merging
patterns of all documents. While the concept is established, the
relevance estimation of each document in the test dataset is
conducted using the document evaluating function as shown
eq. (1) in Test process. Documents in the dataset are ranked
according to their relevance scores .After testing; system’s
performance is evaluated using the metrics such as precision,
recall and f1-measures.
algorithm Figure 4) is used by this
vectors from PDM or PDS and removes the non meaningful
patterns. Output of DPE is normalized document vectors. Here
patterns from negative documents are identified and noisy
(ambiguous) patterns i.e. Patterns which are present in
Positive as well as negative documents, are filtered. Result of
pattern evolving is patterns in <term, support> form by
combining all deployed pattern vectors. The concept of topic
is built by merging patterns of all documents While the
concept is established, the relevance estimation of each
document in the test dataset is conducted using the document
evaluating function as shown eq.(1) in Test process.
Documents in the dataset are ranked according to their
relevance scores. After testing system’s perf
eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
26
PDM uses sequential or non sequential closed pattern and
gives document vectors as output.PDS used sequential or non
sequential closed patterns and gives document vectors with
Algorithm for Pattern deploy Method
The PDM is used with the attempt to address the problem
caused by the inappropriate evaluation of patterns, discovered
using data mining methods. Data mining methods, such as
SPM and NSPM, utilize discovered patterns directly without
hus encounter the problem of lacking
frequency on specific patterns. Processing of discovered
patterns is carried in this module. These discovered patterns
are organized in a specific format. There are two choices for
pattern deploying. One is using pattern deploying method
) and other pattern deploying with support
algorithms. PDM organizes discovered patterns in <term,
support> form by combining all discovered pattern vectors.
PDS gives same output as PDM with support of each term.
erns deploy, the concept of topic is built by merging
patterns of all documents. While the concept is established, the
relevance estimation of each document in the test dataset is
conducted using the document evaluating function as shown
ocess. Documents in the dataset are ranked
according to their relevance scores .After testing; system’s
performance is evaluated using the metrics such as precision,
. Deploy pattern Evolving (DPE
4) is used by this module. It takes document
vectors from PDM or PDS and removes the non meaningful
patterns. Output of DPE is normalized document vectors. Here
patterns from negative documents are identified and noisy
(ambiguous) patterns i.e. Patterns which are present in
Positive as well as negative documents, are filtered. Result of
pattern evolving is patterns in <term, support> form by
combining all deployed pattern vectors. The concept of topic
is built by merging patterns of all documents While the
lished, the relevance estimation of each
document in the test dataset is conducted using the document
evaluating function as shown eq.(1) in Test process.
Documents in the dataset are ranked according to their
relevance scores. After testing system’s performance is
IJRET: International Journal of Research in Engineering and Technology
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @
evaluated using the metrics such as precision, recall and f1
measures.
For checking performance of the system this module
calculates precision, recall and f1-measure metrics. The
precision is the fraction of retrieved documents that
relevant to the topic, and the recall is the fraction of relevant
documents that have been retrieved. For a binary classification
problem the judgment can be defined within a contingency
table as depicted in Table 2.
Table 2: Contingency table
human judgment
System
judgment
Yes
Yes TP
No FN
According to the definition in Table (2), the precision and
recall are calculated using following equations. Where TP
(True positives) is the number of documents the system
correctly identifies as positives; FP (False Positives) is the
number of documents the system falsely identifies as
positives; FN (False Negatives) is the number of relevant
documents the system fails to identify. The precision of first K
returned documents top-K is calculated. The precision of top
K returned documents refers to the relative value of relevant
documents in the first K returned documents.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319
__________________________________________________________________________________________
, Available @ http://www.ijret.org
evaluated using the metrics such as precision, recall and f1-
For checking performance of the system this module
measure metrics. The
precision is the fraction of retrieved documents that are
relevant to the topic, and the recall is the fraction of relevant
documents that have been retrieved. For a binary classification
problem the judgment can be defined within a contingency
Contingency table
No
FP
TN
2), the precision and
recall are calculated using following equations. Where TP
(True positives) is the number of documents the system
positives; FP (False Positives) is the
number of documents the system falsely identifies as
positives; FN (False Negatives) is the number of relevant
documents the system fails to identify. The precision of first K
he precision of top-
K returned documents refers to the relative value of relevant
documents in the first K returned documents.
Fig 4: Algorithm for pattern
The value of K use in the experiments is 20.Another metric
F1-measure is calculated using following equation. To
evaluate performance of system precision, recall and f1
measure of three processes is compared.
6. RESULTS OBTAINED
Following Table 3 shows pattern obtained after pattern
discovery method for topic ship
obtained after pattern Evolving method for topic ship.
Table 3:-1-term, 2
1-term
2-term
3-term
eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
27
Algorithm for pattern evolving method
The value of K use in the experiments is 20.Another metric
calculated using following equation. To
evaluate performance of system precision, recall and f1-
measure of three processes is compared.
. RESULTS OBTAINED
shows pattern obtained after pattern
discovery method for topic ship. Table 4 shows patterns
obtained after pattern Evolving method for topic ship.
term, 2-term, 3-term patterns
Patterns
pct
offer
river
ship
strike
seamen
sector
redund
offer pct
offer river
strike pai
pai seamen
strike seamen
pct river offer
pai strike seamen
ship sourc capac
industri ship japan
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 28
Table 4:-Patterns after pattern Evolving
Document no Term Support
23 River 1.0
43 Ship 0.25
54 Seamen 0.25
62 Missil 0.25
63 Yard 1.0
81 Industry 0.25
62 Sourc 0.25
98 Shell 1.0
98 Strike 1.0
128 Protect 1.0
7. SYSTEM EVALUATION
After Test process, the system is evaluated using three
performance metrics precision (eq.2), recall (eq.3) and F1-
measure (eq.4).Using these metrics, different methods are
compared to check the most appropriate method which gives
maximum relevant documents to topic. Reuters-21578 dataset
consist of 90 topics. Comparison of precision, recall and f1-
measure for topic ship by considering top-k documents with
highest relevance score is as shown in figure 5. It can be
observed that if value of k in top-k is chosen as 20 then system
gives maximum values for precision, recall and f1-measure.
Fig 5:-Precision, recall, f1-measure for topic ship
Maximum number of documents relevant to topic ship are
obtained at k=20. To evaluate performance of system,
performance of different methods is compared using precision,
recall and f1-measure. Comparison of precision and recall for
methods Pattern discovery, Pattern deploy and Pattern
Evolving (for topic ship is as shown in figure 6.
Fig 6:-SCPM, PDM and DPE for topic ship
It can be observed that maximum values for precision, recall
and f1-measure are obtained from DPE. DPE gives maximum
number of documents from test set that are relevant to topic
ship. DPE gives better results than sequential closed pattern
mining (SCPM) method. So, it can be concluded that DPE and
PDM are superior to SCPM.
CONCLUSIONS
Many text mining methods have been proposed; main
drawback of these methods is terms with higher tf*idf are not
useful for finding concept of topic. Many data mining methods
have been proposed for fulfilling various knowledge discovery
tasks. These methods include association rule mining, frequent
item set mining, sequential pattern mining, maximum pattern
mining and closed pattern mining. All frequent patterns are
not useful. Hence, use of these patterns derived from data
mining methods leads to ineffective performance. Knowledge
discovery with PDM and DPE have been proposed to
overcome the above mentioned drawbacks. An effective
knowledge discovery system is implemented using three main
steps: (1) discovering useful patterns by sequential closed
pattern mining algorithm and non sequential closed pattern
mining algorithm. (2) Using discovered patterns by pattern
deploying using PDS and PDM. (3) Adjusting user profiles by
applying pattern evolution using DPE. Numerous experiments
within an information filtering domain are conducted. Reuters-
21578 dataset is used by the system. Three performance
metrics precision, recall and f1-measures are used to evaluate
performance of system. The results show that the implemented
system using pattern deploy and pattern Evolving is superior
to SCPM data mining-based method.
IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308
__________________________________________________________________________________________
Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 29
REFERENCES
[1]. L. P. Jing, H. K. Huang, and H. B. Shi. “Improved feature
selection approach tf*idf in text mining.” International
Conference on Machine Learning and Cybernetics, 2002.
[2]. H. Ahonen-Myka. Discovery of frequent word sequences
in text. In Proceedings of Pattern Detection and Discovery,
pages 180–189, 2002.34, 61
[3]. E. Brill and P. Resnik. “A rule-based approach to
prepositional phrase attachment disambiguation”. In
Proceedings of the 15th International Conference on
Computational Linguistics (COLING), pages 1198–1204,
1994. 34
[4]. H. Ahonen, O. Heinonen, M Klemettinen, and A. I.
Verkamo. “Mining in the phrasal frontier”. In Proceedings of
PKDD, pages 343–350, 1997. 34, 39, 62
[5]. W. Hersh, C. Buckley, T. Leone, and D. Hickman.
“Ohsumed: an interactive retrieval evaluation and new large
text collection for research”. In Proceedings of the 17th ACM
International Conference on Research and Development in
Information Retrieval, pages 192–201, 1994.
[6]. K. Lang. News weeder: Learning to filter net news. In
Proceedings of ICML, pages 331–339, 1995.

Más contenido relacionado

La actualidad más candente

IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET Journal
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategiesijtsrd
 
Bj32809815
Bj32809815Bj32809815
Bj32809815IJMER
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETEditor IJMTER
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestIJDKP
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...IJMER
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...ijsc
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...theijes
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTIJDKP
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text ClassificationAM Publications
 
A Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionA Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionAM Publications
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTIJERA Editor
 
Correlation of artificial neural network classification and nfrs attribute fi...
Correlation of artificial neural network classification and nfrs attribute fi...Correlation of artificial neural network classification and nfrs attribute fi...
Correlation of artificial neural network classification and nfrs attribute fi...eSAT Journals
 
Neighborhood search methods with moth optimization algorithm as a wrapper met...
Neighborhood search methods with moth optimization algorithm as a wrapper met...Neighborhood search methods with moth optimization algorithm as a wrapper met...
Neighborhood search methods with moth optimization algorithm as a wrapper met...IJECEIAES
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationIJCSIS Research Publications
 

La actualidad más candente (20)

IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
IRJET- Classifying Twitter Data in Multiple Classes based on Sentiment Class ...
 
A Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection StrategiesA Survey on Classification of Feature Selection Strategies
A Survey on Classification of Feature Selection Strategies
 
Bj32809815
Bj32809815Bj32809815
Bj32809815
 
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASETSURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
SURVEY ON CLASSIFICATION ALGORITHMS USING BIG DATASET
 
T0 numtq0n tk=
T0 numtq0n tk=T0 numtq0n tk=
T0 numtq0n tk=
 
An experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forestAn experimental study on hypothyroid using rotation forest
An experimental study on hypothyroid using rotation forest
 
A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...A Threshold fuzzy entropy based feature selection method applied in various b...
A Threshold fuzzy entropy based feature selection method applied in various b...
 
Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...Document Classification Using Expectation Maximization with Semi Supervised L...
Document Classification Using Expectation Maximization with Semi Supervised L...
 
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
A Survey and Comparative Study of Filter and Wrapper Feature Selection Techni...
 
N045038690
N045038690N045038690
N045038690
 
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXTAN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
AN EFFICIENT FEATURE SELECTION MODEL FOR IGBO TEXT
 
Survey on Text Classification
Survey on Text ClassificationSurvey on Text Classification
Survey on Text Classification
 
D1803012022
D1803012022D1803012022
D1803012022
 
A Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept ExtractionA Novel approach for Document Clustering using Concept Extraction
A Novel approach for Document Clustering using Concept Extraction
 
50120140501018
5012014050101850120140501018
50120140501018
 
M43016571
M43016571M43016571
M43016571
 
Analysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOTAnalysis on different Data mining Techniques and algorithms used in IOT
Analysis on different Data mining Techniques and algorithms used in IOT
 
Correlation of artificial neural network classification and nfrs attribute fi...
Correlation of artificial neural network classification and nfrs attribute fi...Correlation of artificial neural network classification and nfrs attribute fi...
Correlation of artificial neural network classification and nfrs attribute fi...
 
Neighborhood search methods with moth optimization algorithm as a wrapper met...
Neighborhood search methods with moth optimization algorithm as a wrapper met...Neighborhood search methods with moth optimization algorithm as a wrapper met...
Neighborhood search methods with moth optimization algorithm as a wrapper met...
 
An Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text ClassificationAn Evaluation of Preprocessing Techniques for Text Classification
An Evaluation of Preprocessing Techniques for Text Classification
 

Destacado

The feel of silk bagian kedua
The feel of silk bagian keduaThe feel of silk bagian kedua
The feel of silk bagian keduaBerti Subagijo
 
Cone fender
Cone fender Cone fender
Cone fender beast0004
 
Performance measures for internet server by using m mm queueing model
Performance measures for internet server by using m mm queueing modelPerformance measures for internet server by using m mm queueing model
Performance measures for internet server by using m mm queueing modeleSAT Publishing House
 
State space vector based advanced direct power control of matrix converter as...
State space vector based advanced direct power control of matrix converter as...State space vector based advanced direct power control of matrix converter as...
State space vector based advanced direct power control of matrix converter as...eSAT Publishing House
 
+++Notiuni de-electrotehnica-si-de-matematica (1)
+++Notiuni de-electrotehnica-si-de-matematica (1)+++Notiuni de-electrotehnica-si-de-matematica (1)
+++Notiuni de-electrotehnica-si-de-matematica (1)Ille Marius Dan
 
Efficient data compression of ecg signal using discrete wavelet transform
Efficient data compression of ecg signal using discrete wavelet transformEfficient data compression of ecg signal using discrete wavelet transform
Efficient data compression of ecg signal using discrete wavelet transformeSAT Publishing House
 
Tex Mobile advertising Europe
Tex Mobile advertising Europe Tex Mobile advertising Europe
Tex Mobile advertising Europe Reema Thomas
 
Geweke Hospitality Brochure
Geweke Hospitality BrochureGeweke Hospitality Brochure
Geweke Hospitality BrochureAlicia Russell
 
explicacion del proyecto "todos para uno"
explicacion del proyecto "todos para uno"explicacion del proyecto "todos para uno"
explicacion del proyecto "todos para uno"Maira Alejandra Herrera
 
Patent Data Standard Public
Patent Data Standard PublicPatent Data Standard Public
Patent Data Standard PublicLihua Gao
 
Fun with my small note animal tissue new
Fun with my small note animal tissue newFun with my small note animal tissue new
Fun with my small note animal tissue newCECE SUTIA
 
Classification on multi label dataset using rule mining technique
Classification on multi label dataset using rule mining techniqueClassification on multi label dataset using rule mining technique
Classification on multi label dataset using rule mining techniqueeSAT Publishing House
 
Thought for the day
Thought for the dayThought for the day
Thought for the daycalvin_ams
 
Andrewgoodwinsmusicvideoprinciples
AndrewgoodwinsmusicvideoprinciplesAndrewgoodwinsmusicvideoprinciples
Andrewgoodwinsmusicvideoprinciplesmiloomahoney
 
A fast atmospheric correction algorithm applied to landsat tm images
A fast atmospheric correction algorithm applied to landsat tm imagesA fast atmospheric correction algorithm applied to landsat tm images
A fast atmospheric correction algorithm applied to landsat tm imagesweslyj
 

Destacado (20)

The feel of silk bagian kedua
The feel of silk bagian keduaThe feel of silk bagian kedua
The feel of silk bagian kedua
 
Cone fender
Cone fender Cone fender
Cone fender
 
Performance measures for internet server by using m mm queueing model
Performance measures for internet server by using m mm queueing modelPerformance measures for internet server by using m mm queueing model
Performance measures for internet server by using m mm queueing model
 
Split block domination in graphs
Split block domination in graphsSplit block domination in graphs
Split block domination in graphs
 
State space vector based advanced direct power control of matrix converter as...
State space vector based advanced direct power control of matrix converter as...State space vector based advanced direct power control of matrix converter as...
State space vector based advanced direct power control of matrix converter as...
 
+++Notiuni de-electrotehnica-si-de-matematica (1)
+++Notiuni de-electrotehnica-si-de-matematica (1)+++Notiuni de-electrotehnica-si-de-matematica (1)
+++Notiuni de-electrotehnica-si-de-matematica (1)
 
Efficient data compression of ecg signal using discrete wavelet transform
Efficient data compression of ecg signal using discrete wavelet transformEfficient data compression of ecg signal using discrete wavelet transform
Efficient data compression of ecg signal using discrete wavelet transform
 
Tex Mobile advertising Europe
Tex Mobile advertising Europe Tex Mobile advertising Europe
Tex Mobile advertising Europe
 
Geweke Hospitality Brochure
Geweke Hospitality BrochureGeweke Hospitality Brochure
Geweke Hospitality Brochure
 
History of life part 2
History of life part 2History of life part 2
History of life part 2
 
explicacion del proyecto "todos para uno"
explicacion del proyecto "todos para uno"explicacion del proyecto "todos para uno"
explicacion del proyecto "todos para uno"
 
Patent Data Standard Public
Patent Data Standard PublicPatent Data Standard Public
Patent Data Standard Public
 
Fun with my small note animal tissue new
Fun with my small note animal tissue newFun with my small note animal tissue new
Fun with my small note animal tissue new
 
Classification on multi label dataset using rule mining technique
Classification on multi label dataset using rule mining techniqueClassification on multi label dataset using rule mining technique
Classification on multi label dataset using rule mining technique
 
this is it
this is itthis is it
this is it
 
Thought for the day
Thought for the dayThought for the day
Thought for the day
 
Convenient voting machine
Convenient voting machineConvenient voting machine
Convenient voting machine
 
Individual fucas
Individual fucasIndividual fucas
Individual fucas
 
Andrewgoodwinsmusicvideoprinciples
AndrewgoodwinsmusicvideoprinciplesAndrewgoodwinsmusicvideoprinciples
Andrewgoodwinsmusicvideoprinciples
 
A fast atmospheric correction algorithm applied to landsat tm images
A fast atmospheric correction algorithm applied to landsat tm imagesA fast atmospheric correction algorithm applied to landsat tm images
A fast atmospheric correction algorithm applied to landsat tm images
 

Similar a Using data mining methods knowledge discovery for text mining

A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueeSAT Journals
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...IAESIJAI
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHIJDKP
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMIRJET Journal
 
An efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contextAn efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contexteSAT Journals
 
A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...eSAT Journals
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodijcsit
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningAM Publications,India
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemIRJET Journal
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papersEditorJST
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection methodIJSRD
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrievalidescitation
 
journal of applied science and engineering
journal of applied science and engineeringjournal of applied science and engineering
journal of applied science and engineeringrikaseorika
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...ijceronline
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Publishing House
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringeSAT Journals
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET Journal
 

Similar a Using data mining methods knowledge discovery for text mining (20)

A novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching techniqueA novel approach for text extraction using effective pattern matching technique
A novel approach for text extraction using effective pattern matching technique
 
Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...Machine learning for text document classification-efficient classification ap...
Machine learning for text document classification-efficient classification ap...
 
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACHTEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
TEXT CLUSTERING USING INCREMENTAL FREQUENT PATTERN MINING APPROACH
 
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEMCANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
CANDIDATE SET KEY DOCUMENT RETRIEVAL SYSTEM
 
An efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for contextAn efficient information retrieval ontology system based indexing for context
An efficient information retrieval ontology system based indexing for context
 
A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...A template based algorithm for automatic summarization and dialogue managemen...
A template based algorithm for automatic summarization and dialogue managemen...
 
Arabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation methodArabic text categorization algorithm using vector evaluation method
Arabic text categorization algorithm using vector evaluation method
 
Generic Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data MiningGeneric Algorithm based Data Retrieval Technique in Data Mining
Generic Algorithm based Data Retrieval Technique in Data Mining
 
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering SystemKnowledge Graph and Similarity Based Retrieval Method for Query Answering System
Knowledge Graph and Similarity Based Retrieval Method for Query Answering System
 
A0210110
A0210110A0210110
A0210110
 
6.domain extraction from research papers
6.domain extraction from research papers6.domain extraction from research papers
6.domain extraction from research papers
 
Introduction to feature subset selection method
Introduction to feature subset selection methodIntroduction to feature subset selection method
Introduction to feature subset selection method
 
Performance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information RetrievalPerformance Evaluation of Query Processing Techniques in Information Retrieval
Performance Evaluation of Query Processing Techniques in Information Retrieval
 
journal of applied science and engineering
journal of applied science and engineeringjournal of applied science and engineering
journal of applied science and engineering
 
E43022023
E43022023E43022023
E43022023
 
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
Survey on Existing Text Mining Frameworks and A Proposed Idealistic Framework...
 
B0410206010
B0410206010B0410206010
B0410206010
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
Elevating forensic investigation system for file clustering
Elevating forensic investigation system for file clusteringElevating forensic investigation system for file clustering
Elevating forensic investigation system for file clustering
 
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
IRJET- Towards Efficient Framework for Semantic Query Search Engine in Large-...
 

Más de eSAT Publishing House

Likely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnamLikely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnameSAT Publishing House
 
Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...eSAT Publishing House
 
Hudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnamHudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnameSAT Publishing House
 
Groundwater investigation using geophysical methods a case study of pydibhim...
Groundwater investigation using geophysical methods  a case study of pydibhim...Groundwater investigation using geophysical methods  a case study of pydibhim...
Groundwater investigation using geophysical methods a case study of pydibhim...eSAT Publishing House
 
Flood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaFlood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaeSAT Publishing House
 
Enhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingEnhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingeSAT Publishing House
 
Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...eSAT Publishing House
 
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...eSAT Publishing House
 
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...eSAT Publishing House
 
Shear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a reviewShear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a revieweSAT Publishing House
 
Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...eSAT Publishing House
 
Risk analysis and environmental hazard management
Risk analysis and environmental hazard managementRisk analysis and environmental hazard management
Risk analysis and environmental hazard managementeSAT Publishing House
 
Review study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallsReview study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallseSAT Publishing House
 
Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...eSAT Publishing House
 
Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...eSAT Publishing House
 
Coastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaCoastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaeSAT Publishing House
 
Can fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structuresCan fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structureseSAT Publishing House
 
Assessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingsAssessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingseSAT Publishing House
 
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...eSAT Publishing House
 
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...eSAT Publishing House
 

Más de eSAT Publishing House (20)

Likely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnamLikely impacts of hudhud on the environment of visakhapatnam
Likely impacts of hudhud on the environment of visakhapatnam
 
Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...Impact of flood disaster in a drought prone area – case study of alampur vill...
Impact of flood disaster in a drought prone area – case study of alampur vill...
 
Hudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnamHudhud cyclone – a severe disaster in visakhapatnam
Hudhud cyclone – a severe disaster in visakhapatnam
 
Groundwater investigation using geophysical methods a case study of pydibhim...
Groundwater investigation using geophysical methods  a case study of pydibhim...Groundwater investigation using geophysical methods  a case study of pydibhim...
Groundwater investigation using geophysical methods a case study of pydibhim...
 
Flood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, indiaFlood related disasters concerned to urban flooding in bangalore, india
Flood related disasters concerned to urban flooding in bangalore, india
 
Enhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity buildingEnhancing post disaster recovery by optimal infrastructure capacity building
Enhancing post disaster recovery by optimal infrastructure capacity building
 
Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...Effect of lintel and lintel band on the global performance of reinforced conc...
Effect of lintel and lintel band on the global performance of reinforced conc...
 
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
Wind damage to trees in the gitam university campus at visakhapatnam by cyclo...
 
Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...Wind damage to buildings, infrastrucuture and landscape elements along the be...
Wind damage to buildings, infrastrucuture and landscape elements along the be...
 
Shear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a reviewShear strength of rc deep beam panels – a review
Shear strength of rc deep beam panels – a review
 
Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...Role of voluntary teams of professional engineers in dissater management – ex...
Role of voluntary teams of professional engineers in dissater management – ex...
 
Risk analysis and environmental hazard management
Risk analysis and environmental hazard managementRisk analysis and environmental hazard management
Risk analysis and environmental hazard management
 
Review study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear wallsReview study on performance of seismically tested repaired shear walls
Review study on performance of seismically tested repaired shear walls
 
Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...Monitoring and assessment of air quality with reference to dust particles (pm...
Monitoring and assessment of air quality with reference to dust particles (pm...
 
Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...Low cost wireless sensor networks and smartphone applications for disaster ma...
Low cost wireless sensor networks and smartphone applications for disaster ma...
 
Coastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of indiaCoastal zones – seismic vulnerability an analysis from east coast of india
Coastal zones – seismic vulnerability an analysis from east coast of india
 
Can fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structuresCan fracture mechanics predict damage due disaster of structures
Can fracture mechanics predict damage due disaster of structures
 
Assessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildingsAssessment of seismic susceptibility of rc buildings
Assessment of seismic susceptibility of rc buildings
 
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
A geophysical insight of earthquake occurred on 21 st may 2014 off paradip, b...
 
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
Effect of hudhud cyclone on the development of visakhapatnam as smart and gre...
 

Último

NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfMohonDas
 
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfWave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfErik Friis-Madsen
 
Artificial organ courses Hussein L1-C2.pptx
Artificial organ courses Hussein  L1-C2.pptxArtificial organ courses Hussein  L1-C2.pptx
Artificial organ courses Hussein L1-C2.pptxHusseinMishbak
 
The Journey of Process Safety Management: Past, Present, and Future Trends
The Journey of Process Safety Management: Past, Present, and Future TrendsThe Journey of Process Safety Management: Past, Present, and Future Trends
The Journey of Process Safety Management: Past, Present, and Future Trendssoginsider
 
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityField Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityMorshed Ahmed Rahath
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptxssuser886c55
 
Flutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhFlutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhjamesbond00714
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...marijomiljkovic1
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024California Asphalt Pavement Association
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.RoshniIsrani1
 
عناصر نباتية PDF.pdfbotanical elements..
عناصر نباتية PDF.pdfbotanical elements..عناصر نباتية PDF.pdfbotanical elements..
عناصر نباتية PDF.pdfbotanical elements..mennamohamed200y
 
Final PPT.ppt about human detection and counting
Final PPT.ppt  about human detection and countingFinal PPT.ppt  about human detection and counting
Final PPT.ppt about human detection and countingArbazAhmad25
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesMark Billinghurst
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .happycocoman
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalDr. Manjunatha. P
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationJeyporess2021
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .Ashutosh Satapathy
 

Último (20)

FOREST FIRE USING IoT-A Visual to UG students
FOREST FIRE USING IoT-A Visual to UG studentsFOREST FIRE USING IoT-A Visual to UG students
FOREST FIRE USING IoT-A Visual to UG students
 
NIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdfNIPORT Home Economics Questions Solution 2024.pdf
NIPORT Home Economics Questions Solution 2024.pdf
 
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdfWave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
Wave Energy Technologies Overtopping 1 - Tom Thorpe.pdf
 
Artificial organ courses Hussein L1-C2.pptx
Artificial organ courses Hussein  L1-C2.pptxArtificial organ courses Hussein  L1-C2.pptx
Artificial organ courses Hussein L1-C2.pptx
 
The Journey of Process Safety Management: Past, Present, and Future Trends
The Journey of Process Safety Management: Past, Present, and Future TrendsThe Journey of Process Safety Management: Past, Present, and Future Trends
The Journey of Process Safety Management: Past, Present, and Future Trends
 
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna MunicipalityField Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
Field Report on present condition of Ward 1 and Ward 2 of Pabna Municipality
 
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
0950_Rodriguez_200520_Work_done-GEOGalicia_ELAB-converted.pptx
 
Flutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarhFlutter GDE session GDSC ZHCET AMU, aligarh
Flutter GDE session GDSC ZHCET AMU, aligarh
 
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
EJECTOR REFRIGERATION CYCLE WITH THE INJECTION OF A HIGH DENSITY FLUID INTO A...
 
Hydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix DefenceHydraulic Loading System - Neometrix Defence
Hydraulic Loading System - Neometrix Defence
 
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
Caltrans District 8 Update for the CalAPA Spring Asphalt Conference 2024
 
This chapter gives an outline of the security.
This chapter gives an outline of the security.This chapter gives an outline of the security.
This chapter gives an outline of the security.
 
عناصر نباتية PDF.pdfbotanical elements..
عناصر نباتية PDF.pdfbotanical elements..عناصر نباتية PDF.pdfbotanical elements..
عناصر نباتية PDF.pdfbotanical elements..
 
Caltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavementsCaltrans view on recycling of in-place asphalt pavements
Caltrans view on recycling of in-place asphalt pavements
 
Final PPT.ppt about human detection and counting
Final PPT.ppt  about human detection and countingFinal PPT.ppt  about human detection and counting
Final PPT.ppt about human detection and counting
 
Evaluation Methods for Social XR Experiences
Evaluation Methods for Social XR ExperiencesEvaluation Methods for Social XR Experiences
Evaluation Methods for Social XR Experiences
 
Governors ppt.pdf .
Governors ppt.pdf                              .Governors ppt.pdf                              .
Governors ppt.pdf .
 
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 JournalResearch paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
Research paper publications: Meaning of Q1 Q2 Q3 Q4 Journal
 
A brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station PresentationA brief about Jeypore Sub-station Presentation
A brief about Jeypore Sub-station Presentation
 
Introduction to Data Structures .
Introduction to Data Structures        .Introduction to Data Structures        .
Introduction to Data Structures .
 

Using data mining methods knowledge discovery for text mining

  • 1. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 24 USING DATA MINING METHODS KNOWLEDGE DISCOVERY FOR TEXT MINING D.M.Kulkarni1 , S.K.Shirgave2 1, 2 IT Department Dkte’s TEI Ichalkaranji (Maharashtra), India Abstract Many data mining techniques have been proposed for mining useful patterns in text documents. However, how to effectively use and update discovered patterns is still an open research issue, especially in the domain of text mining. Since most existing text mining methods adopted term-based approaches, they all suffer from the problems of polysemy and synonymy. Over the years, people have often held the hypothesis that pattern (or phrase)-based approaches should perform better than the term-based ones, but many experiments do not support this hypothesis. Proposed work presents an innovative and effective pattern discovery technique which includes the processes of pattern deploying and pattern evolving, to improve the effectiveness of using and updating discovered patterns for finding relevant and interesting information. Keywords:-Text mining, text classification, pattern mining, pattern evolving, information filtering. ----------------------------------------------------------------------***-------------------------------------------------------------------- 1. INTRODUCTION Knowledge discovery is a process of nontrivial extraction of information from large databases, information that is unknown and useful for user. Data mining is the first and essential step in the process of knowledge discovery. Various data mining methods are available such as association rule mining, sequential pattern mining, closed pattern mining and frequent item set mining to perform different knowledge discovery tasks. Effective use of discovered patterns is a research issue. Proposed system is implemented using different data mining methods for knowledge discovery. Text mining is a method of retrieving useful information from a large amount of digital text data. It is therefore crucial that a good text mining model should retrieve the information according to the user requirement. Traditional Information Retrieval (IR) has same objective of automatically retrieving as many relevant documents as possible, whilst filtering out irrelevant documents at the same time. However, IR-based systems do not provide users with what they really need. Many text mining methods have been developed for retrieving useful information for users. Most text mining methods use keyword based approaches, whereas others choose the phrase method to construct a text representation for a set of documents. The phrase-based approaches perform better than the keyword-based as it is considered that more information is carried by a phrase than by a single term. New studies have been focusing on finding better text representatives from a textual data collection. One solution is to use data mining methods, such as sequential pattern mining for Text mining. Such data mining-based methods use concepts of closed sequential patterns and non-closed patterns to decrease the feature set size by removing noisy patterns. New method, Pattern Discovery Model for the purpose of effectively using discovered patterns is proposed. Proposed system is evaluated the measures of patterns using pattern deploying process as well as finds patterns from the negative training examples using pattern Evolving process. 2. LITERATURE SURVEY The main process of text-related machine learning tasks is document indexing, which maps a document into a feature space representing the semantics of the document. Many types of text representations have been proposed in the past. A well known method for text mining is the bag of words that uses keywords (terms) as elements in the vector of the feature. Weighting scheme tf*idf (TFIDF) is used for text representation [1]. In addition to TFIDF, entropy weighting scheme is used, which improves performance by an average of 30 percent. The problem of bag of word approach is selection of a limited number of features amongst a huge set of words or terms in order to increase the system’s efficiency and avoid over fitting. In order to reduce the number of features, many dimensionality reduction approaches are available, such as Information Gain, Mutual Information, Chi-Square, Odds ratio. Some research works have used phrases rather than individual words.Using single words in keyword-based representation pose the semantic ambiguity problem. To solve this problem, the use of multiple words (i.e. phrases) as features therefore is proposed [2, 3]. In general, phrases carry more specific content than single words. For instance, “engine” and “search engine”. Another reason for using phrase-based representation is that the simple keyword-based representation of content is usually inadequate because single words are rarely specific enough for accurate discrimination [4]. To identify groups of words that create meaningful
  • 2. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 25 phrases is a better method, especially for phrases indicating important concepts in the text. The traditional term clustering methods are used to provide significantly improved text representation 3. PROPOSED SYSTEM Proposed system highlights on a software upgrade-based approach to increase efficiency of pattern discovery using different data mining Algorithms with pattern deploying and pattern Evolving method. System use data set from RCV1 (Reuters Corpus Volume 1) which contains training set and test set. Documents in both the set are either positive or negative.”Positive “means document is relevant to the topic otherwise “negative”. Documents are in XML format. System uses sequential closed frequent patterns as well as non sequential closed pattern for finding concept from data set. Modules in the proposed system are as follows • Data transform • Pattern discovery • Pattern deploy • Pattern Evolving • Evaluation 3.1 Data Transform Data transform is preprocessing of document. It consists of removal of irrelevant data from documents. Fig 1: Data Transform Data transform module consists of following steps as shown in figure 1. • Remove Stop Words In this step non informative words removed from document, • Stemming Stemming process to reduce derived word to its root form using Porter algorithm • Feature Selection This step assigns value to each term using a weighting scheme and removes low frequency terms. 3.2 Pattern Discovery This module discovers patterns from preprocessed documents. Sequential closed frequent patterns as well as non sequential closed patterns are extracted using algorithms Sequential closed pattern mining and non-sequential closed pattern mining. 3.3 Pattern Deploy Processing of discovered patterns is carried in this module. These discovered patterns are organized in specific format using pattern deploying method (PDM) and pattern deploying with support (PDS) Algorithms. PDM organizes discovered patterns in <term, frequency> form by combining all discovered pattern vectors. PDS gives same output as PDM with support of each term. 3.4 Pattern Evolving This module removed the non meaningful patterns using deploy pattern Evolving (DPE) and Individual Pattern Evolving (IPE) Algorithms. This module finds patterns from negative document. This module identifies and removes ambiguous patterns i.e. patterns which are present in positive as well as negative documents. 3.5 Evaluation of Pattern Generated after Evolving Method This module is regarding evaluation. This compares output of system without deploy and Evolve method with system using deploy and Evolve method. For checking performance of proposed system this module calculates precision, recall and f1-measures. 4. EXPERIMENTAL DATASET Several standard benchmark datasets such as Reuter’s corpora, OHSUMED[5] and 20 Newsgroups [6] collection are available for experimental purposes. The most frequently used one is the Reuters dataset. Several versions of Reuter’s corpora have been released. Reuters-21578 dataset is considered for experiment because it contains a reasonable number of documents with relevance judgment both in the training and test examples. Table 1 shows summary of Reuters data collections Table 1: Summary of Reuter’s data collections Version #docs #trainings #tests #topics Release year Reuters- 22173 22173 14,704 6,746 135 1993 Retuers- 21578 21578 9,603 3,299 90 1996 RCV1 806,791 5,127 37,556 100 2000
  • 3. IJRET: International Journal of Research in Engineering and Technology __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ Retuers-21578 includes 21,578 documents and 90 topics and released in 1996. Documents from data set are formatted using a structured XML scheme. 5. IMPLEMENTATION System starts from one of the RCV1 topics and retrieves the related information with regard to the training set. Each document is preprocessed with word stemming and words removal and transformed into a set of transactions based on its nature of document structure. System selects one of the pattern discovery algorithms to extract patterns. Discovered patterns are deployed using one of the deploying methods, and then pattern evolving process is used to refine patterns. A concept representing the context of the topic is eventually generated. Each document in the test set is assessed by the Test module and the relevant documents to topic are shown as an output. The result of data transform is a set of transactions and each transaction consists of a vector of stemmed terms. The next step is to find frequent patterns using pattern discovery algorithms. Data mining approaches including association rule mining, frequent sequential pattern mining, closed pattern mining, and item set mining are adopted and applied to the text mining tasks. By splitting each document into several transactions (i.e., par mining methods are used to find frequent patterns from the textual documents. Two pattern discovery methods which have been implemented in the experiments are briefed as follows: - SCPM: Finding sequential closed patterns using the algorithm SPMining. (Figure.2) -NSPM: Finding non-sequential patterns using the algorithm. Fig 2: Algorithm for Sequential closed Pattern mining IJRET: International Journal of Research in Engineering and Technology eISSN: 2319 __________________________________________________________________________________________ , Available @ http://www.ijret.org 21578 includes 21,578 documents and 90 topics and from data set are formatted using starts from one of the RCV1 topics and retrieves the to the training set. Each document is preprocessed with word stemming and stops removal and transformed into a set of transactions based on its nature of document structure. System selects one of the pattern discovery algorithms to extract patterns. Discovered patterns are deployed using one of the deploying pattern evolving process is used to refine patterns. A concept representing the context of the topic is eventually generated. Each document in the test set is assessed t module and the relevant documents to topic are The result of data transform is a set of transactions and each transaction consists of a vector of stemmed terms. The next step is to find frequent patterns using thms. Data mining approaches including association rule mining, frequent sequential pattern mining, closed pattern mining, and item set mining are adopted and applied to the text mining tasks. By splitting each document into several transactions (i.e., paragraphs), these mining methods are used to find frequent patterns from the textual documents. Two pattern discovery methods which have been implemented in the experiments are briefed as Finding sequential closed patterns using the sequential patterns using the algorithm. Sequential closed Pattern mining PDM uses sequential or non sequential closed pattern and gives document vectors as output.PDS used sequential or sequential closed patterns and gives document vectors with support as output. Fig 3: Algorithm for Pattern deploy Method The PDM is used with the attempt to address the problem caused by the inappropriate evaluation of patterns, discovered using data mining methods. Data mining methods, such as SPM and NSPM, utilize discovered patterns directly without any modification and thus encounter the problem of lacking frequency on specific patterns. Processing of discovered patterns is carried in this module. These discovered patterns are organized in a specific format. There are two choices for pattern deploying. One is using patter (PDM Figure 3) and other pattern deploying with support algorithms. PDM organizes discovered patterns in <term, support> form by combining all discovered pattern vectors. PDS gives same output as PDM with support of each term. After patterns deploy, the concept of topic is built by merging patterns of all documents. While the concept is established, the relevance estimation of each document in the test dataset is conducted using the document evaluating function as shown eq. (1) in Test process. Documents in the dataset are ranked according to their relevance scores .After testing; system’s performance is evaluated using the metrics such as precision, recall and f1-measures. algorithm Figure 4) is used by this vectors from PDM or PDS and removes the non meaningful patterns. Output of DPE is normalized document vectors. Here patterns from negative documents are identified and noisy (ambiguous) patterns i.e. Patterns which are present in Positive as well as negative documents, are filtered. Result of pattern evolving is patterns in <term, support> form by combining all deployed pattern vectors. The concept of topic is built by merging patterns of all documents While the concept is established, the relevance estimation of each document in the test dataset is conducted using the document evaluating function as shown eq.(1) in Test process. Documents in the dataset are ranked according to their relevance scores. After testing system’s perf eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ 26 PDM uses sequential or non sequential closed pattern and gives document vectors as output.PDS used sequential or non sequential closed patterns and gives document vectors with Algorithm for Pattern deploy Method The PDM is used with the attempt to address the problem caused by the inappropriate evaluation of patterns, discovered using data mining methods. Data mining methods, such as SPM and NSPM, utilize discovered patterns directly without hus encounter the problem of lacking frequency on specific patterns. Processing of discovered patterns is carried in this module. These discovered patterns are organized in a specific format. There are two choices for pattern deploying. One is using pattern deploying method ) and other pattern deploying with support algorithms. PDM organizes discovered patterns in <term, support> form by combining all discovered pattern vectors. PDS gives same output as PDM with support of each term. erns deploy, the concept of topic is built by merging patterns of all documents. While the concept is established, the relevance estimation of each document in the test dataset is conducted using the document evaluating function as shown ocess. Documents in the dataset are ranked according to their relevance scores .After testing; system’s performance is evaluated using the metrics such as precision, . Deploy pattern Evolving (DPE 4) is used by this module. It takes document vectors from PDM or PDS and removes the non meaningful patterns. Output of DPE is normalized document vectors. Here patterns from negative documents are identified and noisy (ambiguous) patterns i.e. Patterns which are present in Positive as well as negative documents, are filtered. Result of pattern evolving is patterns in <term, support> form by combining all deployed pattern vectors. The concept of topic is built by merging patterns of all documents While the lished, the relevance estimation of each document in the test dataset is conducted using the document evaluating function as shown eq.(1) in Test process. Documents in the dataset are ranked according to their relevance scores. After testing system’s performance is
  • 4. IJRET: International Journal of Research in Engineering and Technology __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ evaluated using the metrics such as precision, recall and f1 measures. For checking performance of the system this module calculates precision, recall and f1-measure metrics. The precision is the fraction of retrieved documents that relevant to the topic, and the recall is the fraction of relevant documents that have been retrieved. For a binary classification problem the judgment can be defined within a contingency table as depicted in Table 2. Table 2: Contingency table human judgment System judgment Yes Yes TP No FN According to the definition in Table (2), the precision and recall are calculated using following equations. Where TP (True positives) is the number of documents the system correctly identifies as positives; FP (False Positives) is the number of documents the system falsely identifies as positives; FN (False Negatives) is the number of relevant documents the system fails to identify. The precision of first K returned documents top-K is calculated. The precision of top K returned documents refers to the relative value of relevant documents in the first K returned documents. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319 __________________________________________________________________________________________ , Available @ http://www.ijret.org evaluated using the metrics such as precision, recall and f1- For checking performance of the system this module measure metrics. The precision is the fraction of retrieved documents that are relevant to the topic, and the recall is the fraction of relevant documents that have been retrieved. For a binary classification problem the judgment can be defined within a contingency Contingency table No FP TN 2), the precision and recall are calculated using following equations. Where TP (True positives) is the number of documents the system positives; FP (False Positives) is the number of documents the system falsely identifies as positives; FN (False Negatives) is the number of relevant documents the system fails to identify. The precision of first K he precision of top- K returned documents refers to the relative value of relevant documents in the first K returned documents. Fig 4: Algorithm for pattern The value of K use in the experiments is 20.Another metric F1-measure is calculated using following equation. To evaluate performance of system precision, recall and f1 measure of three processes is compared. 6. RESULTS OBTAINED Following Table 3 shows pattern obtained after pattern discovery method for topic ship obtained after pattern Evolving method for topic ship. Table 3:-1-term, 2 1-term 2-term 3-term eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ 27 Algorithm for pattern evolving method The value of K use in the experiments is 20.Another metric calculated using following equation. To evaluate performance of system precision, recall and f1- measure of three processes is compared. . RESULTS OBTAINED shows pattern obtained after pattern discovery method for topic ship. Table 4 shows patterns obtained after pattern Evolving method for topic ship. term, 2-term, 3-term patterns Patterns pct offer river ship strike seamen sector redund offer pct offer river strike pai pai seamen strike seamen pct river offer pai strike seamen ship sourc capac industri ship japan
  • 5. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 28 Table 4:-Patterns after pattern Evolving Document no Term Support 23 River 1.0 43 Ship 0.25 54 Seamen 0.25 62 Missil 0.25 63 Yard 1.0 81 Industry 0.25 62 Sourc 0.25 98 Shell 1.0 98 Strike 1.0 128 Protect 1.0 7. SYSTEM EVALUATION After Test process, the system is evaluated using three performance metrics precision (eq.2), recall (eq.3) and F1- measure (eq.4).Using these metrics, different methods are compared to check the most appropriate method which gives maximum relevant documents to topic. Reuters-21578 dataset consist of 90 topics. Comparison of precision, recall and f1- measure for topic ship by considering top-k documents with highest relevance score is as shown in figure 5. It can be observed that if value of k in top-k is chosen as 20 then system gives maximum values for precision, recall and f1-measure. Fig 5:-Precision, recall, f1-measure for topic ship Maximum number of documents relevant to topic ship are obtained at k=20. To evaluate performance of system, performance of different methods is compared using precision, recall and f1-measure. Comparison of precision and recall for methods Pattern discovery, Pattern deploy and Pattern Evolving (for topic ship is as shown in figure 6. Fig 6:-SCPM, PDM and DPE for topic ship It can be observed that maximum values for precision, recall and f1-measure are obtained from DPE. DPE gives maximum number of documents from test set that are relevant to topic ship. DPE gives better results than sequential closed pattern mining (SCPM) method. So, it can be concluded that DPE and PDM are superior to SCPM. CONCLUSIONS Many text mining methods have been proposed; main drawback of these methods is terms with higher tf*idf are not useful for finding concept of topic. Many data mining methods have been proposed for fulfilling various knowledge discovery tasks. These methods include association rule mining, frequent item set mining, sequential pattern mining, maximum pattern mining and closed pattern mining. All frequent patterns are not useful. Hence, use of these patterns derived from data mining methods leads to ineffective performance. Knowledge discovery with PDM and DPE have been proposed to overcome the above mentioned drawbacks. An effective knowledge discovery system is implemented using three main steps: (1) discovering useful patterns by sequential closed pattern mining algorithm and non sequential closed pattern mining algorithm. (2) Using discovered patterns by pattern deploying using PDS and PDM. (3) Adjusting user profiles by applying pattern evolution using DPE. Numerous experiments within an information filtering domain are conducted. Reuters- 21578 dataset is used by the system. Three performance metrics precision, recall and f1-measures are used to evaluate performance of system. The results show that the implemented system using pattern deploy and pattern Evolving is superior to SCPM data mining-based method.
  • 6. IJRET: International Journal of Research in Engineering and Technology eISSN: 2319-1163 | pISSN: 2321-7308 __________________________________________________________________________________________ Volume: 03 Issue: 01 | Jan-2014, Available @ http://www.ijret.org 29 REFERENCES [1]. L. P. Jing, H. K. Huang, and H. B. Shi. “Improved feature selection approach tf*idf in text mining.” International Conference on Machine Learning and Cybernetics, 2002. [2]. H. Ahonen-Myka. Discovery of frequent word sequences in text. In Proceedings of Pattern Detection and Discovery, pages 180–189, 2002.34, 61 [3]. E. Brill and P. Resnik. “A rule-based approach to prepositional phrase attachment disambiguation”. In Proceedings of the 15th International Conference on Computational Linguistics (COLING), pages 1198–1204, 1994. 34 [4]. H. Ahonen, O. Heinonen, M Klemettinen, and A. I. Verkamo. “Mining in the phrasal frontier”. In Proceedings of PKDD, pages 343–350, 1997. 34, 39, 62 [5]. W. Hersh, C. Buckley, T. Leone, and D. Hickman. “Ohsumed: an interactive retrieval evaluation and new large text collection for research”. In Proceedings of the 17th ACM International Conference on Research and Development in Information Retrieval, pages 192–201, 1994. [6]. K. Lang. News weeder: Learning to filter net news. In Proceedings of ICML, pages 331–339, 1995.