SlideShare una empresa de Scribd logo
1 de 16
Descargar para leer sin conexión
Category & Training Texts Selection for
Scientific Article Categorization in
an Expert Search System
By
Gan Keng Hoon*, Chua San Thai,
Khoh Zhuo Yan, Goh Kau Yang
School of Computer Sciences,
Universiti Sains Malaysia
Motivation
Scientific articles are produced as results of research.
Organizing scientific articles into subject areas or topics
help in discovery, navigation etc.
Motivation
Microsoft Academic
Motivation
Google Scholar
Motivation
Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa ,
(2015),"Facet-value extraction scheme from textual contents in XML
data“.
Scope
Application oriented research
Expert Search System
DBLP Dataset
School of Computer Sciences, USM
Goal
Improving the categorization of scientific articles
For
Capturing expert’s expertise based on their publications.
Enable category filtering during search.
Existing Approaches
Labelled Scientific Article
Supervised Learning method to train and test
Feature Selection
Bags of Words, Ngram, POS, Term Frequency, TFIDF
This research
Train with Labelled Scientific Related Domain Texts
Test with Scientific Article
Research Justification
Avoid the use of large number of labelled training texts
Focusing on differentiating good training texts sources.
Use reasonable small number of training texts to build
subject category model.
Process of category model construction on
scientific article domain.
Feature Selection
Feature Term Generation
N-gram technique is used to generate potential term candidates from the training text. E.g.
D = “Search engine is an artificial intelligence system.”
2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] =>
artificial intelligence [5] => intelligence system)
Features Selection by TF-IDF
Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword
weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as
features. This method penalizes the term when it occurs in different training texts. The TF-
IDF values are computed as
𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
= 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
× 𝑙𝑙𝑙𝑙 𝑙𝑙
𝑁𝑁𝐷𝐷
𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖
is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the
total number of document.
Transfer Training Approach
Intuition
If the training texts are representative enough to cover the concept of a
category, hence this training sets can be obtained from any sources that share
similar concepts or semantics.
Criteria
Sharing same or partially similar categories between two texts source.
The categories must bear the same concept or meaning.
The training source must be comprehensive to cover a category’s concept.
The training source must be available but not the testing source.
This approach is particular useful when the resources of unseen texts are not
readily available.
Training and Testing Category Model
The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category,
𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of
features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶
The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛,
the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶.
𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶
Feature Similarity Scoring
The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of
features set of category model is viewed as a set of vectors in a vector space. Each term will have its
own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the
deviation angle between the vectors as follows.
𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 =
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛
is the feature vector of a new document.
Evaluation Settings
Performance Metric
Scientific article is correctly assigned to a category or otherwise.
Expert judgement to evaluate.
Training Texts
Title and Abstract are used.
Tasks
Common (30 general cat) vs. Common + Specific Categories (30
general cat + 12 domain specific )
Automated Selection of Training Texts vs. Manual
Evaluation Results
Common categories
+ Automated
training texts (%)
Common and specific
categories + Automated
training texts (%)
Common and specific
categories + Manual
training texts (%)
Expert 1 62.50 68.75 81.25
Expert 2 46.67 46.67 53.33
Expert 3 33.33 33.33 66.67
Expert 4 33.33 41.67 41.67
Expert 5 43.75 37.50 28.13
(Average) (43.92) (45.59) (54.21)
Conclusion
Possibility
To train a category model using training texts from one source and apply
them on a different source.
Challenge
Selection of training texts as they could influence the accuracy of trained
model.
Limitation
Selection of categories, whereby the selected set is too little to cover the
domain’s (e.g. Computer Science) research area.
Thank You
For more of our work, please visit ir.cs.usm.my
Email me at khgan@usm.my

Más contenido relacionado

La actualidad más candente

Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Tariqul islam
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition NotesAshutosh Agrahari
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXmlaij
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extractionAnmol Dwivedi
 
Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weightingijistjournal
 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...csandit
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methodsgopichandbalusu
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing researchsmumbahelp
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir modelsVaibhav Khanna
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examinationRashid Ansari
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & predictionhktripathy
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering techniquesaniacorreya
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Studentijtsrd
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)Videoconferencias UTPL
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weightingVaibhav Khanna
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research DesignSYIKIN MARIA
 

La actualidad más candente (19)

Survey of natural language processing(midp2)
Survey of natural language processing(midp2)Survey of natural language processing(midp2)
Survey of natural language processing(midp2)
 
N045038690
N045038690N045038690
N045038690
 
Qualitative data analysis
Qualitative data analysisQualitative data analysis
Qualitative data analysis
 
BTech Pattern Recognition Notes
BTech Pattern Recognition NotesBTech Pattern Recognition Notes
BTech Pattern Recognition Notes
 
MACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOXMACHINE LEARNING TOOLBOX
MACHINE LEARNING TOOLBOX
 
Connections b/w active learning and model extraction
Connections b/w active learning and model extractionConnections b/w active learning and model extraction
Connections b/w active learning and model extraction
 
Quantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf WeightingQuantification of Portrayal Concepts using tf-idf Weighting
Quantification of Portrayal Concepts using tf-idf Weighting
 
Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...Association rule discovery for student performance prediction using metaheuri...
Association rule discovery for student performance prediction using metaheuri...
 
Construction of composite index: process & methods
Construction of composite index:  process & methodsConstruction of composite index:  process & methods
Construction of composite index: process & methods
 
Ms 66 marketing research
Ms 66 marketing researchMs 66 marketing research
Ms 66 marketing research
 
Information retrieval 6 ir models
Information retrieval 6 ir modelsInformation retrieval 6 ir models
Information retrieval 6 ir models
 
Predicting students performance in final examination
Predicting students performance in final examinationPredicting students performance in final examination
Predicting students performance in final examination
 
Lect8 Classification & prediction
Lect8 Classification & predictionLect8 Classification & prediction
Lect8 Classification & prediction
 
Students academic performance using clustering technique
Students academic performance using clustering techniqueStudents academic performance using clustering technique
Students academic performance using clustering technique
 
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and RetrievalA Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
A Multiple Ontology, Concept based, Context-sensitive Search and Retrieval
 
Using Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a StudentUsing Naive Bayesian Classifier for Predicting Performance of a Student
Using Naive Bayesian Classifier for Predicting Performance of a Student
 
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
EDUCATIONAL RESEARCH II (II Bimetsre Abril Agosto 2011)
 
Information retrieval 8 term weighting
Information retrieval 8 term weightingInformation retrieval 8 term weighting
Information retrieval 8 term weighting
 
Mixed Methods Research Design
Mixed Methods Research DesignMixed Methods Research Design
Mixed Methods Research Design
 

Similar a Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machineinventionjournals
 
Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Sheila Sinclair
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDESbutest
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification ofijaia
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learningbutest
 
slides
slidesslides
slidesbutest
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfIJDKP
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Intobutest
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxKevinSims18
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Educationgrainne
 
02 course design analysis phase
02 course design   analysis phase02 course design   analysis phase
02 course design analysis phaseDr. Chetan Bhatt
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.pptbutest
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis TechniquesDereck Downing
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-bestABDUmomo
 
Semi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetSemi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetIJCSEA Journal
 

Similar a Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System (20)

Text Classification using Support Vector Machine
Text Classification using Support Vector MachineText Classification using Support Vector Machine
Text Classification using Support Vector Machine
 
Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...Automated Question Paper Generator And Answer Checker Using Information Retri...
Automated Question Paper Generator And Answer Checker Using Information Retri...
 
PPT SLIDES
PPT SLIDESPPT SLIDES
PPT SLIDES
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
syllabus-CBR.pdf
syllabus-CBR.pdfsyllabus-CBR.pdf
syllabus-CBR.pdf
 
Multi label classification of
Multi label classification ofMulti label classification of
Multi label classification of
 
Semi-supervised Learning
Semi-supervised LearningSemi-supervised Learning
Semi-supervised Learning
 
slides
slidesslides
slides
 
LEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdfLEARNING CONTEXT FOR TEXT.pdf
LEARNING CONTEXT FOR TEXT.pdf
 
G04124041046
G04124041046G04124041046
G04124041046
 
Lec1-Into
Lec1-IntoLec1-Into
Lec1-Into
 
NLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docxNLP Techniques for Text Classification.docx
NLP Techniques for Text Classification.docx
 
JISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In EducationJISC LADIE project Learning Design In Education
JISC LADIE project Learning Design In Education
 
02 course design analysis phase
02 course design   analysis phase02 course design   analysis phase
02 course design analysis phase
 
activelearning.ppt
activelearning.pptactivelearning.ppt
activelearning.ppt
 
Automatic Essay Scoring A Review On The Feature Analysis Techniques
Automatic Essay Scoring  A Review On The Feature Analysis TechniquesAutomatic Essay Scoring  A Review On The Feature Analysis Techniques
Automatic Essay Scoring A Review On The Feature Analysis Techniques
 
Data mining chapter04and5-best
Data mining chapter04and5-bestData mining chapter04and5-best
Data mining chapter04and5-best
 
Semi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term SetSemi Automated Text Categorization Using Demonstration Based Term Set
Semi Automated Text Categorization Using Demonstration Based Term Set
 

Más de Gan Keng Hoon

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels Gan Keng Hoon
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RGan Keng Hoon
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfGan Keng Hoon
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdfGan Keng Hoon
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGan Keng Hoon
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfGan Keng Hoon
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Gan Keng Hoon
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Gan Keng Hoon
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceGan Keng Hoon
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval Gan Keng Hoon
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineGan Keng Hoon
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesGan Keng Hoon
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challengeGan Keng Hoon
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchGan Keng Hoon
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingGan Keng Hoon
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_previewGan Keng Hoon
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemGan Keng Hoon
 

Más de Gan Keng Hoon (17)

A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels A View of Text Analytics from Word, Sentence and Document Levels
A View of Text Analytics from Word, Sentence and Document Levels
 
Keywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using RKeywords Discovery with Simple Text Mining using R
Keywords Discovery with Simple Text Mining using R
 
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdfOSS 2020 Using SOLR as Open-Source Search Platform.pdf
OSS 2020 Using SOLR as Open-Source Search Platform.pdf
 
Procrastination and Phd.pdf
Procrastination and Phd.pdfProcrastination and Phd.pdf
Procrastination and Phd.pdf
 
Guest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdfGuest Lecture for Principles of Data Analytics.pdf
Guest Lecture for Principles of Data Analytics.pdf
 
Knowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdfKnowledge Representation Reasoning and Acquisition.pdf
Knowledge Representation Reasoning and Acquisition.pdf
 
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
Project: Interfacing Chatbot with Data Retrieval and Analytics Queries for De...
 
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
Interfacing Chatbot with Data Retrieval and Analytics Queries for Decision Ma...
 
Text and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business IntelligenceText and Sentiment Analytics for Business Intelligence
Text and Sentiment Analytics for Business Intelligence
 
Semantics in Retrieval
Semantics in Retrieval Semantics in Retrieval
Semantics in Retrieval
 
Concepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search EngineConcepts and Challenges of Text Retrieval for Search Engine
Concepts and Challenges of Text Retrieval for Search Engine
 
Faceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise BibliographiesFaceted Search for Finding Expertise Bibliographies
Faceted Search for Finding Expertise Bibliographies
 
Information retrieval concept, practice and challenge
Information retrieval   concept, practice and challengeInformation retrieval   concept, practice and challenge
Information retrieval concept, practice and challenge
 
ACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise SearchACIS 2015 Bibliographical-based Facets for Expertise Search
ACIS 2015 Bibliographical-based Facets for Expertise Search
 
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and PublishingA Brief Introduction to Knowledge Acquisition, Representation and Publishing
A Brief Introduction to Knowledge Acquisition, Representation and Publishing
 
Wi 2015 demo_preview
Wi 2015 demo_previewWi 2015 demo_preview
Wi 2015 demo_preview
 
An overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support SystemAn overview of text mining and sentiment analysis for Decision Support System
An overview of text mining and sentiment analysis for Decision Support System
 

Último

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样wsppdmt
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...HyderabadDolls
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRajesh Mondal
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...gajnagarg
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...HyderabadDolls
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...gajnagarg
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...Health
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...Elaine Werffeli
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowgargpaaro
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.pptibrahimabdi22
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...ZurliaSoop
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteedamy56318795
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...kumargunjan9515
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themeitharjee
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...SOFTTECHHUB
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...nirzagarg
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...Bertram Ludäscher
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...kumargunjan9515
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...nirzagarg
 

Último (20)

如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
如何办理英国诺森比亚大学毕业证(NU毕业证书)成绩单原件一模一样
 
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
Jodhpur Park | Call Girls in Kolkata Phone No 8005736733 Elite Escort Service...
 
Ranking and Scoring Exercises for Research
Ranking and Scoring Exercises for ResearchRanking and Scoring Exercises for Research
Ranking and Scoring Exercises for Research
 
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
Top profile Call Girls In Indore [ 7014168258 ] Call Me For Genuine Models We...
 
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
Charbagh + Female Escorts Service in Lucknow | Starting ₹,5K To @25k with A/C...
 
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
Top profile Call Girls In dimapur [ 7014168258 ] Call Me For Genuine Models W...
 
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
+97470301568>>weed for sale in qatar ,weed for sale in dubai,weed for sale in...
 
Abortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get CytotecAbortion pills in Jeddah | +966572737505 | Get Cytotec
Abortion pills in Jeddah | +966572737505 | Get Cytotec
 
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
SAC 25 Final National, Regional & Local Angel Group Investing Insights 2024 0...
 
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book nowVadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
Vadodara 💋 Call Girl 7737669865 Call Girls in Vadodara Escort service book now
 
7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt7. Epi of Chronic respiratory diseases.ppt
7. Epi of Chronic respiratory diseases.ppt
 
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
Jual Obat Aborsi Surabaya ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan...
 
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
5CL-ADBA,5cladba, Chinese supplier, safety is guaranteed
 
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...Fun all Day Call Girls in Jaipur   9332606886  High Profile Call Girls You Ca...
Fun all Day Call Girls in Jaipur 9332606886 High Profile Call Girls You Ca...
 
Kings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about themKings of Saudi Arabia, information about them
Kings of Saudi Arabia, information about them
 
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
TrafficWave Generator Will Instantly drive targeted and engaging traffic back...
 
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
Top profile Call Girls In Satna [ 7014168258 ] Call Me For Genuine Models We ...
 
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...Reconciling Conflicting Data Curation Actions:  Transparency Through Argument...
Reconciling Conflicting Data Curation Actions: Transparency Through Argument...
 
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
High Profile Call Girls Service in Jalore { 9332606886 } VVIP NISHA Call Girl...
 
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
Top profile Call Girls In Bihar Sharif [ 7014168258 ] Call Me For Genuine Mod...
 

Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System

  • 1. Category & Training Texts Selection for Scientific Article Categorization in an Expert Search System By Gan Keng Hoon*, Chua San Thai, Khoh Zhuo Yan, Goh Kau Yang School of Computer Sciences, Universiti Sains Malaysia
  • 2. Motivation Scientific articles are produced as results of research. Organizing scientific articles into subject areas or topics help in discovery, navigation etc.
  • 5. Motivation Takahiro Komamizu Toshiyuki Amagasa Hiroyuki Kitagawa , (2015),"Facet-value extraction scheme from textual contents in XML data“.
  • 6. Scope Application oriented research Expert Search System DBLP Dataset School of Computer Sciences, USM Goal Improving the categorization of scientific articles For Capturing expert’s expertise based on their publications. Enable category filtering during search.
  • 7. Existing Approaches Labelled Scientific Article Supervised Learning method to train and test Feature Selection Bags of Words, Ngram, POS, Term Frequency, TFIDF This research Train with Labelled Scientific Related Domain Texts Test with Scientific Article
  • 8. Research Justification Avoid the use of large number of labelled training texts Focusing on differentiating good training texts sources. Use reasonable small number of training texts to build subject category model.
  • 9. Process of category model construction on scientific article domain.
  • 10. Feature Selection Feature Term Generation N-gram technique is used to generate potential term candidates from the training text. E.g. D = “Search engine is an artificial intelligence system.” 2-gram word: Array ([0] => Search engine [1] => engine is [2] is an [3] => an artificial [4] => artificial intelligence [5] => intelligence system) Features Selection by TF-IDF Term Frequency Inverse Document Frequency (TF-IDF) is a common method for keyword weighting, which is to compute the TFIDF values and the top N TFIDF values are selected as features. This method penalizes the term when it occurs in different training texts. The TF- IDF values are computed as 𝑇𝑇𝑇𝑇 − 𝐼𝐼 𝐼𝐼 𝐼𝐼𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 = 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 × 𝑙𝑙𝑙𝑙 𝑙𝑙 𝑁𝑁𝐷𝐷 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 where 𝐷𝐷𝐷𝐷𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 is the number of documents containing the term, 𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑇𝑖𝑖 and 𝑁𝑁𝐷𝐷 is the total number of document.
  • 11. Transfer Training Approach Intuition If the training texts are representative enough to cover the concept of a category, hence this training sets can be obtained from any sources that share similar concepts or semantics. Criteria Sharing same or partially similar categories between two texts source. The categories must bear the same concept or meaning. The training source must be comprehensive to cover a category’s concept. The training source must be available but not the testing source. This approach is particular useful when the resources of unseen texts are not readily available.
  • 12. Training and Testing Category Model The training of category model, CM, can be defined using the 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵, function. For each category, 𝐶𝐶𝐶𝐶𝐶𝐶, the function takes in a set of documents, 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶, i.e. training texts; and map them to a set of features, 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝐵𝐵𝐵𝐵𝐵𝐵𝐵𝐵 𝐵𝐵: 𝐷𝐷𝐶𝐶𝐶𝐶𝐶𝐶 → 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 The testing of category model is defined using the 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆, function. For each new document, 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛, the function will map the document to a set of most relevant categories, 𝐶𝐶𝐶𝐶𝐶𝐶. 𝐶𝐶𝐶𝐶𝑆𝑆𝑆𝑆𝑆𝑆: 𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 → 𝐶𝐶𝐶𝐶𝐶𝐶 Feature Similarity Scoring The scoring technique is based on Vector Space Model Cosine Similarity measure. The set of features set of category model is viewed as a set of vectors in a vector space. Each term will have its own axis. The similarity of a category and a document, 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 can be calculated by comparing the deviation angle between the vectors as follows. 𝑆𝑆𝑆𝑆𝑆𝑆𝐹𝐹 = 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 where 𝐹𝐹𝐶𝐶𝐶𝐶𝐶𝐶 is the feature vector of a category and 𝐹𝐹𝐷𝐷𝑛𝑛𝑛𝑛𝑛𝑛 is the feature vector of a new document.
  • 13. Evaluation Settings Performance Metric Scientific article is correctly assigned to a category or otherwise. Expert judgement to evaluate. Training Texts Title and Abstract are used. Tasks Common (30 general cat) vs. Common + Specific Categories (30 general cat + 12 domain specific ) Automated Selection of Training Texts vs. Manual
  • 14. Evaluation Results Common categories + Automated training texts (%) Common and specific categories + Automated training texts (%) Common and specific categories + Manual training texts (%) Expert 1 62.50 68.75 81.25 Expert 2 46.67 46.67 53.33 Expert 3 33.33 33.33 66.67 Expert 4 33.33 41.67 41.67 Expert 5 43.75 37.50 28.13 (Average) (43.92) (45.59) (54.21)
  • 15. Conclusion Possibility To train a category model using training texts from one source and apply them on a different source. Challenge Selection of training texts as they could influence the accuracy of trained model. Limitation Selection of categories, whereby the selected set is too little to cover the domain’s (e.g. Computer Science) research area.
  • 16. Thank You For more of our work, please visit ir.cs.usm.my Email me at khgan@usm.my