SlideShare una empresa de Scribd logo
1 de 8
CSCI 6505 Course Project Report

             Student Name: Yuan An
                 Email: yuana@cs.dal.ca

          Student Name: Suihong Liang
                 Email: abacus@cs.dal.ca

                 Date:4 Dec 2000



Topic: Construct a topic-based search engine using
machine learning approach (instance-based learning
           method) for a given website.
1.DESCRIPTION OF PROBLEM:

Many organization and people have their own websites to post information for public.
With the growth of the number of the files in website, it is convenient to index all the
files for search purpose. There are many commercial search engines such Yahoo!,Google
providing the ability for searching related web pages over all websites around the world.
It is also useful for single website providing the ability for searching related web pages
over this website itself. The simplest technique for indexing HTML files is to count the
number of hit times of given keywords in a HTML file or to look the HEAD part of
HTML file to find related information. We are going to use machine learning approach to
classify HTML file into certain topic, specially, we are going to use Instance-Based
learning algorithm, i.e., k-nearest neighbors, to do this work.

From our original intentions, we want to classify a HTML file related to some computer
science topics into most related topics. And , then index those HTML files in a given
website to provide the ability of topic-based search for this website. Since the reason of
limited time for this project, we didn't use this classifier to index any specific website.
For the experiment, we just downloaded some test HTML files related those topics:
'artificial intelligent', 'programming language', 'operating system', 'database' ,' graphics',
'software engineering' and take them as training data and test data. We extracted the
vocabulary containing 2,901 words related to computer science from an online
technology dictionary: (
http://www.oasismanagement.com/TECHNOLOGY/GLOSSARY/index.html ).
Then we built a classifier using instance-based learning algorithm. The detail will be
discussed in the following sections.

2. CORE IDEA:

2.1 Instance-based learning approach:
Instance-based learning methods such as k-nearest neighbor are straightforward
approaches to approximating real-valued or discrete-valued target functions. Learning in
this algorithm consists of simply storing the presented training data. When a new query
instance is encountered, a set of similar related instances is retrieved from memory and
used to classify the new query instance. Since the Weka package has implements almost
all machine learning algorithms using Java including k-nearest neighbor. We are going to
use this package to implement our project.

2.2 Representation of HTML files:
Since our project is used for studying machine learning purpose, we don't focus on the
representation of document. We use the simplest method to represent instance, i.e.,
HTML file. We define a vector of words as the attributes of instances. The vector is
extracted from an online technology dictionary containing 2,901 words related to
computer science domain. We given several categories of HTML file such as ‘artificial
intelligent’, ‘ programming language’, ‘operating system’, ‘database’, ‘graphics’,’
software engineering’. First of all, we collect a set of training data to train classifier.
Then, we use the trained classifier to index all HTML files for a given website.
2.3 Classification:
In our implementation, there is a keyword vector containing 2,901 words related to
computer science. When a new HTML file comes in, our system first transfer this HTML
file into text file by discarding all HTML tags and comments as well as HEAD part of
this file. Then the system transfer this HTML file into an instance by calling
makeInstance() method. Finally, the trained classifier is called to classify the new
instance using k-nearest neighbors algorithm.

2.4 Indexing:
There is a crawler in our system used to crawl the directory tree for a given website’s
URL. When the crawler crawls along the path, if it encounters a HTML file, then it calls
trained classifier to classify this HTML file into corresponding classification. The crawler
writes the pair of classification label and url of this file into a TreeMap. Here, we used
TreeMap for later rank information. After crawling, the crawler writes the TreeMap into
a text file for user’s search.


3.MAIN COMPONENTS and Interactive diagram:

The total project consists of the following components: (1). A command-line utility for
indexing all HTML files into various topics for a given home directory of website--it will
crawl all subdirectories of given home directory automatically. (2). Server-side CGI or
Java Servlets for replying user's query. (3). User's interface displaying in brower.

3.1 Description of modules:
1. HTML file classifier:
This module is used to train a classifier from scratch, update the classifier by more
training data, and classify new document. The function for transferring a HTML file into
text file is also in this module. The weka package is imported in this part and its
implementation of k-nearest neighbors algorithm and other helping utilities were used.
2. Crawler or Indexer:
This module is used to crawl the directory tree for a given website to index all HTML
files reside in the website. The crawler is command utility used by webmaster after
updating of its website. This crawler takes the home URL as start point and loads the
trained classifier, then it crawls all subdirectories using Breadth-First Search strategy.
Whenever it encounters a new HTML file, it classifies this file into corresponding
category and store the pair of label and address into a map. After crawling, it writes the
map into a text file for user’s search.
3. Server-side Searcher:
This module is used for reply the searching results to users who submit the query. Since
the all HTML files have been indexed and the index information has been written in a
text file, the server-side searcher just searches the index information file and finds the
matched record , replies to user. There are many server side techniques such that CGI and
servlets.
3.2 Interactive diagram:




    HTML classifier                               Crawler:
    module:                                       1. load classifier
    1. build classifier.                          from disk.
    2. update classifier.                         2. crawling along
    3. classify new              Classifier       directory tree.
       document.                 stored in        3. classify
    4. Transfer files.           disk.            encountered files.




                            Searcher:
                            1. accepts user’s            Indexed
                               query.                    information
                            2. Searches on               file in disk.
                               indexed file.
                            3. Replies results.




                                 User’s
                                 browse
4. IMPLEMENTATION:

In this section , we list all java classes used in this project:

The following classes are used for transfer HTML file into text file:

1. public interface HTMLContent.
2. public class HTMLContentList: extends ArrayList.
3.public class HTMLTag: stores a name and optional attribute list.
4. public class HTMLText: stores text of HTML file.
5.public class HTMLToken: stores tokens of HTML file.
6.public class HTMLTokenizer: parse the HTML file into tokens.
7. public class HTMLTokenList: extends ArrayList.
8. public class Parser: take a HTMLTokenlist as input, convert it into HTMLContentList.
9.public class HTMLAttribute: stores attribute of HTML tag.
10.public class HTMLAttributeList: extends ArrayList to store all attributes of a tag.

The following classes are used for indexing, classifying and searching:

11.public  class HTMLIndex: extends HashMap, implementing two methods: (1)
  addString(), takes class label and title/filename arguments and creates a mapping
  between each label and the respective file, (2) writeFile(), streams the index content to
  a file.
12.public class HTMLIndexer: a command line utility that traverses the directories from
  a given root path.
13.public class HTMLClassifier: k-nearest neighbors classifier , implementing those
  methods (1) HTMLClassifier(), constructor to build classifier from scratch or load
  from a file, (2) updateModel(), train classifier using training data, (3)
  classifyMessage(), to classify new instance, (4) makeInstance(), to make a new
  instance, (5) htmlToText(), transfer HTML file into text file
14.public interface Searcher: a search engine that returns the matched records in indexed
  file.
15.public class HTMLSearch: implements interface Searcher.
16.public class SearchServlet: wraps the Searcher with an appropriate interface to handle
  a POST request, with a string argument named 'search'. The result is returned on the
  output stream.
5. SAMPLE RESULTS:

Our implementation of this project is a combination of keyword search engine and topic-
specific search engine. This prototype is just used on our own website to be tested. The
user's interface is a textfield input and two submit buttons (see Figure 1): one button with




the label 'keyword' and another button with the label 'topic'.

                                   Figure 1

When user want to search some relevant documents by keyword, he just types the
keywords in the textfield and click the button bearing the 'keyword' label (see Figure 1).
If there are any documents have the keywords matched , then, the matched documents'
name will be replied with hyperlinks as well as the hit times of the keywords in the
corresponding documents (see Figure 2). If there is no any document matched, the result
is 'no pages found'.
Figure 2
When user want to search some relevant documents by topic, he just types the topic in the
textfield and click the button bearing the 'topic' label (see Figure 1). In this
implementation, it only accept the following topics search: 'artificail intelligent',
'programming language', 'operating system', 'database', 'graphics', 'software engineering'.
If there are any documents have the label matched , then, the matched documents' name
will be replied with hyperlinks as well as the hit times of the keywords in the
corresponding documents (see Figure 3). If there is no any document matched, the result
is 'no pages found'.




                                      Figure 3

6. DISCUSSION:

In this project , we implemented a topic-based search engine for a given website. The key
point of such search engine is to train a classifier for classify HTML files into
corresponding categories. We used k-nearest neighbors algorithm which implemented in
WEKA package dedicated for machine learning algorithms to train classifier for
classifying files related to computer science. Since this project is a course project and its
focus is on machine learning, so it is pretty simple in representation of documents and
collection of training data. There are many open problems can be solved further and
better:
(1)The k-nearest neighbors algorithm needs to store the training data in somewhere.
    When new instance comes in, the k-nearest neighbors are retrieved and compared to
decide the classification of new instance. It is obvious that if there are many training
    data, then it is not efficient. So, we may develop a more efficient classifier in future
    using some efficient machine learning algorithm.
(2)We represent the documents using a vector extracted from a online technology
    dictionary. This is a fairly simple representation. It can be improved more.
(3) In our implementation, it has no the ability to rank the relevant pages for the users by
    topic search. In keyword search, we just calculate the hit times of keywords in the
    relevant page, but in topic search, we didn't come up with any ranking strategy.
    However, such ranking strategy in search engine is desirable.

Más contenido relacionado

La actualidad más candente

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tikaSutthipong Kuruhongsa
 
Content provider in_android
Content provider in_androidContent provider in_android
Content provider in_androidPRITI TELMORE
 
Content providers in Android
Content providers in AndroidContent providers in Android
Content providers in AndroidAlexey Ustenko
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?lucenerevolution
 
Custom content provider in android
Custom content provider in androidCustom content provider in android
Custom content provider in androidAly Arman
 
Александр Третьяков: "Spring Data JPA and MongoDB"
Александр Третьяков: "Spring Data JPA and MongoDB" Александр Третьяков: "Spring Data JPA and MongoDB"
Александр Третьяков: "Spring Data JPA and MongoDB" Anna Shymchenko
 
Android App Development - 10 Content providers
Android App Development - 10 Content providersAndroid App Development - 10 Content providers
Android App Development - 10 Content providersDiego Grancini
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorJukka Zitting
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsTakashi Kobayashi
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsWilliam Smith
 
Android content providers
Android content providersAndroid content providers
Android content providersKurt Mbanje
 
Forensic Toolkit Analysis Of A Windows 98 Virtual
Forensic Toolkit Analysis Of A Windows 98 VirtualForensic Toolkit Analysis Of A Windows 98 Virtual
Forensic Toolkit Analysis Of A Windows 98 VirtualBrjco
 
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL LinksIOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL LinksRafal Kasprowski
 
Android contentprovider
Android contentproviderAndroid contentprovider
Android contentproviderKrazy Koder
 

La actualidad más candente (20)

Understanding information content with apache tika
Understanding information content with apache tikaUnderstanding information content with apache tika
Understanding information content with apache tika
 
Content provider in_android
Content provider in_androidContent provider in_android
Content provider in_android
 
Content providers in Android
Content providers in AndroidContent providers in Android
Content providers in Android
 
Lucene indexing
Lucene indexingLucene indexing
Lucene indexing
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
 
Lucene basics
Lucene basicsLucene basics
Lucene basics
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
What is in a Lucene index?
What is in a Lucene index?What is in a Lucene index?
What is in a Lucene index?
 
Custom content provider in android
Custom content provider in androidCustom content provider in android
Custom content provider in android
 
STACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSISSTACK OVERFLOW DATASET ANALYSIS
STACK OVERFLOW DATASET ANALYSIS
 
Александр Третьяков: "Spring Data JPA and MongoDB"
Александр Третьяков: "Spring Data JPA and MongoDB" Александр Третьяков: "Spring Data JPA and MongoDB"
Александр Третьяков: "Spring Data JPA and MongoDB"
 
Android App Development - 10 Content providers
Android App Development - 10 Content providersAndroid App Development - 10 Content providers
Android App Development - 10 Content providers
 
Open source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache IncubatorOpen source masterclass - Life in the Apache Incubator
Open source masterclass - Life in the Apache Incubator
 
Searching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile RelationshipsSearching Keyword-lacking Files based on Latent Interfile Relationships
Searching Keyword-lacking Files based on Latent Interfile Relationships
 
International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data Visualizations
 
Android content providers
Android content providersAndroid content providers
Android content providers
 
Forensic Toolkit Analysis Of A Windows 98 Virtual
Forensic Toolkit Analysis Of A Windows 98 VirtualForensic Toolkit Analysis Of A Windows 98 Virtual
Forensic Toolkit Analysis Of A Windows 98 Virtual
 
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL LinksIOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
IOTA @ NASIG 2011: Measuring the Quality of OpenURL Links
 
Android contentprovider
Android contentproviderAndroid contentprovider
Android contentprovider
 

Destacado

Mutual fund for your portfolio
Mutual fund for your portfolioMutual fund for your portfolio
Mutual fund for your portfoliotirth2006
 
Operation P.E.A.C.E. 2.0 - Game Development
Operation P.E.A.C.E. 2.0 - Game DevelopmentOperation P.E.A.C.E. 2.0 - Game Development
Operation P.E.A.C.E. 2.0 - Game DevelopmentMEGA Generation
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIbutest
 
research paper on Brain Computer Interface devices I - On Brain ...
research paper on Brain Computer Interface devices I - On Brain ...research paper on Brain Computer Interface devices I - On Brain ...
research paper on Brain Computer Interface devices I - On Brain ...butest
 
Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...butest
 
Ron Kohavi, PhD
Ron Kohavi, PhDRon Kohavi, PhD
Ron Kohavi, PhDbutest
 
Relational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement LearningRelational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement Learningbutest
 
Alpaydin - Chapter 2
Alpaydin - Chapter 2Alpaydin - Chapter 2
Alpaydin - Chapter 2butest
 
Full text
Full textFull text
Full textbutest
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..butest
 
powerpoint
powerpointpowerpoint
powerpointbutest
 
online
onlineonline
onlinebutest
 
Machine Learning applied to Go
Machine Learning applied to GoMachine Learning applied to Go
Machine Learning applied to Gobutest
 
How to hack your wii
How to hack your wiiHow to hack your wii
How to hack your wiiguest0ac591bf
 
User's Guide
User's GuideUser's Guide
User's Guidebutest
 
MS Word
MS WordMS Word
MS Wordbutest
 

Destacado (20)

Mutual fund for your portfolio
Mutual fund for your portfolioMutual fund for your portfolio
Mutual fund for your portfolio
 
Operation P.E.A.C.E. 2.0 - Game Development
Operation P.E.A.C.E. 2.0 - Game DevelopmentOperation P.E.A.C.E. 2.0 - Game Development
Operation P.E.A.C.E. 2.0 - Game Development
 
76. tur suresi
76. tur suresi76. tur suresi
76. tur suresi
 
2d animation
2d animation2d animation
2d animation
 
Wi-Fi Technology
Wi-Fi TechnologyWi-Fi Technology
Wi-Fi Technology
 
Project MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AIProject MLExAI: Machine Learning Experiences in AI
Project MLExAI: Machine Learning Experiences in AI
 
research paper on Brain Computer Interface devices I - On Brain ...
research paper on Brain Computer Interface devices I - On Brain ...research paper on Brain Computer Interface devices I - On Brain ...
research paper on Brain Computer Interface devices I - On Brain ...
 
Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...Enhancement of Error Correction in Quantum Cryptography BB84 ...
Enhancement of Error Correction in Quantum Cryptography BB84 ...
 
Ron Kohavi, PhD
Ron Kohavi, PhDRon Kohavi, PhD
Ron Kohavi, PhD
 
Relational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement LearningRelational Transfer in Reinforcement Learning
Relational Transfer in Reinforcement Learning
 
Alpaydin - Chapter 2
Alpaydin - Chapter 2Alpaydin - Chapter 2
Alpaydin - Chapter 2
 
Full text
Full textFull text
Full text
 
ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..ASCE_ChingHuei_Rev00..
ASCE_ChingHuei_Rev00..
 
powerpoint
powerpointpowerpoint
powerpoint
 
Tank_Porti
Tank_PortiTank_Porti
Tank_Porti
 
online
onlineonline
online
 
Machine Learning applied to Go
Machine Learning applied to GoMachine Learning applied to Go
Machine Learning applied to Go
 
How to hack your wii
How to hack your wiiHow to hack your wii
How to hack your wii
 
User's Guide
User's GuideUser's Guide
User's Guide
 
MS Word
MS WordMS Word
MS Word
 

Similar a CSCI6505 Project:Construct search engine using ML approach

CSE681 – Software Modeling and Analysis Fall 2013 Project .docx
CSE681 – Software Modeling and Analysis Fall 2013 Project .docxCSE681 – Software Modeling and Analysis Fall 2013 Project .docx
CSE681 – Software Modeling and Analysis Fall 2013 Project .docxfaithxdunce63732
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applicationsbutest
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxsmile790243
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsMoutasm Tamimi
 
SharePoint solution developer exam 70-488
SharePoint solution developer exam 70-488SharePoint solution developer exam 70-488
SharePoint solution developer exam 70-488Ahmed Tawfik
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerSean Golliher
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0STIinnsbruck
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxherthaweston
 
Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_internSai Ganesh
 
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxFAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxgattamanenitejeswar
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersCarlos Toxtli
 
Ploneforwebdev
PloneforwebdevPloneforwebdev
Ploneforwebdevbrighteyes
 
130614 sebastiano panichella - mining source code descriptions from develo...
130614   sebastiano panichella -  mining source code descriptions from develo...130614   sebastiano panichella -  mining source code descriptions from develo...
130614 sebastiano panichella - mining source code descriptions from develo...Ptidej Team
 
Php interview-questions and answers
Php interview-questions and answersPhp interview-questions and answers
Php interview-questions and answerssheibansari
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialJonathon Hare
 
Project a twitter dataset analysis
Project a twitter dataset analysisProject a twitter dataset analysis
Project a twitter dataset analysisYu Luo
 

Similar a CSCI6505 Project:Construct search engine using ML approach (20)

People aggregator
People aggregatorPeople aggregator
People aggregator
 
CSE681 – Software Modeling and Analysis Fall 2013 Project .docx
CSE681 – Software Modeling and Analysis Fall 2013 Project .docxCSE681 – Software Modeling and Analysis Fall 2013 Project .docx
CSE681 – Software Modeling and Analysis Fall 2013 Project .docx
 
Machine Learning Applications
Machine Learning ApplicationsMachine Learning Applications
Machine Learning Applications
 
Page 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docxPage 18Goal Implement a complete search engine. Milestones.docx
Page 18Goal Implement a complete search engine. Milestones.docx
 
Reengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software SpecificationsReengineering PDF-Based Documents Targeting Complex Software Specifications
Reengineering PDF-Based Documents Targeting Complex Software Specifications
 
SharePoint solution developer exam 70-488
SharePoint solution developer exam 70-488SharePoint solution developer exam 70-488
SharePoint solution developer exam 70-488
 
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a CrawlerCSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
CSCI 494 - Lect. 3. Anatomy of Search Engines/Building a Crawler
 
I0331047050
I0331047050I0331047050
I0331047050
 
Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0Combining and easing the access of the eswc semantic web data 0
Combining and easing the access of the eswc semantic web data 0
 
Intro to OctoberCMS
Intro to OctoberCMSIntro to OctoberCMS
Intro to OctoberCMS
 
Must be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docxMust be similar to screenshotsI must be able to run the projects.docx
Must be similar to screenshotsI must be able to run the projects.docx
 
Ad507
Ad507Ad507
Ad507
 
Opinioz_intern
Opinioz_internOpinioz_intern
Opinioz_intern
 
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptxFAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
FAST PHRASE SEARCH FOR ENCRYPTED CLOUD STORAGE.pptx
 
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge WorkersExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
ExperTwin: An Alter Ego in Cyberspace for Knowledge Workers
 
Ploneforwebdev
PloneforwebdevPloneforwebdev
Ploneforwebdev
 
130614 sebastiano panichella - mining source code descriptions from develo...
130614   sebastiano panichella -  mining source code descriptions from develo...130614   sebastiano panichella -  mining source code descriptions from develo...
130614 sebastiano panichella - mining source code descriptions from develo...
 
Php interview-questions and answers
Php interview-questions and answersPhp interview-questions and answers
Php interview-questions and answers
 
ESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorialESSIR LivingKnowledge DiversityEngine tutorial
ESSIR LivingKnowledge DiversityEngine tutorial
 
Project a twitter dataset analysis
Project a twitter dataset analysisProject a twitter dataset analysis
Project a twitter dataset analysis
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEbutest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jacksonbutest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALbutest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer IIbutest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazzbutest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.docbutest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1butest
 
Facebook
Facebook Facebook
Facebook butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTbutest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docbutest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docbutest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.docbutest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

CSCI6505 Project:Construct search engine using ML approach

  • 1. CSCI 6505 Course Project Report Student Name: Yuan An Email: yuana@cs.dal.ca Student Name: Suihong Liang Email: abacus@cs.dal.ca Date:4 Dec 2000 Topic: Construct a topic-based search engine using machine learning approach (instance-based learning method) for a given website.
  • 2. 1.DESCRIPTION OF PROBLEM: Many organization and people have their own websites to post information for public. With the growth of the number of the files in website, it is convenient to index all the files for search purpose. There are many commercial search engines such Yahoo!,Google providing the ability for searching related web pages over all websites around the world. It is also useful for single website providing the ability for searching related web pages over this website itself. The simplest technique for indexing HTML files is to count the number of hit times of given keywords in a HTML file or to look the HEAD part of HTML file to find related information. We are going to use machine learning approach to classify HTML file into certain topic, specially, we are going to use Instance-Based learning algorithm, i.e., k-nearest neighbors, to do this work. From our original intentions, we want to classify a HTML file related to some computer science topics into most related topics. And , then index those HTML files in a given website to provide the ability of topic-based search for this website. Since the reason of limited time for this project, we didn't use this classifier to index any specific website. For the experiment, we just downloaded some test HTML files related those topics: 'artificial intelligent', 'programming language', 'operating system', 'database' ,' graphics', 'software engineering' and take them as training data and test data. We extracted the vocabulary containing 2,901 words related to computer science from an online technology dictionary: ( http://www.oasismanagement.com/TECHNOLOGY/GLOSSARY/index.html ). Then we built a classifier using instance-based learning algorithm. The detail will be discussed in the following sections. 2. CORE IDEA: 2.1 Instance-based learning approach: Instance-based learning methods such as k-nearest neighbor are straightforward approaches to approximating real-valued or discrete-valued target functions. Learning in this algorithm consists of simply storing the presented training data. When a new query instance is encountered, a set of similar related instances is retrieved from memory and used to classify the new query instance. Since the Weka package has implements almost all machine learning algorithms using Java including k-nearest neighbor. We are going to use this package to implement our project. 2.2 Representation of HTML files: Since our project is used for studying machine learning purpose, we don't focus on the representation of document. We use the simplest method to represent instance, i.e., HTML file. We define a vector of words as the attributes of instances. The vector is extracted from an online technology dictionary containing 2,901 words related to computer science domain. We given several categories of HTML file such as ‘artificial intelligent’, ‘ programming language’, ‘operating system’, ‘database’, ‘graphics’,’ software engineering’. First of all, we collect a set of training data to train classifier. Then, we use the trained classifier to index all HTML files for a given website.
  • 3. 2.3 Classification: In our implementation, there is a keyword vector containing 2,901 words related to computer science. When a new HTML file comes in, our system first transfer this HTML file into text file by discarding all HTML tags and comments as well as HEAD part of this file. Then the system transfer this HTML file into an instance by calling makeInstance() method. Finally, the trained classifier is called to classify the new instance using k-nearest neighbors algorithm. 2.4 Indexing: There is a crawler in our system used to crawl the directory tree for a given website’s URL. When the crawler crawls along the path, if it encounters a HTML file, then it calls trained classifier to classify this HTML file into corresponding classification. The crawler writes the pair of classification label and url of this file into a TreeMap. Here, we used TreeMap for later rank information. After crawling, the crawler writes the TreeMap into a text file for user’s search. 3.MAIN COMPONENTS and Interactive diagram: The total project consists of the following components: (1). A command-line utility for indexing all HTML files into various topics for a given home directory of website--it will crawl all subdirectories of given home directory automatically. (2). Server-side CGI or Java Servlets for replying user's query. (3). User's interface displaying in brower. 3.1 Description of modules: 1. HTML file classifier: This module is used to train a classifier from scratch, update the classifier by more training data, and classify new document. The function for transferring a HTML file into text file is also in this module. The weka package is imported in this part and its implementation of k-nearest neighbors algorithm and other helping utilities were used. 2. Crawler or Indexer: This module is used to crawl the directory tree for a given website to index all HTML files reside in the website. The crawler is command utility used by webmaster after updating of its website. This crawler takes the home URL as start point and loads the trained classifier, then it crawls all subdirectories using Breadth-First Search strategy. Whenever it encounters a new HTML file, it classifies this file into corresponding category and store the pair of label and address into a map. After crawling, it writes the map into a text file for user’s search. 3. Server-side Searcher: This module is used for reply the searching results to users who submit the query. Since the all HTML files have been indexed and the index information has been written in a text file, the server-side searcher just searches the index information file and finds the matched record , replies to user. There are many server side techniques such that CGI and servlets.
  • 4. 3.2 Interactive diagram: HTML classifier Crawler: module: 1. load classifier 1. build classifier. from disk. 2. update classifier. 2. crawling along 3. classify new Classifier directory tree. document. stored in 3. classify 4. Transfer files. disk. encountered files. Searcher: 1. accepts user’s Indexed query. information 2. Searches on file in disk. indexed file. 3. Replies results. User’s browse
  • 5. 4. IMPLEMENTATION: In this section , we list all java classes used in this project: The following classes are used for transfer HTML file into text file: 1. public interface HTMLContent. 2. public class HTMLContentList: extends ArrayList. 3.public class HTMLTag: stores a name and optional attribute list. 4. public class HTMLText: stores text of HTML file. 5.public class HTMLToken: stores tokens of HTML file. 6.public class HTMLTokenizer: parse the HTML file into tokens. 7. public class HTMLTokenList: extends ArrayList. 8. public class Parser: take a HTMLTokenlist as input, convert it into HTMLContentList. 9.public class HTMLAttribute: stores attribute of HTML tag. 10.public class HTMLAttributeList: extends ArrayList to store all attributes of a tag. The following classes are used for indexing, classifying and searching: 11.public class HTMLIndex: extends HashMap, implementing two methods: (1) addString(), takes class label and title/filename arguments and creates a mapping between each label and the respective file, (2) writeFile(), streams the index content to a file. 12.public class HTMLIndexer: a command line utility that traverses the directories from a given root path. 13.public class HTMLClassifier: k-nearest neighbors classifier , implementing those methods (1) HTMLClassifier(), constructor to build classifier from scratch or load from a file, (2) updateModel(), train classifier using training data, (3) classifyMessage(), to classify new instance, (4) makeInstance(), to make a new instance, (5) htmlToText(), transfer HTML file into text file 14.public interface Searcher: a search engine that returns the matched records in indexed file. 15.public class HTMLSearch: implements interface Searcher. 16.public class SearchServlet: wraps the Searcher with an appropriate interface to handle a POST request, with a string argument named 'search'. The result is returned on the output stream.
  • 6. 5. SAMPLE RESULTS: Our implementation of this project is a combination of keyword search engine and topic- specific search engine. This prototype is just used on our own website to be tested. The user's interface is a textfield input and two submit buttons (see Figure 1): one button with the label 'keyword' and another button with the label 'topic'. Figure 1 When user want to search some relevant documents by keyword, he just types the keywords in the textfield and click the button bearing the 'keyword' label (see Figure 1). If there are any documents have the keywords matched , then, the matched documents' name will be replied with hyperlinks as well as the hit times of the keywords in the corresponding documents (see Figure 2). If there is no any document matched, the result is 'no pages found'.
  • 7. Figure 2 When user want to search some relevant documents by topic, he just types the topic in the textfield and click the button bearing the 'topic' label (see Figure 1). In this implementation, it only accept the following topics search: 'artificail intelligent', 'programming language', 'operating system', 'database', 'graphics', 'software engineering'. If there are any documents have the label matched , then, the matched documents' name will be replied with hyperlinks as well as the hit times of the keywords in the corresponding documents (see Figure 3). If there is no any document matched, the result is 'no pages found'. Figure 3 6. DISCUSSION: In this project , we implemented a topic-based search engine for a given website. The key point of such search engine is to train a classifier for classify HTML files into corresponding categories. We used k-nearest neighbors algorithm which implemented in WEKA package dedicated for machine learning algorithms to train classifier for classifying files related to computer science. Since this project is a course project and its focus is on machine learning, so it is pretty simple in representation of documents and collection of training data. There are many open problems can be solved further and better: (1)The k-nearest neighbors algorithm needs to store the training data in somewhere. When new instance comes in, the k-nearest neighbors are retrieved and compared to
  • 8. decide the classification of new instance. It is obvious that if there are many training data, then it is not efficient. So, we may develop a more efficient classifier in future using some efficient machine learning algorithm. (2)We represent the documents using a vector extracted from a online technology dictionary. This is a fairly simple representation. It can be improved more. (3) In our implementation, it has no the ability to rank the relevant pages for the users by topic search. In keyword search, we just calculate the hit times of keywords in the relevant page, but in topic search, we didn't come up with any ranking strategy. However, such ranking strategy in search engine is desirable.