CSCI6505 Project:Construct search engine using ML approach
1. CSCI 6505 Course Project Report
Student Name: Yuan An
Email: yuana@cs.dal.ca
Student Name: Suihong Liang
Email: abacus@cs.dal.ca
Date:4 Dec 2000
Topic: Construct a topic-based search engine using
machine learning approach (instance-based learning
method) for a given website.
2. 1.DESCRIPTION OF PROBLEM:
Many organization and people have their own websites to post information for public.
With the growth of the number of the files in website, it is convenient to index all the
files for search purpose. There are many commercial search engines such Yahoo!,Google
providing the ability for searching related web pages over all websites around the world.
It is also useful for single website providing the ability for searching related web pages
over this website itself. The simplest technique for indexing HTML files is to count the
number of hit times of given keywords in a HTML file or to look the HEAD part of
HTML file to find related information. We are going to use machine learning approach to
classify HTML file into certain topic, specially, we are going to use Instance-Based
learning algorithm, i.e., k-nearest neighbors, to do this work.
From our original intentions, we want to classify a HTML file related to some computer
science topics into most related topics. And , then index those HTML files in a given
website to provide the ability of topic-based search for this website. Since the reason of
limited time for this project, we didn't use this classifier to index any specific website.
For the experiment, we just downloaded some test HTML files related those topics:
'artificial intelligent', 'programming language', 'operating system', 'database' ,' graphics',
'software engineering' and take them as training data and test data. We extracted the
vocabulary containing 2,901 words related to computer science from an online
technology dictionary: (
http://www.oasismanagement.com/TECHNOLOGY/GLOSSARY/index.html ).
Then we built a classifier using instance-based learning algorithm. The detail will be
discussed in the following sections.
2. CORE IDEA:
2.1 Instance-based learning approach:
Instance-based learning methods such as k-nearest neighbor are straightforward
approaches to approximating real-valued or discrete-valued target functions. Learning in
this algorithm consists of simply storing the presented training data. When a new query
instance is encountered, a set of similar related instances is retrieved from memory and
used to classify the new query instance. Since the Weka package has implements almost
all machine learning algorithms using Java including k-nearest neighbor. We are going to
use this package to implement our project.
2.2 Representation of HTML files:
Since our project is used for studying machine learning purpose, we don't focus on the
representation of document. We use the simplest method to represent instance, i.e.,
HTML file. We define a vector of words as the attributes of instances. The vector is
extracted from an online technology dictionary containing 2,901 words related to
computer science domain. We given several categories of HTML file such as ‘artificial
intelligent’, ‘ programming language’, ‘operating system’, ‘database’, ‘graphics’,’
software engineering’. First of all, we collect a set of training data to train classifier.
Then, we use the trained classifier to index all HTML files for a given website.
3. 2.3 Classification:
In our implementation, there is a keyword vector containing 2,901 words related to
computer science. When a new HTML file comes in, our system first transfer this HTML
file into text file by discarding all HTML tags and comments as well as HEAD part of
this file. Then the system transfer this HTML file into an instance by calling
makeInstance() method. Finally, the trained classifier is called to classify the new
instance using k-nearest neighbors algorithm.
2.4 Indexing:
There is a crawler in our system used to crawl the directory tree for a given website’s
URL. When the crawler crawls along the path, if it encounters a HTML file, then it calls
trained classifier to classify this HTML file into corresponding classification. The crawler
writes the pair of classification label and url of this file into a TreeMap. Here, we used
TreeMap for later rank information. After crawling, the crawler writes the TreeMap into
a text file for user’s search.
3.MAIN COMPONENTS and Interactive diagram:
The total project consists of the following components: (1). A command-line utility for
indexing all HTML files into various topics for a given home directory of website--it will
crawl all subdirectories of given home directory automatically. (2). Server-side CGI or
Java Servlets for replying user's query. (3). User's interface displaying in brower.
3.1 Description of modules:
1. HTML file classifier:
This module is used to train a classifier from scratch, update the classifier by more
training data, and classify new document. The function for transferring a HTML file into
text file is also in this module. The weka package is imported in this part and its
implementation of k-nearest neighbors algorithm and other helping utilities were used.
2. Crawler or Indexer:
This module is used to crawl the directory tree for a given website to index all HTML
files reside in the website. The crawler is command utility used by webmaster after
updating of its website. This crawler takes the home URL as start point and loads the
trained classifier, then it crawls all subdirectories using Breadth-First Search strategy.
Whenever it encounters a new HTML file, it classifies this file into corresponding
category and store the pair of label and address into a map. After crawling, it writes the
map into a text file for user’s search.
3. Server-side Searcher:
This module is used for reply the searching results to users who submit the query. Since
the all HTML files have been indexed and the index information has been written in a
text file, the server-side searcher just searches the index information file and finds the
matched record , replies to user. There are many server side techniques such that CGI and
servlets.
4. 3.2 Interactive diagram:
HTML classifier Crawler:
module: 1. load classifier
1. build classifier. from disk.
2. update classifier. 2. crawling along
3. classify new Classifier directory tree.
document. stored in 3. classify
4. Transfer files. disk. encountered files.
Searcher:
1. accepts user’s Indexed
query. information
2. Searches on file in disk.
indexed file.
3. Replies results.
User’s
browse
5. 4. IMPLEMENTATION:
In this section , we list all java classes used in this project:
The following classes are used for transfer HTML file into text file:
1. public interface HTMLContent.
2. public class HTMLContentList: extends ArrayList.
3.public class HTMLTag: stores a name and optional attribute list.
4. public class HTMLText: stores text of HTML file.
5.public class HTMLToken: stores tokens of HTML file.
6.public class HTMLTokenizer: parse the HTML file into tokens.
7. public class HTMLTokenList: extends ArrayList.
8. public class Parser: take a HTMLTokenlist as input, convert it into HTMLContentList.
9.public class HTMLAttribute: stores attribute of HTML tag.
10.public class HTMLAttributeList: extends ArrayList to store all attributes of a tag.
The following classes are used for indexing, classifying and searching:
11.public class HTMLIndex: extends HashMap, implementing two methods: (1)
addString(), takes class label and title/filename arguments and creates a mapping
between each label and the respective file, (2) writeFile(), streams the index content to
a file.
12.public class HTMLIndexer: a command line utility that traverses the directories from
a given root path.
13.public class HTMLClassifier: k-nearest neighbors classifier , implementing those
methods (1) HTMLClassifier(), constructor to build classifier from scratch or load
from a file, (2) updateModel(), train classifier using training data, (3)
classifyMessage(), to classify new instance, (4) makeInstance(), to make a new
instance, (5) htmlToText(), transfer HTML file into text file
14.public interface Searcher: a search engine that returns the matched records in indexed
file.
15.public class HTMLSearch: implements interface Searcher.
16.public class SearchServlet: wraps the Searcher with an appropriate interface to handle
a POST request, with a string argument named 'search'. The result is returned on the
output stream.
6. 5. SAMPLE RESULTS:
Our implementation of this project is a combination of keyword search engine and topic-
specific search engine. This prototype is just used on our own website to be tested. The
user's interface is a textfield input and two submit buttons (see Figure 1): one button with
the label 'keyword' and another button with the label 'topic'.
Figure 1
When user want to search some relevant documents by keyword, he just types the
keywords in the textfield and click the button bearing the 'keyword' label (see Figure 1).
If there are any documents have the keywords matched , then, the matched documents'
name will be replied with hyperlinks as well as the hit times of the keywords in the
corresponding documents (see Figure 2). If there is no any document matched, the result
is 'no pages found'.
7. Figure 2
When user want to search some relevant documents by topic, he just types the topic in the
textfield and click the button bearing the 'topic' label (see Figure 1). In this
implementation, it only accept the following topics search: 'artificail intelligent',
'programming language', 'operating system', 'database', 'graphics', 'software engineering'.
If there are any documents have the label matched , then, the matched documents' name
will be replied with hyperlinks as well as the hit times of the keywords in the
corresponding documents (see Figure 3). If there is no any document matched, the result
is 'no pages found'.
Figure 3
6. DISCUSSION:
In this project , we implemented a topic-based search engine for a given website. The key
point of such search engine is to train a classifier for classify HTML files into
corresponding categories. We used k-nearest neighbors algorithm which implemented in
WEKA package dedicated for machine learning algorithms to train classifier for
classifying files related to computer science. Since this project is a course project and its
focus is on machine learning, so it is pretty simple in representation of documents and
collection of training data. There are many open problems can be solved further and
better:
(1)The k-nearest neighbors algorithm needs to store the training data in somewhere.
When new instance comes in, the k-nearest neighbors are retrieved and compared to
8. decide the classification of new instance. It is obvious that if there are many training
data, then it is not efficient. So, we may develop a more efficient classifier in future
using some efficient machine learning algorithm.
(2)We represent the documents using a vector extracted from a online technology
dictionary. This is a fairly simple representation. It can be improved more.
(3) In our implementation, it has no the ability to rank the relevant pages for the users by
topic search. In keyword search, we just calculate the hit times of keywords in the
relevant page, but in topic search, we didn't come up with any ranking strategy.
However, such ranking strategy in search engine is desirable.