1. Web Recommender Project Final Report
Wei Chen, Yue (Jenny) Cui
Motivation
People use the web to browse information. One problem is that there is too much information
on the web, and it usually takes time to search for information. So it would be helpful to
facilitate this web-browsing experience in a way that is convenient, fast and accurate. An
existing solution to this problem is to use the search engine. In a typical search engine scenario,
a user types in a query, and the search engine returns relevant pages. Using a search engine to
retrieve relevant pages is not fully automatic; it requires user effort to make and type in a query.
Our goal is to develop a tool to automatically generate queries for the user when s/he is
reading a web page. Then we can use this query to recommend relevant web pages to the user.
Problem Statement
What is a Web Recommender?
A web recommender is a web-browsing tool which recommends relevant web pages to the user
while s/he is reading a page.
Why is it important?
A web recommender provides a convenient way to browse the web. It automatically
recommends relevant information. It requires less effort to make and type in queries. At the
mean time, it reserves the benefit from the state-of-the-art search engine.
Why is it hard?
Making queries from a web page is a keyword summarization problem, which is still an active
research topic. Also, search engines are not perfect: they can return dead links, and they can
return irrelevant pages. Furthermore, it is often hard to define what it means to be relevant. It
depends on different reading goals. All of these are related issues of web recommendation. We
do not attempt to conquer all of them. In this particular project, we focus on the first issue,
which is to extract queries from a web page.
Link to Vision Statement
2. Goals for this project (solution)
We have three goals for this project:
(1) Provide a software framework for Web Recommendation
(2) Provide basic recommendation algorithms
(3) Propose an evaluation prototype
The first goal defines the basic functionality of this software. The second goal provides three
kinds of services. First, it offers basic service to the web recommender. Second, it offers
baselines for future research on Web Recommendation. Finally, it can be used as a tutorial for
teaching people how to develop their own algorithm based on our software framework.
Link to Vision Statement
Link to Domain Model
Requirements
Functional Requirements
(1) Given a web page as input, the system should be able to find a list of relevant web pages.
(2) The system should provide three recommendation algorithms.
a. Baseline algorithm: uses simple string processing techniques
b. HTML-Structure-based algorithm: uses HTML structure features
c. Semantics-based algorithm: uses NLP techniques (named entity recognizer) to
extract features
(3) The system should provide a simple GUI for evaluation.
Non-functional Requirements
Recommendation results can be retrieved in 5 seconds.
Link to Requirement Analysis
Design
Our design has three components: a general software framework design, algorithm design, and
evaluation task design.
Software Framework Design
Class Diagram:
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/ClassDiagramFinal
3. The main algorithm of WebRecommender is implemented in the method recommend(). The util
package provides tools for HTML parsing, basic text processing and NLP tools that are needed
for the recommendation algorithm. QueryFilter is used for key-term selection.
QueryFormulator can be used for combining multiple queries.
Sequence Diagram illustrates an example message flow:
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/SequenceDiagram
Algorithm Design
We designed three algorithms: baseline algorithm, HTML-structure-based algorithm and
semantics-based algorithm. The algorithms are described below.
Baseline Algorithm
1. Strip off HTML tags (e.g. </html>)
2. Remove non-word tokens (e.g. “/**/”)
3. Remove stop words (e.g. “the”)
HTML Structure-based Algorithm
1. Parse HTML page
2. Extract text content from node <title> and <a>
3. Remove stop words (e.g. “the”)
4. Select the 10 most frequent words
Semantic-based Algorithm
1. Strip off HTML tags (e.g. </html>)
2. Tag the page using Stanford named entity tagger
3. Remove non-word tokens (e.g. “/**/”)
4. Remove stop words (e.g. “the”)
5. Select named entities with highest frequency (top 5)
Example Query Comparison
Input page: http://en.wikipedia.org/wiki/Entropy
Table 1. Example query comparison
Algorithm Output Query
Baseline Entropy, free, encyclopedia, Jump, search, article
HTML-Structure ISBN, edit, entropy, thermodynamics, Entropy, energy, system, law, heat,
thermodynamic
Semantic ISBN, University, Press, Boltzmann, John
4. Evaluation Design
Evaluation Form
We designed an evaluation form which consists of three fields: input page, recommended page,
and relevance score. We ask our evaluators to score each recommended page. The relevance
score has two values: 1 means “relevant”; 0 means “irrelevant”. The form also contains two
fields that are hidden from the evaluator: the algorithm used to produce the recommended
page and the rank of the page. These two fields are used for evaluation, and they are invisible
to the evaluators.
Evaluation Criteria
We used the modified Average Precision to aggregate relevance scores. The standard average
precision is calculated as the sum of precision at each position divided by the total number of
relevant pages. In our modified version, we replace the number of relevant pages in the
denominator with the total number of retrieved pages.
N
( P(r ) rel(r ))
r1
ModifiedAv
eP
N
An example of the calculation of modified average precision is shown in our final project
presentation:
link to Final Presentation
Test Data Selection
Our criterion for test data selection is that it has to span multiple dimensions. The dimensions
we considered include:
1. Popular vs. Unpopular (e.g., “Harry Porter” vs. “Wei Chen”)
2. Ambiguous vs. Unambiguous (e.g., “Entropy” vs.“Sushi”)
3. New vs. Old (e.g., “Waterboarding” vs. “Entropy”)
4. Procedural vs. Conceptual (e.g., “How to” vs. “Entropy”)
5. Technological vs. Mass media (e.g., “Entropy” vs. “Harry Porter”)
Based on the test data selection criteria, we selected 5 input pages from 5 topics:
1. “Harry Porter” http://en.wikipedia.org/wiki/Harry_potter
2. “Waterboarding” http://en.wikipedia.org/wiki/Waterboarding
3. “Wei Chen@CMU homepage” http://www.cs.cmu.edu/~weichen/
4. “Entropy (thermodynamics)” http://en.wikipedia.org/wiki/Entropy
5. “How to make Sushi” http://www.wikihow.com/Make-Sushi
5. Evaluation GUI
Link to GUI Demo
Our evaluation GUI is composed of three function areas: the top panel where user type in the
url of the input web page, the left panel where all the urls of the recommended web pages are
displayed, and the content panel which display the web page the user selects. The top panel
includes an internet address bar and the recommendation button. User types in the url of a
webpage, if he presses the enter key then the input web page will be showed in the large
content panel behind it. If the user clicks the recommend button, then the urls of the
recommended web pages are displayed in the left panel in the GUI.
Evaluation and Results
One important question we want to answer in this project is how well each our algorithms are.
So we need to design an experiment which can measure user satisfaction fairly. Our first
hypothesis is that given different kinds of topics the performance of our algorithms will be
different. But at the design stage we are not sure how huge the variation will be.
Our second hypothesis is that users will disagree on how useful the recommended web pages
are. Because if a user changes his goal, he will change his evaluation criteria at the same time.
In order to avoiding non-standard criteria, we limit our evaluation criteria only to the relevancy
of the recommended pages. In our ReadMe file we specify the definition of relevancy for each
of the topics. By doing this we think we can measure the user satisfaction to each of our
algorithms.
Experimental Design
link to example evaluation form
We have three algorithms: baseline, semantic and structure algorithm. We finally chose 5 topics
for our experiment. It is important that the web pages our algorithms recommend contain the
information our user need. But it is equally important that they appear at the top of the list of
the recommended web pages. After we combine each algorithm with 5 topics, we get totally 15
categories (e.g. (baseline, topic1) is one category). We use the top five recommended web
pages from each algorithm. Then each rater evaluates in total 75 recommended web pages.
Whenever a rater thinks a recommended web page is relevant, he scores one in the score
column in the evaluation form.
6. Participants
We have total of 5 participants. Three are females and two are males. All of the raters are with
at least a master degree in computer science. One of them is a native English speaker. The rest
four are not.
Experimental procedure
All the raters read the Readme file which gives out the definition of relevancy for each topic
before they do the evaluation.
Results
Link to a presentation of evaluation results
Our results show that our first hypnosis is correct. Topics that are popular and have more
resources on the web have better scores. The topic “Harry Potter” has the highest relevancy
score. All of our three algorithms recommended satisfying web pages. We think the reason is
there are so many web pages about Harry Potter. So it is easy to find relevant web pages. The
topic “Waterboarding” has the highest number of invalid web pages. We think the reason is
waterboarding is a typical news topic. Most of the time there are few web pages which talk
about Waterboarding. There are few resource on this topic on the web. But one it becomes a
news headline. Many resource are added to the web. But after some time when it is no longer
the headline of the news. Many of the resources are probably deleted from the web. That could
cause the invalid links. The topic “how to make Sushi” has the lowest relevancy score. We think
the reason is it is about a specific procedure which makes the definition of relevancy more strict.
Among our three algorithms, in this experiment structure algorithm has the best performance.
The difference between baseline algorithm and structure algorithm is significant with p-value (p
< 0.001). The difference between baseline and semantic algorithm is not significant.
The structure algorithm has the best performance on topic “entropy” with relevancy score of 1.
This is a really promising result, because if the target users of web recommender are people
who are in academic, they would use it to find technology information. For example, we could
combine web recommender with Wikipedia. Then the users would get more comprehensive
information on the topic he/she is interested in. The structure algorithm has very good
performance on the topic “How to make Sushi” whereas the baseline and semantic algorithms
have the worst the performance. We think the reason is the structure algorithm uses key terms
that are extracted from anchor tags. These anchor tags point to other relevant web pages. So
the key terms extracted from anchor tags are very more relevant than the key terms which we
extract from other part of the web page.
7. Error Analysis
Table 2. Key terms and score of all the categories of topic and algorithm
Topic Algorithm Key Term Score
Entropy Baseline Entropy free encyclopedia Jump search article 0.519
Semantic ISBN University Press Boltzmann John 0.6926
Structure ISBN edit entropy thermodynamics Entropy energy system
law heat thermodynamic 1
Harry Baseline Harry Potter free encyclopedia Jump search
Potter 0.9032
Semantic Harry Potter Voldemort BBC Rowling 0.9686
Structure Potter Harry Rowling Witch Deathly Goblet Magic
witchcraft Film Hallows 0.982
Waterboa Baseline Waterboarding free encyclopedia Jump search Cambodia
rding Khmer 0.507
Semantic CIA United York Bush States 0.1738
Structure Torture News York Waterboarding Times Press CIA ISBN
torture Washington 0.8564
Wei Chen Baseline Wei Chen graduate student Language Technologies
Carnegie Mellon research advisor 0.7444
Semantic Chen Wei University NMF Johns 0.457
Structure States Natural Language Mental Fahlman Word Jack
Lingual AAAI Wei 0.713
How to Baseline Make 10 steps wikiHow Manual Edit RSS Create account
make log prepared
Sushi 0.1274
Semantic RL Commons Article Creative Nicole 0
Structure Sushi Make edit Ads Roll wikiHow Article make Rice Show 0.7444
Overall the semantic algorithm’s performance is not as good as we expected. We expect it at
least as good as the structure algorithm. The semantic algorithm scores zero in topic “How to
make Sushi”. By looking at causes of that, we find that the name entity algorithm we use can
only identify name of person and organization. So it doesn’t include the important key word
“Sushi”. Then by looking at other topics we find that actually the semantic algorithm identify
important name entities which are relevant to the topic and useful to the algorithm. But only
using these name entities is not sufficient enough. We think if we combine the key words in the
title of the web pages and the name entities we extracted as an input to our query, we would
have much better result in the future.
For topic “entropy”, both semantic and structure algorithms score better than the baseline
algorithm. We think the reason behind is in the baseline algorithm there are some “noise” key
8. terms which affect the performance of the baseline algorithm. It makes the baseline algorithm
give out some irrelevant web pages. It is also a very promising sign that the semantic and
structure algorithm will make difference in the recommendation results.
All of the three algorithms are perform very well on the topic “Harry Potter”. We think there
are two reasons for this: first the definition of relevancy for a popular topic is much broader.
Anything about Harry Potter will be thought as relevant no matter it is about the book, the
movie, the author, or the actors. Second, there are so many web pages about Harry Potters on
the web. So it is easier to find 5 from them.
The reason for the error pages in the topic of “Waterboarding” is some of the links are invalid.
Because “Waterboarding” is a time sensitive topic, the content of the recommended web pages
could be deleted at the time of evaluation. By looking at the links we see that they are usually
links to the user generated content pages like forums.
For topic “Wei Chen”, the relevant web pages about this topic are very few. The structure
algorithm only returns 4 web pages as the result. But two of them are relevant. So we think the
major reason for the error pages is the scarce of the relevant web pages on the web.
For topic “how to make Sushi”, we think it will be a difficult case for web recommender. One
problem is caused by the key word “make”. The algorithm gave out pages that are about how-
to make something else instead of Sushi. The other problem is there are a lot of content that
are user generated about this topic. So some of the recommended pages are from forums and
invalid at the time of evaluation.
Conclusion
This experiment gave us a lot of feedback about our algorithms used in web Recommender.
Now we know how topics play a role on the recommendation results of each of the algorithms.
We can also conclude from the experiment that our algorithms do make significant difference
on the recommendation results. We probably could predict which kind of topic web
recommender will be most useful.
Software Engineering Techniques used in this project
We followed the standard software engineering process in this project: requirement analysis,
design, implementation, and evaluation. We used iterative development process at design,
implementation, and evaluation phase. Table 3 summarizes the iterations in each phase. It also
summarizes the main changes we went through.
9. Table 3. Highlights of software engineering process
Design Implementation Evaluation
Iteration 1 1. Initial design of 1. Initial 1. Pilot study
implementation of
framework 2, Weighted average
2. Composite-pattern framework relevance score
2. Implemented
based evaluation
evaluation
design component based on
composite pattern
Iteration 2 1. Added query 1. Implemented query 1. 5 raters, 5 input
formulator and query formulator and query pages
filter filter 2. Modified average
2. Simplified
2. Implemented precision
evaluation design
simplified version of
evaluation GUI
What changed over the semester?
As Table 3 showed, we made changes in each of the development phases. Major changes are
documented in several meetings notes:
Changes in Main Framework:
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-02-2009
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-11-2009
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes02-18-2009
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes03-04-2009
Changes in Evaluation Component:
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-06-2009
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-20-2009
http://seit1.lti.cs.cmu.edu/projects/webrecommender/wiki/MeetingNotes04-22-2009
Our evaluation GUI went through several rounds of changes.
Stage 1: Planed to use relational database to store and retrieve evaluation results
Stage 2: Discarded the idea of relational database. Use composite pattern to implement
aggregation of evaluation scores. Link to composite pattern based design
Stage 3: Discarded the idea of composite pattern. Simplified the evaluation GUI. Implemented
GUI. Link to the GUI Demo
Stage 4: GUI found to be slow. Used Excel files to store and calculate evaluation scores. link to
example evaluation form
10. What would we change if we did the project over again?
1. We would improve our risk analysis: one tricky thing about risk analysis is that it is
unexpected. We didn’t expect that speed will be a problem of our GUI.
2. Evaluation took more time than we had thought. We want to allow more time for
evaluation, because we need time for pilot study before we conduct the experiment.
Then we can have detailed and systematic analysis of the algorithms and improve our
algorithms based on the analysis.
3. We would improve our time management: We should start evaluation early so that we
can improve our algorithms based on evaluation results.
Acknowledgements
We own many thanks to Dr. Nyberg, Dr. Tomasic, Shilpa and Hideki for valuable comments and
suggestions on our project throughout the semester. We thank our raters for the evaluation
task. We also thank our classmates for many helpful discussions.