1. INFORMATION EXTRACTION AND INTEGRATION BASED
CROWDSOURCING PLATFORM IN REAL-WORLD
Pham Nguyen Son Tung1
, Tran Minh Triet1
Nguyen Pham Hoang Anh1
, Nguyen Ngoc Dung1
, Nguyen Thi My Hue1
1
Faculty of Information Technology, University Of Science, Ho Chi Minh City, Vietnam
{pnstung, tmtriet}@fit.hcmus.edu.vn; {1241003, 1241014, 1241047}@student.hcmus.edu.vn
ABSTRACT
With the increasing amount of information over the Internet, accessing data from different
online sources is becoming more difficult. A user may be confused to find and gather information
related to a single entity from various websites. Thus, extraction and integration of information from
different websites are one of the essential requirements for Internet users. This motivates us to
propose a method efficiently extract and integrate information from different websites. Although
DOM tree analysis is a common method for information extraction from websites, this method may
provide unexpected results that require manually correction or refinement, or complicated methods
to machine learning. In our proposed method, we take advantage of the new trend of Crowdsourcing
to get the crowd-assistance by Crowdsourcing platform to improve the accuracy of the method to
analysis’s DOM trees with K-means algorithm. Our proposed system helps users to extract
information on a website more quickly and accurately. Experimental results show that our method
can provide the rate of extraction data correctly up to 98%.
Keywords. web information extraction, web information integration, crowdsourcing, web wrapper.
1. INTRODUCTION
According to the statistics of the Internet Live Stats (elaboration of data by International
Telecommunication Union (ITU) and United Nations Population Division), the amount of Internet
users has increased in 20 years from 14,161,570 to 2,925,249,355 people (c.f. Figure 1). With the
development of the Internet, the amount of information is going huge from websites worldwide. As
a result of this, an Internet user may be overloaded with information related to a given entity/object
in the Internet coming from many different sources and expressions.
For example, when a user wants to buy a book “Harry Potter and the Sorcerer’s Stone,” he or
she may be confused because there are too many information sources from various online bookseller
2. websites. Furthermore, it should be noticed that a book has many attributes/facets, which are
represented in different formats and labels in each web page, such as book name, author, price,
summary, publisher, language, etc.
Figure 1: The chart of number Internet users in the worldwide since 1993 to 5/2014
Generally, people just pay attention on a few pieces of main information on a book. Thus, the
demand of information extraction and integration has become an interesting topic concerned by
scientific research community. Many related papers on information extraction and integration have
been presented well-known conferences such as: International Conference World Wide Web
(WWW), International Conference on Web Engineering (ICWE), etc.
DOM Tree method is a common and efficient approach for an information extraction system.
However, if a system only uses this method, that system may extract information from a website and
provide unexpected results. It would be time-consuming to manually correct or train a system to
eliminate wrong results of a DOM tree method.
As the community of Internet users is larger and larger, the Internet definitely will be a greater
labor market in the future. Internet users from diverse cultures and different nationalities can
become amateurs or experts in a certain field. Crowdsourcing is a new trend that allows us to take
advantage of this huge source of workers. Some successful applications using this platform such as
Amazon Mechanical Turk, Crowd Flower opens a new chance for supervised label information
method. It takes advantage of human resource and knowledge to implement label data tasks, which
do not require computer qualifications. Therefore, we suggest to apply Crowdsourcing platform in
our system to improve the method to extract and integrate information from websites.
The rest of our paper is organized as follows. In section 2, we review several recent researches
related to information extraction. DOM tree methods for information extraction and Crowdsourcing
are then briefly discussed in section 3. Our proposed system and architecture are presented in section
4. Section 5 shows the experimental results of our system and method. Finally, conclusions are
presented in section 5.
3. 2. RECENT RESEARCHES RELATED TO INFORMATION EXTRACTION
2.1. Manual method
Observing a website and its source code, the developer will search some typical templates of a
website. Developer used some language code to create extracting script corresponding these
standard templates of a website. Then, the necessary data is extracted in accordance with the
scenario which was previously determined. However, this method cannot work with a large number
of websites [1].
2.2. Building wrapper method
Most of the websites are created from some agencies using data based on users’ requirements.
They are also known as the hidden web page. This means that it is necessary to have a special tool to
extract information from such sites. This is usually done by the wrapper. A wrapper can be seen as a
procedure designed to extract the contents of a source of information, a program or a set of extracted
rules. This method is proposed by Nicholas Kushmerick in 1997 [2]. The wrapper will be trained to
extract the necessary data based on sets of extracted rules from the samples. A series of recent
studies, we can mention to the research that creates an automatic wrapper for some large site of a
group of authors Nilesh Dalvi et al in 2011 [3], the research that built learning wrapper by Tak-Lam
Wong in 2012 [4], and the research built an unsupervised learning wrapper of Chia-Hui Chang et al
in 2013 [5].
2.3. Building wrapper method
The problems of information extraction often approach the extracting information-based data.
Data are generally divided into three types: unstructured data, structured data and semi-structured
data. The sites are usually a typical form of semi-structured data, the structural components of the
site which is displayed on the web via HTML tags. Based on the structure of the HTML tags to
construct the DOM tree structure to determine how to organize data from which to extract
information in accordance with the desired structure, this method solves the problem of overlapping
information. However it can only be extracted within a web page, but can’t be reuse website
structure or other components. In addition, the data collection can be inaccurate if the website
structure be changed. The outstanding researches that could be mentioned are: M.Álvarez et al with
Reusing web contents: a DOM approach [6], R. Novotny et al [7], M.Shaker et al [8].
3. DEFINITION OF INFORMATION EXTRACTION ON DOM TREE AND BASIC
CONCEPTS OF CROWDSOURCING
3.1. DOM tree structure analysis
Effect. According to W3C, the DOM (Document Object Model) [9] is an application
programming interface (API) for valid HTML and well-formed XML documents. The DOM is
divided into three levels as follows:
4. • Core DOM: interface for any form documents.
• XML DOM: interface for XML documents.
• HTML DOM: interface for HTML documents.
Figure 2: The chart of number Internet users in the worldwide since 1993 to 5/2014
Creating a DOM tree is a necessary step in information extraction algorithm. We will use
IHTMLDOMNode turn each node in the browser and the website from which to build a DOM Tree
showing the structure of the currently displayed web page. Based on that we will know the overall
structure of a Website, including all elements of the website, any element before any element, any
element that contains the element. To extract necessary information at a node of DOM tree, we need
to specify clearly the path from the root of the tree to node need to extract information. This path is
called a Xpath or extraction sample [10]. These could be taken from the root of DOM tree to the
node consists of extracted content.
DOM tree is built to base on HTML tags of a website, which is its root node are <HTML>
tags, then the tags are inside and leaf node is a node which consists of extracted content. Information
extraction on DOM tree, in fact, is to browse HTML tags to get information that these pairs of tags
contain.
Information extraction from DOM tree in Figure 2 as follows: browse alternately nodes of
DOM tree until encountering a leaf node. Value at leaf node is extracted information. Example: To
extract information of Book’s name, we browse DOM tree as follows:
TABLE TR TD Frozen
3.2. The Crowdsourcing platforms
The term “Crowdsourcing” appeared firstly in the article “The rise of Crowdsourcing,” Jeff
Howe, published on the website leading the technology Wired 2006 [11]. Crowdsourcing is the
combination of technology and business of outsourcing and express social aspect of open-source.
5. Individuals or enterprises (also called Requester) use human’s intelligence to perform tasks that
computers can’t perform like: identifying objects in a picture, documents describe products…
Requesters not only find labor in companies or organizations but also in a crowd. Through unsolved
problem, Requesters place their faith in the majority, find out hidden talent to put out the best
solution along with the acceptable cost.
Figure 3. The illustration work on Mturk
(The picture is taken from website http://docs.aws.amazon.com in May 2014)
Some projects use Crowdsourcing successfully as iStockphoto website, exchange and share
photos. Threadless is the community website of online designers. Facebook also uses Crowdsourcing
to create different language versions. This makes Facebook appropriate to various nations. More
outstanding is Amazon Mechanical Turk (MTurk), where provides a lot of work as Crowdsourcing.
Amazon calls tasks that Requester offers are HITs and staff that performs these tasks is known as
Workers.
4. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE
4.1. System Architecture
Our system diagram is divided into three main stages. The first stage is from input data that we
first use Crowdsourcing platform combining with K-mean algorithm to generate the extracted
a BA
6. keywords. Then, we will move into the second stage. It uses Crowdsourcing platform again for a
crowd assisted. The third stage is that from labeled data, we proceed to extract sets, which are used
to extract information.
Figure 4. System Architecture Diagram
4.2. Extracting keywords from the input dataset
According to empirical studies of the authors Jun Zhu [12], the site is currently about 35% is
the List page, and the rest will be the detail page. However, we have successfully extracted this
sample in [13]. So in this paper we will be interested in the second form: websites list an object
(detail page).
4.2.1 Choose Keyword’s option using Crowdsourcing platform for the first time
Keywords are the information about the object that the users are interested. In Fig 4 above, the
object is a book, users interested in information on subjects such as title, work address, year of
publication, price, shipping costs, date of publication, publisher ... the information that the users are
interested in keywords that we need to collect. To accomplish this task, we have developed systems
to answer simple questions for the worker from the answers.
With algorithms get keywords:
Step 1: Set D consists of the input sample site: D = { }
7. Step 2: Decomposing all the keywords in each Di to a set of words called W:
W =
Step 3: , count the frequency of H of w:
Step 4: Call set W* is a set of keywords selected candidates:
W* = in experimental, we choose
Step 5: Once you have selected the candidate set we use Crowdsourcing user time to select the
keywords for each field.
Step 6: Apply the K-means algorithm for dataset users replied to collect the set of keywords.
4.3. Assign the label data using Crowdsourcing platform for the second time
In this stage, each keyword got previous stage will be generated questions which workers have
to answer to collect the best rules for system, xPath rules extracting for each keyword. Previously,
we were using the Greedy algorithm to reduce the number of questions and tasks that a worker must
perform the work. However, for each worker performs is still quite a lot. While this improvement,
we improve the system further aims to bring efficiency in the work of the worker as well as data
acquisition.
The tag when generating questions, instead of just YES/NO question form will set out the
questions to ask the worker click on the region of the data on the website related to the keyword, the
data from that at the moment to be worker confirmed that again was right or not, with using this
flexibility, the number of questions has diminished greatly reduce 1/4 for every worker. To complete
a task assigned labels for a keyword, the worker must complete three tasks like the picture below:
1. Read the question quest to do.
2. Click on the domain that contains the data values question.
3. Confirm the data correctly.
8. Figure 5. The interface of system architecture for worker.
4.4. Assign the label data using Crowdsourcing platform for the second time
After finished the labeling data, a set of rules is going to be created, this one will be used in the
extraction data from websites depends on requirements of human.
Table 1. xPath rules extracting the book title
xPath Rules
1 /html[1]/head[1]/title[1]/#text[1]
2 /html[1]/body[1]/…/span[1]/a[1]/#text[1]
3 /html[1]/body[1]/…/div[1]/a[1]/#text[1]
5. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE
5.1 Results of K-means algorithm
To search keyword repositories for each field, we conducted experiments running the site data
with three different areas of the Movie, Music, E-commerce (selling books), each of these fields, we
will use the site to collect five keywords. K-means algorithm [14] is applied with K = 2 clustering
data to the user to choose the most. Specifically, the collection of data I presented in the table below:
9. Table 2. The number of input keyword and output keyword
Fields
Number of
input websites
Number of
candidate
keywords
Number of
result’s
keyword
Movie 5 20 10
Music 5 22 12
Book 5 23 10
5.2 Comparison of results between Greedy and Labeling data
In this section, we display a result from the experiments with our approach. Our experiments
mainly focus on the impact of greedy and labeling data method based knowledge of the crowd to our
system. We conducted a test on three categories of websites under three different fields to carry out
data extraction under two tables, as following:
Table 3. The overall results and comparsion between Greedy and labeling data by workers
Fields
Number of
input pages
Website
Number of
questions
Number of
human answers
Greedy Click Greedy Click
Movie 5 www.imdb.com 1760 20 88 7
Music 5 www.last.fm 500 24 25 5
Book 5 www.betterworldbooks.com 680 20 34 5
Table 4. Compare the results with the dataset and runtime
Fields Dataset Rate Runtime
Movie 7 x 104
87.75% 01:45:17
Music 3 x 104
91.23% 00:06:04
Book 5 x 104
97.52% 00:22:05
6. CONCLUSION
In this paper, we present our method for building systems to extract and integrate information
from websites with the following characteristics: DOM tree applies K-means algorithm to extract
keywords and build the system to answer questions in "right or wrong" platform based on
Crowdsourcing. The findings from these sets of rules will be updated to conform with the data
extraction, and we apply labeling data by knowledge of the crowd to reduce the number worker
perform as well as cost savings.
10. REFERENCES
1. Bing Liu, "Web Data Mining-Exploring Hyperlinks, Contents, and Usage Data," in
http://www.cs.uic.edu/~liub/WebMiningBook.html, December, 2006.
2. N. Kushmerick, D. S. Weld and R. Doorenbos, "Wrapper induction for information
extraction," in IJCAI, 1997.
3. N. Dalvi, R. Kumar and M. Soliman, "Automatic wrappers for large scale Web extraction," in
Proc. of the VLDB Endowment, 2011.
4. T.-L. Wong, "Learning to adapt cross language information extraction wrapper," in Applied
Intelligence, Volume 36, Issue 4, pp 918-931, June 2012.
5. C.-H. Chang, Y.-L. Lin, K.-C. Lin and M. Kayed, "Page-Level Wrapper Verification for
Unsupervised Web Data Extraction," in Web Information Systems Engineering – Lecture
Notes in Computer Science Volume 8180, pp 454-467, 2013.
6. Luis Álvarez Sabucedo, Luis E. Anido-Rifón, Juan M. Santos-Gago: Reusing web contents: a
DOM approach. Softw., Pract. Exper. 39(3): 299-314 (2009).
7. R. Novotny, P. Vojtas, and D. Maruscak, “Information Extraction from Web Pages,”
Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence
and Intelligent Agent Technology, vol. 3, 2009, pp. 121-124.
8. M. Shaker, H. Ibrahim, A. Mustapha, and L.N. Abdullah, “Information extraction from web
tables,” Proceedings of the 11th International Conference on Information Integration and
Web-based Applications & Services - iiWAS ’09, New York, New York, USA: ACM Press,
2009, pp. 470-476.
9. W. W. W. Consortium, "Document Object Model (DOM)," in http://www.w3.org/DOM/,
January 19, 2005.
10. M Okada, N Ishii, I Torii, “Information extraction using XPath,” Knowledge-Based and
Intelligent Information and Engineering SystemsLecture Notes in Computer Science Volume
6278, 2010, pp 104-112.
11. J. Howe, "The Rise of Crowdsourcing," in Wired Magazine, June 2006.
12. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma, "Simultaneous record detection and
attribute labeling in web data extraction," in Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, 2006.
13. Khanh Nguyen, Huy Nguyen, Nam Nguyen, Cuong Do, Triet Tran. “System for training and
executing WebBot to extract information from websites”, ITCFIT, 2010.
14. JA Hartigan, MA Wong, “A k-means clustering algorithm,” Applied statistics, 1979.