SlideShare una empresa de Scribd logo
1 de 10
Descargar para leer sin conexión
INFORMATION EXTRACTION AND INTEGRATION BASED
CROWDSOURCING PLATFORM IN REAL-WORLD
Pham Nguyen Son Tung1
, Tran Minh Triet1
Nguyen Pham Hoang Anh1
, Nguyen Ngoc Dung1
, Nguyen Thi My Hue1
1
Faculty of Information Technology, University Of Science, Ho Chi Minh City, Vietnam
{pnstung, tmtriet}@fit.hcmus.edu.vn; {1241003, 1241014, 1241047}@student.hcmus.edu.vn
ABSTRACT
With the increasing amount of information over the Internet, accessing data from different
online sources is becoming more difficult. A user may be confused to find and gather information
related to a single entity from various websites. Thus, extraction and integration of information from
different websites are one of the essential requirements for Internet users. This motivates us to
propose a method efficiently extract and integrate information from different websites. Although
DOM tree analysis is a common method for information extraction from websites, this method may
provide unexpected results that require manually correction or refinement, or complicated methods
to machine learning. In our proposed method, we take advantage of the new trend of Crowdsourcing
to get the crowd-assistance by Crowdsourcing platform to improve the accuracy of the method to
analysis’s DOM trees with K-means algorithm. Our proposed system helps users to extract
information on a website more quickly and accurately. Experimental results show that our method
can provide the rate of extraction data correctly up to 98%.
Keywords. web information extraction, web information integration, crowdsourcing, web wrapper.
1. INTRODUCTION
According to the statistics of the Internet Live Stats (elaboration of data by International
Telecommunication Union (ITU) and United Nations Population Division), the amount of Internet
users has increased in 20 years from 14,161,570 to 2,925,249,355 people (c.f. Figure 1). With the
development of the Internet, the amount of information is going huge from websites worldwide. As
a result of this, an Internet user may be overloaded with information related to a given entity/object
in the Internet coming from many different sources and expressions.
For example, when a user wants to buy a book “Harry Potter and the Sorcerer’s Stone,” he or
she may be confused because there are too many information sources from various online bookseller
websites. Furthermore, it should be noticed that a book has many attributes/facets, which are
represented in different formats and labels in each web page, such as book name, author, price,
summary, publisher, language, etc.
Figure 1: The chart of number Internet users in the worldwide since 1993 to 5/2014
Generally, people just pay attention on a few pieces of main information on a book. Thus, the
demand of information extraction and integration has become an interesting topic concerned by
scientific research community. Many related papers on information extraction and integration have
been presented well-known conferences such as: International Conference World Wide Web
(WWW), International Conference on Web Engineering (ICWE), etc.
DOM Tree method is a common and efficient approach for an information extraction system.
However, if a system only uses this method, that system may extract information from a website and
provide unexpected results. It would be time-consuming to manually correct or train a system to
eliminate wrong results of a DOM tree method.
As the community of Internet users is larger and larger, the Internet definitely will be a greater
labor market in the future. Internet users from diverse cultures and different nationalities can
become amateurs or experts in a certain field. Crowdsourcing is a new trend that allows us to take
advantage of this huge source of workers. Some successful applications using this platform such as
Amazon Mechanical Turk, Crowd Flower opens a new chance for supervised label information
method. It takes advantage of human resource and knowledge to implement label data tasks, which
do not require computer qualifications. Therefore, we suggest to apply Crowdsourcing platform in
our system to improve the method to extract and integrate information from websites.
The rest of our paper is organized as follows. In section 2, we review several recent researches
related to information extraction. DOM tree methods for information extraction and Crowdsourcing
are then briefly discussed in section 3. Our proposed system and architecture are presented in section
4. Section 5 shows the experimental results of our system and method. Finally, conclusions are
presented in section 5.
2. RECENT RESEARCHES RELATED TO INFORMATION EXTRACTION
2.1. Manual method
Observing a website and its source code, the developer will search some typical templates of a
website. Developer used some language code to create extracting script corresponding these
standard templates of a website. Then, the necessary data is extracted in accordance with the
scenario which was previously determined. However, this method cannot work with a large number
of websites [1].
2.2. Building wrapper method
Most of the websites are created from some agencies using data based on users’ requirements.
They are also known as the hidden web page. This means that it is necessary to have a special tool to
extract information from such sites. This is usually done by the wrapper. A wrapper can be seen as a
procedure designed to extract the contents of a source of information, a program or a set of extracted
rules. This method is proposed by Nicholas Kushmerick in 1997 [2]. The wrapper will be trained to
extract the necessary data based on sets of extracted rules from the samples. A series of recent
studies, we can mention to the research that creates an automatic wrapper for some large site of a
group of authors Nilesh Dalvi et al in 2011 [3], the research that built learning wrapper by Tak-Lam
Wong in 2012 [4], and the research built an unsupervised learning wrapper of Chia-Hui Chang et al
in 2013 [5].
2.3. Building wrapper method
The problems of information extraction often approach the extracting information-based data.
Data are generally divided into three types: unstructured data, structured data and semi-structured
data. The sites are usually a typical form of semi-structured data, the structural components of the
site which is displayed on the web via HTML tags. Based on the structure of the HTML tags to
construct the DOM tree structure to determine how to organize data from which to extract
information in accordance with the desired structure, this method solves the problem of overlapping
information. However it can only be extracted within a web page, but can’t be reuse website
structure or other components. In addition, the data collection can be inaccurate if the website
structure be changed. The outstanding researches that could be mentioned are: M.Álvarez et al with
Reusing web contents: a DOM approach [6], R. Novotny et al [7], M.Shaker et al [8].
3. DEFINITION OF INFORMATION EXTRACTION ON DOM TREE AND BASIC
CONCEPTS OF CROWDSOURCING
3.1. DOM tree structure analysis
Effect. According to W3C, the DOM (Document Object Model) [9] is an application
programming interface (API) for valid HTML and well-formed XML documents. The DOM is
divided into three levels as follows:
• Core DOM: interface for any form documents.
• XML DOM: interface for XML documents.
• HTML DOM: interface for HTML documents.
Figure 2: The chart of number Internet users in the worldwide since 1993 to 5/2014
Creating a DOM tree is a necessary step in information extraction algorithm. We will use
IHTMLDOMNode turn each node in the browser and the website from which to build a DOM Tree
showing the structure of the currently displayed web page. Based on that we will know the overall
structure of a Website, including all elements of the website, any element before any element, any
element that contains the element. To extract necessary information at a node of DOM tree, we need
to specify clearly the path from the root of the tree to node need to extract information. This path is
called a Xpath or extraction sample [10]. These could be taken from the root of DOM tree to the
node consists of extracted content.
DOM tree is built to base on HTML tags of a website, which is its root node are <HTML>
tags, then the tags are inside and leaf node is a node which consists of extracted content. Information
extraction on DOM tree, in fact, is to browse HTML tags to get information that these pairs of tags
contain.
Information extraction from DOM tree in Figure 2 as follows: browse alternately nodes of
DOM tree until encountering a leaf node. Value at leaf node is extracted information. Example: To
extract information of Book’s name, we browse DOM tree as follows:
TABLE  TR  TD  Frozen
3.2. The Crowdsourcing platforms
The term “Crowdsourcing” appeared firstly in the article “The rise of Crowdsourcing,” Jeff
Howe, published on the website leading the technology Wired 2006 [11]. Crowdsourcing is the
combination of technology and business of outsourcing and express social aspect of open-source.
Individuals or enterprises (also called Requester) use human’s intelligence to perform tasks that
computers can’t perform like: identifying objects in a picture, documents describe products…
Requesters not only find labor in companies or organizations but also in a crowd. Through unsolved
problem, Requesters place their faith in the majority, find out hidden talent to put out the best
solution along with the acceptable cost.
Figure 3. The illustration work on Mturk
(The picture is taken from website http://docs.aws.amazon.com in May 2014)
Some projects use Crowdsourcing successfully as iStockphoto website, exchange and share
photos. Threadless is the community website of online designers. Facebook also uses Crowdsourcing
to create different language versions. This makes Facebook appropriate to various nations. More
outstanding is Amazon Mechanical Turk (MTurk), where provides a lot of work as Crowdsourcing.
Amazon calls tasks that Requester offers are HITs and staff that performs these tasks is known as
Workers.
4. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE
4.1. System Architecture
Our system diagram is divided into three main stages. The first stage is from input data that we
first use Crowdsourcing platform combining with K-mean algorithm to generate the extracted
a BA
keywords. Then, we will move into the second stage. It uses Crowdsourcing platform again for a
crowd assisted. The third stage is that from labeled data, we proceed to extract sets, which are used
to extract information.
Figure 4. System Architecture Diagram
4.2. Extracting keywords from the input dataset
According to empirical studies of the authors Jun Zhu [12], the site is currently about 35% is
the List page, and the rest will be the detail page. However, we have successfully extracted this
sample in [13]. So in this paper we will be interested in the second form: websites list an object
(detail page).
4.2.1 Choose Keyword’s option using Crowdsourcing platform for the first time
Keywords are the information about the object that the users are interested. In Fig 4 above, the
object is a book, users interested in information on subjects such as title, work address, year of
publication, price, shipping costs, date of publication, publisher ... the information that the users are
interested in keywords that we need to collect. To accomplish this task, we have developed systems
to answer simple questions for the worker from the answers.
With algorithms get keywords:
Step 1: Set D consists of the input sample site: D = { }
Step 2: Decomposing all the keywords in each Di to a set of words called W:
W =
Step 3: , count the frequency of H of w:
Step 4: Call set W* is a set of keywords selected candidates:
W* = in experimental, we choose
Step 5: Once you have selected the candidate set we use Crowdsourcing user time to select the
keywords for each field.
Step 6: Apply the K-means algorithm for dataset users replied to collect the set of keywords.
4.3. Assign the label data using Crowdsourcing platform for the second time
In this stage, each keyword got previous stage will be generated questions which workers have
to answer to collect the best rules for system, xPath rules extracting for each keyword. Previously,
we were using the Greedy algorithm to reduce the number of questions and tasks that a worker must
perform the work. However, for each worker performs is still quite a lot. While this improvement,
we improve the system further aims to bring efficiency in the work of the worker as well as data
acquisition.
The tag when generating questions, instead of just YES/NO question form will set out the
questions to ask the worker click on the region of the data on the website related to the keyword, the
data from that at the moment to be worker confirmed that again was right or not, with using this
flexibility, the number of questions has diminished greatly reduce 1/4 for every worker. To complete
a task assigned labels for a keyword, the worker must complete three tasks like the picture below:
1. Read the question quest to do.
2. Click on the domain that contains the data values question.
3. Confirm the data correctly.
Figure 5. The interface of system architecture for worker.
4.4. Assign the label data using Crowdsourcing platform for the second time
After finished the labeling data, a set of rules is going to be created, this one will be used in the
extraction data from websites depends on requirements of human.
Table 1. xPath rules extracting the book title
xPath Rules
1 /html[1]/head[1]/title[1]/#text[1]
2 /html[1]/body[1]/…/span[1]/a[1]/#text[1]
3 /html[1]/body[1]/…/div[1]/a[1]/#text[1]
5. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE
5.1 Results of K-means algorithm
To search keyword repositories for each field, we conducted experiments running the site data
with three different areas of the Movie, Music, E-commerce (selling books), each of these fields, we
will use the site to collect five keywords. K-means algorithm [14] is applied with K = 2 clustering
data to the user to choose the most. Specifically, the collection of data I presented in the table below:
Table 2. The number of input keyword and output keyword
Fields
Number of
input websites
Number of
candidate
keywords
Number of
result’s
keyword
Movie 5 20 10
Music 5 22 12
Book 5 23 10
5.2 Comparison of results between Greedy and Labeling data
In this section, we display a result from the experiments with our approach. Our experiments
mainly focus on the impact of greedy and labeling data method based knowledge of the crowd to our
system. We conducted a test on three categories of websites under three different fields to carry out
data extraction under two tables, as following:
Table 3. The overall results and comparsion between Greedy and labeling data by workers
Fields
Number of
input pages
Website
Number of
questions
Number of
human answers
Greedy Click Greedy Click
Movie 5 www.imdb.com 1760 20 88 7
Music 5 www.last.fm 500 24 25 5
Book 5 www.betterworldbooks.com 680 20 34 5
Table 4. Compare the results with the dataset and runtime
Fields Dataset Rate Runtime
Movie 7 x 104
87.75% 01:45:17
Music 3 x 104
91.23% 00:06:04
Book 5 x 104
97.52% 00:22:05
6. CONCLUSION
In this paper, we present our method for building systems to extract and integrate information
from websites with the following characteristics: DOM tree applies K-means algorithm to extract
keywords and build the system to answer questions in "right or wrong" platform based on
Crowdsourcing. The findings from these sets of rules will be updated to conform with the data
extraction, and we apply labeling data by knowledge of the crowd to reduce the number worker
perform as well as cost savings.
REFERENCES
1. Bing Liu, "Web Data Mining-Exploring Hyperlinks, Contents, and Usage Data," in
http://www.cs.uic.edu/~liub/WebMiningBook.html, December, 2006.
2. N. Kushmerick, D. S. Weld and R. Doorenbos, "Wrapper induction for information
extraction," in IJCAI, 1997.
3. N. Dalvi, R. Kumar and M. Soliman, "Automatic wrappers for large scale Web extraction," in
Proc. of the VLDB Endowment, 2011.
4. T.-L. Wong, "Learning to adapt cross language information extraction wrapper," in Applied
Intelligence, Volume 36, Issue 4, pp 918-931, June 2012.
5. C.-H. Chang, Y.-L. Lin, K.-C. Lin and M. Kayed, "Page-Level Wrapper Verification for
Unsupervised Web Data Extraction," in Web Information Systems Engineering – Lecture
Notes in Computer Science Volume 8180, pp 454-467, 2013.
6. Luis Álvarez Sabucedo, Luis E. Anido-Rifón, Juan M. Santos-Gago: Reusing web contents: a
DOM approach. Softw., Pract. Exper. 39(3): 299-314 (2009).
7. R. Novotny, P. Vojtas, and D. Maruscak, “Information Extraction from Web Pages,”
Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence
and Intelligent Agent Technology, vol. 3, 2009, pp. 121-124.
8. M. Shaker, H. Ibrahim, A. Mustapha, and L.N. Abdullah, “Information extraction from web
tables,” Proceedings of the 11th International Conference on Information Integration and
Web-based Applications & Services - iiWAS ’09, New York, New York, USA: ACM Press,
2009, pp. 470-476.
9. W. W. W. Consortium, "Document Object Model (DOM)," in http://www.w3.org/DOM/,
January 19, 2005.
10. M Okada, N Ishii, I Torii, “Information extraction using XPath,” Knowledge-Based and
Intelligent Information and Engineering SystemsLecture Notes in Computer Science Volume
6278, 2010, pp 104-112.
11. J. Howe, "The Rise of Crowdsourcing," in Wired Magazine, June 2006.
12. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma, "Simultaneous record detection and
attribute labeling in web data extraction," in Proceedings of the 12th ACM SIGKDD
international conference on Knowledge discovery and data mining, 2006.
13. Khanh Nguyen, Huy Nguyen, Nam Nguyen, Cuong Do, Triet Tran. “System for training and
executing WebBot to extract information from websites”, ITCFIT, 2010.
14. JA Hartigan, MA Wong, “A k-means clustering algorithm,” Applied statistics, 1979.

Más contenido relacionado

La actualidad más candente

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...IOSR Journals
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...Dr. Aparna Varde
 
Web content mining a case study for bput results
Web content mining a case study for bput resultsWeb content mining a case study for bput results
Web content mining a case study for bput resultseSAT Publishing House
 
Research on classification algorithms and its impact on web mining
Research on classification algorithms and its impact on web miningResearch on classification algorithms and its impact on web mining
Research on classification algorithms and its impact on web miningIAEME Publication
 
Fuzzy clustering technique
Fuzzy clustering techniqueFuzzy clustering technique
Fuzzy clustering techniqueprjpublications
 
Semantic web
Semantic webSemantic web
Semantic webcat_us
 
Quick Linked Data Introduction
Quick Linked Data IntroductionQuick Linked Data Introduction
Quick Linked Data IntroductionMichael Hausenblas
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
Introduction to web technology
Introduction to web technologyIntroduction to web technology
Introduction to web technologyPallawiBulakh1
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebMarina Santini
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsColin Bell
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesOpen Data Support
 
Overview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceOverview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceHaklae Kim
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web MiningIOSR Journals
 
Social semantic web
Social semantic webSocial semantic web
Social semantic webVlad Posea
 
Jarrar: The Next Generation of the Web 3.0: The Semantic Web
Jarrar: The Next Generation of the Web 3.0: The Semantic WebJarrar: The Next Generation of the Web 3.0: The Semantic Web
Jarrar: The Next Generation of the Web 3.0: The Semantic WebMustafa Jarrar
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the HaystackAdrian Stevenson
 

La actualidad más candente (20)

International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
Web Mining for an Academic Portal: The case of Al-Imam Muhammad Ibn Saud Isla...
 
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
The Hidden Web, XML and the Semantic Web: A Scientific Data Management Perspe...
 
Web content mining a case study for bput results
Web content mining a case study for bput resultsWeb content mining a case study for bput results
Web content mining a case study for bput results
 
Web content minin
Web content mininWeb content minin
Web content minin
 
Research on classification algorithms and its impact on web mining
Research on classification algorithms and its impact on web miningResearch on classification algorithms and its impact on web mining
Research on classification algorithms and its impact on web mining
 
Fuzzy clustering technique
Fuzzy clustering techniqueFuzzy clustering technique
Fuzzy clustering technique
 
Semantic web
Semantic webSemantic web
Semantic web
 
Quick Linked Data Introduction
Quick Linked Data IntroductionQuick Linked Data Introduction
Quick Linked Data Introduction
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
Introduction to web technology
Introduction to web technologyIntroduction to web technology
Introduction to web technology
 
Lecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic WebLecture: Ontologies and the Semantic Web
Lecture: Ontologies and the Semantic Web
 
Social Graphs and Semantic Analytics
Social Graphs and Semantic AnalyticsSocial Graphs and Semantic Analytics
Social Graphs and Semantic Analytics
 
Linked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and ExamplesLinked Open Data Principles, Technologies and Examples
Linked Open Data Principles, Technologies and Examples
 
Overview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web ScienceOverview of Open Data, Linked Data and Web Science
Overview of Open Data, Linked Data and Web Science
 
Business Intelligence: A Rapidly Growing Option through Web Mining
Business Intelligence: A Rapidly Growing Option through Web  MiningBusiness Intelligence: A Rapidly Growing Option through Web  Mining
Business Intelligence: A Rapidly Growing Option through Web Mining
 
Social semantic web
Social semantic webSocial semantic web
Social semantic web
 
Jarrar: The Next Generation of the Web 3.0: The Semantic Web
Jarrar: The Next Generation of the Web 3.0: The Semantic WebJarrar: The Next Generation of the Web 3.0: The Semantic Web
Jarrar: The Next Generation of the Web 3.0: The Semantic Web
 
Semantic web Santhosh N Basavarajappa
Semantic web   Santhosh N BasavarajappaSemantic web   Santhosh N Basavarajappa
Semantic web Santhosh N Basavarajappa
 
How to Find a Needle in the Haystack
How to Find a Needle in the HaystackHow to Find a Needle in the Haystack
How to Find a Needle in the Haystack
 

Destacado

Indian Writing In English
Indian Writing In EnglishIndian Writing In English
Indian Writing In EnglishAvani
 
RewardTivity instructions
RewardTivity instructionsRewardTivity instructions
RewardTivity instructionsRaceReach.com
 
tintucmoitruong.com - huong dan su dung kloxo
tintucmoitruong.com - huong dan su dung kloxotintucmoitruong.com - huong dan su dung kloxo
tintucmoitruong.com - huong dan su dung kloxoTư vấn môi trường
 
CARMEN YULIETH FUENTES 1
CARMEN YULIETH FUENTES 1CARMEN YULIETH FUENTES 1
CARMEN YULIETH FUENTES 1klaumilenitha
 
Neny lucero rojas martínez1
Neny lucero rojas martínez1Neny lucero rojas martínez1
Neny lucero rojas martínez1klaumilenitha
 
INTEGRATE 2016 - Whitney Drake
INTEGRATE 2016 - Whitney Drake INTEGRATE 2016 - Whitney Drake
INTEGRATE 2016 - Whitney Drake IMCWVU
 
MAIRA LIZBETH LINARES
MAIRA LIZBETH LINARESMAIRA LIZBETH LINARES
MAIRA LIZBETH LINARESklaumilenitha
 
Louis Britto profile
Louis Britto profileLouis Britto profile
Louis Britto profileLouis Britto
 
AR VR Webinar
AR VR WebinarAR VR Webinar
AR VR WebinarSomo
 
Planning lesson based on q´s
Planning lesson  based on q´sPlanning lesson  based on q´s
Planning lesson based on q´sdiego-819
 

Destacado (17)

Indian Writing In English
Indian Writing In EnglishIndian Writing In English
Indian Writing In English
 
LAURA GORDILLO
LAURA GORDILLOLAURA GORDILLO
LAURA GORDILLO
 
Juan daniel mora.
Juan daniel mora.Juan daniel mora.
Juan daniel mora.
 
RewardTivity instructions
RewardTivity instructionsRewardTivity instructions
RewardTivity instructions
 
tintucmoitruong.com - huong dan su dung kloxo
tintucmoitruong.com - huong dan su dung kloxotintucmoitruong.com - huong dan su dung kloxo
tintucmoitruong.com - huong dan su dung kloxo
 
CARMEN YULIETH FUENTES 1
CARMEN YULIETH FUENTES 1CARMEN YULIETH FUENTES 1
CARMEN YULIETH FUENTES 1
 
Neny lucero rojas martínez1
Neny lucero rojas martínez1Neny lucero rojas martínez1
Neny lucero rojas martínez1
 
God knows
God knowsGod knows
God knows
 
Hb services .net application 2015 16
Hb services .net application 2015 16Hb services .net application 2015 16
Hb services .net application 2015 16
 
INTEGRATE 2016 - Whitney Drake
INTEGRATE 2016 - Whitney Drake INTEGRATE 2016 - Whitney Drake
INTEGRATE 2016 - Whitney Drake
 
MAIRA LIZBETH LINARES
MAIRA LIZBETH LINARESMAIRA LIZBETH LINARES
MAIRA LIZBETH LINARES
 
Louis Britto profile
Louis Britto profileLouis Britto profile
Louis Britto profile
 
Environmental studies[cong ty moi truong]
Environmental studies[cong ty moi truong]Environmental studies[cong ty moi truong]
Environmental studies[cong ty moi truong]
 
Ledysgutierrez
LedysgutierrezLedysgutierrez
Ledysgutierrez
 
AR VR Webinar
AR VR WebinarAR VR Webinar
AR VR Webinar
 
Classic American Cars
Classic American CarsClassic American Cars
Classic American Cars
 
Planning lesson based on q´s
Planning lesson  based on q´sPlanning lesson  based on q´s
Planning lesson based on q´s
 

Similar a ACOMP_2014_submission_70

A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...eSAT Journals
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)theijes
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  dannyijwest
 
Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012STIinnsbruck
 
C03406021027
C03406021027C03406021027
C03406021027theijes
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internetcarolyn oldham
 
A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building dannyijwest
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...IAEME Publication
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...ijscai
 
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...IJSCAI Journal
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...ijscai
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringIRJET Journal
 
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesCSCJournals
 

Similar a ACOMP_2014_submission_70 (20)

Introduction abstract
Introduction abstractIntroduction abstract
Introduction abstract
 
A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...A language independent web data extraction using vision based page segmentati...
A language independent web data extraction using vision based page segmentati...
 
A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.A Clustering Based Approach for knowledge discovery on web.
A Clustering Based Approach for knowledge discovery on web.
 
The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)The International Journal of Engineering and Science (The IJES)
The International Journal of Engineering and Science (The IJES)
 
Paper24
Paper24Paper24
Paper24
 
Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  Effective Performance of Information Retrieval on Web by Using Web Crawling  
Effective Performance of Information Retrieval on Web by Using Web Crawling  
 
Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012Scei technical whitepaper-19.06.2012
Scei technical whitepaper-19.06.2012
 
C03406021027
C03406021027C03406021027
C03406021027
 
H017554148
H017554148H017554148
H017554148
 
Week 2 computers, web and the internet
Week 2 computers, web and the internetWeek 2 computers, web and the internet
Week 2 computers, web and the internet
 
A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building A Multimodal Approach to Incremental User Profile Building
A Multimodal Approach to Incremental User Profile Building
 
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
ANALYTICAL IMPLEMENTATION OF WEB STRUCTURE MINING USING DATA ANALYSIS IN ONLI...
 
Pf3426712675
Pf3426712675Pf3426712675
Pf3426712675
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
 
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
Implementation of Folksonomy Based Tag Cloud Model for Information Retrieval ...
 
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
IMPLEMENTATION OF FOLKSONOMY BASED TAG CLOUD MODEL FOR INFORMATION RETRIEVAL ...
 
320 324
320 324320 324
320 324
 
Extraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web EngineeringExtraction and Retrieval of Web based Content in Web Engineering
Extraction and Retrieval of Web based Content in Web Engineering
 
Group 3
Group 3Group 3
Group 3
 
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and OpportunitiesSemantic Web Mining of Un-structured Data: Challenges and Opportunities
Semantic Web Mining of Un-structured Data: Challenges and Opportunities
 

Más de David Nguyen

Compressed js with NodeJS & GruntJS
Compressed js with NodeJS & GruntJSCompressed js with NodeJS & GruntJS
Compressed js with NodeJS & GruntJSDavid Nguyen
 
jQuery Super Basic
jQuery Super BasicjQuery Super Basic
jQuery Super BasicDavid Nguyen
 
Javascript native OOP - 3 layers
Javascript native OOP - 3 layers Javascript native OOP - 3 layers
Javascript native OOP - 3 layers David Nguyen
 
MVC4 – knockout.js – bootstrap – step by step – part 1
MVC4 – knockout.js – bootstrap – step by step – part 1MVC4 – knockout.js – bootstrap – step by step – part 1
MVC4 – knockout.js – bootstrap – step by step – part 1David Nguyen
 
Chứng minh số node của Heap chiều cao h
Chứng minh số node của Heap chiều cao hChứng minh số node của Heap chiều cao h
Chứng minh số node của Heap chiều cao hDavid Nguyen
 
Hướng dẫn sử dụng Mind Manager 8
Hướng dẫn sử dụng Mind Manager 8 Hướng dẫn sử dụng Mind Manager 8
Hướng dẫn sử dụng Mind Manager 8 David Nguyen
 
KTMT Lý Thuyết Tổng Quát
KTMT Lý Thuyết Tổng QuátKTMT Lý Thuyết Tổng Quát
KTMT Lý Thuyết Tổng QuátDavid Nguyen
 
KTMT Số Nguyên - Số Chấm Động
KTMT Số Nguyên - Số Chấm ĐộngKTMT Số Nguyên - Số Chấm Động
KTMT Số Nguyên - Số Chấm ĐộngDavid Nguyen
 

Más de David Nguyen (13)

Compressed js with NodeJS & GruntJS
Compressed js with NodeJS & GruntJSCompressed js with NodeJS & GruntJS
Compressed js with NodeJS & GruntJS
 
jQuery Super Basic
jQuery Super BasicjQuery Super Basic
jQuery Super Basic
 
Javascript native OOP - 3 layers
Javascript native OOP - 3 layers Javascript native OOP - 3 layers
Javascript native OOP - 3 layers
 
MVC4 – knockout.js – bootstrap – step by step – part 1
MVC4 – knockout.js – bootstrap – step by step – part 1MVC4 – knockout.js – bootstrap – step by step – part 1
MVC4 – knockout.js – bootstrap – step by step – part 1
 
Facebook API
Facebook APIFacebook API
Facebook API
 
Quick sort
Quick sortQuick sort
Quick sort
 
Merge sort
Merge sortMerge sort
Merge sort
 
Heap Sort
Heap SortHeap Sort
Heap Sort
 
Chứng minh số node của Heap chiều cao h
Chứng minh số node của Heap chiều cao hChứng minh số node của Heap chiều cao h
Chứng minh số node của Heap chiều cao h
 
Hướng dẫn sử dụng Mind Manager 8
Hướng dẫn sử dụng Mind Manager 8 Hướng dẫn sử dụng Mind Manager 8
Hướng dẫn sử dụng Mind Manager 8
 
KTMT Lý Thuyết Tổng Quát
KTMT Lý Thuyết Tổng QuátKTMT Lý Thuyết Tổng Quát
KTMT Lý Thuyết Tổng Quát
 
KTMT Số Nguyên - Số Chấm Động
KTMT Số Nguyên - Số Chấm ĐộngKTMT Số Nguyên - Số Chấm Động
KTMT Số Nguyên - Số Chấm Động
 
Mô Hình MVC 3.0
Mô Hình MVC 3.0Mô Hình MVC 3.0
Mô Hình MVC 3.0
 

ACOMP_2014_submission_70

  • 1. INFORMATION EXTRACTION AND INTEGRATION BASED CROWDSOURCING PLATFORM IN REAL-WORLD Pham Nguyen Son Tung1 , Tran Minh Triet1 Nguyen Pham Hoang Anh1 , Nguyen Ngoc Dung1 , Nguyen Thi My Hue1 1 Faculty of Information Technology, University Of Science, Ho Chi Minh City, Vietnam {pnstung, tmtriet}@fit.hcmus.edu.vn; {1241003, 1241014, 1241047}@student.hcmus.edu.vn ABSTRACT With the increasing amount of information over the Internet, accessing data from different online sources is becoming more difficult. A user may be confused to find and gather information related to a single entity from various websites. Thus, extraction and integration of information from different websites are one of the essential requirements for Internet users. This motivates us to propose a method efficiently extract and integrate information from different websites. Although DOM tree analysis is a common method for information extraction from websites, this method may provide unexpected results that require manually correction or refinement, or complicated methods to machine learning. In our proposed method, we take advantage of the new trend of Crowdsourcing to get the crowd-assistance by Crowdsourcing platform to improve the accuracy of the method to analysis’s DOM trees with K-means algorithm. Our proposed system helps users to extract information on a website more quickly and accurately. Experimental results show that our method can provide the rate of extraction data correctly up to 98%. Keywords. web information extraction, web information integration, crowdsourcing, web wrapper. 1. INTRODUCTION According to the statistics of the Internet Live Stats (elaboration of data by International Telecommunication Union (ITU) and United Nations Population Division), the amount of Internet users has increased in 20 years from 14,161,570 to 2,925,249,355 people (c.f. Figure 1). With the development of the Internet, the amount of information is going huge from websites worldwide. As a result of this, an Internet user may be overloaded with information related to a given entity/object in the Internet coming from many different sources and expressions. For example, when a user wants to buy a book “Harry Potter and the Sorcerer’s Stone,” he or she may be confused because there are too many information sources from various online bookseller
  • 2. websites. Furthermore, it should be noticed that a book has many attributes/facets, which are represented in different formats and labels in each web page, such as book name, author, price, summary, publisher, language, etc. Figure 1: The chart of number Internet users in the worldwide since 1993 to 5/2014 Generally, people just pay attention on a few pieces of main information on a book. Thus, the demand of information extraction and integration has become an interesting topic concerned by scientific research community. Many related papers on information extraction and integration have been presented well-known conferences such as: International Conference World Wide Web (WWW), International Conference on Web Engineering (ICWE), etc. DOM Tree method is a common and efficient approach for an information extraction system. However, if a system only uses this method, that system may extract information from a website and provide unexpected results. It would be time-consuming to manually correct or train a system to eliminate wrong results of a DOM tree method. As the community of Internet users is larger and larger, the Internet definitely will be a greater labor market in the future. Internet users from diverse cultures and different nationalities can become amateurs or experts in a certain field. Crowdsourcing is a new trend that allows us to take advantage of this huge source of workers. Some successful applications using this platform such as Amazon Mechanical Turk, Crowd Flower opens a new chance for supervised label information method. It takes advantage of human resource and knowledge to implement label data tasks, which do not require computer qualifications. Therefore, we suggest to apply Crowdsourcing platform in our system to improve the method to extract and integrate information from websites. The rest of our paper is organized as follows. In section 2, we review several recent researches related to information extraction. DOM tree methods for information extraction and Crowdsourcing are then briefly discussed in section 3. Our proposed system and architecture are presented in section 4. Section 5 shows the experimental results of our system and method. Finally, conclusions are presented in section 5.
  • 3. 2. RECENT RESEARCHES RELATED TO INFORMATION EXTRACTION 2.1. Manual method Observing a website and its source code, the developer will search some typical templates of a website. Developer used some language code to create extracting script corresponding these standard templates of a website. Then, the necessary data is extracted in accordance with the scenario which was previously determined. However, this method cannot work with a large number of websites [1]. 2.2. Building wrapper method Most of the websites are created from some agencies using data based on users’ requirements. They are also known as the hidden web page. This means that it is necessary to have a special tool to extract information from such sites. This is usually done by the wrapper. A wrapper can be seen as a procedure designed to extract the contents of a source of information, a program or a set of extracted rules. This method is proposed by Nicholas Kushmerick in 1997 [2]. The wrapper will be trained to extract the necessary data based on sets of extracted rules from the samples. A series of recent studies, we can mention to the research that creates an automatic wrapper for some large site of a group of authors Nilesh Dalvi et al in 2011 [3], the research that built learning wrapper by Tak-Lam Wong in 2012 [4], and the research built an unsupervised learning wrapper of Chia-Hui Chang et al in 2013 [5]. 2.3. Building wrapper method The problems of information extraction often approach the extracting information-based data. Data are generally divided into three types: unstructured data, structured data and semi-structured data. The sites are usually a typical form of semi-structured data, the structural components of the site which is displayed on the web via HTML tags. Based on the structure of the HTML tags to construct the DOM tree structure to determine how to organize data from which to extract information in accordance with the desired structure, this method solves the problem of overlapping information. However it can only be extracted within a web page, but can’t be reuse website structure or other components. In addition, the data collection can be inaccurate if the website structure be changed. The outstanding researches that could be mentioned are: M.Álvarez et al with Reusing web contents: a DOM approach [6], R. Novotny et al [7], M.Shaker et al [8]. 3. DEFINITION OF INFORMATION EXTRACTION ON DOM TREE AND BASIC CONCEPTS OF CROWDSOURCING 3.1. DOM tree structure analysis Effect. According to W3C, the DOM (Document Object Model) [9] is an application programming interface (API) for valid HTML and well-formed XML documents. The DOM is divided into three levels as follows:
  • 4. • Core DOM: interface for any form documents. • XML DOM: interface for XML documents. • HTML DOM: interface for HTML documents. Figure 2: The chart of number Internet users in the worldwide since 1993 to 5/2014 Creating a DOM tree is a necessary step in information extraction algorithm. We will use IHTMLDOMNode turn each node in the browser and the website from which to build a DOM Tree showing the structure of the currently displayed web page. Based on that we will know the overall structure of a Website, including all elements of the website, any element before any element, any element that contains the element. To extract necessary information at a node of DOM tree, we need to specify clearly the path from the root of the tree to node need to extract information. This path is called a Xpath or extraction sample [10]. These could be taken from the root of DOM tree to the node consists of extracted content. DOM tree is built to base on HTML tags of a website, which is its root node are <HTML> tags, then the tags are inside and leaf node is a node which consists of extracted content. Information extraction on DOM tree, in fact, is to browse HTML tags to get information that these pairs of tags contain. Information extraction from DOM tree in Figure 2 as follows: browse alternately nodes of DOM tree until encountering a leaf node. Value at leaf node is extracted information. Example: To extract information of Book’s name, we browse DOM tree as follows: TABLE  TR  TD  Frozen 3.2. The Crowdsourcing platforms The term “Crowdsourcing” appeared firstly in the article “The rise of Crowdsourcing,” Jeff Howe, published on the website leading the technology Wired 2006 [11]. Crowdsourcing is the combination of technology and business of outsourcing and express social aspect of open-source.
  • 5. Individuals or enterprises (also called Requester) use human’s intelligence to perform tasks that computers can’t perform like: identifying objects in a picture, documents describe products… Requesters not only find labor in companies or organizations but also in a crowd. Through unsolved problem, Requesters place their faith in the majority, find out hidden talent to put out the best solution along with the acceptable cost. Figure 3. The illustration work on Mturk (The picture is taken from website http://docs.aws.amazon.com in May 2014) Some projects use Crowdsourcing successfully as iStockphoto website, exchange and share photos. Threadless is the community website of online designers. Facebook also uses Crowdsourcing to create different language versions. This makes Facebook appropriate to various nations. More outstanding is Amazon Mechanical Turk (MTurk), where provides a lot of work as Crowdsourcing. Amazon calls tasks that Requester offers are HITs and staff that performs these tasks is known as Workers. 4. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE 4.1. System Architecture Our system diagram is divided into three main stages. The first stage is from input data that we first use Crowdsourcing platform combining with K-mean algorithm to generate the extracted a BA
  • 6. keywords. Then, we will move into the second stage. It uses Crowdsourcing platform again for a crowd assisted. The third stage is that from labeled data, we proceed to extract sets, which are used to extract information. Figure 4. System Architecture Diagram 4.2. Extracting keywords from the input dataset According to empirical studies of the authors Jun Zhu [12], the site is currently about 35% is the List page, and the rest will be the detail page. However, we have successfully extracted this sample in [13]. So in this paper we will be interested in the second form: websites list an object (detail page). 4.2.1 Choose Keyword’s option using Crowdsourcing platform for the first time Keywords are the information about the object that the users are interested. In Fig 4 above, the object is a book, users interested in information on subjects such as title, work address, year of publication, price, shipping costs, date of publication, publisher ... the information that the users are interested in keywords that we need to collect. To accomplish this task, we have developed systems to answer simple questions for the worker from the answers. With algorithms get keywords: Step 1: Set D consists of the input sample site: D = { }
  • 7. Step 2: Decomposing all the keywords in each Di to a set of words called W: W = Step 3: , count the frequency of H of w: Step 4: Call set W* is a set of keywords selected candidates: W* = in experimental, we choose Step 5: Once you have selected the candidate set we use Crowdsourcing user time to select the keywords for each field. Step 6: Apply the K-means algorithm for dataset users replied to collect the set of keywords. 4.3. Assign the label data using Crowdsourcing platform for the second time In this stage, each keyword got previous stage will be generated questions which workers have to answer to collect the best rules for system, xPath rules extracting for each keyword. Previously, we were using the Greedy algorithm to reduce the number of questions and tasks that a worker must perform the work. However, for each worker performs is still quite a lot. While this improvement, we improve the system further aims to bring efficiency in the work of the worker as well as data acquisition. The tag when generating questions, instead of just YES/NO question form will set out the questions to ask the worker click on the region of the data on the website related to the keyword, the data from that at the moment to be worker confirmed that again was right or not, with using this flexibility, the number of questions has diminished greatly reduce 1/4 for every worker. To complete a task assigned labels for a keyword, the worker must complete three tasks like the picture below: 1. Read the question quest to do. 2. Click on the domain that contains the data values question. 3. Confirm the data correctly.
  • 8. Figure 5. The interface of system architecture for worker. 4.4. Assign the label data using Crowdsourcing platform for the second time After finished the labeling data, a set of rules is going to be created, this one will be used in the extraction data from websites depends on requirements of human. Table 1. xPath rules extracting the book title xPath Rules 1 /html[1]/head[1]/title[1]/#text[1] 2 /html[1]/body[1]/…/span[1]/a[1]/#text[1] 3 /html[1]/body[1]/…/div[1]/a[1]/#text[1] 5. PROPOSED SOLUTION AND SYSTEM ARCHITECTURE 5.1 Results of K-means algorithm To search keyword repositories for each field, we conducted experiments running the site data with three different areas of the Movie, Music, E-commerce (selling books), each of these fields, we will use the site to collect five keywords. K-means algorithm [14] is applied with K = 2 clustering data to the user to choose the most. Specifically, the collection of data I presented in the table below:
  • 9. Table 2. The number of input keyword and output keyword Fields Number of input websites Number of candidate keywords Number of result’s keyword Movie 5 20 10 Music 5 22 12 Book 5 23 10 5.2 Comparison of results between Greedy and Labeling data In this section, we display a result from the experiments with our approach. Our experiments mainly focus on the impact of greedy and labeling data method based knowledge of the crowd to our system. We conducted a test on three categories of websites under three different fields to carry out data extraction under two tables, as following: Table 3. The overall results and comparsion between Greedy and labeling data by workers Fields Number of input pages Website Number of questions Number of human answers Greedy Click Greedy Click Movie 5 www.imdb.com 1760 20 88 7 Music 5 www.last.fm 500 24 25 5 Book 5 www.betterworldbooks.com 680 20 34 5 Table 4. Compare the results with the dataset and runtime Fields Dataset Rate Runtime Movie 7 x 104 87.75% 01:45:17 Music 3 x 104 91.23% 00:06:04 Book 5 x 104 97.52% 00:22:05 6. CONCLUSION In this paper, we present our method for building systems to extract and integrate information from websites with the following characteristics: DOM tree applies K-means algorithm to extract keywords and build the system to answer questions in "right or wrong" platform based on Crowdsourcing. The findings from these sets of rules will be updated to conform with the data extraction, and we apply labeling data by knowledge of the crowd to reduce the number worker perform as well as cost savings.
  • 10. REFERENCES 1. Bing Liu, "Web Data Mining-Exploring Hyperlinks, Contents, and Usage Data," in http://www.cs.uic.edu/~liub/WebMiningBook.html, December, 2006. 2. N. Kushmerick, D. S. Weld and R. Doorenbos, "Wrapper induction for information extraction," in IJCAI, 1997. 3. N. Dalvi, R. Kumar and M. Soliman, "Automatic wrappers for large scale Web extraction," in Proc. of the VLDB Endowment, 2011. 4. T.-L. Wong, "Learning to adapt cross language information extraction wrapper," in Applied Intelligence, Volume 36, Issue 4, pp 918-931, June 2012. 5. C.-H. Chang, Y.-L. Lin, K.-C. Lin and M. Kayed, "Page-Level Wrapper Verification for Unsupervised Web Data Extraction," in Web Information Systems Engineering – Lecture Notes in Computer Science Volume 8180, pp 454-467, 2013. 6. Luis Álvarez Sabucedo, Luis E. Anido-Rifón, Juan M. Santos-Gago: Reusing web contents: a DOM approach. Softw., Pract. Exper. 39(3): 299-314 (2009). 7. R. Novotny, P. Vojtas, and D. Maruscak, “Information Extraction from Web Pages,” Proceedings of the 2009 IEEE/WIC/ACM International Joint Conference on Web Intelligence and Intelligent Agent Technology, vol. 3, 2009, pp. 121-124. 8. M. Shaker, H. Ibrahim, A. Mustapha, and L.N. Abdullah, “Information extraction from web tables,” Proceedings of the 11th International Conference on Information Integration and Web-based Applications & Services - iiWAS ’09, New York, New York, USA: ACM Press, 2009, pp. 470-476. 9. W. W. W. Consortium, "Document Object Model (DOM)," in http://www.w3.org/DOM/, January 19, 2005. 10. M Okada, N Ishii, I Torii, “Information extraction using XPath,” Knowledge-Based and Intelligent Information and Engineering SystemsLecture Notes in Computer Science Volume 6278, 2010, pp 104-112. 11. J. Howe, "The Rise of Crowdsourcing," in Wired Magazine, June 2006. 12. J. Zhu, Z. Nie, J.-R. Wen, B. Zhang and W.-Y. Ma, "Simultaneous record detection and attribute labeling in web data extraction," in Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining, 2006. 13. Khanh Nguyen, Huy Nguyen, Nam Nguyen, Cuong Do, Triet Tran. “System for training and executing WebBot to extract information from websites”, ITCFIT, 2010. 14. JA Hartigan, MA Wong, “A k-means clustering algorithm,” Applied statistics, 1979.