This document proposes a method to extract structured records from Wikipedia pages by identifying associations between prominent textual keywords and neighboring contextual text blocks. The method uses a combination of pattern matching and a token frequency algorithm to heuristically identify context headers on each Wikipedia page and hierarchically group surrounding elements. The extracted results are stored in a domain-neutral schema to support improved search capabilities through a relational database interface. Experimental results indicate the approach successfully extracts diverse Wikipedia structural types with good accuracy and high recall.
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
Extracting Structured Records From Wikipedia
1. Extracting Structured Records from Wikipedia
Aniruddha Despande
Shivkumar Chandrashekhar
University of Texas at Arlington
University of Texas at Arlington
The Wikipedia is the largest collaborative knowledge sharing
ABSTRACT
web encyclopedia. It is one of the most frequently accessed
The Wikipedia is a web-based collaborative knowledge sharing
websites on the internet, undergoes frequent revisions, and is
portal comprising of articles contributed by authors all over the
available in around 250 languages with English alone estimated
world. But its search capabilities are limited to title and full-text
to possess around 2 million pages & around 800,000 registered
search only. There is a growing interest in querying over the
users. However, to find information from the vast amount of
structure implicit in unstructured documents, and this paper
articles on Wikipedia, users have to rely on a combination of
explores fundamental ideas to achieve this objective using
keyword search and browsing. These mechanisms are effective
Wikipedia as a document source. We suggest that semantic
but incapable of supporting complex aggregate queries over the
information can be extracted from Wikipedia by identifying
potentially rich set of structures embedded in Wikipedia text. For
associations among prominent textual keywords and its
example consider the pages about the states Ohio, Illinois and
neighboring contextual text blocks. This association discovery
Texas in Wikipedia. Information about the total area, total
can be accomplished using a combination of primitive pattern or
population and % water can be explicitly inferred from these
regular expression matching along with a token frequency
pages.
determination algorithm for every Wikipedia page to heuristically
promote certain structural entities as context headers and Table 1: Portions from info boxes found on Wikipedia pages
hierarchically wrap surrounding elements with such identified
Total Area Total Population % Water
segment titles. The extracted results can be maintained in a
domain neutral semantic schema customized for Wikipedia which Ohio 44,825 sq miles 11,353,140 8.7
could improve its search capabilities through the use of an
efficient interface extended over a relational data source realizing
Total Area Total Population % Water
this modeled schema. Experimental results and the implemented
prototype indicate that this notion is successful is in achieving Illinois 57,918 sq miles 12,831,970 4.0
good accuracy in entity associations and a high recall in
extraction of diverse Wikipedia structural types.
Total Area Total Population % Water
Categories and Subject Descriptors Texas 268,820 sq miles 20,851,820 2.5
H.2 [Information Systems]: Database Management; H.3.3
[Information Systems]: Information Storage and Retrieval –
However, this restricts us from expressing SQL queries to order
Information Extraction and Retrieval
the states in increasing sequence of their population densities
(total area / total population) or similar operations to process the
General Terms cumulative knowledge contained in these pages. The ability to
Algorithms, Experimentation, Standardization
query this set of structures is highly desirable. Such data can be
practically found in different kinds of Wikipedia structures
Keywords namely info boxes, wiki tables, lists, images, etc. In this paper,
Information Extraction, Wikipedia we present a scheme to extract such structured information from
a given Wikipedia page, irrespective of its inherent structural
make-up. Wikipedia pages differ from one another in the
1. INTRODUCTION
compositional make-up of the structural types they are built from.
The need to amalgamate the web’s structured information and
Wikipedia allows authors a wide variety of rich representations
knowledge to enable semantically rich queries is a widely
to choose from, thus creating a healthy diversity in the content
accepted necessitate. This is the goal behind most of the modern
representation formats used across pages. This heterogeneity
day research into information extraction and integration, with
across structural make-up of different pages (even from the same
standards like the Semantic Web being advertised as the future
domain) presents a challenge to textual extraction and content
vision of the current day web. We investigate one plausible
association.
approach to achieve expression of semantically rich queries over
one such web document source, the content rich, multi-lingual The rest of the paper is organized as follows. A brief description
Wikipedia portal (http://www.wikipedia.org) to be particular. of the motivation of the project is presented in Section 2. In
Section 3 we formalize the problem definition. Section 4 surveys
related work. Section 5 describes the data/ query model and the
1
2. project’s architectural components. The algorithm design and manual and automatic data extraction to construct ontology from
details are debriefed in Section 6. Section 7 touches upon the Wikipedia templates. In paper [7], the authors aim to construct
implementation details and we present our evaluation results in knowledge bases that are focused to the task of organizing and
Section 8. We conclude and provide directions for future work in facilitating retrieval within individual Wikipedia document
Section 9. collections. Our approach resembles the idea presented in [8]
where the notion of using a relational system as a basis for a
workbench for extracting and querying structure from
2. MOTIVATION
unstructured data in Wikipedia. Their paper focuses on
The incentive for this project can be summarized as an attempt to
incrementally evolving the understanding of the data in the
leverage the rich structures inherent in Wikipedia content for
context of the relational workbench, while our approach relies on
successful information extraction & association and augment its
an element-term frequency measure to determine its relative
traditional search options with a mechanism to support rich
weight in a given page and model our associations as gravitating
relational operations over the Wikipedia information base. Thus
towards the lesser used tokens. The paper [2] presents an
our motivation arises from a need to formulate a mechanism to
approach to mining information relating people, places,
recognize inherent structures occurring on Wikipedia pages,
organizations and events from Wikipedia and linking them on a
design and develop an extraction framework which mines for
time scale, while the authors in paper [1] explore the possibility
structural segments in the Wikipedia text, and construction of an
of automatically identifying “common sense” statements from
comprehensive interface to support effective query formulation
unrestricted natural language text found in Wikipedia and
over the Wikipedia textual corpus and thus realize an integrated
mapping them to RDF. This system works on a hypothesis that
model enabling analytical queries on the Wikipedia knowledge
common sense knowledge is often expressed in a subject
base.
predicate form, and their work focuses on the challenge of
automatically identifying such generic statements.
3. PROBLEM DEFINITION
The problem being addressed in this project can be adequately
5. EXTRACTION FRAMEWORK
described as a research initiative into extraction strategies and
This section provides details of the data model used in the
design & implementation of an efficient & capable extraction
system and the architectural composition of the extraction
framework to identify and retrieve structured information from
framework devised. The conceptual view paradigm widely
Wikipedia and accurately establish associations among them so
adopted in relational databases as an abstract representation or
as preserve them in a relational data source. This problem
model of the real world does not apply to our case. The idea of
requirement involves dealing with various Wikipedia content
identifying entities and tabulating them or establishing
representation types such as text segments, info boxes, wiki
relationships across them in a Wikipedia source is highly
tables, images, paragraphs, links, etc and the creation of a
irrelevant and certainly not scalable.
database scheme adequate enough to encompass such diverse
extracted tuples. The provision for rich SQL queries is a simple 5.1 Data/ Query Model
addition to this system, and hence will not be fully explored but
Our data model to accommodate Wikipedia data has been
rather the support for querying realized by presenting a simplistic
designed to be domain independent. The database schema has
querying interface. The problem definition can thus be summed
been modeled to resemble RDF tuples, and is primarily designed
up as –
to scale with increasing content size as well deal with vivid
heterogeneity or diversity in extracted data.
1. Research into extraction strategies, design &
implementation of a retrieval framework to identify and Table 2: pics table
extract structured information from Wikipedia.
Id Title Image Tag pic_url
2. Enable extraction of various Wikipedia content types
11 Alabama Flag Al http://wikimedia.org/AlFlg.jpg
including text segments, info boxes, free text, references
and images.
Table 3: infoboxes table
3. Design of a database schema to accommodate diverse data
fields extracted from Wikipedia. Id Title Property PValue
4. Provision of basic querying primitives over the extracted 11 Alabama Governor Robert R. Riley
information.
Table 4: twikis table
4. RELATED WORK
Id Title Content Tag
We performed a wide literature survey to learn about similar
extraction initiatives. The idea to bring Semantics into Wikipedia 11 Alabama It is seen .. Law_and_government
is not new and several studies on this topic have been carried out
in the last few years. Semantic extraction and relationships were
discussed in [6]. The authors analyze relevant measures for The data schema reflects the structural types encountered on
inferring the semantic relationships between page categories of Wikipedia as tables in the relational source. This data model
Wikipedia. DBpedia [5] is a community based effort that uses allows for easy incorporation of a new representation type
2
3. whenever it is encountered in a new Wikipedia page. Records are
grouped independent of its parent domain or title document. The
original page can be constructed back from the segregated tuples
by joining across the ‘Title’ attribute. A title of a Wikipedia is a
distinguishing attribute and hence chosen to be a key in our
schema. The ‘ID’ field is an auto incrementing column which
behaves as the primary key within the database tables, however
all operations are based on the logically coherent ‘Title’ field.
Since all tuples in a given table correspond to same or similar
Wikipedia content types, the extracted tuples are uniform hence
allowing for faster indexing options. However the data model
suffers from the traditional weakness of a RDF oriented schema
of requiring too many joins to construct data back. In this paper,
we emphasize on the extraction of the tuples and believe that the
extracted data in these tuples can be easily migrated to a more
rigidly normalized data store, and hence choose to ignore the
limitations of RDF. We have consciously chosen to record only
location addresses of images rather than the binary content to
preserve precious database space as well account for update of
images on the actual Wikipedia website. The database tables
contain additional attributes including auto-number identifiers
and timestamps primarily introduced for housekeeping purposes.
This data model clearly favors ease of insertion and quick
retrieval however does not support quick updates of the linked
tuples. We believe that updates are relatively fewer in our
system, and the response times for updates can be improved by
treating updates as delete operation followed by a fresh insertion
action, which also inadvertently helps in flushing stale tuples.
The data model actively adapts to evolving Wikipedia structural
types; however the addition of a new type is one time manual
affair. The anatomy of few of the tables is presented in the table
above, and as mentioned earlier, they resemble key-value form.
Figure 1: System Architecture
5.2 Architecture The Wikipedia template set is a pattern list which can be
The project adopts a simple architecture and is presented in incrementally improved to account for different Wikipedia pages
Figure 1. The ‘Wiki Parser’ is an extraction engine built by or even web pages on the general web. The templates also act as
extending a general HTML parser. The HTML parser is a simple a noise filter to selectively weed out incomplete tags or structures
interface in our system that includes a small crawler segment to existing in the Wikipedia page. The user interface is an AJAX
selectively pick out Wikipedia pages of interest, and extract the enabled web front end which allows users to express queries
HTML source from these pages and convert it into an internal upon Wikipedia, and displays the query outcome visually. The
object representation. The template set is a collection of user interface works with the system database to serve user
Wikipedia HTML/ CSS templates or classes and regular queries using extracted information. However, the user interface
expressions built over them to easily identify frequently is also equipped to display extracted tuples returned by the ‘Wiki
occurring content headers in Wikipedia pages. In addition, to Parser’, during the online mode of functioning.
these templates a frequency determination module augments the
token identification process. The ‘Wiki Parser’ uses the token
6. ALGORITHM DESIGN
frequency identification component in association with the
The extraction algorithm is a two pass algorithm over an input
template matching to isolate main tokens in a given Wikipedia
web document which identifies structural tokens and their
page. The ‘Wiki parser’ is capable of handing diverse elements
hierarchies in the first phase and performs appropriate token to
types including images, lists, tables, sections, free text and
text matching associations in the subsequent phase. We present
headers. It then iteratively associates surrounding context to
the extraction challenge in sub-section 6.1, algorithm details in
these identified tokens to determine key-value pairs to be
6.2 and analysis in sub-section 6.3
included in a system hash table. The hash tables are mapped onto
the system relational database using the structural type to tabular
6.1 Extraction Challenge
mapping explained in the data model. The database mapping also
The main challenge to extraction is to identify section title or key
generates XML records of the extracted knowledge for direct
elements in a given HTML page. It can be seen that Wikipedia
consumption by specific applications.
offers a wide choice of elements to be promoted as section titles.
For example, the segments headers on a Wikipedia page about
3
4. the state ‘New York’ could be demarcated with a CSS class 6.3 Algorithm Analysis
‘mw-headline’ and the surrounding text could appear in a CSS This two-pass algorithm displays a reasonable execution trace for
class ‘reference’. This observation however does not stand to be a moderate size input document set. We provide some formal
consistent across all Wikipedia pages. There could be pages notations below to aid the analysis of our algorithm and compute
where the section titles are tagged using CSS class ‘reference’ asymptotic time complexity. Let the set D denote the input
and its associated text uses CSS class ‘toctext’. Hence it is not document set = {d1, d2, d3, .. dn}. The set K denotes the token set
trivial to identify which elements are keys within a web page and = {k1, k2, .. kj} for each document. Let P (e) denote the token
what are its corresponding values. This occurrence is further identification time and P (a) represent the association time per
complicated due the free usage of JavaScript between structural token. The time complexity is given as –
types. Also, the segment text headers may not always be chosen
using a Wikipedia provided type which maps to an equivalent Σ 1 to n [ dj * P(e) + Σ 1 to j kj * P (a) ]
CSS class. Since Wikipedia provides support for authors to insert
their own HTML formatting, some authors choose to tag their The term P (e) is the key determining factor in this equation, and
headers using standard HTML tags like <h3> tag, while others we use the heuristic based regular expression matching to reduce
may choose a different tag like <h4> while the rest may opt for the unit time per token identification, as an attempt to speedup
the Wikipedia provided format classes. This adds a layer of the algorithm execution.
ambiguity to the problem of accurately selecting the key element
7. IMPLEMENTATION
fields. To overcome this contextual uncertainty, we use a
statistical measure to predict or promote certain fields as The project implementation has been modeled into a web
headers. The details on this approach are presented in the application by making the query interface web enabled. The
following sub-section. application has been built using the Ruby Language on the Rails
web hosting framework. Ruby is an interpreted scripting
6.2 Algorithm Details language specially designed for quick and easy object oriented
programming. The ‘Rails’ platform is an open source web
This section provides the internal workings of statistical
application framework developed to make web programming
measures used for segment header identification. The
efficient and simple. Development in Ruby on Rails explicitly
implementation of this algorithmic procedure has been performed
follows the Model View Controller, the industry proven
in the Token Frequency Identification module explained as part
implementation architecture. The Model View Controller (MVC)
of the system architecture. The main essence of textual extraction
architecture can be broadly defined as a design pattern that
is to identify what text belongs under which header. Unlike
describes a recurring problem and its solution where the solution
XML, HTML context closely resembles a similar ambiguity of
is never exactly the same for every recurrence. The
associating context-body with context-headers in traditional
implementation relies on the use of regular expressions to
natural language processing. Our approach augments the use of
perform template matching and for content identification. The
simple pattern determining regular expressions with an element
extracted information or tuples are maintained in a MySQL
frequency score which is computed for every structural element.
database which serves as a relational data source.
For example if the CSS class ‘reference’ is used thrice in a
HTML page, and the CSS class ‘text’ appears seven times, then
we can relatively conclude with a high probability that since the
CSS class ‘reference’ is sparingly used, it corresponds to a
higher weight element. The algorithmic procedure performs two
passes over a given document element list. It computes a
frequency of occurrence for every unique structural element, and
promotes the lesser frequent ones as possible segment definers.
The second pass uses a combination of Wikipedia templates, the
determined frequencies to be used for a proximity calculation to
identify the most proximal match for a more frequent element
with a probable context header identified in first pass. A crude
verification is performed using predefined regular expressions to
validate the grouping done using this two-pass token association
algorithm.
The algorithm displays very good accuracy in working with text
segments and images, however encounters occasional erroneous
predictions with working with Wikitables. The algorithm seeks
to obtain a statistical score as a determining factor to base its
associations upon, to overcome the ambiguity that the context
presents. This algorithmic approach can be augmented by a
context based certifier to determine a similarity score between
the context heading and its associated value, and thus verify the
statistical association computed. Figure 2: Implementation Snapshot 1
4
5. The implementation is intentionally made object-oriented to enough for different content types occurring on Wikipedia pages.
provide to ease of extensibility and scaling. The querying The extraction algorithm was also tested for a heavy contiguous
interface follows a query builder paradigm, has AJAX support load of around 60 web pages, which took 6 minutes on a 1 Ghz
and has been built using the Google Web Toolkit (GWT). The machine, and thus exhibits good efficiency.
GWT is an open source Java software development framework
A functional evaluation of the system was also performed to test
allowing web developers create AJAX applications in Java.
the integrated working of the system and its inter-connections
The project implementation of the querying interface provides between various components, which depicted a steady working
support for an online querying mode where in the user’s query is state of the constructed prototype.
served by real time extraction of the Wikipedia page. For
analytical or complex queries we recommend the offline mode of 9. CONCLUSION
querying, which works on the pre-extracted data and results in
The work we have performed is still in the early stages of
faster response times. The implementation supports preserving
research, but we believe it offers an innovative way of
extracted Wikipedia information not only in the relational
contributing to the construction of ontology from web pages.
MySQL data source but also as flat XML files.
Likewise, we believe the prospect of using this methodology to
help generate semantic extensions into Wikipedia is both
exciting and useful. We summarize our work as to include the
design, development and implementation of an extraction engine
to retrieve structured information from Wikipedia; data model to
map extracted knowledge into a relational store; XML
representations to publish the acquired information; a scheme to
handle data diversity during extraction and data preservation; a
statistical inference mechanism to decipher contextual ambiguity
and a capable querying interface to present these features.
We envision future work on this topic in terms of incrementally
augmenting the statistical score computation using domain
analysis or through active regression based learning approaches.
A parallel stem of research can involve identifying and managing
inter-article relationships in Wikipedia by observing document
references obtained from our extraction framework.
Implementation specific effort can channeled for design and
construction of web services publishing the extracted data, or
interactively engaging with web agents and receiving queries
from them. Focus can also be directed at enriching the extracted
Figure 3: Sample XML Output Wikipedia knowledge base with information from external data
sets by inventing ontology to support such objectives.
8. EVALUATION RESULTS
10. ACKNOWLEDGMENTS
The implemented system was evaluated by performing extraction
of over 100 articles from Wikipedia. The extraction of these We would like to thank our instructor Dr. Chengkai Li for his
articles resulted in an extraction of over 4087 images, 3627 text guidance and support to help us develop the necessary skills in
boxes, and 2866 info box property values. The extraction focused Information extraction and timely accomplish this project.
on a number of diverse subject domains including geography,
politics, famous personalities and tourist attractions. The results 11. REFERENCES
indicate that the text and image extraction are domain–neutral,
[1] Suh, S., Halpin, H., and Klein, E. Extracting Common
and representation independent. The cleanliness of text
Sense Knowledge from Wikipedia. ISWC Workshop,
extractions was one of the promising aspects of this project. The
Athens, Georgia (Nov. 2006).
extraction results indicate a very high recall degree in text and
[2] Bhole, A., Fortuna, B., Grobelnik, M. and Mladenic, D.
image extraction segments. The association results were proven
Extracting Named Entities and Relating Them over Time
by validating against the real world associations occurring on the
Based on Wikipedia. Informatica, 2007, 463-468.
source Wikipedia page. The frequency based estimation
technique fares an accuracy of over 90% for associating text, [3] Cafarella, M., Etzioni, O., and Suciu, D. Structured Queries
images and info boxes with keywords. It displays an accuracy of Over Web Text. In Bulletin: IEEE Computer Society
around 75% for deeply nested Wiki tables. Technical Committee on Data Engineering, 2006.
The system was specifically evaluated to check the reliability of [4] Cafarella, M., Re, C., Suciu, D., Etzioni, O., and Banko, M.
the associations for pages with unseen content definitions. The Structured Querying of Web Text: A Technical Challenge,
system proved to yield acceptable results for around 18 different 3rd Biennial Conference on Innovative Data Systems
representation types, out of the 21 different representation types Research (CIDR), Asimolar, California, January 2007.
tested for. The database schema has been found to be flexible
5
6. [5] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, [7] Milne, D., Witten, I., and Nichols, D., Extracting Corpus-
R., and Ives, DBpedia: A Nucleus for a Web of Open Data, Specific Knowledge Bases from Wikipedia, CIKM’07,
6th International Semantic Web Conference (ISWC 2007), Lisbon, Portugal, November 2007.
Busan, Korea, November 2007. [8] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. A
[6] Chernov, S., Iofciu, T., Nejdl, and W., Zhou, X., Extracting Relational Approach to Incrementally Extracting and
Semantic Relationships between Wikipedia Categories, Querying Structure in Unstructured Data, VLDB’ 07,
Vienna, Austria, September 2007.
SemWiki2006, 2006.
6