SlideShare una empresa de Scribd logo
1 de 6
Extracting Structured Records from Wikipedia
                                                                                                    Aniruddha Despande
      Shivkumar Chandrashekhar
                                                                                                University of Texas at Arlington
      University of Texas at Arlington



                                                                            The Wikipedia is the largest collaborative knowledge sharing
ABSTRACT
                                                                            web encyclopedia. It is one of the most frequently accessed
The Wikipedia is a web-based collaborative knowledge sharing
                                                                            websites on the internet, undergoes frequent revisions, and is
portal comprising of articles contributed by authors all over the
                                                                            available in around 250 languages with English alone estimated
world. But its search capabilities are limited to title and full-text
                                                                            to possess around 2 million pages & around 800,000 registered
search only. There is a growing interest in querying over the
                                                                            users. However, to find information from the vast amount of
structure implicit in unstructured documents, and this paper
                                                                            articles on Wikipedia, users have to rely on a combination of
explores fundamental ideas to achieve this objective using
                                                                            keyword search and browsing. These mechanisms are effective
Wikipedia as a document source. We suggest that semantic
                                                                            but incapable of supporting complex aggregate queries over the
information can be extracted from Wikipedia by identifying
                                                                            potentially rich set of structures embedded in Wikipedia text. For
associations among prominent textual keywords and its
                                                                            example consider the pages about the states Ohio, Illinois and
neighboring contextual text blocks. This association discovery
                                                                            Texas in Wikipedia. Information about the total area, total
can be accomplished using a combination of primitive pattern or
                                                                            population and % water can be explicitly inferred from these
regular expression matching along with a token frequency
                                                                            pages.
determination algorithm for every Wikipedia page to heuristically
promote certain structural entities as context headers and                  Table 1: Portions from info boxes found on Wikipedia pages
hierarchically wrap surrounding elements with such identified
                                                                                          Total Area        Total Population       % Water
segment titles. The extracted results can be maintained in a
domain neutral semantic schema customized for Wikipedia which                Ohio       44,825 sq miles         11,353,140            8.7
could improve its search capabilities through the use of an
efficient interface extended over a relational data source realizing
                                                                                          Total Area        Total Population       % Water
this modeled schema. Experimental results and the implemented
prototype indicate that this notion is successful is in achieving           Illinois    57,918 sq miles         12,831,970            4.0
good accuracy in entity associations and a high recall in
extraction of diverse Wikipedia structural types.
                                                                                         Total Area         Total Population       % Water
Categories and Subject Descriptors                                          Texas      268,820 sq miles         20,851,820            2.5
H.2 [Information Systems]: Database Management; H.3.3
[Information Systems]: Information Storage and Retrieval –
                                                                            However, this restricts us from expressing SQL queries to order
Information Extraction and Retrieval
                                                                            the states in increasing sequence of their population densities
                                                                            (total area / total population) or similar operations to process the
General Terms                                                               cumulative knowledge contained in these pages. The ability to
Algorithms, Experimentation, Standardization
                                                                            query this set of structures is highly desirable. Such data can be
                                                                            practically found in different kinds of Wikipedia structures
Keywords                                                                    namely info boxes, wiki tables, lists, images, etc. In this paper,
Information Extraction, Wikipedia                                           we present a scheme to extract such structured information from
                                                                            a given Wikipedia page, irrespective of its inherent structural
                                                                            make-up. Wikipedia pages differ from one another in the
1. INTRODUCTION
                                                                            compositional make-up of the structural types they are built from.
The need to amalgamate the web’s structured information and
                                                                            Wikipedia allows authors a wide variety of rich representations
knowledge to enable semantically rich queries is a widely
                                                                            to choose from, thus creating a healthy diversity in the content
accepted necessitate. This is the goal behind most of the modern
                                                                            representation formats used across pages. This heterogeneity
day research into information extraction and integration, with
                                                                            across structural make-up of different pages (even from the same
standards like the Semantic Web being advertised as the future
                                                                            domain) presents a challenge to textual extraction and content
vision of the current day web. We investigate one plausible
                                                                            association.
approach to achieve expression of semantically rich queries over
one such web document source, the content rich, multi-lingual               The rest of the paper is organized as follows. A brief description
Wikipedia portal (http://www.wikipedia.org) to be particular.               of the motivation of the project is presented in Section 2. In
                                                                            Section 3 we formalize the problem definition. Section 4 surveys
                                                                            related work. Section 5 describes the data/ query model and the



                                                                        1
project’s architectural components. The algorithm design and              manual and automatic data extraction to construct ontology from
details are debriefed in Section 6. Section 7 touches upon the            Wikipedia templates. In paper [7], the authors aim to construct
implementation details and we present our evaluation results in           knowledge bases that are focused to the task of organizing and
Section 8. We conclude and provide directions for future work in          facilitating retrieval within individual Wikipedia document
Section 9.                                                                collections. Our approach resembles the idea presented in [8]
                                                                          where the notion of using a relational system as a basis for a
                                                                          workbench for extracting and querying structure from
2. MOTIVATION
                                                                          unstructured data in Wikipedia. Their paper focuses on
The incentive for this project can be summarized as an attempt to
                                                                          incrementally evolving the understanding of the data in the
leverage the rich structures inherent in Wikipedia content for
                                                                          context of the relational workbench, while our approach relies on
successful information extraction & association and augment its
                                                                          an element-term frequency measure to determine its relative
traditional search options with a mechanism to support rich
                                                                          weight in a given page and model our associations as gravitating
relational operations over the Wikipedia information base. Thus
                                                                          towards the lesser used tokens. The paper [2] presents an
our motivation arises from a need to formulate a mechanism to
                                                                          approach to mining information relating people, places,
recognize inherent structures occurring on Wikipedia pages,
                                                                          organizations and events from Wikipedia and linking them on a
design and develop an extraction framework which mines for
                                                                          time scale, while the authors in paper [1] explore the possibility
structural segments in the Wikipedia text, and construction of an
                                                                          of automatically identifying “common sense” statements from
comprehensive interface to support effective query formulation
                                                                          unrestricted natural language text found in Wikipedia and
over the Wikipedia textual corpus and thus realize an integrated
                                                                          mapping them to RDF. This system works on a hypothesis that
model enabling analytical queries on the Wikipedia knowledge
                                                                          common sense knowledge is often expressed in a subject
base.
                                                                          predicate form, and their work focuses on the challenge of
                                                                          automatically identifying such generic statements.
3. PROBLEM DEFINITION
The problem being addressed in this project can be adequately
                                                                          5. EXTRACTION FRAMEWORK
described as a research initiative into extraction strategies and
                                                                          This section provides details of the data model used in the
design & implementation of an efficient & capable extraction
                                                                          system and the architectural composition of the extraction
framework to identify and retrieve structured information from
                                                                          framework devised. The conceptual view paradigm widely
Wikipedia and accurately establish associations among them so
                                                                          adopted in relational databases as an abstract representation or
as preserve them in a relational data source. This problem
                                                                          model of the real world does not apply to our case. The idea of
requirement involves dealing with various Wikipedia content
                                                                          identifying entities and tabulating them or establishing
representation types such as text segments, info boxes, wiki
                                                                          relationships across them in a Wikipedia source is highly
tables, images, paragraphs, links, etc and the creation of a
                                                                          irrelevant and certainly not scalable.
database scheme adequate enough to encompass such diverse
extracted tuples. The provision for rich SQL queries is a simple          5.1 Data/ Query Model
addition to this system, and hence will not be fully explored but
                                                                          Our data model to accommodate Wikipedia data has been
rather the support for querying realized by presenting a simplistic
                                                                          designed to be domain independent. The database schema has
querying interface. The problem definition can thus be summed
                                                                          been modeled to resemble RDF tuples, and is primarily designed
up as –
                                                                          to scale with increasing content size as well deal with vivid
                                                                          heterogeneity or diversity in extracted data.
1.   Research      into    extraction   strategies, design &
     implementation of a retrieval framework to identify and              Table 2: pics table
     extract structured information from Wikipedia.
                                                                          Id      Title      Image Tag                     pic_url
2.   Enable extraction of various Wikipedia content types
                                                                          11    Alabama         Flag Al         http://wikimedia.org/AlFlg.jpg
     including text segments, info boxes, free text, references
     and images.
                                                                          Table 3: infoboxes table
3.   Design of a database schema to accommodate diverse data
     fields extracted from Wikipedia.                                     Id      Title          Property                   PValue
4.   Provision of basic querying primitives over the extracted            11    Alabama          Governor               Robert R. Riley
     information.

                                                                          Table 4: twikis table
4. RELATED WORK
                                                                          Id      Title          Content                     Tag
We performed a wide literature survey to learn about similar
extraction initiatives. The idea to bring Semantics into Wikipedia        11    Alabama         It is seen ..       Law_and_government
is not new and several studies on this topic have been carried out
in the last few years. Semantic extraction and relationships were
discussed in [6]. The authors analyze relevant measures for               The data schema reflects the structural types encountered on
inferring the semantic relationships between page categories of           Wikipedia as tables in the relational source. This data model
Wikipedia. DBpedia [5] is a community based effort that uses              allows for easy incorporation of a new representation type



                                                                      2
whenever it is encountered in a new Wikipedia page. Records are
grouped independent of its parent domain or title document. The
original page can be constructed back from the segregated tuples
by joining across the ‘Title’ attribute. A title of a Wikipedia is a
distinguishing attribute and hence chosen to be a key in our
schema. The ‘ID’ field is an auto incrementing column which
behaves as the primary key within the database tables, however
all operations are based on the logically coherent ‘Title’ field.
Since all tuples in a given table correspond to same or similar
Wikipedia content types, the extracted tuples are uniform hence
allowing for faster indexing options. However the data model
suffers from the traditional weakness of a RDF oriented schema
of requiring too many joins to construct data back. In this paper,
we emphasize on the extraction of the tuples and believe that the
extracted data in these tuples can be easily migrated to a more
rigidly normalized data store, and hence choose to ignore the
limitations of RDF. We have consciously chosen to record only
location addresses of images rather than the binary content to
preserve precious database space as well account for update of
images on the actual Wikipedia website. The database tables
contain additional attributes including auto-number identifiers
and timestamps primarily introduced for housekeeping purposes.
This data model clearly favors ease of insertion and quick
retrieval however does not support quick updates of the linked
tuples. We believe that updates are relatively fewer in our
system, and the response times for updates can be improved by
treating updates as delete operation followed by a fresh insertion
action, which also inadvertently helps in flushing stale tuples.
The data model actively adapts to evolving Wikipedia structural
types; however the addition of a new type is one time manual
affair. The anatomy of few of the tables is presented in the table
above, and as mentioned earlier, they resemble key-value form.
                                                                                            Figure 1: System Architecture
5.2 Architecture                                                           The Wikipedia template set is a pattern list which can be
The project adopts a simple architecture and is presented in               incrementally improved to account for different Wikipedia pages
Figure 1. The ‘Wiki Parser’ is an extraction engine built by               or even web pages on the general web. The templates also act as
extending a general HTML parser. The HTML parser is a simple               a noise filter to selectively weed out incomplete tags or structures
interface in our system that includes a small crawler segment to           existing in the Wikipedia page. The user interface is an AJAX
selectively pick out Wikipedia pages of interest, and extract the          enabled web front end which allows users to express queries
HTML source from these pages and convert it into an internal               upon Wikipedia, and displays the query outcome visually. The
object representation. The template set is a collection of                 user interface works with the system database to serve user
Wikipedia HTML/ CSS templates or classes and regular                       queries using extracted information. However, the user interface
expressions built over them to easily identify frequently                  is also equipped to display extracted tuples returned by the ‘Wiki
occurring content headers in Wikipedia pages. In addition, to              Parser’, during the online mode of functioning.
these templates a frequency determination module augments the
token identification process. The ‘Wiki Parser’ uses the token
                                                                           6. ALGORITHM DESIGN
frequency identification component in association with the
                                                                           The extraction algorithm is a two pass algorithm over an input
template matching to isolate main tokens in a given Wikipedia
                                                                           web document which identifies structural tokens and their
page. The ‘Wiki parser’ is capable of handing diverse elements
                                                                           hierarchies in the first phase and performs appropriate token to
types including images, lists, tables, sections, free text and
                                                                           text matching associations in the subsequent phase. We present
headers. It then iteratively associates surrounding context to
                                                                           the extraction challenge in sub-section 6.1, algorithm details in
these identified tokens to determine key-value pairs to be
                                                                           6.2 and analysis in sub-section 6.3
included in a system hash table. The hash tables are mapped onto
the system relational database using the structural type to tabular
                                                                           6.1 Extraction Challenge
mapping explained in the data model. The database mapping also
                                                                           The main challenge to extraction is to identify section title or key
generates XML records of the extracted knowledge for direct
                                                                           elements in a given HTML page. It can be seen that Wikipedia
consumption by specific applications.
                                                                           offers a wide choice of elements to be promoted as section titles.
                                                                           For example, the segments headers on a Wikipedia page about




                                                                       3
the state ‘New York’ could be demarcated with a CSS class                6.3 Algorithm Analysis
‘mw-headline’ and the surrounding text could appear in a CSS             This two-pass algorithm displays a reasonable execution trace for
class ‘reference’. This observation however does not stand to be         a moderate size input document set. We provide some formal
consistent across all Wikipedia pages. There could be pages              notations below to aid the analysis of our algorithm and compute
where the section titles are tagged using CSS class ‘reference’          asymptotic time complexity. Let the set D denote the input
and its associated text uses CSS class ‘toctext’. Hence it is not        document set = {d1, d2, d3, .. dn}. The set K denotes the token set
trivial to identify which elements are keys within a web page and        = {k1, k2, .. kj} for each document. Let P (e) denote the token
what are its corresponding values. This occurrence is further            identification time and P (a) represent the association time per
complicated due the free usage of JavaScript between structural          token. The time complexity is given as –
types. Also, the segment text headers may not always be chosen
using a Wikipedia provided type which maps to an equivalent                             Σ 1 to n [ dj * P(e) + Σ 1 to j kj * P (a) ]
CSS class. Since Wikipedia provides support for authors to insert
their own HTML formatting, some authors choose to tag their              The term P (e) is the key determining factor in this equation, and
headers using standard HTML tags like <h3> tag, while others             we use the heuristic based regular expression matching to reduce
may choose a different tag like <h4> while the rest may opt for          the unit time per token identification, as an attempt to speedup
the Wikipedia provided format classes. This adds a layer of              the algorithm execution.
ambiguity to the problem of accurately selecting the key element
                                                                         7. IMPLEMENTATION
fields. To overcome this contextual uncertainty, we use a
statistical measure to predict or promote certain fields as              The project implementation has been modeled into a web
headers. The details on this approach are presented in the               application by making the query interface web enabled. The
following sub-section.                                                   application has been built using the Ruby Language on the Rails
                                                                         web hosting framework. Ruby is an interpreted scripting
6.2 Algorithm Details                                                    language specially designed for quick and easy object oriented
                                                                         programming. The ‘Rails’ platform is an open source web
This section provides the internal workings of statistical
                                                                         application framework developed to make web programming
measures used for segment header identification. The
                                                                         efficient and simple. Development in Ruby on Rails explicitly
implementation of this algorithmic procedure has been performed
                                                                         follows the Model View Controller, the industry proven
in the Token Frequency Identification module explained as part
                                                                         implementation architecture. The Model View Controller (MVC)
of the system architecture. The main essence of textual extraction
                                                                         architecture can be broadly defined as a design pattern that
is to identify what text belongs under which header. Unlike
                                                                         describes a recurring problem and its solution where the solution
XML, HTML context closely resembles a similar ambiguity of
                                                                         is never exactly the same for every recurrence. The
associating context-body with context-headers in traditional
                                                                         implementation relies on the use of regular expressions to
natural language processing. Our approach augments the use of
                                                                         perform template matching and for content identification. The
simple pattern determining regular expressions with an element
                                                                         extracted information or tuples are maintained in a MySQL
frequency score which is computed for every structural element.
                                                                         database which serves as a relational data source.
For example if the CSS class ‘reference’ is used thrice in a
HTML page, and the CSS class ‘text’ appears seven times, then
we can relatively conclude with a high probability that since the
CSS class ‘reference’ is sparingly used, it corresponds to a
higher weight element. The algorithmic procedure performs two
passes over a given document element list. It computes a
frequency of occurrence for every unique structural element, and
promotes the lesser frequent ones as possible segment definers.
The second pass uses a combination of Wikipedia templates, the
determined frequencies to be used for a proximity calculation to
identify the most proximal match for a more frequent element
with a probable context header identified in first pass. A crude
verification is performed using predefined regular expressions to
validate the grouping done using this two-pass token association
algorithm.
The algorithm displays very good accuracy in working with text
segments and images, however encounters occasional erroneous
predictions with working with Wikitables. The algorithm seeks
to obtain a statistical score as a determining factor to base its
associations upon, to overcome the ambiguity that the context
presents. This algorithmic approach can be augmented by a
context based certifier to determine a similarity score between
the context heading and its associated value, and thus verify the
statistical association computed.                                                     Figure 2: Implementation Snapshot 1



                                                                     4
The implementation is intentionally made object-oriented to               enough for different content types occurring on Wikipedia pages.
provide to ease of extensibility and scaling. The querying                The extraction algorithm was also tested for a heavy contiguous
interface follows a query builder paradigm, has AJAX support              load of around 60 web pages, which took 6 minutes on a 1 Ghz
and has been built using the Google Web Toolkit (GWT). The                machine, and thus exhibits good efficiency.
GWT is an open source Java software development framework
                                                                          A functional evaluation of the system was also performed to test
allowing web developers create AJAX applications in Java.
                                                                          the integrated working of the system and its inter-connections
The project implementation of the querying interface provides             between various components, which depicted a steady working
support for an online querying mode where in the user’s query is          state of the constructed prototype.
served by real time extraction of the Wikipedia page. For
analytical or complex queries we recommend the offline mode of            9. CONCLUSION
querying, which works on the pre-extracted data and results in
                                                                          The work we have performed is still in the early stages of
faster response times. The implementation supports preserving
                                                                          research, but we believe it offers an innovative way of
extracted Wikipedia information not only in the relational
                                                                          contributing to the construction of ontology from web pages.
MySQL data source but also as flat XML files.
                                                                          Likewise, we believe the prospect of using this methodology to
                                                                          help generate semantic extensions into Wikipedia is both
                                                                          exciting and useful. We summarize our work as to include the
                                                                          design, development and implementation of an extraction engine
                                                                          to retrieve structured information from Wikipedia; data model to
                                                                          map extracted knowledge into a relational store; XML
                                                                          representations to publish the acquired information; a scheme to
                                                                          handle data diversity during extraction and data preservation; a
                                                                          statistical inference mechanism to decipher contextual ambiguity
                                                                          and a capable querying interface to present these features.
                                                                          We envision future work on this topic in terms of incrementally
                                                                          augmenting the statistical score computation using domain
                                                                          analysis or through active regression based learning approaches.
                                                                          A parallel stem of research can involve identifying and managing
                                                                          inter-article relationships in Wikipedia by observing document
                                                                          references obtained from our extraction framework.
                                                                          Implementation specific effort can channeled for design and
                                                                          construction of web services publishing the extracted data, or
                                                                          interactively engaging with web agents and receiving queries
                                                                          from them. Focus can also be directed at enriching the extracted
                Figure 3: Sample XML Output                               Wikipedia knowledge base with information from external data
                                                                          sets by inventing ontology to support such objectives.
8. EVALUATION RESULTS
                                                                          10. ACKNOWLEDGMENTS
The implemented system was evaluated by performing extraction
of over 100 articles from Wikipedia. The extraction of these              We would like to thank our instructor Dr. Chengkai Li for his
articles resulted in an extraction of over 4087 images, 3627 text         guidance and support to help us develop the necessary skills in
boxes, and 2866 info box property values. The extraction focused          Information extraction and timely accomplish this project.
on a number of diverse subject domains including geography,
politics, famous personalities and tourist attractions. The results       11. REFERENCES
indicate that the text and image extraction are domain–neutral,
                                                                          [1] Suh, S., Halpin, H., and Klein, E. Extracting Common
and representation independent. The cleanliness of text
                                                                              Sense Knowledge from Wikipedia. ISWC Workshop,
extractions was one of the promising aspects of this project. The
                                                                              Athens, Georgia (Nov. 2006).
extraction results indicate a very high recall degree in text and
                                                                          [2] Bhole, A., Fortuna, B., Grobelnik, M. and Mladenic, D.
image extraction segments. The association results were proven
                                                                              Extracting Named Entities and Relating Them over Time
by validating against the real world associations occurring on the
                                                                              Based on Wikipedia. Informatica, 2007, 463-468.
source Wikipedia page. The frequency based estimation
technique fares an accuracy of over 90% for associating text,             [3] Cafarella, M., Etzioni, O., and Suciu, D. Structured Queries
images and info boxes with keywords. It displays an accuracy of               Over Web Text. In Bulletin: IEEE Computer Society
around 75% for deeply nested Wiki tables.                                     Technical Committee on Data Engineering, 2006.
The system was specifically evaluated to check the reliability of         [4] Cafarella, M., Re, C., Suciu, D., Etzioni, O., and Banko, M.
the associations for pages with unseen content definitions. The               Structured Querying of Web Text: A Technical Challenge,
system proved to yield acceptable results for around 18 different             3rd Biennial Conference on Innovative Data Systems
representation types, out of the 21 different representation types            Research (CIDR), Asimolar, California, January 2007.
tested for. The database schema has been found to be flexible



                                                                      5
[5] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak,         [7] Milne, D., Witten, I., and Nichols, D., Extracting Corpus-
    R., and Ives, DBpedia: A Nucleus for a Web of Open Data,               Specific Knowledge Bases from Wikipedia, CIKM’07,
    6th International Semantic Web Conference (ISWC 2007),                 Lisbon, Portugal, November 2007.
    Busan, Korea, November 2007.                                       [8] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. A
[6] Chernov, S., Iofciu, T., Nejdl, and W., Zhou, X., Extracting           Relational Approach to Incrementally Extracting and
    Semantic Relationships between Wikipedia Categories,                   Querying Structure in Unstructured Data, VLDB’ 07,
                                                                           Vienna, Austria, September 2007.
    SemWiki2006, 2006.




                                                                   6

Más contenido relacionado

Destacado

New And Improved Cap Presentation
New And Improved Cap PresentationNew And Improved Cap Presentation
New And Improved Cap Presentationdbigue
 
Meet the Coordinator of Learning Technology
Meet the Coordinator of Learning TechnologyMeet the Coordinator of Learning Technology
Meet the Coordinator of Learning TechnologyDarcy Goshorn
 
Narraciones extraordinarias
Narraciones extraordinariasNarraciones extraordinarias
Narraciones extraordinariasdanialar
 

Destacado (6)

New And Improved Cap Presentation
New And Improved Cap PresentationNew And Improved Cap Presentation
New And Improved Cap Presentation
 
Meet the Coordinator of Learning Technology
Meet the Coordinator of Learning TechnologyMeet the Coordinator of Learning Technology
Meet the Coordinator of Learning Technology
 
Sales Kickoff Plan
Sales Kickoff PlanSales Kickoff Plan
Sales Kickoff Plan
 
Campaign Presentation
Campaign PresentationCampaign Presentation
Campaign Presentation
 
Technology Keynote
Technology KeynoteTechnology Keynote
Technology Keynote
 
Narraciones extraordinarias
Narraciones extraordinariasNarraciones extraordinarias
Narraciones extraordinarias
 

Similar a Extracting Structured Records from Wikipedia

Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintokeee
 
Allen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationAllen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationWilliam Smith
 
WikiOnt: An Ontology for Describing and Exchanging Wiki Articles
WikiOnt: An Ontology for Describing and Exchanging Wiki ArticlesWikiOnt: An Ontology for Describing and Exchanging Wiki Articles
WikiOnt: An Ontology for Describing and Exchanging Wiki ArticlesJohn Breslin
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsWilliam Smith
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexingVaralakshmiRSR
 
7 things you should know about wikis
7 things you should know about wikis7 things you should know about wikis
7 things you should know about wikisAykut Özmen
 
SWAN/SIOC: Aligning Scientific Discourse Representation and Social Semantics
SWAN/SIOC: Aligning Scientific Discourse Representation and Social SemanticsSWAN/SIOC: Aligning Scientific Discourse Representation and Social Semantics
SWAN/SIOC: Aligning Scientific Discourse Representation and Social SemanticsJohn Breslin
 
Semantic Search on Heterogeneous Wiki Systems - Short
Semantic Search on Heterogeneous Wiki Systems - ShortSemantic Search on Heterogeneous Wiki Systems - Short
Semantic Search on Heterogeneous Wiki Systems - ShortFabrizio Orlandi
 
Semantic Web Tools For Agricultural Materials
Semantic Web Tools For Agricultural MaterialsSemantic Web Tools For Agricultural Materials
Semantic Web Tools For Agricultural MaterialsGerard Sylvester
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisXanat V. Meza
 
Wnl 122 towards social sementic by samhati soor
Wnl 122 towards social sementic by samhati soorWnl 122 towards social sementic by samhati soor
Wnl 122 towards social sementic by samhati soorKishor Satpathy
 
De-centralized but global: Redesigning biodiversity data aggregation for impr...
De-centralized but global: Redesigning biodiversity data aggregation for impr...De-centralized but global: Redesigning biodiversity data aggregation for impr...
De-centralized but global: Redesigning biodiversity data aggregation for impr...taxonbytes
 
Analyzing wikipedia haley
Analyzing wikipedia haleyAnalyzing wikipedia haley
Analyzing wikipedia haleyRajiv Kumar
 
Wikis in the Workplace: Enhancing Collaboration and Knowledge Management
Wikis in the Workplace: Enhancing Collaboration and Knowledge ManagementWikis in the Workplace: Enhancing Collaboration and Knowledge Management
Wikis in the Workplace: Enhancing Collaboration and Knowledge ManagementMary Jenkins
 
Web24dev Icrisat 2
Web24dev Icrisat 2Web24dev Icrisat 2
Web24dev Icrisat 2pritpalkaur
 

Similar a Extracting Structured Records from Wikipedia (20)

Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprintSw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
Sw 3 bizer etal-d bpedia-crystallization-point-jws-preprint
 
Allen Institute Neurowiki Presentation
Allen Institute Neurowiki PresentationAllen Institute Neurowiki Presentation
Allen Institute Neurowiki Presentation
 
WikiOnt: An Ontology for Describing and Exchanging Wiki Articles
WikiOnt: An Ontology for Describing and Exchanging Wiki ArticlesWikiOnt: An Ontology for Describing and Exchanging Wiki Articles
WikiOnt: An Ontology for Describing and Exchanging Wiki Articles
 
SMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data VisualizationsSMWCon 2012 Linked Data Visualizations
SMWCon 2012 Linked Data Visualizations
 
Relevance of clasification and indexing
Relevance of clasification and indexingRelevance of clasification and indexing
Relevance of clasification and indexing
 
Biodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic WebBiodiversity Informatics on the Semantic Web
Biodiversity Informatics on the Semantic Web
 
Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"Hahn "Wikidata as a hub to library linked data re-use"
Hahn "Wikidata as a hub to library linked data re-use"
 
BabelNet 3.0
BabelNet 3.0BabelNet 3.0
BabelNet 3.0
 
7 things you should know about wikis
7 things you should know about wikis7 things you should know about wikis
7 things you should know about wikis
 
Resources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the WebResources, resources, resources: the three rs of the Web
Resources, resources, resources: the three rs of the Web
 
SWAN/SIOC: Aligning Scientific Discourse Representation and Social Semantics
SWAN/SIOC: Aligning Scientific Discourse Representation and Social SemanticsSWAN/SIOC: Aligning Scientific Discourse Representation and Social Semantics
SWAN/SIOC: Aligning Scientific Discourse Representation and Social Semantics
 
Semantic Search on Heterogeneous Wiki Systems - Short
Semantic Search on Heterogeneous Wiki Systems - ShortSemantic Search on Heterogeneous Wiki Systems - Short
Semantic Search on Heterogeneous Wiki Systems - Short
 
Semantic Web Tools For Agricultural Materials
Semantic Web Tools For Agricultural MaterialsSemantic Web Tools For Agricultural Materials
Semantic Web Tools For Agricultural Materials
 
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de BellisFrom Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
From Bibliometrics to Cybermetrics - a book chapter by Nicola de Bellis
 
Wnl 122 towards social sementic by samhati soor
Wnl 122 towards social sementic by samhati soorWnl 122 towards social sementic by samhati soor
Wnl 122 towards social sementic by samhati soor
 
De-centralized but global: Redesigning biodiversity data aggregation for impr...
De-centralized but global: Redesigning biodiversity data aggregation for impr...De-centralized but global: Redesigning biodiversity data aggregation for impr...
De-centralized but global: Redesigning biodiversity data aggregation for impr...
 
Analyzing wikipedia haley
Analyzing wikipedia haleyAnalyzing wikipedia haley
Analyzing wikipedia haley
 
Wikis in the Workplace: Enhancing Collaboration and Knowledge Management
Wikis in the Workplace: Enhancing Collaboration and Knowledge ManagementWikis in the Workplace: Enhancing Collaboration and Knowledge Management
Wikis in the Workplace: Enhancing Collaboration and Knowledge Management
 
Web24dev Icrisat 2
Web24dev Icrisat 2Web24dev Icrisat 2
Web24dev Icrisat 2
 
New old(1)
New old(1)New old(1)
New old(1)
 

Último

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Paola De la Torre
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...gurkirankumar98700
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreternaman860154
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 

Último (20)

Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101Salesforce Community Group Quito, Salesforce 101
Salesforce Community Group Quito, Salesforce 101
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
Kalyanpur ) Call Girls in Lucknow Finest Escorts Service 🍸 8923113531 🎰 Avail...
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Presentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreterPresentation on how to chat with PDF using ChatGPT code interpreter
Presentation on how to chat with PDF using ChatGPT code interpreter
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 

Extracting Structured Records from Wikipedia

  • 1. Extracting Structured Records from Wikipedia Aniruddha Despande Shivkumar Chandrashekhar University of Texas at Arlington University of Texas at Arlington The Wikipedia is the largest collaborative knowledge sharing ABSTRACT web encyclopedia. It is one of the most frequently accessed The Wikipedia is a web-based collaborative knowledge sharing websites on the internet, undergoes frequent revisions, and is portal comprising of articles contributed by authors all over the available in around 250 languages with English alone estimated world. But its search capabilities are limited to title and full-text to possess around 2 million pages & around 800,000 registered search only. There is a growing interest in querying over the users. However, to find information from the vast amount of structure implicit in unstructured documents, and this paper articles on Wikipedia, users have to rely on a combination of explores fundamental ideas to achieve this objective using keyword search and browsing. These mechanisms are effective Wikipedia as a document source. We suggest that semantic but incapable of supporting complex aggregate queries over the information can be extracted from Wikipedia by identifying potentially rich set of structures embedded in Wikipedia text. For associations among prominent textual keywords and its example consider the pages about the states Ohio, Illinois and neighboring contextual text blocks. This association discovery Texas in Wikipedia. Information about the total area, total can be accomplished using a combination of primitive pattern or population and % water can be explicitly inferred from these regular expression matching along with a token frequency pages. determination algorithm for every Wikipedia page to heuristically promote certain structural entities as context headers and Table 1: Portions from info boxes found on Wikipedia pages hierarchically wrap surrounding elements with such identified Total Area Total Population % Water segment titles. The extracted results can be maintained in a domain neutral semantic schema customized for Wikipedia which Ohio 44,825 sq miles 11,353,140 8.7 could improve its search capabilities through the use of an efficient interface extended over a relational data source realizing Total Area Total Population % Water this modeled schema. Experimental results and the implemented prototype indicate that this notion is successful is in achieving Illinois 57,918 sq miles 12,831,970 4.0 good accuracy in entity associations and a high recall in extraction of diverse Wikipedia structural types. Total Area Total Population % Water Categories and Subject Descriptors Texas 268,820 sq miles 20,851,820 2.5 H.2 [Information Systems]: Database Management; H.3.3 [Information Systems]: Information Storage and Retrieval – However, this restricts us from expressing SQL queries to order Information Extraction and Retrieval the states in increasing sequence of their population densities (total area / total population) or similar operations to process the General Terms cumulative knowledge contained in these pages. The ability to Algorithms, Experimentation, Standardization query this set of structures is highly desirable. Such data can be practically found in different kinds of Wikipedia structures Keywords namely info boxes, wiki tables, lists, images, etc. In this paper, Information Extraction, Wikipedia we present a scheme to extract such structured information from a given Wikipedia page, irrespective of its inherent structural make-up. Wikipedia pages differ from one another in the 1. INTRODUCTION compositional make-up of the structural types they are built from. The need to amalgamate the web’s structured information and Wikipedia allows authors a wide variety of rich representations knowledge to enable semantically rich queries is a widely to choose from, thus creating a healthy diversity in the content accepted necessitate. This is the goal behind most of the modern representation formats used across pages. This heterogeneity day research into information extraction and integration, with across structural make-up of different pages (even from the same standards like the Semantic Web being advertised as the future domain) presents a challenge to textual extraction and content vision of the current day web. We investigate one plausible association. approach to achieve expression of semantically rich queries over one such web document source, the content rich, multi-lingual The rest of the paper is organized as follows. A brief description Wikipedia portal (http://www.wikipedia.org) to be particular. of the motivation of the project is presented in Section 2. In Section 3 we formalize the problem definition. Section 4 surveys related work. Section 5 describes the data/ query model and the 1
  • 2. project’s architectural components. The algorithm design and manual and automatic data extraction to construct ontology from details are debriefed in Section 6. Section 7 touches upon the Wikipedia templates. In paper [7], the authors aim to construct implementation details and we present our evaluation results in knowledge bases that are focused to the task of organizing and Section 8. We conclude and provide directions for future work in facilitating retrieval within individual Wikipedia document Section 9. collections. Our approach resembles the idea presented in [8] where the notion of using a relational system as a basis for a workbench for extracting and querying structure from 2. MOTIVATION unstructured data in Wikipedia. Their paper focuses on The incentive for this project can be summarized as an attempt to incrementally evolving the understanding of the data in the leverage the rich structures inherent in Wikipedia content for context of the relational workbench, while our approach relies on successful information extraction & association and augment its an element-term frequency measure to determine its relative traditional search options with a mechanism to support rich weight in a given page and model our associations as gravitating relational operations over the Wikipedia information base. Thus towards the lesser used tokens. The paper [2] presents an our motivation arises from a need to formulate a mechanism to approach to mining information relating people, places, recognize inherent structures occurring on Wikipedia pages, organizations and events from Wikipedia and linking them on a design and develop an extraction framework which mines for time scale, while the authors in paper [1] explore the possibility structural segments in the Wikipedia text, and construction of an of automatically identifying “common sense” statements from comprehensive interface to support effective query formulation unrestricted natural language text found in Wikipedia and over the Wikipedia textual corpus and thus realize an integrated mapping them to RDF. This system works on a hypothesis that model enabling analytical queries on the Wikipedia knowledge common sense knowledge is often expressed in a subject base. predicate form, and their work focuses on the challenge of automatically identifying such generic statements. 3. PROBLEM DEFINITION The problem being addressed in this project can be adequately 5. EXTRACTION FRAMEWORK described as a research initiative into extraction strategies and This section provides details of the data model used in the design & implementation of an efficient & capable extraction system and the architectural composition of the extraction framework to identify and retrieve structured information from framework devised. The conceptual view paradigm widely Wikipedia and accurately establish associations among them so adopted in relational databases as an abstract representation or as preserve them in a relational data source. This problem model of the real world does not apply to our case. The idea of requirement involves dealing with various Wikipedia content identifying entities and tabulating them or establishing representation types such as text segments, info boxes, wiki relationships across them in a Wikipedia source is highly tables, images, paragraphs, links, etc and the creation of a irrelevant and certainly not scalable. database scheme adequate enough to encompass such diverse extracted tuples. The provision for rich SQL queries is a simple 5.1 Data/ Query Model addition to this system, and hence will not be fully explored but Our data model to accommodate Wikipedia data has been rather the support for querying realized by presenting a simplistic designed to be domain independent. The database schema has querying interface. The problem definition can thus be summed been modeled to resemble RDF tuples, and is primarily designed up as – to scale with increasing content size as well deal with vivid heterogeneity or diversity in extracted data. 1. Research into extraction strategies, design & implementation of a retrieval framework to identify and Table 2: pics table extract structured information from Wikipedia. Id Title Image Tag pic_url 2. Enable extraction of various Wikipedia content types 11 Alabama Flag Al http://wikimedia.org/AlFlg.jpg including text segments, info boxes, free text, references and images. Table 3: infoboxes table 3. Design of a database schema to accommodate diverse data fields extracted from Wikipedia. Id Title Property PValue 4. Provision of basic querying primitives over the extracted 11 Alabama Governor Robert R. Riley information. Table 4: twikis table 4. RELATED WORK Id Title Content Tag We performed a wide literature survey to learn about similar extraction initiatives. The idea to bring Semantics into Wikipedia 11 Alabama It is seen .. Law_and_government is not new and several studies on this topic have been carried out in the last few years. Semantic extraction and relationships were discussed in [6]. The authors analyze relevant measures for The data schema reflects the structural types encountered on inferring the semantic relationships between page categories of Wikipedia as tables in the relational source. This data model Wikipedia. DBpedia [5] is a community based effort that uses allows for easy incorporation of a new representation type 2
  • 3. whenever it is encountered in a new Wikipedia page. Records are grouped independent of its parent domain or title document. The original page can be constructed back from the segregated tuples by joining across the ‘Title’ attribute. A title of a Wikipedia is a distinguishing attribute and hence chosen to be a key in our schema. The ‘ID’ field is an auto incrementing column which behaves as the primary key within the database tables, however all operations are based on the logically coherent ‘Title’ field. Since all tuples in a given table correspond to same or similar Wikipedia content types, the extracted tuples are uniform hence allowing for faster indexing options. However the data model suffers from the traditional weakness of a RDF oriented schema of requiring too many joins to construct data back. In this paper, we emphasize on the extraction of the tuples and believe that the extracted data in these tuples can be easily migrated to a more rigidly normalized data store, and hence choose to ignore the limitations of RDF. We have consciously chosen to record only location addresses of images rather than the binary content to preserve precious database space as well account for update of images on the actual Wikipedia website. The database tables contain additional attributes including auto-number identifiers and timestamps primarily introduced for housekeeping purposes. This data model clearly favors ease of insertion and quick retrieval however does not support quick updates of the linked tuples. We believe that updates are relatively fewer in our system, and the response times for updates can be improved by treating updates as delete operation followed by a fresh insertion action, which also inadvertently helps in flushing stale tuples. The data model actively adapts to evolving Wikipedia structural types; however the addition of a new type is one time manual affair. The anatomy of few of the tables is presented in the table above, and as mentioned earlier, they resemble key-value form. Figure 1: System Architecture 5.2 Architecture The Wikipedia template set is a pattern list which can be The project adopts a simple architecture and is presented in incrementally improved to account for different Wikipedia pages Figure 1. The ‘Wiki Parser’ is an extraction engine built by or even web pages on the general web. The templates also act as extending a general HTML parser. The HTML parser is a simple a noise filter to selectively weed out incomplete tags or structures interface in our system that includes a small crawler segment to existing in the Wikipedia page. The user interface is an AJAX selectively pick out Wikipedia pages of interest, and extract the enabled web front end which allows users to express queries HTML source from these pages and convert it into an internal upon Wikipedia, and displays the query outcome visually. The object representation. The template set is a collection of user interface works with the system database to serve user Wikipedia HTML/ CSS templates or classes and regular queries using extracted information. However, the user interface expressions built over them to easily identify frequently is also equipped to display extracted tuples returned by the ‘Wiki occurring content headers in Wikipedia pages. In addition, to Parser’, during the online mode of functioning. these templates a frequency determination module augments the token identification process. The ‘Wiki Parser’ uses the token 6. ALGORITHM DESIGN frequency identification component in association with the The extraction algorithm is a two pass algorithm over an input template matching to isolate main tokens in a given Wikipedia web document which identifies structural tokens and their page. The ‘Wiki parser’ is capable of handing diverse elements hierarchies in the first phase and performs appropriate token to types including images, lists, tables, sections, free text and text matching associations in the subsequent phase. We present headers. It then iteratively associates surrounding context to the extraction challenge in sub-section 6.1, algorithm details in these identified tokens to determine key-value pairs to be 6.2 and analysis in sub-section 6.3 included in a system hash table. The hash tables are mapped onto the system relational database using the structural type to tabular 6.1 Extraction Challenge mapping explained in the data model. The database mapping also The main challenge to extraction is to identify section title or key generates XML records of the extracted knowledge for direct elements in a given HTML page. It can be seen that Wikipedia consumption by specific applications. offers a wide choice of elements to be promoted as section titles. For example, the segments headers on a Wikipedia page about 3
  • 4. the state ‘New York’ could be demarcated with a CSS class 6.3 Algorithm Analysis ‘mw-headline’ and the surrounding text could appear in a CSS This two-pass algorithm displays a reasonable execution trace for class ‘reference’. This observation however does not stand to be a moderate size input document set. We provide some formal consistent across all Wikipedia pages. There could be pages notations below to aid the analysis of our algorithm and compute where the section titles are tagged using CSS class ‘reference’ asymptotic time complexity. Let the set D denote the input and its associated text uses CSS class ‘toctext’. Hence it is not document set = {d1, d2, d3, .. dn}. The set K denotes the token set trivial to identify which elements are keys within a web page and = {k1, k2, .. kj} for each document. Let P (e) denote the token what are its corresponding values. This occurrence is further identification time and P (a) represent the association time per complicated due the free usage of JavaScript between structural token. The time complexity is given as – types. Also, the segment text headers may not always be chosen using a Wikipedia provided type which maps to an equivalent Σ 1 to n [ dj * P(e) + Σ 1 to j kj * P (a) ] CSS class. Since Wikipedia provides support for authors to insert their own HTML formatting, some authors choose to tag their The term P (e) is the key determining factor in this equation, and headers using standard HTML tags like <h3> tag, while others we use the heuristic based regular expression matching to reduce may choose a different tag like <h4> while the rest may opt for the unit time per token identification, as an attempt to speedup the Wikipedia provided format classes. This adds a layer of the algorithm execution. ambiguity to the problem of accurately selecting the key element 7. IMPLEMENTATION fields. To overcome this contextual uncertainty, we use a statistical measure to predict or promote certain fields as The project implementation has been modeled into a web headers. The details on this approach are presented in the application by making the query interface web enabled. The following sub-section. application has been built using the Ruby Language on the Rails web hosting framework. Ruby is an interpreted scripting 6.2 Algorithm Details language specially designed for quick and easy object oriented programming. The ‘Rails’ platform is an open source web This section provides the internal workings of statistical application framework developed to make web programming measures used for segment header identification. The efficient and simple. Development in Ruby on Rails explicitly implementation of this algorithmic procedure has been performed follows the Model View Controller, the industry proven in the Token Frequency Identification module explained as part implementation architecture. The Model View Controller (MVC) of the system architecture. The main essence of textual extraction architecture can be broadly defined as a design pattern that is to identify what text belongs under which header. Unlike describes a recurring problem and its solution where the solution XML, HTML context closely resembles a similar ambiguity of is never exactly the same for every recurrence. The associating context-body with context-headers in traditional implementation relies on the use of regular expressions to natural language processing. Our approach augments the use of perform template matching and for content identification. The simple pattern determining regular expressions with an element extracted information or tuples are maintained in a MySQL frequency score which is computed for every structural element. database which serves as a relational data source. For example if the CSS class ‘reference’ is used thrice in a HTML page, and the CSS class ‘text’ appears seven times, then we can relatively conclude with a high probability that since the CSS class ‘reference’ is sparingly used, it corresponds to a higher weight element. The algorithmic procedure performs two passes over a given document element list. It computes a frequency of occurrence for every unique structural element, and promotes the lesser frequent ones as possible segment definers. The second pass uses a combination of Wikipedia templates, the determined frequencies to be used for a proximity calculation to identify the most proximal match for a more frequent element with a probable context header identified in first pass. A crude verification is performed using predefined regular expressions to validate the grouping done using this two-pass token association algorithm. The algorithm displays very good accuracy in working with text segments and images, however encounters occasional erroneous predictions with working with Wikitables. The algorithm seeks to obtain a statistical score as a determining factor to base its associations upon, to overcome the ambiguity that the context presents. This algorithmic approach can be augmented by a context based certifier to determine a similarity score between the context heading and its associated value, and thus verify the statistical association computed. Figure 2: Implementation Snapshot 1 4
  • 5. The implementation is intentionally made object-oriented to enough for different content types occurring on Wikipedia pages. provide to ease of extensibility and scaling. The querying The extraction algorithm was also tested for a heavy contiguous interface follows a query builder paradigm, has AJAX support load of around 60 web pages, which took 6 minutes on a 1 Ghz and has been built using the Google Web Toolkit (GWT). The machine, and thus exhibits good efficiency. GWT is an open source Java software development framework A functional evaluation of the system was also performed to test allowing web developers create AJAX applications in Java. the integrated working of the system and its inter-connections The project implementation of the querying interface provides between various components, which depicted a steady working support for an online querying mode where in the user’s query is state of the constructed prototype. served by real time extraction of the Wikipedia page. For analytical or complex queries we recommend the offline mode of 9. CONCLUSION querying, which works on the pre-extracted data and results in The work we have performed is still in the early stages of faster response times. The implementation supports preserving research, but we believe it offers an innovative way of extracted Wikipedia information not only in the relational contributing to the construction of ontology from web pages. MySQL data source but also as flat XML files. Likewise, we believe the prospect of using this methodology to help generate semantic extensions into Wikipedia is both exciting and useful. We summarize our work as to include the design, development and implementation of an extraction engine to retrieve structured information from Wikipedia; data model to map extracted knowledge into a relational store; XML representations to publish the acquired information; a scheme to handle data diversity during extraction and data preservation; a statistical inference mechanism to decipher contextual ambiguity and a capable querying interface to present these features. We envision future work on this topic in terms of incrementally augmenting the statistical score computation using domain analysis or through active regression based learning approaches. A parallel stem of research can involve identifying and managing inter-article relationships in Wikipedia by observing document references obtained from our extraction framework. Implementation specific effort can channeled for design and construction of web services publishing the extracted data, or interactively engaging with web agents and receiving queries from them. Focus can also be directed at enriching the extracted Figure 3: Sample XML Output Wikipedia knowledge base with information from external data sets by inventing ontology to support such objectives. 8. EVALUATION RESULTS 10. ACKNOWLEDGMENTS The implemented system was evaluated by performing extraction of over 100 articles from Wikipedia. The extraction of these We would like to thank our instructor Dr. Chengkai Li for his articles resulted in an extraction of over 4087 images, 3627 text guidance and support to help us develop the necessary skills in boxes, and 2866 info box property values. The extraction focused Information extraction and timely accomplish this project. on a number of diverse subject domains including geography, politics, famous personalities and tourist attractions. The results 11. REFERENCES indicate that the text and image extraction are domain–neutral, [1] Suh, S., Halpin, H., and Klein, E. Extracting Common and representation independent. The cleanliness of text Sense Knowledge from Wikipedia. ISWC Workshop, extractions was one of the promising aspects of this project. The Athens, Georgia (Nov. 2006). extraction results indicate a very high recall degree in text and [2] Bhole, A., Fortuna, B., Grobelnik, M. and Mladenic, D. image extraction segments. The association results were proven Extracting Named Entities and Relating Them over Time by validating against the real world associations occurring on the Based on Wikipedia. Informatica, 2007, 463-468. source Wikipedia page. The frequency based estimation technique fares an accuracy of over 90% for associating text, [3] Cafarella, M., Etzioni, O., and Suciu, D. Structured Queries images and info boxes with keywords. It displays an accuracy of Over Web Text. In Bulletin: IEEE Computer Society around 75% for deeply nested Wiki tables. Technical Committee on Data Engineering, 2006. The system was specifically evaluated to check the reliability of [4] Cafarella, M., Re, C., Suciu, D., Etzioni, O., and Banko, M. the associations for pages with unseen content definitions. The Structured Querying of Web Text: A Technical Challenge, system proved to yield acceptable results for around 18 different 3rd Biennial Conference on Innovative Data Systems representation types, out of the 21 different representation types Research (CIDR), Asimolar, California, January 2007. tested for. The database schema has been found to be flexible 5
  • 6. [5] Auer, S., Bizer, C., Kobilarov, G., Lehmann, J., Cyganiak, [7] Milne, D., Witten, I., and Nichols, D., Extracting Corpus- R., and Ives, DBpedia: A Nucleus for a Web of Open Data, Specific Knowledge Bases from Wikipedia, CIKM’07, 6th International Semantic Web Conference (ISWC 2007), Lisbon, Portugal, November 2007. Busan, Korea, November 2007. [8] Chu, E., Baid, A., Chen, T., Doan, A., and Naughton, J. A [6] Chernov, S., Iofciu, T., Nejdl, and W., Zhou, X., Extracting Relational Approach to Incrementally Extracting and Semantic Relationships between Wikipedia Categories, Querying Structure in Unstructured Data, VLDB’ 07, Vienna, Austria, September 2007. SemWiki2006, 2006. 6