SlideShare a Scribd company logo
1 of 9
Download to read offline
Automatically Generating Wikipedia Articles:
                          A Structure-Aware Approach

                           Christina Sauper and Regina Barzilay
                     Computer Science and Artificial Intelligence Laboratory
                            Massachusetts Institute of Technology
                         {csauper,regina}@csail.mit.edu


                    Abstract                           such as searching the Internet. Moreover, the chal-
                                                       lenge of maintaining output readability is mag-
    In this paper, we investigate an ap-               nified when creating a longer document that dis-
    proach for creating a comprehensive tex-           cusses multiple topics.
    tual overview of a subject composed of in-            In our approach, we explore how the high-
    formation drawn from the Internet. We use          level structure of human-authored documents can
    the high-level structure of human-authored         be used to produce well-formed comprehensive
    texts to automatically induce a domain-            overview articles. We select relevant material for
    specific template for the topic structure of        an article using a domain-specific automatically
    a new overview. The algorithmic innova-            generated content template. For example, a tem-
    tion of our work is a method to learn topic-       plate for articles about diseases might contain di-
    specific extractors for content selection           agnosis, causes, symptoms, and treatment. Our
    jointly for the entire template. We aug-           system induces these templates by analyzing pat-
    ment the standard perceptron algorithm             terns in the structure of human-authored docu-
    with a global integer linear programming           ments in the domain of interest. Then, it produces
    formulation to optimize both local fit of           a new article by selecting content from the Internet
    information into each topic and global co-         for each part of this template. An example of our
    herence across the entire overview. The            system’s output1 is shown in Figure 1.
    results of our evaluation confirm the bene-            The algorithmic innovation of our work is a
    fits of incorporating structural information        method for learning topic-specific extractors for
    into the content selection process.                content selection jointly across the entire template.
                                                       Learning a single topic-specific extractor can be
1 Introduction
                                                       easily achieved in a standard classification frame-
In this paper, we consider the task of automatically   work. However, the choices for different topics
creating a multi-paragraph overview article that       in a template are mutually dependent; for exam-
provides a comprehensive summary of a subject of       ple, in a multi-topic article, there is potential for
interest. Examples of such overviews include ac-       redundancy across topics. Simultaneously learn-
tor biographies from IMDB and disease synopses         ing content selection for all topics enables us to
from Wikipedia. Producing these texts by hand is       explicitly model these inter-topic connections.
a labor-intensive task, especially when relevant in-      We formulate this task as a structured classifica-
formation is scattered throughout a wide range of      tion problem. We estimate the parameters of our
Internet sources. Our goal is to automate this pro-    model using the perceptron algorithm augmented
cess. We aim to create an overview of a subject –      with an integer linear programming (ILP) formu-
e.g., 3-M Syndrome – by intelligently combining        lation, run over a training set of example articles
relevant excerpts from across the Internet.            in the given domain.
   As a starting point, we can employ meth-               The key features of this structure-aware ap-
ods developed for multi-document summarization.        proach are twofold:
However, our task poses additional technical chal-
                                                          1
lenges with respect to content planning. Gen-               This system output was added to Wikipedia at http://
                                                       en.wikipedia.org/wiki/3-M_syndrome on June
erating a well-rounded overview article requires       26, 2008. The page’s history provides examples of changes
proactive strategies to gather relevant material,      performed by human editors to articles created by our system.
Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the
GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations
have been identified in an affected family member in a research or clinical laboratory.
Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classic
genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. In recessive
disorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . .
Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). In
some cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated with
Three M syndrome.
Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi-
viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that may be potentially
associated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic and
supportive.

              Figure 1: A fragment from the automatically created article for 3-M Syndrome.


  • Automatic template creation: Templates                         domains, output summaries often suffer from co-
    are automatically induced from human-                          herence and coverage problems.
    authored documents. This ensures that the
                                                                      In between these two approaches is work on
    overview article will have the breadth ex-
                                                                   domain-specific text-to-text generation. Instances
    pected in a comprehensive summary, with
                                                                   of these tasks are biography generation in sum-
    content drawn from a wide variety of Inter-
                                                                   marization and answering definition requests in
    net sources.
                                                                   question-answering. In contrast to a generic sum-
  • Joint parameter estimation for content se-                     marizer, these applications aim to characterize
    lection: Parameters are learned jointly for                    the types of information that are essential in a
    all topics in the template. This procedure op-                 given domain. This characterization varies greatly
    timizes both local relevance of information                    in granularity. For instance, some approaches
    for each topic and global coherence across                     coarsely discriminate between biographical and
    the entire article.                                            non-biographical information (Zhou et al., 2004;
                                                                   Biadsy et al., 2008), while others go beyond binary
   We evaluate our approach by creating articles in                distinction by identifying atomic events – e.g., oc-
two domains: Actors and Diseases. For a data set,                  cupation and marital status – that are typically in-
we use Wikipedia, which contains articles simi-                    cluded in a biography (Weischedel et al., 2004;
lar to those we wish to produce in terms of length                 Filatova and Prager, 2005; Filatova et al., 2006).
and breadth. An advantage of this data set is that                 Commonly, such templates are specified manually
Wikipedia articles explicitly delineate topical sec-               and are hard-coded for a particular domain (Fujii
tions, facilitating structural analysis. The results               and Ishikawa, 2004; Weischedel et al., 2004).
of our evaluation confirm the benefits of structure-
aware content selection over approaches that do                       Our work is related to these approaches; how-
not explicitly model topical structure.                            ever, content selection in our work is driven by
                                                                   domain-specific automatically induced templates.
2 Related Work                                                     As our experiments demonstrate, patterns ob-
                                                                   served in domain-specific training data provide
Concept-to-text generation and text-to-text gener-
                                                                   sufficient constraints for topic organization, which
ation take very different approaches to content se-
                                                                   is crucial for a comprehensive text.
lection. In traditional concept-to-text generation,
a content planner provides a detailed template for                    Our work also relates to a large body of recent
what information should be included in the output                  work that uses Wikipedia material. Instances of
and how this information should be organized (Re-                  this work include information extraction, ontology
iter and Dale, 2000). In text-to-text generation,                  induction and resource acquisition (Wu and Weld,
such templates for information organization are                    2007; Biadsy et al., 2008; Nastase, 2008; Nastase
not available; sentences are selected based on their               and Strube, 2008). Our focus is on a different task
salience properties (Mani and Maybury, 1999).                      — generation of new overview articles that follow
While this strategy is robust and portable across                  the structure of Wikipedia articles.
3 Method                                                             quality of a given excerpt. Using the percep-
                                                                     tron framework augmented with an ILP for-
The goal of our system is to produce a compre-
                                                                     mulation for global optimization, the system
hensive overview article given a title – e.g., Can-
                                                                     is trained to select the best excerpt for each
cer. We assume that relevant information on the
                                                                     document di and each topic tj . For train-
subject is available on the Internet but scattered
                                                                     ing, we assume the best excerpt is the original
among several pages interspersed with noise.
                                                                     human-authored text sij .
   We are provided with a training corpus consist-
ing of n documents d1 . . . dn in the same domain                 3. Application (Section 3.2) Given the title of
– e.g., Diseases. Each document di has a title and                   a requested document, we select several ex-
a set of delineated sections2 si1 . . . sim . The num-               cerpts from the candidate vectors returned by
ber of sections m varies between documents. Each                     the search procedure (1b) to create a com-
section sij also has a corresponding heading hij –                   prehensive overview article. We perform the
e.g., Treatment.                                                     decoding procedure jointly using learned pa-
   Our overview article creation process consists                    rameters w1 . . . wk and the same ILP formu-
of three parts. First, a preprocessing step creates                  lation for global optimization as in training.
a template and searches for a number of candidate                    The result is a new document with k excerpts,
excerpts from the Internet. Next, parameters must                    one for each topic.
be trained for the content selection algorithm us-
ing our training data set. Finally, a complete ar-              3.1 Preprocessing
ticle may be created by combining a selection of                Template Induction A content template speci-
candidate excerpts.                                             fies the topical structure of documents in one do-
                                                                main. For instance, the template for articles about
  1. Preprocessing (Section 3.1) Our prepro-
                                                                actors consists of four topics t1 . . . t4 : biography,
     cessing step leverages previous work in topic
                                                                early life, career, and personal life. Using this
     segmentation and query reformulation to pre-
                                                                template to create the biography of a new actor
     pare a template and a set of candidate ex-
                                                                will ensure that its information coverage is con-
     cerpts for content selection. Template gen-
                                                                sistent with existing human-authored documents.
     eration must occur once per domain, whereas
                                                                   We aim to derive these templates by discovering
     search occurs every time an article is gener-
                                                                common patterns in the organization of documents
     ated in both learning and application.
                                                                in a domain of interest. There has been a sizable
       (a) Template Induction To create a con-                  amount of research on structure induction ranging
           tent template, we cluster all section                from linear segmentation (Hearst, 1994) to content
           headings hi1 . . . him for all documents             modeling (Barzilay and Lee, 2004). At the core
           di . Each cluster is labeled with the most           of these methods is the assumption that fragments
           common heading hij within the clus-                  of text conveying similar information have simi-
           ter. The largest k clusters are selected to          lar word distribution patterns. Therefore, often a
           become topics t1 . . . tk , which form the           simple segment clustering across domain texts can
           domain-specific content template.                     identify strong patterns in content structure (Barzi-
       (b) Search For each document that we                     lay and Elhadad, 2003). Clusters containing frag-
           wish to create, we retrieve from the In-             ments from many documents are indicative of top-
           ternet a set of r excerpts ej1 . . . ejr for         ics that are essential for a comprehensive sum-
           each topic tj from the template. We de-              mary. Given the simplicity and robustness of this
           fine appropriate search queries using the             approach, we utilize it for template induction.
           requested document title and topics tj .                We cluster all section headings hi1 . . . him from
                                                                all documents di using a repeated bisectioning
  2. Learning Content Selection (Section 3.2)                   algorithm (Zhao et al., 2005). As a similarity
     For each topic tj , we learn the corresponding             function, we use cosine similarity weighted with
     topic-specific parameters wj to determine the               TF*IDF. We eliminate any clusters with low in-
   2
                                                                ternal similarity (i.e., smaller than 0.5), as we as-
    In data sets where such mark-up is not available, one can
employ topical segmentation algorithms as an additional pre-    sume these are “miscellaneous” clusters that will
processing step.                                                not yield unified topics.
We determine the average number of sections             is needed when candidate excerpts are drawn di-
k over all documents in our training set, then se-         rectly from the web, as they may be contaminated
lect the k largest section clusters as topics. We or-      with noise.
der these topics as t1 . . . tk using a majority order-       We propose a novel joint training algorithm that
ing algorithm (Cohen et al., 1998). This algorithm         learns selection criteria for all the topics simulta-
finds a total order among clusters that is consistent       neously. This approach enables us to maximize
with a maximal number of pairwise relationships            both local fit and global coherence. We implement
observed in our data set.                                  this algorithm using the perceptron framework, as
   Each topic tj is identified by the most frequent         it can be easily modified for structured prediction
heading found within the cluster – e.g., Causes.           while preserving convergence guarantees (Daum´      e
This set of topics forms the content template for a        III and Marcu, 2005; Snyder and Barzilay, 2007).
domain.                                                       In this section, we first describe the structure
Search To retrieve relevant excerpts, we must              and decoding procedure of our model. We then
define appropriate search queries for each topic            present an algorithm to jointly learn the parame-
t1 . . . tk . Query reformulation is an active area of     ters of all topic models.
research (Agichtein et al., 2001). We have exper-          3.2.1 Model Structure
imented with several of these methods for draw-
                                                           The model inputs are as follows:
ing search queries from representative words in the
body text of each topic; however, we find that the            • The title of the desired document
best performance is provided by deriving queries             • t1 . . . tk — topics from the content template
from a conjunction of the document title and topic           • ej1 . . . ejr — candidate excerpts for each
– e.g., “3-M syndrome” diagnosis.                              topic tj
   Using these queries, we search using Yahoo!
and retrieve the first ten result pages for each topic.       In addition, we define feature and parameter
From each of these pages, we extract all possible          vectors:
excerpts consisting of chunks of text between stan-
                                                             • φ(ejl ) — feature vector for the lth candidate
dardized boundary indicators (such as <p> tags).
                                                               excerpt for topic tj
In our experiments, there are an average of 6 ex-
                                                             • w1 . . . wk — parameter vectors, one for each
cerpts taken from each page. For each topic tj of
                                                               of the topics t1 . . . tk
each document we wish to create, the total number
of excerpts r found on the Internet may differ. We            Our model constructs a new article by following
label the excerpts ej1 . . . ejr .                         these two steps:
                                                           Ranking First, we attempt to rank candidate
3.2   Selection Model                                      excerpts based on how representative they are of
Our selection model takes the content template             each individual topic. For each topic tj , we induce
t1 . . . tk and the candidate excerpts ej1 . . . ejr for   a ranking of the excerpts ej1 . . . ejr by mapping
each topic tj produced in the previous steps. It           each excerpt ejl to a score:
then selects a series of k excerpts, one from each
topic, to create a coherent summary.                                   scorej (ejl ) = φ(ejl ) · wj
   One possible approach is to perform individ-               Candidates for each topic are ranked from high-
ual selections from each set of excerpts ej1 . . . ejr     est to lowest score. After this procedure, the posi-
and then combine the results. This strategy is             tion l of excerpt ejl within the topic-specific can-
commonly used in multi-document summariza-                 didate vector is the excerpt’s rank.
tion (Barzilay et al., 1999; Goldstein et al., 2000;       Optimizing the Global Objective To avoid re-
Radev et al., 2000), where the combination step            dundancy between topics, we formulate an opti-
eliminates the redundancy across selected ex-              mization problem using excerpt rankings to create
cerpts. However, separating the two steps may not          the final article. Given k topics, we would like to
be optimal for this task — the balance between             select one excerpt ejl for each topic tj , such that
coverage and redundancy is harder to achieve               the rank is minimized; that is, scorej (ejl ) is high.
when a multi-paragraph summary is generated. In               To select the optimal excerpts, we employ inte-
addition, a more discriminative selection strategy         ger linear programming (ILP). This framework is
commonly used in generation and summarization                  Feature                Value
                                                               UNI wordi              count of word occurrences
applications where the selection process is driven             POS wordi              first position of word in excerpt
by multiple constraints (Marciniak and Strube,                 BI wordi wordi+1       count of bigram occurrences
2005; Clarke and Lapata, 2007).                                SENT                   count of all sentences
                                                               EXCL                   count of exclamations
  We represent excerpts included in the output                 QUES                   count of questions
using a set of indicator variables, xjl . For each             WORD                   count of all words
excerpt ejl , the corresponding indicator variable             NAME                   count of title mentions
                                                               DATE                   count of dates
xjl = 1 if the excerpt is included in the final doc-            PROP                   count of proper nouns
ument, and xjl = 0 otherwise.                                  PRON                   count of pronouns
  Our objective is to minimize the ranks of the                NUM                    count of numbers
                                                               FIRST word1            1∗
excerpts selected for the final document:                       FIRST word1 word2      1†
                                                               SIMS                   count of similar excerpts‡
                            k    r
                    min              l · xjl               Table 1: Features employed in the ranking model.
                                                           ∗
                          j=1 l=1                            Defined as the first unigram in the excerpt.
                                                           †
                                                             Defined as the first bigram in the excerpt.
                                                           ‡
    We augment this formulation with two types of            Defined as excerpts with cosine similarity > 0.5
constraints.
Exclusivity Constraints We want to ensure that
                                                           Features As shown in Table 1, most of the fea-
exactly one indicator xjl is nonzero for each topic
                                                           tures we select in our model have been employed
tj . These constraints are formulated as follows:
                                                           in previous work on summarization (Mani and
             r                                             Maybury, 1999). All features except the SIMS
                  xjl = 1        ∀j ∈ {1 . . . k}          feature are defined for individual excerpts in isola-
            l=1                                            tion. For each excerpt ejl , the value of the SIMS
Redundancy Constraints We also want to pre-                feature is the count of excerpts ejl′ in the same
vent redundancy across topics.                We define     topic tj for which sim(ejl , ejl′ ) > 0.5. This fea-
sim(ejl , ej ′ l′ ) as the cosine similarity between ex-   ture quantifies the degree of repetition within a
cerpts ejl from topic tj and ej ′ l′ from topic tj ′ .     topic, often indicative of an excerpt’s accuracy and
We introduce constraints that ensure no pair of ex-        relevance.
cerpts has similarity above 0.5:
                                                           3.2.2 Model Training

           (xjl + xj ′ l′ ) · sim(ejl , ej ′ l′ ) ≤ 1      Generating Training Data For training, we are
                                                           given n original documents d1 . . . dn , a content
          ∀j, j ′ = 1 . . . k        ∀l, l′ = 1 . . . r    template consisting of topics t1 . . . tk , and a set of
                                                           candidate excerpts eij1 . . . eijr for each document
   If excerpts ejl and ej ′ l′ have cosine similarity
                                                           di and topic tj . For each section of each docu-
sim(ejl , ej ′ l′ ) > 0.5, only one excerpt may be
                                                           ment, we add the gold excerpt sij to the corre-
selected for the final document – i.e., either xjl
                                                           sponding vector of candidate excerpts eij1 . . . eijr .
or xj ′ l′ may be 1, but not both. Conversely, if          This excerpt represents the target for our training
sim(ejl , ej ′ l′ ) ≤ 0.5, both excerpts may be se-        algorithm. Note that the algorithm does not re-
lected.                                                    quire annotated ranking data; only knowledge of
Solving the ILP Solving an integer linear pro-             this “optimal” excerpt is required. However, if
gram is NP-hard (Cormen et al., 1992); however,            the excerpts provided in the training data have low
in practice there exist several strategies for solving     quality, noise is introduced into the system.
certain ILPs efficiently. In our study, we employed
                                                           Training Procedure Our algorithm is a
lp solve,3 an efficient mixed integer programming
                                                           modification of the perceptron ranking algo-
solver which implements the Branch-and-Bound
                                                           rithm (Collins, 2002), which allows for joint
algorithm. On a larger scale, there are several al-
                                                           learning across several ranking problems (Daum´        e
ternatives to approximate the ILP results, such as a
                                                           III and Marcu, 2005; Snyder and Barzilay, 2007).
dynamic programming approximation to the knap-
                                                           Pseudocode for this algorithm is provided in
sack problem (McDonald, 2007).
                                                           Figure 2.
   3
       http://lpsolve.sourceforge.net/5.5/                    First, we define Rank(eij1 . . . eijr , wj ), which
ranks all excerpts from the candidate excerpt            Input:
vector eij1 . . . eijr for document di and topic              d1 . . . dn : A set of n documents, each containing
tj . Excerpts are ordered by scorej (ejl ) using                k sections si1 . . . sik
                                                              eij1 . . . eijr : Sets of candidate excerpts for each topic
the current parameter values. We also define                     tj and document di
Optimize(eij1 . . . eijr ), which finds the optimal       Define:
                                                              Rank(eij1 . . . eijr , wj ):
selection of excerpts (one per topic) given ranked              As described in Section 3.2.1:
lists of excerpts eij1 . . . eijr for each document di          Calculates scorej (eijl ) for all excerpts for
and topic tj . These functions follow the ranking                   document di and topic tj , using parameters wj .
                                                                Orders the list of excerpts by scorej (eijl )
and optimization procedures described in Section                    from highest to lowest.
3.2.1. The algorithm maintains k parameter vec-               Optimize(ei11 . . . eikr ):
tors w1 . . . wk , one associated with each topic tj            As described in Section 3.2.1:
                                                                Finds the optimal selection of excerpts to form a
desired in the final article. During initialization,                 final article, given ranked lists of excerpts
all parameter vectors are set to zeros (line 2).                    for each topic t1 . . . tk .
    To learn the optimal parameters, this algorithm             Returns a list of k excerpts, one for each topic.
                                                              φ(eijl ):
iterates over the training set until the parameters             Returns the feature vector representing excerpt eijl
converge or a maximum number of iterations is            Initialization:
                                                         1    For j = 1 . . . k
reached (line 3). For each document in the train-        2      Set parameters wj = 0
ing set (line 4), the following steps occur: First,      Training:
candidate excerpts for each topic are ranked (lines      3    Repeat until convergence or while iter < itermax :
                                                         4      For i = 1 . . . n
5-6). Next, decoding through ILP optimization is         5          For j = 1 . . . k
performed over all ranked lists of candidate ex-         6                Rank(eij1 . . . eijr , wj )
cerpts, selecting one excerpt for each topic (line       7          x1 . . . xk = Optimize(ei11 . . . eikr )
                                                         8          For j = 1 . . . k
7). Finally, the parameters are updated in a joint       9                If sim(xj , sij ) < 0.8
fashion. For each topic (line 8), if the selected        10                  wj = wj + φ(sij ) − φ(xi )
excerpt is not similar enough to the gold excerpt        11     iter = iter + 1
                                                         12   Return parameters w1 . . . wk
(line 9), the parameters for that topic are updated
using a standard perceptron update rule (line 10).
When convergence is reached or the maximum it-
                                                         Figure 2: An algorithm for learning several rank-
eration count is exceeded, the learned parameter
                                                         ing problems with a joint decoding mechanism.
values are returned (line 12).
    The use of ILP during each step of training
sets this algorithm apart from previous work. In
                                                         tor reaction to system-produced articles submitted
prior research, ILP was used as a postprocess-
                                                         to Wikipedia.
ing step to remove redundancy and make other
global decisions about parameters (McDonald,             Data For evaluation, we consider two domains:
2007; Marciniak and Strube, 2005; Clarke and La-         American Film Actors and Diseases. These do-
pata, 2007). However, in our training, we inter-         mains have been commonly used in prior work
twine the complete decoding procedure with the           on summarization (Weischedel et al., 2004; Zhou
parameter updates. Our joint learning approach           et al., 2004; Filatova and Prager, 2005; Demner-
finds per-topic parameter values that are maxi-           Fushman and Lin, 2007; Biadsy et al., 2008). Our
mally suited for the global decoding procedure for       text corpus consists of articles drawn from the cor-
content selection.                                       responding categories in Wikipedia. There are
                                                         2,150 articles in American Film Actors and 523
4 Experimental Setup                                     articles in Diseases. For each domain, we ran-
                                                         domly select 90% of articles for training and test
We evaluate our method by observing the quality          on the remaining 10%. Human-authored articles
of automatically created articles in different do-       in both domains contain an average of four top-
mains. We compute the similarity of a large num-         ics, and each topic contains an average of 193
ber of articles produced by our system and sev-          words. In order to model the real-world scenario
eral baselines to the original human-authored arti-      where Wikipedia articles are not always available
cles using ROUGE, a standard metric for summary          (as for new or specialized topics), we specifically
quality. In addition, we perform an analysis of edi-     exclude Wikipedia sources during our search pro-
Avg. Excerpts   Avg. Sources      confidence scores.
  Amer. Film Actors                                          Our third baseline, Disjoint, uses the ranking
  Search                     2.3             1
  No Template                 4             4.0           perceptron framework as in our full system; how-
  Disjoint                    4             2.1           ever, rather than perform an optimization step
  Full Model                  4             3.4           during training and decoding, we simply select
  Oracle                     4.3            4.3
  Diseases
                                                          the highest-ranked excerpt for each topic. This
  Search                     3.1             1            equates to standard linear classification for each
  No Template                 4             2.5           section individually.
  Disjoint                    4             3.0
  Full Model                  4             3.2              In addition to these baselines, we compare
  Oracle                     5.8            3.9           against an Oracle system. For each topic present
                                                          in the human-authored article, the Oracle selects
Table 2: Average number of excerpts selected and          the excerpt from our full model’s candidate ex-
sources used in article creation for test articles.       cerpts with the highest cosine similarity to the
                                                          human-authored text. This excerpt is the optimal
                                                          automatic selection from the results available, and
cedure (Section 3.1) for evaluation.                      therefore represents an upper bound on our excerpt
                                                          selection task. Some articles contain additional
Baselines Our first baseline, Search, relies
                                                          topics beyond those in the template; in these cases,
solely on search engine ranking for content selec-
                                                          the Oracle system produces a longer article than
tion. Using the article title as a query – e.g., Bacil-
                                                          our algorithm.
lary Angiomatosis, this method selects the web
page that is ranked first by the search engine. From          Table 2 shows the average number of excerpts
this page we select the first k paragraphs where k         selected and sources used in articles created by our
is defined in the same way as in our full model. If        full model and each baseline.
there are less than k paragraphs on the page, all         Automatic Evaluation To assess the quality of
paragraphs are selected, but no other sources are         the resulting overview articles, we compare them
used. This yields a document of comparable size           with the original human-authored articles. We
with the output of our system. Despite its sim-           use ROUGE, an evaluation metric employed at the
plicity, this baseline is not naive: extracting ma-       Document Understanding Conferences (DUC),
terial from a single document guarantees that the         which assumes that proximity to human-authored
output is coherent, and a page highly ranked by a         text is an indicator of summary quality. We
search engine may readily contain a comprehen-            use the publicly available ROUGE toolkit (Lin,
sive overview of the subject.                             2004) to compute recall, precision, and F-score for
                                                          ROUGE-1. We use the Wilcoxon Signed Rank Test
   Our second baseline, No Template, does not
                                                          to determine statistical significance.
use a template to specify desired topics; there-
                                                          Analysis of Human Edits In addition to our auto-
fore, there are no constraints on content selection.
                                                          matic evaluation, we perform a study of reactions
Instead, we follow a simplified form of previous
                                                          to system-produced articles by the general pub-
work on biography creation, where a classifier is
                                                          lic. To achieve this goal, we insert automatically
trained to distinguish biographical text (Zhou et
                                                          created articles4 into Wikipedia itself and exam-
al., 2004; Biadsy et al., 2008).
                                                          ine the feedback of Wikipedia editors. Selection
   In this case, we train a classifier to distinguish      of specific articles is constrained by the need to
domain-specific text. Positive training data is            find topics which are currently of “stub” status that
drawn from all topics in the given domain cor-            have enough information available on the Internet
pus. To find negative training data, we perform            to construct a valid article. After a period of time,
the search procedure as in our full model (see            we analyzed the edits made to the articles to deter-
Section 3.1) using only the article titles as search      mine the overall editor reaction. We report results
queries. Any excerpts which have very low sim-            on 15 articles in the Diseases category5 .
ilarity to the original articles are used as negative
                                                             4
examples. During the decoding procedure, we use                 In addition to the summary itself, we also include proper
the same search procedure. We then classify each          citations to the sources from which the material is extracted.
                                                              5
                                                                We are continually submitting new articles; however, we
excerpt as relevant or irrelevant and select the k        report results on those that have at least a 6 month history at
non-redundant excerpts with the highest relevance         time of writing.
Recall   Precision    F-score                  Type                       Count
    Amer. Film Actors                                                     Total articles                15
    Search                  0.09       0.37       0.13 ∗                    Promoted articles           10
    No Template             0.33       0.50       0.39 ∗                  Edit types
    Disjoint                0.45       0.32       0.36 ∗                    Intra-wiki links             36
    Full Model              0.46       0.40       0.41                      Formatting                   25
    Oracle                  0.48       0.64       0.54 ∗                    Grammar                      20
    Diseases                                                                Minor topic edits             2
    Search                  0.31       0.37       0.32 †                    Major topic changes           1
    No Template             0.32       0.27       0.28 ∗                  Total edits                    85
    Disjoint                0.33       0.40       0.35 ∗
    Full Model              0.36       0.39       0.37          Table 4: Distribution of edits on Wikipedia.
    Oracle                  0.59       0.37       0.44 ∗

Table 3: Results of ROUGE-1 evaluation.                     by human editors from stubs to regular Wikipedia
∗
  Significant with respect to our full model for p ≤ 0.05.   entries based on the quality and coverage of the
†
  Significant with respect to our full model for p ≤ 0.10.
                                                            material. Information was removed in three cases
                                                            for being irrelevant, one entire section and two
   Since Wikipedia is a live resource, we do not            smaller pieces. The most common changes were
repeat this procedure for our baseline systems.             small edits to formatting and introduction of links
Adding articles from systems which have previ-              to other Wikipedia articles in the body text.
ously demonstrated poor quality would be im-
proper, especially in Diseases. Therefore, we               6 Conclusion
present this analysis as an additional observation          In this paper, we investigated an approach for cre-
rather than a rigorous technical study.                     ating a multi-paragraph overview article by select-
                                                            ing relevant material from the web and organiz-
5 Results
                                                            ing it into a single coherent text. Our algorithm
Automatic Evaluation The results of this evalu-             yields significant gains over a structure-agnostic
ation are shown in Table 3. Our full model outper-          approach. Moreover, our results demonstrate the
forms all of the baselines. By surpassing the Dis-          benefits of structured classification, which out-
joint baseline, we demonstrate the benefits of joint         performs independently trained topical classifiers.
classification. Furthermore, the high performance            Overall, the results of our evaluation combined
of both our full model and the Disjoint baseline            with our analysis of human edits confirm that the
relative to the other baselines shows the impor-            proposed method can effectively produce compre-
tance of structure-aware content selection. The             hensive overview articles.
Oracle system, which represents an upper bound                 This work opens several directions for future re-
on our system’s capabilities, performs well.                search. Diseases and American Film Actors ex-
   The remaining baselines have different flaws:             hibit fairly consistent article structures, which are
Articles produced by the No Template baseline               successfully captured by a simple template cre-
tend to focus on a single topic extensively at the          ation process. However, with categories that ex-
expense of breadth, because there are no con-               hibit structural variability, more sophisticated sta-
straints to ensure diverse topic selection. On the          tistical approaches may be required to produce ac-
other hand, performance of the Search baseline              curate templates. Moreover, a promising direction
varies dramatically. This is expected; this base-           is to consider hierarchical discourse formalisms
line relies heavily on both the search engine and           such as RST (Mann and Thompson, 1988) to sup-
individual web pages. The search engine must cor-           plement our template-based approach.
rectly rank relevant pages, and the web pages must
provide the important material first.                        Acknowledgments
Analysis of Human Edits The results of our ob-              The authors acknowledge the support of the NSF (CA-
servation of editing patterns are shown in Table            REER grant IIS-0448168, grant IIS-0835445, and grant IIS-
                                                            0835652) and NIH (grant V54LM008748). Thanks to Mike
4. These articles have resided on Wikipedia for             Collins, Julia Hirschberg, and members of the MIT NLP
a period of time ranging from 5-11 months. All              group for their helpful suggestions and comments. Any opin-
                                                            ions, findings, conclusions, or recommendations expressed in
of them have been edited, and no articles were re-          this paper are those of the authors, and do not necessarily re-
moved due to lack of quality. Moreover, ten au-             flect the views of the funding organizations.
tomatically created articles have been promoted
References                                                        Inderjeet Mani and Mark T. Maybury. 1999. Advances in
                                                                     Automatic Text Summarization. The MIT Press.
Eugene Agichtein, Steve Lawrence, and Luis Gravano. 2001.
  Learning search engine specific query transformations for        William C. Mann and Sandra A. Thompson. 1988. Rhetor-
  question answering. In Proceedings of WWW, pages 169–             ical structure theory: Toward a functional theory of text
  178.                                                              organization. Text, 8(3):243–281.
Regina Barzilay and Noemie Elhadad. 2003. Sentence align-         Tomasz Marciniak and Michael Strube. 2005. Beyond the
  ment for monolingual comparable corpora. In Proceed-              pipeline: Discrete optimization in NLP. In Proceedings
  ings of EMNLP, pages 25–32.                                       of CoNLL, pages 136–143.
Regina Barzilay and Lillian Lee. 2004. Catching the drift:        Ryan McDonald. 2007. A study of global inference algo-
  Probabilistic content models, with applications to genera-        rithms in multi-document summarization. In Proceedings
  tion and summarization. In Proceedings of HLT-NAACL,              of EICR, pages 557–564.
  pages 113–120.
                                                                  Vivi Nastase and Michael Strube. 2008. Decoding wikipedia
Regina Barzilay, Kathleen R. McKeown, and Michael El-
                                                                     categories for knowledge acquisition. In Proceedings of
  hadad. 1999. Information fusion in the context of multi-
                                                                     AAAI, pages 1219–1224.
  document summarization. In Proceedings of ACL, pages
  550–557.                                                        Vivi Nastase. 2008. Topic-driven multi-document summa-
                                                                     rization with encyclopedic knowledge and spreading acti-
Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008.
                                                                     vation. In Proceedings of EMNLP, pages 763–772.
  An unsupervised approach to biography production using
  wikipedia. In Proceedings of ACL/HLT, pages 807–815.            Dragomir R. Radev, Hongyan Jing, and Malgorzata
James Clarke and Mirella Lapata. 2007. Modelling com-               Budzikowska. 2000. Centroid-based summarization
  pression with discourse constraints. In Proceedings of            of multiple documents: sentence extraction, utility-
  EMNLP-CoNLL, pages 1–11.                                          based evaluation, and user studies. In Proceedings of
                                                                    ANLP/NAACL, pages 21–29.
William W. Cohen, Robert E. Schapire, and Yoram Singer.
  1998. Learning to order things. In Proceedings of NIPS,         Ehud Reiter and Robert Dale. 2000. Building Natural Lan-
  pages 451–457.                                                    guage Generation Systems. Cambridge University Press,
                                                                    Cambridge.
Michael Collins. 2002. Ranking algorithms for named-entity
  extraction: Boosting and the voted perceptron. In Pro-          Benjamin Snyder and Regina Barzilay. 2007. Multiple as-
  ceedings of ACL, pages 489–496.                                   pect ranking using the good grief algorithm. In Proceed-
                                                                    ings of HLT-NAACL, pages 300–307.
Thomas H. Cormen, Charles E. Leiserson, and Ronald L.
  Rivest. 1992. Intoduction to Algorithms. The MIT Press.         Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. A
                                                                    hybrid approach to answering biographical questions. In
Hal Daum´ III and Daniel Marcu. 2005. A large-scale explo-
         e                                                          New Directions in Question Answering, pages 59–70.
  ration of effective global features for a joint entity detec-
  tion and tracking model. In Proceedings of HLT/EMNLP,           Fei Wu and Daniel S. Weld. 2007. Autonomously semanti-
  pages 97–104.                                                      fying wikipedia. In Proceedings of CIKM, pages 41–50.

Dina Demner-Fushman and Jimmy Lin. 2007. Answer-                  Ying Zhao, George Karypis, and Usama Fayyad. 2005.
  ing clinical questions with knowledge-based and statisti-          Hierarchical clustering algorithms for document datasets.
  cal techniques. Computational Linguistics, 33(1):63–103.           Data Mining and Knowledge Discovery, 10(2):141–168.

Elena Filatova and John M. Prager. 2005. Tell me what you         L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi-
   do and I’ll tell you what you are: Learning occupation-           document biography summarization. In Proceedings of
   related activities for biographies. In Proceedings of             EMNLP, pages 434–441.
   HLT/EMNLP, pages 113–120.

Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen
   McKeown. 2006. Automatic creation of domain tem-
   plates. In Proceedings of ACL, pages 207–214.

Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en-
   cyclopedic term descriptions on the web. In Proceedings
   of COLING, page 645.

Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark
   Kantrowitz. 2000. Multi-document summarization by
   sentence extraction. In Proceedings of NAACL-ANLP,
   pages 40–48.

Marti A. Hearst. 1994. Multi-paragraph segmentation of ex-
  pository text. In Proceedings of ACL, pages 9–16.

Chin-Yew Lin. 2004. ROUGE: A package for automatic
  evaluation of summaries. In Proceedings of ACL, pages
  74–81.

More Related Content

What's hot

Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extractionijsrd.com
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matchingIJwest
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpusijcsit
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNIJECEIAES
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003robertstevens65
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONIJDKP
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsAlexander Pico
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformaticsijtsrd
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)IJERD Editor
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSIJDKP
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...Editor IJCATR
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011Alexander Pico
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...IJNSA Journal
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationNinad Samel
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012Alexander Pico
 
Technology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksTechnology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksAlexander Pico
 

What's hot (18)

Novel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information ExtractionNovel Database-Centric Framework for Incremental Information Extraction
Novel Database-Centric Framework for Incremental Information Extraction
 
Structural weights in ontology matching
Structural weights in ontology matchingStructural weights in ontology matching
Structural weights in ontology matching
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 
[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma[IJET-V2I3P19] Authors: Priyanka Sharma
[IJET-V2I3P19] Authors: Priyanka Sharma
 
NRNB EAC Meeting 2012
NRNB EAC Meeting 2012NRNB EAC Meeting 2012
NRNB EAC Meeting 2012
 
Prediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNNPrediction of Answer Keywords using Char-RNN
Prediction of Answer Keywords using Char-RNN
 
Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003Beyond Transparency: Success & Lessons From tambisBoston2003
Beyond Transparency: Success & Lessons From tambisBoston2003
 
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATIONONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
ONTOLOGY INTEGRATION APPROACHES AND ITS IMPACT ON TEXT CATEGORIZATION
 
Technology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network RepresentationsTechnology R&D Theme 3: Multi-scale Network Representations
Technology R&D Theme 3: Multi-scale Network Representations
 
Computational of Bioinformatics
Computational of BioinformaticsComputational of Bioinformatics
Computational of Bioinformatics
 
International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)International Journal of Engineering Research and Development (IJERD)
International Journal of Engineering Research and Development (IJERD)
 
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLSTWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
TWO LEVEL SELF-SUPERVISED RELATION EXTRACTION FROM MEDLINE USING UMLS
 
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
The Statement of Conjunctive and Disjunctive Queries in Object Oriented Datab...
 
NRNB Annual Report 2011
NRNB Annual Report 2011NRNB Annual Report 2011
NRNB Annual Report 2011
 
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...ONTOLOGY-DRIVEN INFORMATION RETRIEVAL  FOR HEALTHCARE INFORMATION SYSTEM :   ...
ONTOLOGY-DRIVEN INFORMATION RETRIEVAL FOR HEALTHCARE INFORMATION SYSTEM : ...
 
A rough set based hybrid method to text categorization
A rough set based hybrid method to text categorizationA rough set based hybrid method to text categorization
A rough set based hybrid method to text categorization
 
NRNB Annual Report 2012
NRNB Annual Report 2012NRNB Annual Report 2012
NRNB Annual Report 2012
 
Technology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential NetworksTechnology R&D Theme 1: Differential Networks
Technology R&D Theme 1: Differential Networks
 

Similar to Automatically Generating Wikipedia Articles: A Structure-Aware Approach

The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)theijes
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...IJCI JOURNAL
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISrathnaarul
 
Dr31564567
Dr31564567Dr31564567
Dr31564567IJMER
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET Journal
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach IJCSIS Research Publications
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesDrjabez
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.docbutest
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignAnubhav Jain
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsIRJET Journal
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Databaseijbuiiir1
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...ijceronline
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET Journal
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningIOSR Journals
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available herebutest
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS AIRCC Publishing Corporation
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusAIRCC Publishing Corporation
 

Similar to Automatically Generating Wikipedia Articles: A Structure-Aware Approach (20)

The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)The International Journal of Engineering and Science (IJES)
The International Journal of Engineering and Science (IJES)
 
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
EMPLOYING THE CATEGORIES OF WIKIPEDIA IN THE TASK OF AUTOMATIC DOCUMENTS CLUS...
 
NE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSISNE7012- SOCIAL NETWORK ANALYSIS
NE7012- SOCIAL NETWORK ANALYSIS
 
Dr31564567
Dr31564567Dr31564567
Dr31564567
 
IRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect InformationIRJET - Deep Collaborrative Filtering with Aspect Information
IRJET - Deep Collaborrative Filtering with Aspect Information
 
Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach Improved Text Mining for Bulk Data Using Deep Learning Approach
Improved Text Mining for Bulk Data Using Deep Learning Approach
 
Automated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning TechniquesAutomated News Categorization Using Machine Learning Techniques
Automated News Categorization Using Machine Learning Techniques
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
View the Microsoft Word document.doc
View the Microsoft Word document.docView the Microsoft Word document.doc
View the Microsoft Word document.doc
 
Applications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials DesignApplications of Natural Language Processing to Materials Design
Applications of Natural Language Processing to Materials Design
 
Exploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining ApplicationsExploiting Wikipedia and Twitter for Text Mining Applications
Exploiting Wikipedia and Twitter for Text Mining Applications
 
Towards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational DatabaseTowards Ontology Development Based on Relational Database
Towards Ontology Development Based on Relational Database
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...IJCER (www.ijceronline.com) International Journal of computational Engineerin...
IJCER (www.ijceronline.com) International Journal of computational Engineerin...
 
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-MeansIRJET- Concept Extraction from Ambiguous Text Document using K-Means
IRJET- Concept Extraction from Ambiguous Text Document using K-Means
 
Classification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern MiningClassification of News and Research Articles Using Text Pattern Mining
Classification of News and Research Articles Using Text Pattern Mining
 
A report of the work done in this project is available here
A report of the work done in this project is available hereA report of the work done in this project is available here
A report of the work done in this project is available here
 
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
A SEMANTIC RETRIEVAL SYSTEM FOR EXTRACTING RELATIONSHIPS FROM BIOLOGICAL CORPUS
 
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological CorpusA Semantic Retrieval System for Extracting Relationships from Biological Corpus
A Semantic Retrieval System for Extracting Relationships from Biological Corpus
 

More from George Ang

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...George Ang
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarizationGeorge Ang
 
Huffman coding
Huffman codingHuffman coding
Huffman codingGeorge Ang
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textGeorge Ang
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿George Ang
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势George Ang
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程George Ang
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qqGeorge Ang
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道George Ang
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化George Ang
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间George Ang
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨George Ang
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站George Ang
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程George Ang
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagementGeorge Ang
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享George Ang
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍George Ang
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享George Ang
 

More from George Ang (20)

Wrapper induction construct wrappers automatically to extract information f...
Wrapper induction   construct wrappers automatically to extract information f...Wrapper induction   construct wrappers automatically to extract information f...
Wrapper induction construct wrappers automatically to extract information f...
 
Opinion mining and summarization
Opinion mining and summarizationOpinion mining and summarization
Opinion mining and summarization
 
Huffman coding
Huffman codingHuffman coding
Huffman coding
 
Do not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar textDo not crawl in the dust 
different ur ls similar text
Do not crawl in the dust 
different ur ls similar text
 
大规模数据处理的那些事儿
大规模数据处理的那些事儿大规模数据处理的那些事儿
大规模数据处理的那些事儿
 
腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势腾讯大讲堂02 休闲游戏发展的文化趋势
腾讯大讲堂02 休闲游戏发展的文化趋势
 
腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程腾讯大讲堂03 qq邮箱成长历程
腾讯大讲堂03 qq邮箱成长历程
 
腾讯大讲堂04 im qq
腾讯大讲堂04 im qq腾讯大讲堂04 im qq
腾讯大讲堂04 im qq
 
腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道腾讯大讲堂05 面向对象应对之道
腾讯大讲堂05 面向对象应对之道
 
腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化腾讯大讲堂06 qq邮箱性能优化
腾讯大讲堂06 qq邮箱性能优化
 
腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间腾讯大讲堂07 qq空间
腾讯大讲堂07 qq空间
 
腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨腾讯大讲堂08 可扩展web架构探讨
腾讯大讲堂08 可扩展web架构探讨
 
腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站腾讯大讲堂09 如何建设高性能网站
腾讯大讲堂09 如何建设高性能网站
 
腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程腾讯大讲堂01 移动qq产品发展历程
腾讯大讲堂01 移动qq产品发展历程
 
腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement腾讯大讲堂10 customer engagement
腾讯大讲堂10 customer engagement
 
腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享腾讯大讲堂11 拍拍ce工作经验分享
腾讯大讲堂11 拍拍ce工作经验分享
 
腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍腾讯大讲堂14 qq直播(qq live) 介绍
腾讯大讲堂14 qq直播(qq live) 介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
腾讯大讲堂15 市场研究及数据分析理念及方法概要介绍
 
腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享腾讯大讲堂16 产品经理工作心得分享
腾讯大讲堂16 产品经理工作心得分享
 

Recently uploaded

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupFlorian Wilhelm
 

Recently uploaded (20)

"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko"Debugging python applications inside k8s environment", Andrii Soldatenko
"Debugging python applications inside k8s environment", Andrii Soldatenko
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Streamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project SetupStreamlining Python Development: A Guide to a Modern Project Setup
Streamlining Python Development: A Guide to a Modern Project Setup
 

Automatically Generating Wikipedia Articles: A Structure-Aware Approach

  • 1. Automatically Generating Wikipedia Articles: A Structure-Aware Approach Christina Sauper and Regina Barzilay Computer Science and Artificial Intelligence Laboratory Massachusetts Institute of Technology {csauper,regina}@csail.mit.edu Abstract such as searching the Internet. Moreover, the chal- lenge of maintaining output readability is mag- In this paper, we investigate an ap- nified when creating a longer document that dis- proach for creating a comprehensive tex- cusses multiple topics. tual overview of a subject composed of in- In our approach, we explore how the high- formation drawn from the Internet. We use level structure of human-authored documents can the high-level structure of human-authored be used to produce well-formed comprehensive texts to automatically induce a domain- overview articles. We select relevant material for specific template for the topic structure of an article using a domain-specific automatically a new overview. The algorithmic innova- generated content template. For example, a tem- tion of our work is a method to learn topic- plate for articles about diseases might contain di- specific extractors for content selection agnosis, causes, symptoms, and treatment. Our jointly for the entire template. We aug- system induces these templates by analyzing pat- ment the standard perceptron algorithm terns in the structure of human-authored docu- with a global integer linear programming ments in the domain of interest. Then, it produces formulation to optimize both local fit of a new article by selecting content from the Internet information into each topic and global co- for each part of this template. An example of our herence across the entire overview. The system’s output1 is shown in Figure 1. results of our evaluation confirm the bene- The algorithmic innovation of our work is a fits of incorporating structural information method for learning topic-specific extractors for into the content selection process. content selection jointly across the entire template. Learning a single topic-specific extractor can be 1 Introduction easily achieved in a standard classification frame- In this paper, we consider the task of automatically work. However, the choices for different topics creating a multi-paragraph overview article that in a template are mutually dependent; for exam- provides a comprehensive summary of a subject of ple, in a multi-topic article, there is potential for interest. Examples of such overviews include ac- redundancy across topics. Simultaneously learn- tor biographies from IMDB and disease synopses ing content selection for all topics enables us to from Wikipedia. Producing these texts by hand is explicitly model these inter-topic connections. a labor-intensive task, especially when relevant in- We formulate this task as a structured classifica- formation is scattered throughout a wide range of tion problem. We estimate the parameters of our Internet sources. Our goal is to automate this pro- model using the perceptron algorithm augmented cess. We aim to create an overview of a subject – with an integer linear programming (ILP) formu- e.g., 3-M Syndrome – by intelligently combining lation, run over a training set of example articles relevant excerpts from across the Internet. in the given domain. As a starting point, we can employ meth- The key features of this structure-aware ap- ods developed for multi-document summarization. proach are twofold: However, our task poses additional technical chal- 1 lenges with respect to content planning. Gen- This system output was added to Wikipedia at http:// en.wikipedia.org/wiki/3-M_syndrome on June erating a well-rounded overview article requires 26, 2008. The page’s history provides examples of changes proactive strategies to gather relevant material, performed by human editors to articles created by our system.
  • 2. Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations have been identified in an affected family member in a research or clinical laboratory. Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classic genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. In recessive disorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . . Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). In some cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated with Three M syndrome. Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi- viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that may be potentially associated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic and supportive. Figure 1: A fragment from the automatically created article for 3-M Syndrome. • Automatic template creation: Templates domains, output summaries often suffer from co- are automatically induced from human- herence and coverage problems. authored documents. This ensures that the In between these two approaches is work on overview article will have the breadth ex- domain-specific text-to-text generation. Instances pected in a comprehensive summary, with of these tasks are biography generation in sum- content drawn from a wide variety of Inter- marization and answering definition requests in net sources. question-answering. In contrast to a generic sum- • Joint parameter estimation for content se- marizer, these applications aim to characterize lection: Parameters are learned jointly for the types of information that are essential in a all topics in the template. This procedure op- given domain. This characterization varies greatly timizes both local relevance of information in granularity. For instance, some approaches for each topic and global coherence across coarsely discriminate between biographical and the entire article. non-biographical information (Zhou et al., 2004; Biadsy et al., 2008), while others go beyond binary We evaluate our approach by creating articles in distinction by identifying atomic events – e.g., oc- two domains: Actors and Diseases. For a data set, cupation and marital status – that are typically in- we use Wikipedia, which contains articles simi- cluded in a biography (Weischedel et al., 2004; lar to those we wish to produce in terms of length Filatova and Prager, 2005; Filatova et al., 2006). and breadth. An advantage of this data set is that Commonly, such templates are specified manually Wikipedia articles explicitly delineate topical sec- and are hard-coded for a particular domain (Fujii tions, facilitating structural analysis. The results and Ishikawa, 2004; Weischedel et al., 2004). of our evaluation confirm the benefits of structure- aware content selection over approaches that do Our work is related to these approaches; how- not explicitly model topical structure. ever, content selection in our work is driven by domain-specific automatically induced templates. 2 Related Work As our experiments demonstrate, patterns ob- served in domain-specific training data provide Concept-to-text generation and text-to-text gener- sufficient constraints for topic organization, which ation take very different approaches to content se- is crucial for a comprehensive text. lection. In traditional concept-to-text generation, a content planner provides a detailed template for Our work also relates to a large body of recent what information should be included in the output work that uses Wikipedia material. Instances of and how this information should be organized (Re- this work include information extraction, ontology iter and Dale, 2000). In text-to-text generation, induction and resource acquisition (Wu and Weld, such templates for information organization are 2007; Biadsy et al., 2008; Nastase, 2008; Nastase not available; sentences are selected based on their and Strube, 2008). Our focus is on a different task salience properties (Mani and Maybury, 1999). — generation of new overview articles that follow While this strategy is robust and portable across the structure of Wikipedia articles.
  • 3. 3 Method quality of a given excerpt. Using the percep- tron framework augmented with an ILP for- The goal of our system is to produce a compre- mulation for global optimization, the system hensive overview article given a title – e.g., Can- is trained to select the best excerpt for each cer. We assume that relevant information on the document di and each topic tj . For train- subject is available on the Internet but scattered ing, we assume the best excerpt is the original among several pages interspersed with noise. human-authored text sij . We are provided with a training corpus consist- ing of n documents d1 . . . dn in the same domain 3. Application (Section 3.2) Given the title of – e.g., Diseases. Each document di has a title and a requested document, we select several ex- a set of delineated sections2 si1 . . . sim . The num- cerpts from the candidate vectors returned by ber of sections m varies between documents. Each the search procedure (1b) to create a com- section sij also has a corresponding heading hij – prehensive overview article. We perform the e.g., Treatment. decoding procedure jointly using learned pa- Our overview article creation process consists rameters w1 . . . wk and the same ILP formu- of three parts. First, a preprocessing step creates lation for global optimization as in training. a template and searches for a number of candidate The result is a new document with k excerpts, excerpts from the Internet. Next, parameters must one for each topic. be trained for the content selection algorithm us- ing our training data set. Finally, a complete ar- 3.1 Preprocessing ticle may be created by combining a selection of Template Induction A content template speci- candidate excerpts. fies the topical structure of documents in one do- main. For instance, the template for articles about 1. Preprocessing (Section 3.1) Our prepro- actors consists of four topics t1 . . . t4 : biography, cessing step leverages previous work in topic early life, career, and personal life. Using this segmentation and query reformulation to pre- template to create the biography of a new actor pare a template and a set of candidate ex- will ensure that its information coverage is con- cerpts for content selection. Template gen- sistent with existing human-authored documents. eration must occur once per domain, whereas We aim to derive these templates by discovering search occurs every time an article is gener- common patterns in the organization of documents ated in both learning and application. in a domain of interest. There has been a sizable (a) Template Induction To create a con- amount of research on structure induction ranging tent template, we cluster all section from linear segmentation (Hearst, 1994) to content headings hi1 . . . him for all documents modeling (Barzilay and Lee, 2004). At the core di . Each cluster is labeled with the most of these methods is the assumption that fragments common heading hij within the clus- of text conveying similar information have simi- ter. The largest k clusters are selected to lar word distribution patterns. Therefore, often a become topics t1 . . . tk , which form the simple segment clustering across domain texts can domain-specific content template. identify strong patterns in content structure (Barzi- (b) Search For each document that we lay and Elhadad, 2003). Clusters containing frag- wish to create, we retrieve from the In- ments from many documents are indicative of top- ternet a set of r excerpts ej1 . . . ejr for ics that are essential for a comprehensive sum- each topic tj from the template. We de- mary. Given the simplicity and robustness of this fine appropriate search queries using the approach, we utilize it for template induction. requested document title and topics tj . We cluster all section headings hi1 . . . him from all documents di using a repeated bisectioning 2. Learning Content Selection (Section 3.2) algorithm (Zhao et al., 2005). As a similarity For each topic tj , we learn the corresponding function, we use cosine similarity weighted with topic-specific parameters wj to determine the TF*IDF. We eliminate any clusters with low in- 2 ternal similarity (i.e., smaller than 0.5), as we as- In data sets where such mark-up is not available, one can employ topical segmentation algorithms as an additional pre- sume these are “miscellaneous” clusters that will processing step. not yield unified topics.
  • 4. We determine the average number of sections is needed when candidate excerpts are drawn di- k over all documents in our training set, then se- rectly from the web, as they may be contaminated lect the k largest section clusters as topics. We or- with noise. der these topics as t1 . . . tk using a majority order- We propose a novel joint training algorithm that ing algorithm (Cohen et al., 1998). This algorithm learns selection criteria for all the topics simulta- finds a total order among clusters that is consistent neously. This approach enables us to maximize with a maximal number of pairwise relationships both local fit and global coherence. We implement observed in our data set. this algorithm using the perceptron framework, as Each topic tj is identified by the most frequent it can be easily modified for structured prediction heading found within the cluster – e.g., Causes. while preserving convergence guarantees (Daum´ e This set of topics forms the content template for a III and Marcu, 2005; Snyder and Barzilay, 2007). domain. In this section, we first describe the structure Search To retrieve relevant excerpts, we must and decoding procedure of our model. We then define appropriate search queries for each topic present an algorithm to jointly learn the parame- t1 . . . tk . Query reformulation is an active area of ters of all topic models. research (Agichtein et al., 2001). We have exper- 3.2.1 Model Structure imented with several of these methods for draw- The model inputs are as follows: ing search queries from representative words in the body text of each topic; however, we find that the • The title of the desired document best performance is provided by deriving queries • t1 . . . tk — topics from the content template from a conjunction of the document title and topic • ej1 . . . ejr — candidate excerpts for each – e.g., “3-M syndrome” diagnosis. topic tj Using these queries, we search using Yahoo! and retrieve the first ten result pages for each topic. In addition, we define feature and parameter From each of these pages, we extract all possible vectors: excerpts consisting of chunks of text between stan- • φ(ejl ) — feature vector for the lth candidate dardized boundary indicators (such as <p> tags). excerpt for topic tj In our experiments, there are an average of 6 ex- • w1 . . . wk — parameter vectors, one for each cerpts taken from each page. For each topic tj of of the topics t1 . . . tk each document we wish to create, the total number of excerpts r found on the Internet may differ. We Our model constructs a new article by following label the excerpts ej1 . . . ejr . these two steps: Ranking First, we attempt to rank candidate 3.2 Selection Model excerpts based on how representative they are of Our selection model takes the content template each individual topic. For each topic tj , we induce t1 . . . tk and the candidate excerpts ej1 . . . ejr for a ranking of the excerpts ej1 . . . ejr by mapping each topic tj produced in the previous steps. It each excerpt ejl to a score: then selects a series of k excerpts, one from each topic, to create a coherent summary. scorej (ejl ) = φ(ejl ) · wj One possible approach is to perform individ- Candidates for each topic are ranked from high- ual selections from each set of excerpts ej1 . . . ejr est to lowest score. After this procedure, the posi- and then combine the results. This strategy is tion l of excerpt ejl within the topic-specific can- commonly used in multi-document summariza- didate vector is the excerpt’s rank. tion (Barzilay et al., 1999; Goldstein et al., 2000; Optimizing the Global Objective To avoid re- Radev et al., 2000), where the combination step dundancy between topics, we formulate an opti- eliminates the redundancy across selected ex- mization problem using excerpt rankings to create cerpts. However, separating the two steps may not the final article. Given k topics, we would like to be optimal for this task — the balance between select one excerpt ejl for each topic tj , such that coverage and redundancy is harder to achieve the rank is minimized; that is, scorej (ejl ) is high. when a multi-paragraph summary is generated. In To select the optimal excerpts, we employ inte- addition, a more discriminative selection strategy ger linear programming (ILP). This framework is
  • 5. commonly used in generation and summarization Feature Value UNI wordi count of word occurrences applications where the selection process is driven POS wordi first position of word in excerpt by multiple constraints (Marciniak and Strube, BI wordi wordi+1 count of bigram occurrences 2005; Clarke and Lapata, 2007). SENT count of all sentences EXCL count of exclamations We represent excerpts included in the output QUES count of questions using a set of indicator variables, xjl . For each WORD count of all words excerpt ejl , the corresponding indicator variable NAME count of title mentions DATE count of dates xjl = 1 if the excerpt is included in the final doc- PROP count of proper nouns ument, and xjl = 0 otherwise. PRON count of pronouns Our objective is to minimize the ranks of the NUM count of numbers FIRST word1 1∗ excerpts selected for the final document: FIRST word1 word2 1† SIMS count of similar excerpts‡ k r min l · xjl Table 1: Features employed in the ranking model. ∗ j=1 l=1 Defined as the first unigram in the excerpt. † Defined as the first bigram in the excerpt. ‡ We augment this formulation with two types of Defined as excerpts with cosine similarity > 0.5 constraints. Exclusivity Constraints We want to ensure that Features As shown in Table 1, most of the fea- exactly one indicator xjl is nonzero for each topic tures we select in our model have been employed tj . These constraints are formulated as follows: in previous work on summarization (Mani and r Maybury, 1999). All features except the SIMS xjl = 1 ∀j ∈ {1 . . . k} feature are defined for individual excerpts in isola- l=1 tion. For each excerpt ejl , the value of the SIMS Redundancy Constraints We also want to pre- feature is the count of excerpts ejl′ in the same vent redundancy across topics. We define topic tj for which sim(ejl , ejl′ ) > 0.5. This fea- sim(ejl , ej ′ l′ ) as the cosine similarity between ex- ture quantifies the degree of repetition within a cerpts ejl from topic tj and ej ′ l′ from topic tj ′ . topic, often indicative of an excerpt’s accuracy and We introduce constraints that ensure no pair of ex- relevance. cerpts has similarity above 0.5: 3.2.2 Model Training (xjl + xj ′ l′ ) · sim(ejl , ej ′ l′ ) ≤ 1 Generating Training Data For training, we are given n original documents d1 . . . dn , a content ∀j, j ′ = 1 . . . k ∀l, l′ = 1 . . . r template consisting of topics t1 . . . tk , and a set of candidate excerpts eij1 . . . eijr for each document If excerpts ejl and ej ′ l′ have cosine similarity di and topic tj . For each section of each docu- sim(ejl , ej ′ l′ ) > 0.5, only one excerpt may be ment, we add the gold excerpt sij to the corre- selected for the final document – i.e., either xjl sponding vector of candidate excerpts eij1 . . . eijr . or xj ′ l′ may be 1, but not both. Conversely, if This excerpt represents the target for our training sim(ejl , ej ′ l′ ) ≤ 0.5, both excerpts may be se- algorithm. Note that the algorithm does not re- lected. quire annotated ranking data; only knowledge of Solving the ILP Solving an integer linear pro- this “optimal” excerpt is required. However, if gram is NP-hard (Cormen et al., 1992); however, the excerpts provided in the training data have low in practice there exist several strategies for solving quality, noise is introduced into the system. certain ILPs efficiently. In our study, we employed Training Procedure Our algorithm is a lp solve,3 an efficient mixed integer programming modification of the perceptron ranking algo- solver which implements the Branch-and-Bound rithm (Collins, 2002), which allows for joint algorithm. On a larger scale, there are several al- learning across several ranking problems (Daum´ e ternatives to approximate the ILP results, such as a III and Marcu, 2005; Snyder and Barzilay, 2007). dynamic programming approximation to the knap- Pseudocode for this algorithm is provided in sack problem (McDonald, 2007). Figure 2. 3 http://lpsolve.sourceforge.net/5.5/ First, we define Rank(eij1 . . . eijr , wj ), which
  • 6. ranks all excerpts from the candidate excerpt Input: vector eij1 . . . eijr for document di and topic d1 . . . dn : A set of n documents, each containing tj . Excerpts are ordered by scorej (ejl ) using k sections si1 . . . sik eij1 . . . eijr : Sets of candidate excerpts for each topic the current parameter values. We also define tj and document di Optimize(eij1 . . . eijr ), which finds the optimal Define: Rank(eij1 . . . eijr , wj ): selection of excerpts (one per topic) given ranked As described in Section 3.2.1: lists of excerpts eij1 . . . eijr for each document di Calculates scorej (eijl ) for all excerpts for and topic tj . These functions follow the ranking document di and topic tj , using parameters wj . Orders the list of excerpts by scorej (eijl ) and optimization procedures described in Section from highest to lowest. 3.2.1. The algorithm maintains k parameter vec- Optimize(ei11 . . . eikr ): tors w1 . . . wk , one associated with each topic tj As described in Section 3.2.1: Finds the optimal selection of excerpts to form a desired in the final article. During initialization, final article, given ranked lists of excerpts all parameter vectors are set to zeros (line 2). for each topic t1 . . . tk . To learn the optimal parameters, this algorithm Returns a list of k excerpts, one for each topic. φ(eijl ): iterates over the training set until the parameters Returns the feature vector representing excerpt eijl converge or a maximum number of iterations is Initialization: 1 For j = 1 . . . k reached (line 3). For each document in the train- 2 Set parameters wj = 0 ing set (line 4), the following steps occur: First, Training: candidate excerpts for each topic are ranked (lines 3 Repeat until convergence or while iter < itermax : 4 For i = 1 . . . n 5-6). Next, decoding through ILP optimization is 5 For j = 1 . . . k performed over all ranked lists of candidate ex- 6 Rank(eij1 . . . eijr , wj ) cerpts, selecting one excerpt for each topic (line 7 x1 . . . xk = Optimize(ei11 . . . eikr ) 8 For j = 1 . . . k 7). Finally, the parameters are updated in a joint 9 If sim(xj , sij ) < 0.8 fashion. For each topic (line 8), if the selected 10 wj = wj + φ(sij ) − φ(xi ) excerpt is not similar enough to the gold excerpt 11 iter = iter + 1 12 Return parameters w1 . . . wk (line 9), the parameters for that topic are updated using a standard perceptron update rule (line 10). When convergence is reached or the maximum it- Figure 2: An algorithm for learning several rank- eration count is exceeded, the learned parameter ing problems with a joint decoding mechanism. values are returned (line 12). The use of ILP during each step of training sets this algorithm apart from previous work. In tor reaction to system-produced articles submitted prior research, ILP was used as a postprocess- to Wikipedia. ing step to remove redundancy and make other global decisions about parameters (McDonald, Data For evaluation, we consider two domains: 2007; Marciniak and Strube, 2005; Clarke and La- American Film Actors and Diseases. These do- pata, 2007). However, in our training, we inter- mains have been commonly used in prior work twine the complete decoding procedure with the on summarization (Weischedel et al., 2004; Zhou parameter updates. Our joint learning approach et al., 2004; Filatova and Prager, 2005; Demner- finds per-topic parameter values that are maxi- Fushman and Lin, 2007; Biadsy et al., 2008). Our mally suited for the global decoding procedure for text corpus consists of articles drawn from the cor- content selection. responding categories in Wikipedia. There are 2,150 articles in American Film Actors and 523 4 Experimental Setup articles in Diseases. For each domain, we ran- domly select 90% of articles for training and test We evaluate our method by observing the quality on the remaining 10%. Human-authored articles of automatically created articles in different do- in both domains contain an average of four top- mains. We compute the similarity of a large num- ics, and each topic contains an average of 193 ber of articles produced by our system and sev- words. In order to model the real-world scenario eral baselines to the original human-authored arti- where Wikipedia articles are not always available cles using ROUGE, a standard metric for summary (as for new or specialized topics), we specifically quality. In addition, we perform an analysis of edi- exclude Wikipedia sources during our search pro-
  • 7. Avg. Excerpts Avg. Sources confidence scores. Amer. Film Actors Our third baseline, Disjoint, uses the ranking Search 2.3 1 No Template 4 4.0 perceptron framework as in our full system; how- Disjoint 4 2.1 ever, rather than perform an optimization step Full Model 4 3.4 during training and decoding, we simply select Oracle 4.3 4.3 Diseases the highest-ranked excerpt for each topic. This Search 3.1 1 equates to standard linear classification for each No Template 4 2.5 section individually. Disjoint 4 3.0 Full Model 4 3.2 In addition to these baselines, we compare Oracle 5.8 3.9 against an Oracle system. For each topic present in the human-authored article, the Oracle selects Table 2: Average number of excerpts selected and the excerpt from our full model’s candidate ex- sources used in article creation for test articles. cerpts with the highest cosine similarity to the human-authored text. This excerpt is the optimal automatic selection from the results available, and cedure (Section 3.1) for evaluation. therefore represents an upper bound on our excerpt selection task. Some articles contain additional Baselines Our first baseline, Search, relies topics beyond those in the template; in these cases, solely on search engine ranking for content selec- the Oracle system produces a longer article than tion. Using the article title as a query – e.g., Bacil- our algorithm. lary Angiomatosis, this method selects the web page that is ranked first by the search engine. From Table 2 shows the average number of excerpts this page we select the first k paragraphs where k selected and sources used in articles created by our is defined in the same way as in our full model. If full model and each baseline. there are less than k paragraphs on the page, all Automatic Evaluation To assess the quality of paragraphs are selected, but no other sources are the resulting overview articles, we compare them used. This yields a document of comparable size with the original human-authored articles. We with the output of our system. Despite its sim- use ROUGE, an evaluation metric employed at the plicity, this baseline is not naive: extracting ma- Document Understanding Conferences (DUC), terial from a single document guarantees that the which assumes that proximity to human-authored output is coherent, and a page highly ranked by a text is an indicator of summary quality. We search engine may readily contain a comprehen- use the publicly available ROUGE toolkit (Lin, sive overview of the subject. 2004) to compute recall, precision, and F-score for ROUGE-1. We use the Wilcoxon Signed Rank Test Our second baseline, No Template, does not to determine statistical significance. use a template to specify desired topics; there- Analysis of Human Edits In addition to our auto- fore, there are no constraints on content selection. matic evaluation, we perform a study of reactions Instead, we follow a simplified form of previous to system-produced articles by the general pub- work on biography creation, where a classifier is lic. To achieve this goal, we insert automatically trained to distinguish biographical text (Zhou et created articles4 into Wikipedia itself and exam- al., 2004; Biadsy et al., 2008). ine the feedback of Wikipedia editors. Selection In this case, we train a classifier to distinguish of specific articles is constrained by the need to domain-specific text. Positive training data is find topics which are currently of “stub” status that drawn from all topics in the given domain cor- have enough information available on the Internet pus. To find negative training data, we perform to construct a valid article. After a period of time, the search procedure as in our full model (see we analyzed the edits made to the articles to deter- Section 3.1) using only the article titles as search mine the overall editor reaction. We report results queries. Any excerpts which have very low sim- on 15 articles in the Diseases category5 . ilarity to the original articles are used as negative 4 examples. During the decoding procedure, we use In addition to the summary itself, we also include proper the same search procedure. We then classify each citations to the sources from which the material is extracted. 5 We are continually submitting new articles; however, we excerpt as relevant or irrelevant and select the k report results on those that have at least a 6 month history at non-redundant excerpts with the highest relevance time of writing.
  • 8. Recall Precision F-score Type Count Amer. Film Actors Total articles 15 Search 0.09 0.37 0.13 ∗ Promoted articles 10 No Template 0.33 0.50 0.39 ∗ Edit types Disjoint 0.45 0.32 0.36 ∗ Intra-wiki links 36 Full Model 0.46 0.40 0.41 Formatting 25 Oracle 0.48 0.64 0.54 ∗ Grammar 20 Diseases Minor topic edits 2 Search 0.31 0.37 0.32 † Major topic changes 1 No Template 0.32 0.27 0.28 ∗ Total edits 85 Disjoint 0.33 0.40 0.35 ∗ Full Model 0.36 0.39 0.37 Table 4: Distribution of edits on Wikipedia. Oracle 0.59 0.37 0.44 ∗ Table 3: Results of ROUGE-1 evaluation. by human editors from stubs to regular Wikipedia ∗ Significant with respect to our full model for p ≤ 0.05. entries based on the quality and coverage of the † Significant with respect to our full model for p ≤ 0.10. material. Information was removed in three cases for being irrelevant, one entire section and two Since Wikipedia is a live resource, we do not smaller pieces. The most common changes were repeat this procedure for our baseline systems. small edits to formatting and introduction of links Adding articles from systems which have previ- to other Wikipedia articles in the body text. ously demonstrated poor quality would be im- proper, especially in Diseases. Therefore, we 6 Conclusion present this analysis as an additional observation In this paper, we investigated an approach for cre- rather than a rigorous technical study. ating a multi-paragraph overview article by select- ing relevant material from the web and organiz- 5 Results ing it into a single coherent text. Our algorithm Automatic Evaluation The results of this evalu- yields significant gains over a structure-agnostic ation are shown in Table 3. Our full model outper- approach. Moreover, our results demonstrate the forms all of the baselines. By surpassing the Dis- benefits of structured classification, which out- joint baseline, we demonstrate the benefits of joint performs independently trained topical classifiers. classification. Furthermore, the high performance Overall, the results of our evaluation combined of both our full model and the Disjoint baseline with our analysis of human edits confirm that the relative to the other baselines shows the impor- proposed method can effectively produce compre- tance of structure-aware content selection. The hensive overview articles. Oracle system, which represents an upper bound This work opens several directions for future re- on our system’s capabilities, performs well. search. Diseases and American Film Actors ex- The remaining baselines have different flaws: hibit fairly consistent article structures, which are Articles produced by the No Template baseline successfully captured by a simple template cre- tend to focus on a single topic extensively at the ation process. However, with categories that ex- expense of breadth, because there are no con- hibit structural variability, more sophisticated sta- straints to ensure diverse topic selection. On the tistical approaches may be required to produce ac- other hand, performance of the Search baseline curate templates. Moreover, a promising direction varies dramatically. This is expected; this base- is to consider hierarchical discourse formalisms line relies heavily on both the search engine and such as RST (Mann and Thompson, 1988) to sup- individual web pages. The search engine must cor- plement our template-based approach. rectly rank relevant pages, and the web pages must provide the important material first. Acknowledgments Analysis of Human Edits The results of our ob- The authors acknowledge the support of the NSF (CA- servation of editing patterns are shown in Table REER grant IIS-0448168, grant IIS-0835445, and grant IIS- 0835652) and NIH (grant V54LM008748). Thanks to Mike 4. These articles have resided on Wikipedia for Collins, Julia Hirschberg, and members of the MIT NLP a period of time ranging from 5-11 months. All group for their helpful suggestions and comments. Any opin- ions, findings, conclusions, or recommendations expressed in of them have been edited, and no articles were re- this paper are those of the authors, and do not necessarily re- moved due to lack of quality. Moreover, ten au- flect the views of the funding organizations. tomatically created articles have been promoted
  • 9. References Inderjeet Mani and Mark T. Maybury. 1999. Advances in Automatic Text Summarization. The MIT Press. Eugene Agichtein, Steve Lawrence, and Luis Gravano. 2001. Learning search engine specific query transformations for William C. Mann and Sandra A. Thompson. 1988. Rhetor- question answering. In Proceedings of WWW, pages 169– ical structure theory: Toward a functional theory of text 178. organization. Text, 8(3):243–281. Regina Barzilay and Noemie Elhadad. 2003. Sentence align- Tomasz Marciniak and Michael Strube. 2005. Beyond the ment for monolingual comparable corpora. In Proceed- pipeline: Discrete optimization in NLP. In Proceedings ings of EMNLP, pages 25–32. of CoNLL, pages 136–143. Regina Barzilay and Lillian Lee. 2004. Catching the drift: Ryan McDonald. 2007. A study of global inference algo- Probabilistic content models, with applications to genera- rithms in multi-document summarization. In Proceedings tion and summarization. In Proceedings of HLT-NAACL, of EICR, pages 557–564. pages 113–120. Vivi Nastase and Michael Strube. 2008. Decoding wikipedia Regina Barzilay, Kathleen R. McKeown, and Michael El- categories for knowledge acquisition. In Proceedings of hadad. 1999. Information fusion in the context of multi- AAAI, pages 1219–1224. document summarization. In Proceedings of ACL, pages 550–557. Vivi Nastase. 2008. Topic-driven multi-document summa- rization with encyclopedic knowledge and spreading acti- Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008. vation. In Proceedings of EMNLP, pages 763–772. An unsupervised approach to biography production using wikipedia. In Proceedings of ACL/HLT, pages 807–815. Dragomir R. Radev, Hongyan Jing, and Malgorzata James Clarke and Mirella Lapata. 2007. Modelling com- Budzikowska. 2000. Centroid-based summarization pression with discourse constraints. In Proceedings of of multiple documents: sentence extraction, utility- EMNLP-CoNLL, pages 1–11. based evaluation, and user studies. In Proceedings of ANLP/NAACL, pages 21–29. William W. Cohen, Robert E. Schapire, and Yoram Singer. 1998. Learning to order things. In Proceedings of NIPS, Ehud Reiter and Robert Dale. 2000. Building Natural Lan- pages 451–457. guage Generation Systems. Cambridge University Press, Cambridge. Michael Collins. 2002. Ranking algorithms for named-entity extraction: Boosting and the voted perceptron. In Pro- Benjamin Snyder and Regina Barzilay. 2007. Multiple as- ceedings of ACL, pages 489–496. pect ranking using the good grief algorithm. In Proceed- ings of HLT-NAACL, pages 300–307. Thomas H. Cormen, Charles E. Leiserson, and Ronald L. Rivest. 1992. Intoduction to Algorithms. The MIT Press. Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. A hybrid approach to answering biographical questions. In Hal Daum´ III and Daniel Marcu. 2005. A large-scale explo- e New Directions in Question Answering, pages 59–70. ration of effective global features for a joint entity detec- tion and tracking model. In Proceedings of HLT/EMNLP, Fei Wu and Daniel S. Weld. 2007. Autonomously semanti- pages 97–104. fying wikipedia. In Proceedings of CIKM, pages 41–50. Dina Demner-Fushman and Jimmy Lin. 2007. Answer- Ying Zhao, George Karypis, and Usama Fayyad. 2005. ing clinical questions with knowledge-based and statisti- Hierarchical clustering algorithms for document datasets. cal techniques. Computational Linguistics, 33(1):63–103. Data Mining and Knowledge Discovery, 10(2):141–168. Elena Filatova and John M. Prager. 2005. Tell me what you L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi- do and I’ll tell you what you are: Learning occupation- document biography summarization. In Proceedings of related activities for biographies. In Proceedings of EMNLP, pages 434–441. HLT/EMNLP, pages 113–120. Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen McKeown. 2006. Automatic creation of domain tem- plates. In Proceedings of ACL, pages 207–214. Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en- cyclopedic term descriptions on the web. In Proceedings of COLING, page 645. Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark Kantrowitz. 2000. Multi-document summarization by sentence extraction. In Proceedings of NAACL-ANLP, pages 40–48. Marti A. Hearst. 1994. Multi-paragraph segmentation of ex- pository text. In Proceedings of ACL, pages 9–16. Chin-Yew Lin. 2004. ROUGE: A package for automatic evaluation of summaries. In Proceedings of ACL, pages 74–81.