Streamlining Python Development: A Guide to a Modern Project Setup
Automatically Generating Wikipedia Articles: A Structure-Aware Approach
1. Automatically Generating Wikipedia Articles:
A Structure-Aware Approach
Christina Sauper and Regina Barzilay
Computer Science and Artificial Intelligence Laboratory
Massachusetts Institute of Technology
{csauper,regina}@csail.mit.edu
Abstract such as searching the Internet. Moreover, the chal-
lenge of maintaining output readability is mag-
In this paper, we investigate an ap- nified when creating a longer document that dis-
proach for creating a comprehensive tex- cusses multiple topics.
tual overview of a subject composed of in- In our approach, we explore how the high-
formation drawn from the Internet. We use level structure of human-authored documents can
the high-level structure of human-authored be used to produce well-formed comprehensive
texts to automatically induce a domain- overview articles. We select relevant material for
specific template for the topic structure of an article using a domain-specific automatically
a new overview. The algorithmic innova- generated content template. For example, a tem-
tion of our work is a method to learn topic- plate for articles about diseases might contain di-
specific extractors for content selection agnosis, causes, symptoms, and treatment. Our
jointly for the entire template. We aug- system induces these templates by analyzing pat-
ment the standard perceptron algorithm terns in the structure of human-authored docu-
with a global integer linear programming ments in the domain of interest. Then, it produces
formulation to optimize both local fit of a new article by selecting content from the Internet
information into each topic and global co- for each part of this template. An example of our
herence across the entire overview. The system’s output1 is shown in Figure 1.
results of our evaluation confirm the bene- The algorithmic innovation of our work is a
fits of incorporating structural information method for learning topic-specific extractors for
into the content selection process. content selection jointly across the entire template.
Learning a single topic-specific extractor can be
1 Introduction
easily achieved in a standard classification frame-
In this paper, we consider the task of automatically work. However, the choices for different topics
creating a multi-paragraph overview article that in a template are mutually dependent; for exam-
provides a comprehensive summary of a subject of ple, in a multi-topic article, there is potential for
interest. Examples of such overviews include ac- redundancy across topics. Simultaneously learn-
tor biographies from IMDB and disease synopses ing content selection for all topics enables us to
from Wikipedia. Producing these texts by hand is explicitly model these inter-topic connections.
a labor-intensive task, especially when relevant in- We formulate this task as a structured classifica-
formation is scattered throughout a wide range of tion problem. We estimate the parameters of our
Internet sources. Our goal is to automate this pro- model using the perceptron algorithm augmented
cess. We aim to create an overview of a subject – with an integer linear programming (ILP) formu-
e.g., 3-M Syndrome – by intelligently combining lation, run over a training set of example articles
relevant excerpts from across the Internet. in the given domain.
As a starting point, we can employ meth- The key features of this structure-aware ap-
ods developed for multi-document summarization. proach are twofold:
However, our task poses additional technical chal-
1
lenges with respect to content planning. Gen- This system output was added to Wikipedia at http://
en.wikipedia.org/wiki/3-M_syndrome on June
erating a well-rounded overview article requires 26, 2008. The page’s history provides examples of changes
proactive strategies to gather relevant material, performed by human editors to articles created by our system.
2. Diagnosis . . . No laboratories offering molecular genetic testing for prenatal diagnosis of 3-M syndrome are listed in the
GeneTests Laboratory Directory. However, prenatal testing may be available for families in which the disease-causing mutations
have been identified in an affected family member in a research or clinical laboratory.
Causes Three M syndrome is thought to be inherited as an autosomal recessive genetic trait. Human traits, including the classic
genetic diseases, are the product of the interaction of two genes, one received from the father and one from the mother. In recessive
disorders, the condition does not occur unless an individual inherits the same defective gene for the same trait from each parent. . . .
Symptoms . . . Many of the symptoms and physical features associated with the disorder are apparent at birth (congenital). In
some cases, individuals who carry a single copy of the disease gene (heterozygotes) may exhibit mild symptoms associated with
Three M syndrome.
Treatment . . . Genetic counseling will be of benefit for affected individuals and their families. Family members of affected indi-
viduals should also receive regular clinical evaluations to detect any symptoms and physical characteristics that may be potentially
associated with Three M syndrome or heterozygosity for the disorder. Other treatment for Three M syndrome is symptomatic and
supportive.
Figure 1: A fragment from the automatically created article for 3-M Syndrome.
• Automatic template creation: Templates domains, output summaries often suffer from co-
are automatically induced from human- herence and coverage problems.
authored documents. This ensures that the
In between these two approaches is work on
overview article will have the breadth ex-
domain-specific text-to-text generation. Instances
pected in a comprehensive summary, with
of these tasks are biography generation in sum-
content drawn from a wide variety of Inter-
marization and answering definition requests in
net sources.
question-answering. In contrast to a generic sum-
• Joint parameter estimation for content se- marizer, these applications aim to characterize
lection: Parameters are learned jointly for the types of information that are essential in a
all topics in the template. This procedure op- given domain. This characterization varies greatly
timizes both local relevance of information in granularity. For instance, some approaches
for each topic and global coherence across coarsely discriminate between biographical and
the entire article. non-biographical information (Zhou et al., 2004;
Biadsy et al., 2008), while others go beyond binary
We evaluate our approach by creating articles in distinction by identifying atomic events – e.g., oc-
two domains: Actors and Diseases. For a data set, cupation and marital status – that are typically in-
we use Wikipedia, which contains articles simi- cluded in a biography (Weischedel et al., 2004;
lar to those we wish to produce in terms of length Filatova and Prager, 2005; Filatova et al., 2006).
and breadth. An advantage of this data set is that Commonly, such templates are specified manually
Wikipedia articles explicitly delineate topical sec- and are hard-coded for a particular domain (Fujii
tions, facilitating structural analysis. The results and Ishikawa, 2004; Weischedel et al., 2004).
of our evaluation confirm the benefits of structure-
aware content selection over approaches that do Our work is related to these approaches; how-
not explicitly model topical structure. ever, content selection in our work is driven by
domain-specific automatically induced templates.
2 Related Work As our experiments demonstrate, patterns ob-
served in domain-specific training data provide
Concept-to-text generation and text-to-text gener-
sufficient constraints for topic organization, which
ation take very different approaches to content se-
is crucial for a comprehensive text.
lection. In traditional concept-to-text generation,
a content planner provides a detailed template for Our work also relates to a large body of recent
what information should be included in the output work that uses Wikipedia material. Instances of
and how this information should be organized (Re- this work include information extraction, ontology
iter and Dale, 2000). In text-to-text generation, induction and resource acquisition (Wu and Weld,
such templates for information organization are 2007; Biadsy et al., 2008; Nastase, 2008; Nastase
not available; sentences are selected based on their and Strube, 2008). Our focus is on a different task
salience properties (Mani and Maybury, 1999). — generation of new overview articles that follow
While this strategy is robust and portable across the structure of Wikipedia articles.
3. 3 Method quality of a given excerpt. Using the percep-
tron framework augmented with an ILP for-
The goal of our system is to produce a compre-
mulation for global optimization, the system
hensive overview article given a title – e.g., Can-
is trained to select the best excerpt for each
cer. We assume that relevant information on the
document di and each topic tj . For train-
subject is available on the Internet but scattered
ing, we assume the best excerpt is the original
among several pages interspersed with noise.
human-authored text sij .
We are provided with a training corpus consist-
ing of n documents d1 . . . dn in the same domain 3. Application (Section 3.2) Given the title of
– e.g., Diseases. Each document di has a title and a requested document, we select several ex-
a set of delineated sections2 si1 . . . sim . The num- cerpts from the candidate vectors returned by
ber of sections m varies between documents. Each the search procedure (1b) to create a com-
section sij also has a corresponding heading hij – prehensive overview article. We perform the
e.g., Treatment. decoding procedure jointly using learned pa-
Our overview article creation process consists rameters w1 . . . wk and the same ILP formu-
of three parts. First, a preprocessing step creates lation for global optimization as in training.
a template and searches for a number of candidate The result is a new document with k excerpts,
excerpts from the Internet. Next, parameters must one for each topic.
be trained for the content selection algorithm us-
ing our training data set. Finally, a complete ar- 3.1 Preprocessing
ticle may be created by combining a selection of Template Induction A content template speci-
candidate excerpts. fies the topical structure of documents in one do-
main. For instance, the template for articles about
1. Preprocessing (Section 3.1) Our prepro-
actors consists of four topics t1 . . . t4 : biography,
cessing step leverages previous work in topic
early life, career, and personal life. Using this
segmentation and query reformulation to pre-
template to create the biography of a new actor
pare a template and a set of candidate ex-
will ensure that its information coverage is con-
cerpts for content selection. Template gen-
sistent with existing human-authored documents.
eration must occur once per domain, whereas
We aim to derive these templates by discovering
search occurs every time an article is gener-
common patterns in the organization of documents
ated in both learning and application.
in a domain of interest. There has been a sizable
(a) Template Induction To create a con- amount of research on structure induction ranging
tent template, we cluster all section from linear segmentation (Hearst, 1994) to content
headings hi1 . . . him for all documents modeling (Barzilay and Lee, 2004). At the core
di . Each cluster is labeled with the most of these methods is the assumption that fragments
common heading hij within the clus- of text conveying similar information have simi-
ter. The largest k clusters are selected to lar word distribution patterns. Therefore, often a
become topics t1 . . . tk , which form the simple segment clustering across domain texts can
domain-specific content template. identify strong patterns in content structure (Barzi-
(b) Search For each document that we lay and Elhadad, 2003). Clusters containing frag-
wish to create, we retrieve from the In- ments from many documents are indicative of top-
ternet a set of r excerpts ej1 . . . ejr for ics that are essential for a comprehensive sum-
each topic tj from the template. We de- mary. Given the simplicity and robustness of this
fine appropriate search queries using the approach, we utilize it for template induction.
requested document title and topics tj . We cluster all section headings hi1 . . . him from
all documents di using a repeated bisectioning
2. Learning Content Selection (Section 3.2) algorithm (Zhao et al., 2005). As a similarity
For each topic tj , we learn the corresponding function, we use cosine similarity weighted with
topic-specific parameters wj to determine the TF*IDF. We eliminate any clusters with low in-
2
ternal similarity (i.e., smaller than 0.5), as we as-
In data sets where such mark-up is not available, one can
employ topical segmentation algorithms as an additional pre- sume these are “miscellaneous” clusters that will
processing step. not yield unified topics.
4. We determine the average number of sections is needed when candidate excerpts are drawn di-
k over all documents in our training set, then se- rectly from the web, as they may be contaminated
lect the k largest section clusters as topics. We or- with noise.
der these topics as t1 . . . tk using a majority order- We propose a novel joint training algorithm that
ing algorithm (Cohen et al., 1998). This algorithm learns selection criteria for all the topics simulta-
finds a total order among clusters that is consistent neously. This approach enables us to maximize
with a maximal number of pairwise relationships both local fit and global coherence. We implement
observed in our data set. this algorithm using the perceptron framework, as
Each topic tj is identified by the most frequent it can be easily modified for structured prediction
heading found within the cluster – e.g., Causes. while preserving convergence guarantees (Daum´ e
This set of topics forms the content template for a III and Marcu, 2005; Snyder and Barzilay, 2007).
domain. In this section, we first describe the structure
Search To retrieve relevant excerpts, we must and decoding procedure of our model. We then
define appropriate search queries for each topic present an algorithm to jointly learn the parame-
t1 . . . tk . Query reformulation is an active area of ters of all topic models.
research (Agichtein et al., 2001). We have exper- 3.2.1 Model Structure
imented with several of these methods for draw-
The model inputs are as follows:
ing search queries from representative words in the
body text of each topic; however, we find that the • The title of the desired document
best performance is provided by deriving queries • t1 . . . tk — topics from the content template
from a conjunction of the document title and topic • ej1 . . . ejr — candidate excerpts for each
– e.g., “3-M syndrome” diagnosis. topic tj
Using these queries, we search using Yahoo!
and retrieve the first ten result pages for each topic. In addition, we define feature and parameter
From each of these pages, we extract all possible vectors:
excerpts consisting of chunks of text between stan-
• φ(ejl ) — feature vector for the lth candidate
dardized boundary indicators (such as <p> tags).
excerpt for topic tj
In our experiments, there are an average of 6 ex-
• w1 . . . wk — parameter vectors, one for each
cerpts taken from each page. For each topic tj of
of the topics t1 . . . tk
each document we wish to create, the total number
of excerpts r found on the Internet may differ. We Our model constructs a new article by following
label the excerpts ej1 . . . ejr . these two steps:
Ranking First, we attempt to rank candidate
3.2 Selection Model excerpts based on how representative they are of
Our selection model takes the content template each individual topic. For each topic tj , we induce
t1 . . . tk and the candidate excerpts ej1 . . . ejr for a ranking of the excerpts ej1 . . . ejr by mapping
each topic tj produced in the previous steps. It each excerpt ejl to a score:
then selects a series of k excerpts, one from each
topic, to create a coherent summary. scorej (ejl ) = φ(ejl ) · wj
One possible approach is to perform individ- Candidates for each topic are ranked from high-
ual selections from each set of excerpts ej1 . . . ejr est to lowest score. After this procedure, the posi-
and then combine the results. This strategy is tion l of excerpt ejl within the topic-specific can-
commonly used in multi-document summariza- didate vector is the excerpt’s rank.
tion (Barzilay et al., 1999; Goldstein et al., 2000; Optimizing the Global Objective To avoid re-
Radev et al., 2000), where the combination step dundancy between topics, we formulate an opti-
eliminates the redundancy across selected ex- mization problem using excerpt rankings to create
cerpts. However, separating the two steps may not the final article. Given k topics, we would like to
be optimal for this task — the balance between select one excerpt ejl for each topic tj , such that
coverage and redundancy is harder to achieve the rank is minimized; that is, scorej (ejl ) is high.
when a multi-paragraph summary is generated. In To select the optimal excerpts, we employ inte-
addition, a more discriminative selection strategy ger linear programming (ILP). This framework is
5. commonly used in generation and summarization Feature Value
UNI wordi count of word occurrences
applications where the selection process is driven POS wordi first position of word in excerpt
by multiple constraints (Marciniak and Strube, BI wordi wordi+1 count of bigram occurrences
2005; Clarke and Lapata, 2007). SENT count of all sentences
EXCL count of exclamations
We represent excerpts included in the output QUES count of questions
using a set of indicator variables, xjl . For each WORD count of all words
excerpt ejl , the corresponding indicator variable NAME count of title mentions
DATE count of dates
xjl = 1 if the excerpt is included in the final doc- PROP count of proper nouns
ument, and xjl = 0 otherwise. PRON count of pronouns
Our objective is to minimize the ranks of the NUM count of numbers
FIRST word1 1∗
excerpts selected for the final document: FIRST word1 word2 1†
SIMS count of similar excerpts‡
k r
min l · xjl Table 1: Features employed in the ranking model.
∗
j=1 l=1 Defined as the first unigram in the excerpt.
†
Defined as the first bigram in the excerpt.
‡
We augment this formulation with two types of Defined as excerpts with cosine similarity > 0.5
constraints.
Exclusivity Constraints We want to ensure that
Features As shown in Table 1, most of the fea-
exactly one indicator xjl is nonzero for each topic
tures we select in our model have been employed
tj . These constraints are formulated as follows:
in previous work on summarization (Mani and
r Maybury, 1999). All features except the SIMS
xjl = 1 ∀j ∈ {1 . . . k} feature are defined for individual excerpts in isola-
l=1 tion. For each excerpt ejl , the value of the SIMS
Redundancy Constraints We also want to pre- feature is the count of excerpts ejl′ in the same
vent redundancy across topics. We define topic tj for which sim(ejl , ejl′ ) > 0.5. This fea-
sim(ejl , ej ′ l′ ) as the cosine similarity between ex- ture quantifies the degree of repetition within a
cerpts ejl from topic tj and ej ′ l′ from topic tj ′ . topic, often indicative of an excerpt’s accuracy and
We introduce constraints that ensure no pair of ex- relevance.
cerpts has similarity above 0.5:
3.2.2 Model Training
(xjl + xj ′ l′ ) · sim(ejl , ej ′ l′ ) ≤ 1 Generating Training Data For training, we are
given n original documents d1 . . . dn , a content
∀j, j ′ = 1 . . . k ∀l, l′ = 1 . . . r template consisting of topics t1 . . . tk , and a set of
candidate excerpts eij1 . . . eijr for each document
If excerpts ejl and ej ′ l′ have cosine similarity
di and topic tj . For each section of each docu-
sim(ejl , ej ′ l′ ) > 0.5, only one excerpt may be
ment, we add the gold excerpt sij to the corre-
selected for the final document – i.e., either xjl
sponding vector of candidate excerpts eij1 . . . eijr .
or xj ′ l′ may be 1, but not both. Conversely, if This excerpt represents the target for our training
sim(ejl , ej ′ l′ ) ≤ 0.5, both excerpts may be se- algorithm. Note that the algorithm does not re-
lected. quire annotated ranking data; only knowledge of
Solving the ILP Solving an integer linear pro- this “optimal” excerpt is required. However, if
gram is NP-hard (Cormen et al., 1992); however, the excerpts provided in the training data have low
in practice there exist several strategies for solving quality, noise is introduced into the system.
certain ILPs efficiently. In our study, we employed
Training Procedure Our algorithm is a
lp solve,3 an efficient mixed integer programming
modification of the perceptron ranking algo-
solver which implements the Branch-and-Bound
rithm (Collins, 2002), which allows for joint
algorithm. On a larger scale, there are several al-
learning across several ranking problems (Daum´ e
ternatives to approximate the ILP results, such as a
III and Marcu, 2005; Snyder and Barzilay, 2007).
dynamic programming approximation to the knap-
Pseudocode for this algorithm is provided in
sack problem (McDonald, 2007).
Figure 2.
3
http://lpsolve.sourceforge.net/5.5/ First, we define Rank(eij1 . . . eijr , wj ), which
6. ranks all excerpts from the candidate excerpt Input:
vector eij1 . . . eijr for document di and topic d1 . . . dn : A set of n documents, each containing
tj . Excerpts are ordered by scorej (ejl ) using k sections si1 . . . sik
eij1 . . . eijr : Sets of candidate excerpts for each topic
the current parameter values. We also define tj and document di
Optimize(eij1 . . . eijr ), which finds the optimal Define:
Rank(eij1 . . . eijr , wj ):
selection of excerpts (one per topic) given ranked As described in Section 3.2.1:
lists of excerpts eij1 . . . eijr for each document di Calculates scorej (eijl ) for all excerpts for
and topic tj . These functions follow the ranking document di and topic tj , using parameters wj .
Orders the list of excerpts by scorej (eijl )
and optimization procedures described in Section from highest to lowest.
3.2.1. The algorithm maintains k parameter vec- Optimize(ei11 . . . eikr ):
tors w1 . . . wk , one associated with each topic tj As described in Section 3.2.1:
Finds the optimal selection of excerpts to form a
desired in the final article. During initialization, final article, given ranked lists of excerpts
all parameter vectors are set to zeros (line 2). for each topic t1 . . . tk .
To learn the optimal parameters, this algorithm Returns a list of k excerpts, one for each topic.
φ(eijl ):
iterates over the training set until the parameters Returns the feature vector representing excerpt eijl
converge or a maximum number of iterations is Initialization:
1 For j = 1 . . . k
reached (line 3). For each document in the train- 2 Set parameters wj = 0
ing set (line 4), the following steps occur: First, Training:
candidate excerpts for each topic are ranked (lines 3 Repeat until convergence or while iter < itermax :
4 For i = 1 . . . n
5-6). Next, decoding through ILP optimization is 5 For j = 1 . . . k
performed over all ranked lists of candidate ex- 6 Rank(eij1 . . . eijr , wj )
cerpts, selecting one excerpt for each topic (line 7 x1 . . . xk = Optimize(ei11 . . . eikr )
8 For j = 1 . . . k
7). Finally, the parameters are updated in a joint 9 If sim(xj , sij ) < 0.8
fashion. For each topic (line 8), if the selected 10 wj = wj + φ(sij ) − φ(xi )
excerpt is not similar enough to the gold excerpt 11 iter = iter + 1
12 Return parameters w1 . . . wk
(line 9), the parameters for that topic are updated
using a standard perceptron update rule (line 10).
When convergence is reached or the maximum it-
Figure 2: An algorithm for learning several rank-
eration count is exceeded, the learned parameter
ing problems with a joint decoding mechanism.
values are returned (line 12).
The use of ILP during each step of training
sets this algorithm apart from previous work. In
tor reaction to system-produced articles submitted
prior research, ILP was used as a postprocess-
to Wikipedia.
ing step to remove redundancy and make other
global decisions about parameters (McDonald, Data For evaluation, we consider two domains:
2007; Marciniak and Strube, 2005; Clarke and La- American Film Actors and Diseases. These do-
pata, 2007). However, in our training, we inter- mains have been commonly used in prior work
twine the complete decoding procedure with the on summarization (Weischedel et al., 2004; Zhou
parameter updates. Our joint learning approach et al., 2004; Filatova and Prager, 2005; Demner-
finds per-topic parameter values that are maxi- Fushman and Lin, 2007; Biadsy et al., 2008). Our
mally suited for the global decoding procedure for text corpus consists of articles drawn from the cor-
content selection. responding categories in Wikipedia. There are
2,150 articles in American Film Actors and 523
4 Experimental Setup articles in Diseases. For each domain, we ran-
domly select 90% of articles for training and test
We evaluate our method by observing the quality on the remaining 10%. Human-authored articles
of automatically created articles in different do- in both domains contain an average of four top-
mains. We compute the similarity of a large num- ics, and each topic contains an average of 193
ber of articles produced by our system and sev- words. In order to model the real-world scenario
eral baselines to the original human-authored arti- where Wikipedia articles are not always available
cles using ROUGE, a standard metric for summary (as for new or specialized topics), we specifically
quality. In addition, we perform an analysis of edi- exclude Wikipedia sources during our search pro-
7. Avg. Excerpts Avg. Sources confidence scores.
Amer. Film Actors Our third baseline, Disjoint, uses the ranking
Search 2.3 1
No Template 4 4.0 perceptron framework as in our full system; how-
Disjoint 4 2.1 ever, rather than perform an optimization step
Full Model 4 3.4 during training and decoding, we simply select
Oracle 4.3 4.3
Diseases
the highest-ranked excerpt for each topic. This
Search 3.1 1 equates to standard linear classification for each
No Template 4 2.5 section individually.
Disjoint 4 3.0
Full Model 4 3.2 In addition to these baselines, we compare
Oracle 5.8 3.9 against an Oracle system. For each topic present
in the human-authored article, the Oracle selects
Table 2: Average number of excerpts selected and the excerpt from our full model’s candidate ex-
sources used in article creation for test articles. cerpts with the highest cosine similarity to the
human-authored text. This excerpt is the optimal
automatic selection from the results available, and
cedure (Section 3.1) for evaluation. therefore represents an upper bound on our excerpt
selection task. Some articles contain additional
Baselines Our first baseline, Search, relies
topics beyond those in the template; in these cases,
solely on search engine ranking for content selec-
the Oracle system produces a longer article than
tion. Using the article title as a query – e.g., Bacil-
our algorithm.
lary Angiomatosis, this method selects the web
page that is ranked first by the search engine. From Table 2 shows the average number of excerpts
this page we select the first k paragraphs where k selected and sources used in articles created by our
is defined in the same way as in our full model. If full model and each baseline.
there are less than k paragraphs on the page, all Automatic Evaluation To assess the quality of
paragraphs are selected, but no other sources are the resulting overview articles, we compare them
used. This yields a document of comparable size with the original human-authored articles. We
with the output of our system. Despite its sim- use ROUGE, an evaluation metric employed at the
plicity, this baseline is not naive: extracting ma- Document Understanding Conferences (DUC),
terial from a single document guarantees that the which assumes that proximity to human-authored
output is coherent, and a page highly ranked by a text is an indicator of summary quality. We
search engine may readily contain a comprehen- use the publicly available ROUGE toolkit (Lin,
sive overview of the subject. 2004) to compute recall, precision, and F-score for
ROUGE-1. We use the Wilcoxon Signed Rank Test
Our second baseline, No Template, does not
to determine statistical significance.
use a template to specify desired topics; there-
Analysis of Human Edits In addition to our auto-
fore, there are no constraints on content selection.
matic evaluation, we perform a study of reactions
Instead, we follow a simplified form of previous
to system-produced articles by the general pub-
work on biography creation, where a classifier is
lic. To achieve this goal, we insert automatically
trained to distinguish biographical text (Zhou et
created articles4 into Wikipedia itself and exam-
al., 2004; Biadsy et al., 2008).
ine the feedback of Wikipedia editors. Selection
In this case, we train a classifier to distinguish of specific articles is constrained by the need to
domain-specific text. Positive training data is find topics which are currently of “stub” status that
drawn from all topics in the given domain cor- have enough information available on the Internet
pus. To find negative training data, we perform to construct a valid article. After a period of time,
the search procedure as in our full model (see we analyzed the edits made to the articles to deter-
Section 3.1) using only the article titles as search mine the overall editor reaction. We report results
queries. Any excerpts which have very low sim- on 15 articles in the Diseases category5 .
ilarity to the original articles are used as negative
4
examples. During the decoding procedure, we use In addition to the summary itself, we also include proper
the same search procedure. We then classify each citations to the sources from which the material is extracted.
5
We are continually submitting new articles; however, we
excerpt as relevant or irrelevant and select the k report results on those that have at least a 6 month history at
non-redundant excerpts with the highest relevance time of writing.
8. Recall Precision F-score Type Count
Amer. Film Actors Total articles 15
Search 0.09 0.37 0.13 ∗ Promoted articles 10
No Template 0.33 0.50 0.39 ∗ Edit types
Disjoint 0.45 0.32 0.36 ∗ Intra-wiki links 36
Full Model 0.46 0.40 0.41 Formatting 25
Oracle 0.48 0.64 0.54 ∗ Grammar 20
Diseases Minor topic edits 2
Search 0.31 0.37 0.32 † Major topic changes 1
No Template 0.32 0.27 0.28 ∗ Total edits 85
Disjoint 0.33 0.40 0.35 ∗
Full Model 0.36 0.39 0.37 Table 4: Distribution of edits on Wikipedia.
Oracle 0.59 0.37 0.44 ∗
Table 3: Results of ROUGE-1 evaluation. by human editors from stubs to regular Wikipedia
∗
Significant with respect to our full model for p ≤ 0.05. entries based on the quality and coverage of the
†
Significant with respect to our full model for p ≤ 0.10.
material. Information was removed in three cases
for being irrelevant, one entire section and two
Since Wikipedia is a live resource, we do not smaller pieces. The most common changes were
repeat this procedure for our baseline systems. small edits to formatting and introduction of links
Adding articles from systems which have previ- to other Wikipedia articles in the body text.
ously demonstrated poor quality would be im-
proper, especially in Diseases. Therefore, we 6 Conclusion
present this analysis as an additional observation In this paper, we investigated an approach for cre-
rather than a rigorous technical study. ating a multi-paragraph overview article by select-
ing relevant material from the web and organiz-
5 Results
ing it into a single coherent text. Our algorithm
Automatic Evaluation The results of this evalu- yields significant gains over a structure-agnostic
ation are shown in Table 3. Our full model outper- approach. Moreover, our results demonstrate the
forms all of the baselines. By surpassing the Dis- benefits of structured classification, which out-
joint baseline, we demonstrate the benefits of joint performs independently trained topical classifiers.
classification. Furthermore, the high performance Overall, the results of our evaluation combined
of both our full model and the Disjoint baseline with our analysis of human edits confirm that the
relative to the other baselines shows the impor- proposed method can effectively produce compre-
tance of structure-aware content selection. The hensive overview articles.
Oracle system, which represents an upper bound This work opens several directions for future re-
on our system’s capabilities, performs well. search. Diseases and American Film Actors ex-
The remaining baselines have different flaws: hibit fairly consistent article structures, which are
Articles produced by the No Template baseline successfully captured by a simple template cre-
tend to focus on a single topic extensively at the ation process. However, with categories that ex-
expense of breadth, because there are no con- hibit structural variability, more sophisticated sta-
straints to ensure diverse topic selection. On the tistical approaches may be required to produce ac-
other hand, performance of the Search baseline curate templates. Moreover, a promising direction
varies dramatically. This is expected; this base- is to consider hierarchical discourse formalisms
line relies heavily on both the search engine and such as RST (Mann and Thompson, 1988) to sup-
individual web pages. The search engine must cor- plement our template-based approach.
rectly rank relevant pages, and the web pages must
provide the important material first. Acknowledgments
Analysis of Human Edits The results of our ob- The authors acknowledge the support of the NSF (CA-
servation of editing patterns are shown in Table REER grant IIS-0448168, grant IIS-0835445, and grant IIS-
0835652) and NIH (grant V54LM008748). Thanks to Mike
4. These articles have resided on Wikipedia for Collins, Julia Hirschberg, and members of the MIT NLP
a period of time ranging from 5-11 months. All group for their helpful suggestions and comments. Any opin-
ions, findings, conclusions, or recommendations expressed in
of them have been edited, and no articles were re- this paper are those of the authors, and do not necessarily re-
moved due to lack of quality. Moreover, ten au- flect the views of the funding organizations.
tomatically created articles have been promoted
9. References Inderjeet Mani and Mark T. Maybury. 1999. Advances in
Automatic Text Summarization. The MIT Press.
Eugene Agichtein, Steve Lawrence, and Luis Gravano. 2001.
Learning search engine specific query transformations for William C. Mann and Sandra A. Thompson. 1988. Rhetor-
question answering. In Proceedings of WWW, pages 169– ical structure theory: Toward a functional theory of text
178. organization. Text, 8(3):243–281.
Regina Barzilay and Noemie Elhadad. 2003. Sentence align- Tomasz Marciniak and Michael Strube. 2005. Beyond the
ment for monolingual comparable corpora. In Proceed- pipeline: Discrete optimization in NLP. In Proceedings
ings of EMNLP, pages 25–32. of CoNLL, pages 136–143.
Regina Barzilay and Lillian Lee. 2004. Catching the drift: Ryan McDonald. 2007. A study of global inference algo-
Probabilistic content models, with applications to genera- rithms in multi-document summarization. In Proceedings
tion and summarization. In Proceedings of HLT-NAACL, of EICR, pages 557–564.
pages 113–120.
Vivi Nastase and Michael Strube. 2008. Decoding wikipedia
Regina Barzilay, Kathleen R. McKeown, and Michael El-
categories for knowledge acquisition. In Proceedings of
hadad. 1999. Information fusion in the context of multi-
AAAI, pages 1219–1224.
document summarization. In Proceedings of ACL, pages
550–557. Vivi Nastase. 2008. Topic-driven multi-document summa-
rization with encyclopedic knowledge and spreading acti-
Fadi Biadsy, Julia Hirschberg, and Elena Filatova. 2008.
vation. In Proceedings of EMNLP, pages 763–772.
An unsupervised approach to biography production using
wikipedia. In Proceedings of ACL/HLT, pages 807–815. Dragomir R. Radev, Hongyan Jing, and Malgorzata
James Clarke and Mirella Lapata. 2007. Modelling com- Budzikowska. 2000. Centroid-based summarization
pression with discourse constraints. In Proceedings of of multiple documents: sentence extraction, utility-
EMNLP-CoNLL, pages 1–11. based evaluation, and user studies. In Proceedings of
ANLP/NAACL, pages 21–29.
William W. Cohen, Robert E. Schapire, and Yoram Singer.
1998. Learning to order things. In Proceedings of NIPS, Ehud Reiter and Robert Dale. 2000. Building Natural Lan-
pages 451–457. guage Generation Systems. Cambridge University Press,
Cambridge.
Michael Collins. 2002. Ranking algorithms for named-entity
extraction: Boosting and the voted perceptron. In Pro- Benjamin Snyder and Regina Barzilay. 2007. Multiple as-
ceedings of ACL, pages 489–496. pect ranking using the good grief algorithm. In Proceed-
ings of HLT-NAACL, pages 300–307.
Thomas H. Cormen, Charles E. Leiserson, and Ronald L.
Rivest. 1992. Intoduction to Algorithms. The MIT Press. Ralph M. Weischedel, Jinxi Xu, and Ana Licuanan. 2004. A
hybrid approach to answering biographical questions. In
Hal Daum´ III and Daniel Marcu. 2005. A large-scale explo-
e New Directions in Question Answering, pages 59–70.
ration of effective global features for a joint entity detec-
tion and tracking model. In Proceedings of HLT/EMNLP, Fei Wu and Daniel S. Weld. 2007. Autonomously semanti-
pages 97–104. fying wikipedia. In Proceedings of CIKM, pages 41–50.
Dina Demner-Fushman and Jimmy Lin. 2007. Answer- Ying Zhao, George Karypis, and Usama Fayyad. 2005.
ing clinical questions with knowledge-based and statisti- Hierarchical clustering algorithms for document datasets.
cal techniques. Computational Linguistics, 33(1):63–103. Data Mining and Knowledge Discovery, 10(2):141–168.
Elena Filatova and John M. Prager. 2005. Tell me what you L. Zhou, M. Ticrea, and Eduard Hovy. 2004. Multi-
do and I’ll tell you what you are: Learning occupation- document biography summarization. In Proceedings of
related activities for biographies. In Proceedings of EMNLP, pages 434–441.
HLT/EMNLP, pages 113–120.
Elena Filatova, Vasileios Hatzivassiloglou, and Kathleen
McKeown. 2006. Automatic creation of domain tem-
plates. In Proceedings of ACL, pages 207–214.
Atsushi Fujii and Tetsuya Ishikawa. 2004. Summarizing en-
cyclopedic term descriptions on the web. In Proceedings
of COLING, page 645.
Jade Goldstein, Vibhu Mittal, Jaime Carbonell, and Mark
Kantrowitz. 2000. Multi-document summarization by
sentence extraction. In Proceedings of NAACL-ANLP,
pages 40–48.
Marti A. Hearst. 1994. Multi-paragraph segmentation of ex-
pository text. In Proceedings of ACL, pages 9–16.
Chin-Yew Lin. 2004. ROUGE: A package for automatic
evaluation of summaries. In Proceedings of ACL, pages
74–81.