SlideShare a Scribd company logo
1 of 66
Download to read offline
Automatic Text Summarization
           Katja Filippova
    filippova@eml-research.de

         EML Research gGmbH
            TU Darmstadt




                                Text Summarization – 25.02.2009 – p. 1
Text summarization
• A summary is a text that is produced from one or more
  texts, that contains a significant portion of the information in
  the original text(s), and that is no longer than half of the
  original text(s) (Hovy, 2003)


                              • information retrieval
                              • stock market prediction
                              • generation of abstracts
                              • online news summarization
                              • ...



                                                   Text Summarization – 25.02.2009 – p. 2
Overview
• Introduction
   • classification of summarization systems
   • abstraction vs. extraction
• Text cohesion and coherence for summarization
   • graph based methods
  • discourse structure based methods
• Document Understanding Conference
   • tasks
  • an example
• Research directions
   • sentence fusion and compression
   • integrating world knowledge
                                              Text Summarization – 25.02.2009 – p. 3
Text summarization: types

 • A summary is a text that is produced from one or more
   texts, that contains a significant portion of the information in
   the original text(s), and that is no longer than half of the
   original text(s) (Hovy, 2003)
 • Indicative
  « indicates types of information
  « “alerts”




                                                    Text Summarization – 25.02.2009 – p. 4
Text summarization: types

 • A summary is a text that is produced from one or more
   texts, that contains a significant portion of the information in
   the original text(s), and that is no longer than half of the
   original text(s) (Hovy, 2003)
 • Indicative
  « indicates types of information
  « “alerts”
 • Informative
  « includes quantitative/qualitative information
  « “informs”


                                                    Text Summarization – 25.02.2009 – p. 4
Text summarization: types

 • A summary is a text that is produced from one or more
   texts, that contains a significant portion of the information in
   the original text(s), and that is no longer than half of the
   original text(s) (Hovy, 2003)
 • Indicative
  « indicates types of information
  « “alerts”
 • Informative
  « includes quantitative/qualitative information
  « “informs”
 • Critic/evaluative
  « evaluates the content of the document           Text Summarization – 25.02.2009 – p. 4
Text summarization: types

INDICATIVE
 • The work of Consumer Advice Centres is examined. The
   information sources used to support this work are reviewed.
   The recent closure of many CACs has seriously affected the
   availability of consumer information and advice. The
   contribution that public libraries can make in enhancing the
   availability of consumer information and advice both to the
   public and other agencies involved in consumer information
   and advice, is discussed.




                                                 Text Summarization – 25.02.2009 – p. 5
Text summarization: types

INFORMATIVE
 • An examination of the work of Consumer Advice Centres
   and of the information sources and support activities that
   public libraries can offer. CACs have dealt with pre-shopping
   advice, education on consumers’ rights and complaints
   about goods and services, advising the client and often
   obtaining expert assessment. They have drawn on a wide
   range of information sources including case records, trade
   literature, contact files and external links. The recent closure
   of many CACs has seriously affected the availability of
   consumer information and advice. Libraries can cooperate
   closely with advice agencies through local coordinating
   committed, shared premises, join publicity referral and the
   sharing of professional expertise.
                                                   Text Summarization – 25.02.2009 – p. 5
Text summarization: types

 • Source: single-document vs. multi-document
  « research paper
  « proceedings of a conference




                                                Text Summarization – 25.02.2009 – p. 6
Text summarization: types

 • Source: single-document vs. multi-document
  « research paper
  « proceedings of a conference
 • Content: generic vs. query-based vs. user-focused
  « equal coverage of all major topics
  « based on a question “what are the causes of the war?”
  « users interested in chemistry




                                                Text Summarization – 25.02.2009 – p. 6
Text summarization: types

 • Source: single-document vs. multi-document
  « research paper
  « proceedings of a conference
 • Content: generic vs. query-based vs. user-focused
  « equal coverage of all major topics
  « based on a question “what are the causes of the war?”
  « users interested in chemistry
 • Form: extract vs. abstract
  « fragments from the document
  « newly re-written text


                                                Text Summarization – 25.02.2009 – p. 6
Extraction vs. abstraction

How should a text summarization system proceed?

 • read the documents




 • understand them – build
   a semantic representation


 • generate a summary from
   this representation


                                              Text Summarization – 25.02.2009 – p. 7
Extraction vs. abstraction
 • unfortunately, a rich semantic representation is not
   possible yet
 • to date, most summarization systems are extractive

 • usually, extraction units are sentences

 • low cost solution: could work without ontologies,
   complex representations, etc.
 • extractive summaries are usually incoherent

 • trade-off between non-redundancy and completeness




                                                  Text Summarization – 25.02.2009 – p. 8
Extraction vs. abstraction

Three sentences from related documents (Oct. 27 2009):
 • The Syrian foreign minister today condemned the killing of
   eight civilians in a US raid as an act of quot;criminal and terrorist
   aggressionquot;. (The Guardian)
 • Syria accused the United States on Monday of carrying out
   a quot;terrorist aggressionquot; after a deadly raid near its border
   with Iraq which it said killed eight civilians. (Reuters)
 • Lebanese President Michel Suleiman on Monday contacted
   his Syrian counterpart Bashar Assad to denounce
   quot;Sunday’s American aggressionquot; against the Syrian village
   of Abu Kamal near the border with Iraq, local Elnashra
   website reported. (Aljazeera)

                                                    Text Summarization – 25.02.2009 – p. 9
Extraction vs. abstraction

Three sentences from related documents (Oct. 27 2009):
 • The Syrian foreign minister today condemned the killing of
   eight civilians in a US raid as an act of quot;criminal and terrorist
   aggressionquot;. (The Guardian)
 • Syria accused the United States on Monday of carrying out
   a quot;terrorist aggressionquot; after a deadly raid near its border
   with Iraq which it said killed eight civilians. (Reuters)
 • Lebanese President Michel Suleiman on Monday contacted
   his Syrian counterpart Bashar Assad to denounce
   quot;Sunday’s American aggressionquot; against the Syrian village
   of Abu Kamal near the border with Iraq, local Elnashra
   website reported. (Aljazeera)

                                                    Text Summarization – 25.02.2009 – p. 9
Extraction vs. abstraction

Three sentences from related documents (Oct. 27 2009):
 • The Syrian foreign minister today condemned the killing of
   eight civilians in a US raid as an act of quot;criminal and terrorist
   aggressionquot;. (The Guardian)
 • Syria accused the United States on Monday of carrying out
   a quot;terrorist aggressionquot; after a deadly raid near its border
   with Iraq which it said killed eight civilians. (Reuters)
 • Lebanese President Michel Suleiman on Monday contacted
   his Syrian counterpart Bashar Assad to denounce
   quot;Sunday’s American aggressionquot; against the Syrian village
   of Abu Kamal near the border with Iraq, local Elnashra
   website reported. (Aljazeera)

                                                    Text Summarization – 25.02.2009 – p. 9
Extraction vs. abstraction

Three sentences from related documents (Oct. 27 2009):
 • The Syrian foreign minister today condemned the killing of
   eight civilians in a US raid as an act of quot;criminal and terrorist
   aggressionquot;. (The Guardian)
 • Syria accused the United States on Monday of carrying out
   a quot;terrorist aggressionquot; after a deadly raid near its border
   with Iraq which it said killed eight civilians. (Reuters)
 • Lebanese President Michel Suleiman on Monday contacted
   his Syrian counterpart Bashar Assad to denounce
   quot;Sunday’s American aggressionquot; against the Syrian village
   of Abu Kamal near the border with Iraq, local Elnashra
   website reported. (Aljazeera)

                                                    Text Summarization – 25.02.2009 – p. 9
Extraction vs. abstraction
 • extractive summaries are not coherent – sentences pulled
  out from different documents make sense each but sound
  awkward when put together




                                               Text Summarization – 25.02.2009 – p. 10
Extraction vs. abstraction
 • extractive summaries are not coherent – sentences pulled
  out from different documents make sense each but sound
  awkward when put together
 • unresolved pronouns may distort the meaning




                                               Text Summarization – 25.02.2009 – p. 10
Extraction vs. abstraction
 • extractive summaries are not coherent – sentences pulled
   out from different documents make sense each but sound
   awkward when put together
 • unresolved pronouns may distort the meaning

 • beginning with a sentence which starts with However, ... is
   not a good idea




                                                 Text Summarization – 25.02.2009 – p. 10
Extraction vs. abstraction
 • extractive summaries are not coherent – sentences pulled
   out from different documents make sense each but sound
   awkward when put together
 • unresolved pronouns may distort the meaning

 • beginning with a sentence which starts with However, ... is
   not a good idea
 • there is a striking difference with human generated texts –
   pronouns and connectives are in the right place, the flow of
   discourse makes sense



                                                 Text Summarization – 25.02.2009 – p. 10
Extraction vs. abstraction
 • extractive summaries are not coherent – sentences pulled
   out from different documents make sense each but sound
   awkward when put together
 • unresolved pronouns may distort the meaning

 • beginning with a sentence which starts with However, ... is
   not a good idea
 • there is a striking difference with human generated texts –
   pronouns and connectives are in the right place, the flow of
   discourse makes sense
 • How could one use this property of natural discourse for
   summarization?
                                                 Text Summarization – 25.02.2009 – p. 10
Text coherence vs. text cohesion
 • John enjoys playing the piano. John wants to become a
  famous piano player. John works hard and works hard every
  day. Working hard is necessary to become a famous piano
  player.




                                              Text Summarization – 25.02.2009 – p. 11
Text coherence vs. text cohesion
 • John enjoys playing the piano. John wants to become a
  famous piano player. John works hard and works hard every
  day. Working hard is necessary to become a famous piano
  player.




                                              Text Summarization – 25.02.2009 – p. 11
Text coherence vs. text cohesion
 • John enjoys playing the piano. John wants to become a
  famous piano player. John works hard and works hard every
  day. Working hard is necessary to become a famous piano
  player.
 • John enjoys playing the piano. However, he woke up early
  yesterday. But the day before yesterday the weather was
  wonderful, because rain and snow started immediately and
  continued the whole day through. By the way, his teacher
  did the same.




                                               Text Summarization – 25.02.2009 – p. 11
Text coherence vs. text cohesion
 • John enjoys playing the piano. John wants to become a
  famous piano player. John works hard and works hard every
  day. Working hard is necessary to become a famous piano
  player.
 • John enjoys playing the piano. However, he woke up early
  yesterday. But the day before yesterday the weather was
  wonderful, because rain and snow started immediately and
  continued the whole day through. By the way, his teacher
  did the same.




                                               Text Summarization – 25.02.2009 – p. 11
Text coherence vs. text cohesion
 • John enjoys playing the piano. John wants to become a
  famous piano player. John works hard and works hard every
  day. Working hard is necessary to become a famous piano
  player.
 • John enjoys playing the piano. However, he woke up early
  yesterday. But the day before yesterday the weather was
  wonderful, because rain and snow started immediately and
  continued the whole day through. By the way, his teacher
  did the same.
 • John enjoys playing the piano and wants to become famous.
  He works hard and does it every day because it is
  necessary for his goal.

                                               Text Summarization – 25.02.2009 – p. 11
Text coherence vs. text cohesion
 • Text coherence represents the overall structure of a
  multi-sentence text in terms of macro-level relations
  between clauses or sentences (Halliday & Hasan, 1996).
  « Rhetorical Structure Theory (Mann & Thompson, 1988)
  « Discourse Representation Theory (Kamp, 1981)
  « Discourse Lexicalized Tree Adjoining Grammar (Forbes,
     2001)
 • John enjoys playing the piano. [John wants to become a
   famous piano player.] (that’s why) [John works hard and
   works hard every day.] Working hard is necessary to
   become a famous piano player.


                                                Text Summarization – 25.02.2009 – p. 12
Text coherence vs. text cohesion
 • Text cohesion involves relations between words, word
  senses, or referring expressions, which determine how
  tightly connected the text is (Halliday & Hasan, 1996).
  « anaphora, ellipsis, connectives
  « synonymy and other lexical relations
 • John enjoys playing the piano. However, he woke up early
  yesterday. But the day before yesterday the weather was
  wonderful, because rain and snow started immediately and
  continued the whole day through. By the way, his teacher
  did the same.



                                                Text Summarization – 25.02.2009 – p. 12
Coherence based summarization
• earlier systems considered technical documents and aimed
  at identifying important information by assigning weights to
  sentences (Luhn, 1958; Edmundson, 1969)
• several weighted features were used:
 « word (stem) frequency
 « presence of cue words (e.g., as a result, significant)
   which signalize important content
 « sentence position
 « document structure
• feature weights were tuned manually



                                                 Text Summarization – 25.02.2009 – p. 13
Coherence based summarization
• Rhetorical Structure Theory (Mann & Thompson, 1987)
   • elaboration
  • example
  • contrast
  • background
  • motivation
  • etc.

                                                      Circumstance
             Attribution



       quot;I am optimisticquot;
                               said Mr. Smith
                                                  as the market plunged.

                           (from Sporleder & Lapata, 2005)
                                                                     Text Summarization – 25.02.2009 – p. 14
Coherence based summarization
• one could use discourse structure for summarization
  (Marcu, 2000)
• however, this is not done often:
   • there are few discourse parsers and they are not very
     precise
   • there are arguments whether tree representation is
     sufficient for discourse (Wolf & Gibson, 2005)
   • it is not obvious to classify rhetorical relations
   • some relations are argued to be anaphoric and not
     discourse (Webber et al., 2003)



                                               Text Summarization – 25.02.2009 – p. 15
Cohesion based summarization
 • it is common to represent a text as a graph, where nodes
   are sentences and edges are some relations between them
   (e.g., discourse relations or just similarity)
 • a common graph connectivity assumption is that the nodes
   which are connected to many other nodes are likely to carry
   salient information
 • it is also assumed that nodes whose removal affects the
   structure of the document are important (Skorochodko, 1972
   from Mani, 2001)




                                                Text Summarization – 25.02.2009 – p. 16
Cohesion based summarization
 • it is common to represent a text as a graph, where nodes
   are sentences and edges are some relations between them
   (e.g., discourse relations or just similarity)
 • a common graph connectivity assumption is that the nodes
   which are connected to many other nodes are likely to carry
   salient information
 • it is also assumed that nodes whose removal affects the
   structure of the document are important (Skorochodko, 1972
   from Mani, 2001)




                                                Text Summarization – 25.02.2009 – p. 16
Cohesion based summarization
 • modern approaches extend this idea and use PageRank
  (Page & Brin, 1998) to find salient nodes (Erkan & Radev,
  2004; Mihalcea & Tarau, 2004) in such a graph




                         • similar sentences are connected
                           (bag-of-words similarity)




                                               Text Summarization – 25.02.2009 – p. 17
Cohesion based summarization
 • modern approaches extend this idea and use PageRank
  (Page & Brin, 1998) to find salient nodes (Erkan & Radev,
  2004; Mihalcea & Tarau, 2004) in such a graph



                         • similar sentences are connected
                           (bag-of-words similarity)
                         • a similarity threshold is used




                                               Text Summarization – 25.02.2009 – p. 17
Cohesion based summarization
 • modern approaches extend this idea and use PageRank
  (Page & Brin, 1998) to find salient nodes (Erkan & Radev,
  2004; Mihalcea & Tarau, 2004) in such a graph


                         • similar sentences are connected
                           (bag-of-words similarity)
                         • a similarity threshold is used
                         • the top N of page-ranked
                           sentences are extracted




                                               Text Summarization – 25.02.2009 – p. 17
Coherence vs. cohesion based TS
  • Coherence:
      + transparent; coherence of the output can be improved
      – annotation of relations is still a challenge; preprocessing
        difficulties
  • Cohesion:
      + intuitively appealing; low-cost; even unsupervized
      – requires WSD*, anaphora resolution; hard to pin down;
        tuned thresholds

* word sense disambiguation




                                                     Text Summarization – 25.02.2009 – p. 18
DUC competitions

 • Document Understanding Conferences (2000-2007)
 • from 2008 Text Analysis Conference (TAC)

 • provide participants with
    - a task
    - data
    - manual and automatic evaluation
 • increasing challenge in tasks: from generic single-document
   summarization to multi-document update summary (2008)




                                               Text Summarization – 25.02.2009 – p. 19
DUC competitions

Sample topic:   D0740I


round-the-world balloon flight


Report on the planning, attempts and first
successful balloon circumnavigation of the earth
by Bertrand Piccard and his crew.




                                     Text Summarization – 25.02.2009 – p. 20
DUC competitions
 <DOC>
<DOCNO> APW19981112.0453 </DOCNO>
<DOCTYPE> NEWS STORY </DOCTYPE>
<DATE_TIME> 11/12/1998 08:21:00 </DATE_TIME>
<HEADER> w1942 &Cx1f; wstm- r i &Cx13; &Cx11; BC-Switzerland-BalloonQu
11-12 0355 </HEADER>
<BODY>
<SLUG> BC-Switzerland-Balloon Quest </SLUG> <HEADLINE> Swiss challenger
prepares third attempt at global record </HEADLINE> &UR; AP Photos GEV
101-102 &QL; <TEXT> GENEVA (AP) _ Swiss balloon pilot Bertrand Piccard
and his new teammate, British flight engineer Tony Brown, said Thursday
they will be ready later this month for a new attempt to fly nonstop
round the world.       Their new Breitling Orbiter 3 balloon will take off
from Chateau d’Oex, in the Swiss Alps, as soon after Nov.                     25 as weather
conditions are favorable, they said.              It will be Piccard’s third attempt
to become the first to pilot a balloon around the world.                    In February
the Swiss pilot, along with British flight engineer AndyText Summarization – 25.02.2009 – p. 20
                                                          Elson and
The EML NLP group at DUC 2007




                           Text Summarization – 25.02.2009 – p. 21
Preprocessing: Annotation

 • Sentence splitting
 • Tokenization
 • PoS tagging
 • Chunking
 • Named Entities recognition




                                Text Summarization – 25.02.2009 – p. 22
Preprocessing: Problems

 • Sentence splitting
   <sentence>At Pine Ridge, a scrolling marquee
   at Big Bat’s Texaco expressed both joy over
   Clinton’s visit and wariness of all the
   official attention: “Welcome President
   Clinton.</sentence> <sentence>Remember our
   treaties,” the sign read.




                                     Text Summarization – 25.02.2009 – p. 23
Preprocessing: Problems

 • Sentence splitting
   <sentence>At Pine Ridge, a scrolling marquee
   at Big Bat’s Texaco expressed both joy over
   Clinton’s visit and wariness of all the
   official attention: “Welcome President
   Clinton.</sentence> <sentence>Remember our
   treaties,” the sign read.
 • and cleaning
    <sentence>PINE RIDGE, S.D.</sentence>
   <sentence>(AP) - President Clinton turned the
   attention of his national poverty tour today
   to arguably the poorest, most forgotten U.S.
   citizens of them all: American
   Indians.</sentence>               Text Summarization – 25.02.2009 – p. 23
Preprocessing: Document filtering

 • Match topic with document extracts
 • Pick the top 5 matching documents




                                        Text Summarization – 25.02.2009 – p. 24
Semantic analysis


 • Filter topic
 • Connect topic words with words in
   document sentences
 • Compute sentence scores
      matching words
      matching word sequences


« ranked list of sentences


                                       Text Summarization – 25.02.2009 – p. 25
Extractive summary generation




 • Rerank sentences
 • Select the top non-redundant sentences (250 word limit)
 • Re-arrange sentences                        Text Summarization – 25.02.2009 – p. 26
A good summary

Round-the-world balloon flight: Report on the planning, attempts
and first successful balloon circumnavigation of the earth by
Bertrand Piccard and his crew.

Swiss balloon pilot Bertrand Piccard announced Wednesday
that he has chosen Brian Jones as his teammate for his next
attempt at circling the world in a balloon. Jones, 52, replaces
fellow British flight engineer Tony Brown. Achieving what
promoters called the last great milestone of aviation, Bertrand
Piccard and Brian Jones joined legends like the Wright Brothers
and Charles Lindbergh with Saturday’s completion of the first
manned round-the-world balloon flight. At 4:54 a.m. EST
Saturday, the two balloonists crossed the line of longitude from
which they had departed on March 1 at Chateau D’Oex,
Switzerland, ...                                 Text Summarization – 25.02.2009 – p. 27
A bad summary

Angelina Jolie: What have been the most recent significant
events in the life and career of actress Angelina Jolie?

Angelina Jolie’s win for best supporting actress for her role in
“Girl, Interrupted” came 21 years after father Jon Voight was
awarded best actor for “Coming Home.“ ANGELINA JOLIE’S
LIFE ON THE EDGE After all, her career is in overdrive. But
Jolie cautions that she’s still a serious actress. It’s not like I’m
suddenly a better actress because I have awards or this box
office clout,” she says. “I am secure in the fact that I do have
something to offer as an actress,”Jolie says. ‘...



                                                       Text Summarization – 25.02.2009 – p. 28
Evaluation
• automatic evaluation with ROUGE (Lin, 2004)

• manual evaluation with respect to
 « responsiveness
 « linguistic quality
   1. grammaticality
   2. non-redundancy
   3. referential clarity
   4. focus
   5. structure and coherence
• our system scored above the average, top 5 for
  non-redundancy and coherence (recall the document
  filtering stage)
                                                Text Summarization – 25.02.2009 – p. 29
Research directions
 • like in information retrieval, query expansion is expected to
  improve recall
  « WordNet (Fellbaum, 1998) for similarity
  « Wikipedia for relatedness (Strube & Ponzetto, 2006)
  « paraphrases




                                                 Text Summarization – 25.02.2009 – p. 30
Research directions
 • like in information retrieval, query expansion is expected to
  improve recall
  « WordNet (Fellbaum, 1998) for similarity
  « Wikipedia for relatedness (Strube & Ponzetto, 2006)
  « paraphrases
 • coreference resolution is needed for preprocessing,
   otherwise, e.g., pronouns are filtered as stopwords




                                                 Text Summarization – 25.02.2009 – p. 30
Research directions
 • like in information retrieval, query expansion is expected to
  improve recall
  « WordNet (Fellbaum, 1998) for similarity
  « Wikipedia for relatedness (Strube & Ponzetto, 2006)
  « paraphrases
 • coreference resolution is needed for preprocessing,
   otherwise, e.g., pronouns are filtered as stopwords
 • relevance vs. redundancy issue: in MDS, how can we
   ensure non-redundancy of the summary? (Carbonell &
   Goldstein, 1998)


                                                 Text Summarization – 25.02.2009 – p. 30
Research directions
 • like in information retrieval, query expansion is expected to
  improve recall
  « WordNet (Fellbaum, 1998) for similarity
  « Wikipedia for relatedness (Strube & Ponzetto, 2006)
  « paraphrases
 • coreference resolution is needed for preprocessing,
   otherwise, e.g., pronouns are filtered as stopwords
 • relevance vs. redundancy issue: in MDS, how can we
   ensure non-redundancy of the summary? (Carbonell &
   Goldstein, 1998)
 • sentence ordering for extractive MDS (Barzilay & Lapata,
   2005)
                                                 Text Summarization – 25.02.2009 – p. 30
Directions of research

 • abstractive summarization is a distant goal but there are
  ways to go beyond sentence extraction
  « sentence compression
  « sentence fusion




                                                  Text Summarization – 25.02.2009 – p. 31
Sentence compression

This is true, regardless of the opinion that some people have of Syria, and of
their unhappiness at Syria’s presence in Lebanon.




                                                             Text Summarization – 25.02.2009 – p. 32
Sentence compression

This is true, regardless of the opinion that some people have of Syria, and of
their unhappiness at Syria’s presence in Lebanon.




                                                             Text Summarization – 25.02.2009 – p. 32
Sentence compression

This is true, regardless of the opinion that some people have of Syria, and of
their unhappiness at Syria’s presence in Lebanon.

  • summarization on the sentence level

  • in principle, a compression can be different from the input
    (different wording and structure)
  • to date, most systems use word deletion only

  • meanwhile there is a compression corpus available online
    http://homepages.inf.ed.ac.uk/s0460084/data
  • the performance can be evaluated automatically



                                                             Text Summarization – 25.02.2009 – p. 32
Sentence fusion
 1 John Smith, born November 15 1900, studied chemistry and physics at
   the University of London.
 2 From 1917 Mr. Smith studied at the University of London and in 1921 he
   graduated with distinction.




                                                        Text Summarization – 25.02.2009 – p. 33
Sentence fusion
 1 John Smith, born November 15 1900, studied chemistry and physics at
   the University of London.
 2 From 1917 Mr. Smith studied at the University of London and in 1921 he
   graduated with distinction.
« Mr. Smith studied chemistry and physics at the University of London
  from 1917.

 • pieces of related sentences are used to generate a novel
   sentence
 • can be seen as a middle ground between extractive and
   abstractive summarization
 • addresses the incompleteness-redundancy problem

                                                        Text Summarization – 25.02.2009 – p. 33
Thank you!




             (FOR YOUR ATTENTION)




                                    Text Summarization – 25.02.2009 – p. 34
References
• R. Barzilay & M. Lapata, 2005: Modeling local coherence:
  An entity-based approach
• S. Brin & L. Page, 1998: The anatomy of a large-scale
  hypertextual web search engine
• J. G. Carbonell & J. Goldstein, 1998: The use of MMR,
  diversity-based reranking for reordering documents and
  producing summaries
• H. P. Edmundson, 1969: New methods in automatic
  extracting
• G. Erkan & D. Radev, 2004: LexRank: Graph-based lexical
  centrality as salience in text summarization
• C. Fellbaum, 1998: WordNet: An electronic lexical database

                                                 Text Summarization – 25.02.2009 – p. 35
References
• K. Forbes, E. Miltsakaki, R. Prasad, A. Sarkar, A. Joshi, B.
  L. Webber, 2001: DLTAG system – discourse parsing with a
  Lexicalized Tree Adjoining Grammar
• M. Halliday & R. Hasan, 1996: Cohesion in text
• E. H. Hovy, 2003: Text summarization
• H. Kamp, 1981: A theory of truth and semantic
  representation
• C.-Y. Lin, 2004: Automatic evaluation of summaries using
  N-gram co-occurrence statistics
• H. P. Luhn, 1958: The automatic creation of literature
  abstracts
• I. Mani, 2001: Automatic summarization
                                                 Text Summarization – 25.02.2009 – p. 36
References
• W. C. Mann & S. A. Thompson, 1988: Rhetorical structure
  theory. Towards a functional theory of text organization
• D. Marcu, 2000: The theory and practice of discourse
  parsing and summarization
• R. Mihalcea & P. Tarau, 2004: TextRank: Bringing order
  into text
• E. Skorochodko, 1972: Adaptive method of automatic
  abstracting and indexing
• C. Sporleder & M. Lapata, 2005: Discourse chunking and its
  application to sentence compression
• M. Strube & S. P. Ponzetto, 2006: WikiRelate! Computing
  semantic relatedness using Wikipedia
                                                 Text Summarization – 25.02.2009 – p. 37
References
• B. L. Webber, M. Stone, A. Joshi, A. Knott, 2003: Anaphora
  and discourse structure
• F. Wolf & E. Gibson, 2005: Representing discourse
  coherence: A corpus-based study




                                               Text Summarization – 25.02.2009 – p. 38

More Related Content

Viewers also liked

Introduction to Automatic Summarization
Introduction to Automatic SummarizationIntroduction to Automatic Summarization
Introduction to Automatic SummarizationHitoshi Nishikawa
 
Automatic Summarization (2014)
Automatic Summarization (2014)Automatic Summarization (2014)
Automatic Summarization (2014)Hitoshi Nishikawa
 
Продвижение лендинга с помощью контента
Продвижение лендинга с помощью контентаПродвижение лендинга с помощью контента
Продвижение лендинга с помощью контентаNadya Pominova
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarizationConstantin Orasan
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーションYuya Unno
 
態度
態度態度
態度nonnon
 
Spring 3 - An Introduction
Spring 3 - An IntroductionSpring 3 - An Introduction
Spring 3 - An IntroductionThorsten Kamann
 
Open Source Bridge Opening Day
Open Source Bridge Opening DayOpen Source Bridge Opening Day
Open Source Bridge Opening DaySelena Deckelmann
 
朱家故事chu's family
朱家故事chu's family朱家故事chu's family
朱家故事chu's familynonnon
 
Empowerment Movie Ppt Version Sample
Empowerment Movie Ppt Version SampleEmpowerment Movie Ppt Version Sample
Empowerment Movie Ppt Version SampleAndrew Schwartz
 
O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?Neli Maria Mengalli
 
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...John Newton
 
Martin karlssons vykortssamling malmen
Martin karlssons vykortssamling   malmenMartin karlssons vykortssamling   malmen
Martin karlssons vykortssamling malmenhembygdsigtuna
 
Cars
CarsCars
Carsshore
 
Dreams Movie Ppt Version Sample
Dreams Movie Ppt Version SampleDreams Movie Ppt Version Sample
Dreams Movie Ppt Version SampleAndrew Schwartz
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyLidia Pivovarova
 
Spring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingSpring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingThorsten Kamann
 

Viewers also liked (19)

Introduction to Automatic Summarization
Introduction to Automatic SummarizationIntroduction to Automatic Summarization
Introduction to Automatic Summarization
 
Automatic Summarization (2014)
Automatic Summarization (2014)Automatic Summarization (2014)
Automatic Summarization (2014)
 
Продвижение лендинга с помощью контента
Продвижение лендинга с помощью контентаПродвижение лендинга с помощью контента
Продвижение лендинга с помощью контента
 
Tutorial on automatic summarization
Tutorial on automatic summarizationTutorial on automatic summarization
Tutorial on automatic summarization
 
深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション深層学習による機械とのコミュニケーション
深層学習による機械とのコミュニケーション
 
態度
態度態度
態度
 
Spring 3 - An Introduction
Spring 3 - An IntroductionSpring 3 - An Introduction
Spring 3 - An Introduction
 
Open Source Bridge Opening Day
Open Source Bridge Opening DayOpen Source Bridge Opening Day
Open Source Bridge Opening Day
 
朱家故事chu's family
朱家故事chu's family朱家故事chu's family
朱家故事chu's family
 
Empowerment Movie Ppt Version Sample
Empowerment Movie Ppt Version SampleEmpowerment Movie Ppt Version Sample
Empowerment Movie Ppt Version Sample
 
Filesystem
FilesystemFilesystem
Filesystem
 
O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?O que aconteceu com os mundos virtuais no ensino?
O que aconteceu com os mundos virtuais no ensino?
 
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...
Madrid Alfresco Day 2015 - John Pomeroy - Why Alfresco in today’s Digital Ent...
 
Martin karlssons vykortssamling malmen
Martin karlssons vykortssamling   malmenMartin karlssons vykortssamling   malmen
Martin karlssons vykortssamling malmen
 
Mathematics Of Life
Mathematics Of LifeMathematics Of Life
Mathematics Of Life
 
Cars
CarsCars
Cars
 
Dreams Movie Ppt Version Sample
Dreams Movie Ppt Version SampleDreams Movie Ppt Version Sample
Dreams Movie Ppt Version Sample
 
AINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, SelegeyAINL 2016: Shavrina, Selegey
AINL 2016: Shavrina, Selegey
 
Spring 3 - Der dritte Frühling
Spring 3 - Der dritte FrühlingSpring 3 - Der dritte Frühling
Spring 3 - Der dritte Frühling
 

More from Lidia Pivovarova

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Lidia Pivovarova
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classificationLidia Pivovarova
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesLidia Pivovarova
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текстаLidia Pivovarova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovLidia Pivovarova
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...Lidia Pivovarova
 

More from Lidia Pivovarova (20)

Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...Classification and clustering in media monitoring: from knowledge engineering...
Classification and clustering in media monitoring: from knowledge engineering...
 
Convolutional neural networks for text classification
Convolutional neural networks for text classificationConvolutional neural networks for text classification
Convolutional neural networks for text classification
 
Grouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entitiesGrouping business news stories based on salience of named entities
Grouping business news stories based on salience of named entities
 
Интеллектуальный анализ текста
Интеллектуальный анализ текстаИнтеллектуальный анализ текста
Интеллектуальный анализ текста
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
AINL 2016: Kuznetsova
AINL 2016: KuznetsovaAINL 2016: Kuznetsova
AINL 2016: Kuznetsova
 
AINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, MaksimovAINL 2016: Bodrunova, Blekanov, Maksimov
AINL 2016: Bodrunova, Blekanov, Maksimov
 
AINL 2016: Boldyreva
AINL 2016: BoldyrevaAINL 2016: Boldyreva
AINL 2016: Boldyreva
 
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
AINL 2016: Rykov, Nagornyy, Koltsova, Natta, Kremenets, Manovich, Cerrone, Cr...
 
AINL 2016: Kozerenko
AINL 2016: Kozerenko AINL 2016: Kozerenko
AINL 2016: Kozerenko
 
AINL 2016: Khudobakhshov
AINL 2016: KhudobakhshovAINL 2016: Khudobakhshov
AINL 2016: Khudobakhshov
 
AINL 2016: Proncheva
AINL 2016: PronchevaAINL 2016: Proncheva
AINL 2016: Proncheva
 
AINL 2016:
AINL 2016: AINL 2016:
AINL 2016:
 
AINL 2016: Bugaychenko
AINL 2016: BugaychenkoAINL 2016: Bugaychenko
AINL 2016: Bugaychenko
 
AINL 2016: Grigorieva
AINL 2016: GrigorievaAINL 2016: Grigorieva
AINL 2016: Grigorieva
 
AINL 2016: Muravyov
AINL 2016: MuravyovAINL 2016: Muravyov
AINL 2016: Muravyov
 
AINL 2016: Just AI
AINL 2016: Just AIAINL 2016: Just AI
AINL 2016: Just AI
 
AINL 2016: Moskvichev
AINL 2016: MoskvichevAINL 2016: Moskvichev
AINL 2016: Moskvichev
 
AINL 2016: Goncharov
AINL 2016: GoncharovAINL 2016: Goncharov
AINL 2016: Goncharov
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Scott Keck-Warren
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
DevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache MavenDevoxxFR 2024 Reproducible Builds with Apache Maven
DevoxxFR 2024 Reproducible Builds with Apache Maven
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024Advanced Test Driven-Development @ php[tek] 2024
Advanced Test Driven-Development @ php[tek] 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 

Automatic Text Summarization Techniques

  • 1. Automatic Text Summarization Katja Filippova filippova@eml-research.de EML Research gGmbH TU Darmstadt Text Summarization – 25.02.2009 – p. 1
  • 2. Text summarization • A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) (Hovy, 2003) • information retrieval • stock market prediction • generation of abstracts • online news summarization • ... Text Summarization – 25.02.2009 – p. 2
  • 3. Overview • Introduction • classification of summarization systems • abstraction vs. extraction • Text cohesion and coherence for summarization • graph based methods • discourse structure based methods • Document Understanding Conference • tasks • an example • Research directions • sentence fusion and compression • integrating world knowledge Text Summarization – 25.02.2009 – p. 3
  • 4. Text summarization: types • A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) (Hovy, 2003) • Indicative « indicates types of information « “alerts” Text Summarization – 25.02.2009 – p. 4
  • 5. Text summarization: types • A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) (Hovy, 2003) • Indicative « indicates types of information « “alerts” • Informative « includes quantitative/qualitative information « “informs” Text Summarization – 25.02.2009 – p. 4
  • 6. Text summarization: types • A summary is a text that is produced from one or more texts, that contains a significant portion of the information in the original text(s), and that is no longer than half of the original text(s) (Hovy, 2003) • Indicative « indicates types of information « “alerts” • Informative « includes quantitative/qualitative information « “informs” • Critic/evaluative « evaluates the content of the document Text Summarization – 25.02.2009 – p. 4
  • 7. Text summarization: types INDICATIVE • The work of Consumer Advice Centres is examined. The information sources used to support this work are reviewed. The recent closure of many CACs has seriously affected the availability of consumer information and advice. The contribution that public libraries can make in enhancing the availability of consumer information and advice both to the public and other agencies involved in consumer information and advice, is discussed. Text Summarization – 25.02.2009 – p. 5
  • 8. Text summarization: types INFORMATIVE • An examination of the work of Consumer Advice Centres and of the information sources and support activities that public libraries can offer. CACs have dealt with pre-shopping advice, education on consumers’ rights and complaints about goods and services, advising the client and often obtaining expert assessment. They have drawn on a wide range of information sources including case records, trade literature, contact files and external links. The recent closure of many CACs has seriously affected the availability of consumer information and advice. Libraries can cooperate closely with advice agencies through local coordinating committed, shared premises, join publicity referral and the sharing of professional expertise. Text Summarization – 25.02.2009 – p. 5
  • 9. Text summarization: types • Source: single-document vs. multi-document « research paper « proceedings of a conference Text Summarization – 25.02.2009 – p. 6
  • 10. Text summarization: types • Source: single-document vs. multi-document « research paper « proceedings of a conference • Content: generic vs. query-based vs. user-focused « equal coverage of all major topics « based on a question “what are the causes of the war?” « users interested in chemistry Text Summarization – 25.02.2009 – p. 6
  • 11. Text summarization: types • Source: single-document vs. multi-document « research paper « proceedings of a conference • Content: generic vs. query-based vs. user-focused « equal coverage of all major topics « based on a question “what are the causes of the war?” « users interested in chemistry • Form: extract vs. abstract « fragments from the document « newly re-written text Text Summarization – 25.02.2009 – p. 6
  • 12. Extraction vs. abstraction How should a text summarization system proceed? • read the documents • understand them – build a semantic representation • generate a summary from this representation Text Summarization – 25.02.2009 – p. 7
  • 13. Extraction vs. abstraction • unfortunately, a rich semantic representation is not possible yet • to date, most summarization systems are extractive • usually, extraction units are sentences • low cost solution: could work without ontologies, complex representations, etc. • extractive summaries are usually incoherent • trade-off between non-redundancy and completeness Text Summarization – 25.02.2009 – p. 8
  • 14. Extraction vs. abstraction Three sentences from related documents (Oct. 27 2009): • The Syrian foreign minister today condemned the killing of eight civilians in a US raid as an act of quot;criminal and terrorist aggressionquot;. (The Guardian) • Syria accused the United States on Monday of carrying out a quot;terrorist aggressionquot; after a deadly raid near its border with Iraq which it said killed eight civilians. (Reuters) • Lebanese President Michel Suleiman on Monday contacted his Syrian counterpart Bashar Assad to denounce quot;Sunday’s American aggressionquot; against the Syrian village of Abu Kamal near the border with Iraq, local Elnashra website reported. (Aljazeera) Text Summarization – 25.02.2009 – p. 9
  • 15. Extraction vs. abstraction Three sentences from related documents (Oct. 27 2009): • The Syrian foreign minister today condemned the killing of eight civilians in a US raid as an act of quot;criminal and terrorist aggressionquot;. (The Guardian) • Syria accused the United States on Monday of carrying out a quot;terrorist aggressionquot; after a deadly raid near its border with Iraq which it said killed eight civilians. (Reuters) • Lebanese President Michel Suleiman on Monday contacted his Syrian counterpart Bashar Assad to denounce quot;Sunday’s American aggressionquot; against the Syrian village of Abu Kamal near the border with Iraq, local Elnashra website reported. (Aljazeera) Text Summarization – 25.02.2009 – p. 9
  • 16. Extraction vs. abstraction Three sentences from related documents (Oct. 27 2009): • The Syrian foreign minister today condemned the killing of eight civilians in a US raid as an act of quot;criminal and terrorist aggressionquot;. (The Guardian) • Syria accused the United States on Monday of carrying out a quot;terrorist aggressionquot; after a deadly raid near its border with Iraq which it said killed eight civilians. (Reuters) • Lebanese President Michel Suleiman on Monday contacted his Syrian counterpart Bashar Assad to denounce quot;Sunday’s American aggressionquot; against the Syrian village of Abu Kamal near the border with Iraq, local Elnashra website reported. (Aljazeera) Text Summarization – 25.02.2009 – p. 9
  • 17. Extraction vs. abstraction Three sentences from related documents (Oct. 27 2009): • The Syrian foreign minister today condemned the killing of eight civilians in a US raid as an act of quot;criminal and terrorist aggressionquot;. (The Guardian) • Syria accused the United States on Monday of carrying out a quot;terrorist aggressionquot; after a deadly raid near its border with Iraq which it said killed eight civilians. (Reuters) • Lebanese President Michel Suleiman on Monday contacted his Syrian counterpart Bashar Assad to denounce quot;Sunday’s American aggressionquot; against the Syrian village of Abu Kamal near the border with Iraq, local Elnashra website reported. (Aljazeera) Text Summarization – 25.02.2009 – p. 9
  • 18. Extraction vs. abstraction • extractive summaries are not coherent – sentences pulled out from different documents make sense each but sound awkward when put together Text Summarization – 25.02.2009 – p. 10
  • 19. Extraction vs. abstraction • extractive summaries are not coherent – sentences pulled out from different documents make sense each but sound awkward when put together • unresolved pronouns may distort the meaning Text Summarization – 25.02.2009 – p. 10
  • 20. Extraction vs. abstraction • extractive summaries are not coherent – sentences pulled out from different documents make sense each but sound awkward when put together • unresolved pronouns may distort the meaning • beginning with a sentence which starts with However, ... is not a good idea Text Summarization – 25.02.2009 – p. 10
  • 21. Extraction vs. abstraction • extractive summaries are not coherent – sentences pulled out from different documents make sense each but sound awkward when put together • unresolved pronouns may distort the meaning • beginning with a sentence which starts with However, ... is not a good idea • there is a striking difference with human generated texts – pronouns and connectives are in the right place, the flow of discourse makes sense Text Summarization – 25.02.2009 – p. 10
  • 22. Extraction vs. abstraction • extractive summaries are not coherent – sentences pulled out from different documents make sense each but sound awkward when put together • unresolved pronouns may distort the meaning • beginning with a sentence which starts with However, ... is not a good idea • there is a striking difference with human generated texts – pronouns and connectives are in the right place, the flow of discourse makes sense • How could one use this property of natural discourse for summarization? Text Summarization – 25.02.2009 – p. 10
  • 23. Text coherence vs. text cohesion • John enjoys playing the piano. John wants to become a famous piano player. John works hard and works hard every day. Working hard is necessary to become a famous piano player. Text Summarization – 25.02.2009 – p. 11
  • 24. Text coherence vs. text cohesion • John enjoys playing the piano. John wants to become a famous piano player. John works hard and works hard every day. Working hard is necessary to become a famous piano player. Text Summarization – 25.02.2009 – p. 11
  • 25. Text coherence vs. text cohesion • John enjoys playing the piano. John wants to become a famous piano player. John works hard and works hard every day. Working hard is necessary to become a famous piano player. • John enjoys playing the piano. However, he woke up early yesterday. But the day before yesterday the weather was wonderful, because rain and snow started immediately and continued the whole day through. By the way, his teacher did the same. Text Summarization – 25.02.2009 – p. 11
  • 26. Text coherence vs. text cohesion • John enjoys playing the piano. John wants to become a famous piano player. John works hard and works hard every day. Working hard is necessary to become a famous piano player. • John enjoys playing the piano. However, he woke up early yesterday. But the day before yesterday the weather was wonderful, because rain and snow started immediately and continued the whole day through. By the way, his teacher did the same. Text Summarization – 25.02.2009 – p. 11
  • 27. Text coherence vs. text cohesion • John enjoys playing the piano. John wants to become a famous piano player. John works hard and works hard every day. Working hard is necessary to become a famous piano player. • John enjoys playing the piano. However, he woke up early yesterday. But the day before yesterday the weather was wonderful, because rain and snow started immediately and continued the whole day through. By the way, his teacher did the same. • John enjoys playing the piano and wants to become famous. He works hard and does it every day because it is necessary for his goal. Text Summarization – 25.02.2009 – p. 11
  • 28. Text coherence vs. text cohesion • Text coherence represents the overall structure of a multi-sentence text in terms of macro-level relations between clauses or sentences (Halliday & Hasan, 1996). « Rhetorical Structure Theory (Mann & Thompson, 1988) « Discourse Representation Theory (Kamp, 1981) « Discourse Lexicalized Tree Adjoining Grammar (Forbes, 2001) • John enjoys playing the piano. [John wants to become a famous piano player.] (that’s why) [John works hard and works hard every day.] Working hard is necessary to become a famous piano player. Text Summarization – 25.02.2009 – p. 12
  • 29. Text coherence vs. text cohesion • Text cohesion involves relations between words, word senses, or referring expressions, which determine how tightly connected the text is (Halliday & Hasan, 1996). « anaphora, ellipsis, connectives « synonymy and other lexical relations • John enjoys playing the piano. However, he woke up early yesterday. But the day before yesterday the weather was wonderful, because rain and snow started immediately and continued the whole day through. By the way, his teacher did the same. Text Summarization – 25.02.2009 – p. 12
  • 30. Coherence based summarization • earlier systems considered technical documents and aimed at identifying important information by assigning weights to sentences (Luhn, 1958; Edmundson, 1969) • several weighted features were used: « word (stem) frequency « presence of cue words (e.g., as a result, significant) which signalize important content « sentence position « document structure • feature weights were tuned manually Text Summarization – 25.02.2009 – p. 13
  • 31. Coherence based summarization • Rhetorical Structure Theory (Mann & Thompson, 1987) • elaboration • example • contrast • background • motivation • etc. Circumstance Attribution quot;I am optimisticquot; said Mr. Smith as the market plunged. (from Sporleder & Lapata, 2005) Text Summarization – 25.02.2009 – p. 14
  • 32. Coherence based summarization • one could use discourse structure for summarization (Marcu, 2000) • however, this is not done often: • there are few discourse parsers and they are not very precise • there are arguments whether tree representation is sufficient for discourse (Wolf & Gibson, 2005) • it is not obvious to classify rhetorical relations • some relations are argued to be anaphoric and not discourse (Webber et al., 2003) Text Summarization – 25.02.2009 – p. 15
  • 33. Cohesion based summarization • it is common to represent a text as a graph, where nodes are sentences and edges are some relations between them (e.g., discourse relations or just similarity) • a common graph connectivity assumption is that the nodes which are connected to many other nodes are likely to carry salient information • it is also assumed that nodes whose removal affects the structure of the document are important (Skorochodko, 1972 from Mani, 2001) Text Summarization – 25.02.2009 – p. 16
  • 34. Cohesion based summarization • it is common to represent a text as a graph, where nodes are sentences and edges are some relations between them (e.g., discourse relations or just similarity) • a common graph connectivity assumption is that the nodes which are connected to many other nodes are likely to carry salient information • it is also assumed that nodes whose removal affects the structure of the document are important (Skorochodko, 1972 from Mani, 2001) Text Summarization – 25.02.2009 – p. 16
  • 35. Cohesion based summarization • modern approaches extend this idea and use PageRank (Page & Brin, 1998) to find salient nodes (Erkan & Radev, 2004; Mihalcea & Tarau, 2004) in such a graph • similar sentences are connected (bag-of-words similarity) Text Summarization – 25.02.2009 – p. 17
  • 36. Cohesion based summarization • modern approaches extend this idea and use PageRank (Page & Brin, 1998) to find salient nodes (Erkan & Radev, 2004; Mihalcea & Tarau, 2004) in such a graph • similar sentences are connected (bag-of-words similarity) • a similarity threshold is used Text Summarization – 25.02.2009 – p. 17
  • 37. Cohesion based summarization • modern approaches extend this idea and use PageRank (Page & Brin, 1998) to find salient nodes (Erkan & Radev, 2004; Mihalcea & Tarau, 2004) in such a graph • similar sentences are connected (bag-of-words similarity) • a similarity threshold is used • the top N of page-ranked sentences are extracted Text Summarization – 25.02.2009 – p. 17
  • 38. Coherence vs. cohesion based TS • Coherence: + transparent; coherence of the output can be improved – annotation of relations is still a challenge; preprocessing difficulties • Cohesion: + intuitively appealing; low-cost; even unsupervized – requires WSD*, anaphora resolution; hard to pin down; tuned thresholds * word sense disambiguation Text Summarization – 25.02.2009 – p. 18
  • 39. DUC competitions • Document Understanding Conferences (2000-2007) • from 2008 Text Analysis Conference (TAC) • provide participants with - a task - data - manual and automatic evaluation • increasing challenge in tasks: from generic single-document summarization to multi-document update summary (2008) Text Summarization – 25.02.2009 – p. 19
  • 40. DUC competitions Sample topic: D0740I round-the-world balloon flight Report on the planning, attempts and first successful balloon circumnavigation of the earth by Bertrand Piccard and his crew. Text Summarization – 25.02.2009 – p. 20
  • 41. DUC competitions <DOC> <DOCNO> APW19981112.0453 </DOCNO> <DOCTYPE> NEWS STORY </DOCTYPE> <DATE_TIME> 11/12/1998 08:21:00 </DATE_TIME> <HEADER> w1942 &Cx1f; wstm- r i &Cx13; &Cx11; BC-Switzerland-BalloonQu 11-12 0355 </HEADER> <BODY> <SLUG> BC-Switzerland-Balloon Quest </SLUG> <HEADLINE> Swiss challenger prepares third attempt at global record </HEADLINE> &UR; AP Photos GEV 101-102 &QL; <TEXT> GENEVA (AP) _ Swiss balloon pilot Bertrand Piccard and his new teammate, British flight engineer Tony Brown, said Thursday they will be ready later this month for a new attempt to fly nonstop round the world. Their new Breitling Orbiter 3 balloon will take off from Chateau d’Oex, in the Swiss Alps, as soon after Nov. 25 as weather conditions are favorable, they said. It will be Piccard’s third attempt to become the first to pilot a balloon around the world. In February the Swiss pilot, along with British flight engineer AndyText Summarization – 25.02.2009 – p. 20 Elson and
  • 42. The EML NLP group at DUC 2007 Text Summarization – 25.02.2009 – p. 21
  • 43. Preprocessing: Annotation • Sentence splitting • Tokenization • PoS tagging • Chunking • Named Entities recognition Text Summarization – 25.02.2009 – p. 22
  • 44. Preprocessing: Problems • Sentence splitting <sentence>At Pine Ridge, a scrolling marquee at Big Bat’s Texaco expressed both joy over Clinton’s visit and wariness of all the official attention: “Welcome President Clinton.</sentence> <sentence>Remember our treaties,” the sign read. Text Summarization – 25.02.2009 – p. 23
  • 45. Preprocessing: Problems • Sentence splitting <sentence>At Pine Ridge, a scrolling marquee at Big Bat’s Texaco expressed both joy over Clinton’s visit and wariness of all the official attention: “Welcome President Clinton.</sentence> <sentence>Remember our treaties,” the sign read. • and cleaning <sentence>PINE RIDGE, S.D.</sentence> <sentence>(AP) - President Clinton turned the attention of his national poverty tour today to arguably the poorest, most forgotten U.S. citizens of them all: American Indians.</sentence> Text Summarization – 25.02.2009 – p. 23
  • 46. Preprocessing: Document filtering • Match topic with document extracts • Pick the top 5 matching documents Text Summarization – 25.02.2009 – p. 24
  • 47. Semantic analysis • Filter topic • Connect topic words with words in document sentences • Compute sentence scores matching words matching word sequences « ranked list of sentences Text Summarization – 25.02.2009 – p. 25
  • 48. Extractive summary generation • Rerank sentences • Select the top non-redundant sentences (250 word limit) • Re-arrange sentences Text Summarization – 25.02.2009 – p. 26
  • 49. A good summary Round-the-world balloon flight: Report on the planning, attempts and first successful balloon circumnavigation of the earth by Bertrand Piccard and his crew. Swiss balloon pilot Bertrand Piccard announced Wednesday that he has chosen Brian Jones as his teammate for his next attempt at circling the world in a balloon. Jones, 52, replaces fellow British flight engineer Tony Brown. Achieving what promoters called the last great milestone of aviation, Bertrand Piccard and Brian Jones joined legends like the Wright Brothers and Charles Lindbergh with Saturday’s completion of the first manned round-the-world balloon flight. At 4:54 a.m. EST Saturday, the two balloonists crossed the line of longitude from which they had departed on March 1 at Chateau D’Oex, Switzerland, ... Text Summarization – 25.02.2009 – p. 27
  • 50. A bad summary Angelina Jolie: What have been the most recent significant events in the life and career of actress Angelina Jolie? Angelina Jolie’s win for best supporting actress for her role in “Girl, Interrupted” came 21 years after father Jon Voight was awarded best actor for “Coming Home.“ ANGELINA JOLIE’S LIFE ON THE EDGE After all, her career is in overdrive. But Jolie cautions that she’s still a serious actress. It’s not like I’m suddenly a better actress because I have awards or this box office clout,” she says. “I am secure in the fact that I do have something to offer as an actress,”Jolie says. ‘... Text Summarization – 25.02.2009 – p. 28
  • 51. Evaluation • automatic evaluation with ROUGE (Lin, 2004) • manual evaluation with respect to « responsiveness « linguistic quality 1. grammaticality 2. non-redundancy 3. referential clarity 4. focus 5. structure and coherence • our system scored above the average, top 5 for non-redundancy and coherence (recall the document filtering stage) Text Summarization – 25.02.2009 – p. 29
  • 52. Research directions • like in information retrieval, query expansion is expected to improve recall « WordNet (Fellbaum, 1998) for similarity « Wikipedia for relatedness (Strube & Ponzetto, 2006) « paraphrases Text Summarization – 25.02.2009 – p. 30
  • 53. Research directions • like in information retrieval, query expansion is expected to improve recall « WordNet (Fellbaum, 1998) for similarity « Wikipedia for relatedness (Strube & Ponzetto, 2006) « paraphrases • coreference resolution is needed for preprocessing, otherwise, e.g., pronouns are filtered as stopwords Text Summarization – 25.02.2009 – p. 30
  • 54. Research directions • like in information retrieval, query expansion is expected to improve recall « WordNet (Fellbaum, 1998) for similarity « Wikipedia for relatedness (Strube & Ponzetto, 2006) « paraphrases • coreference resolution is needed for preprocessing, otherwise, e.g., pronouns are filtered as stopwords • relevance vs. redundancy issue: in MDS, how can we ensure non-redundancy of the summary? (Carbonell & Goldstein, 1998) Text Summarization – 25.02.2009 – p. 30
  • 55. Research directions • like in information retrieval, query expansion is expected to improve recall « WordNet (Fellbaum, 1998) for similarity « Wikipedia for relatedness (Strube & Ponzetto, 2006) « paraphrases • coreference resolution is needed for preprocessing, otherwise, e.g., pronouns are filtered as stopwords • relevance vs. redundancy issue: in MDS, how can we ensure non-redundancy of the summary? (Carbonell & Goldstein, 1998) • sentence ordering for extractive MDS (Barzilay & Lapata, 2005) Text Summarization – 25.02.2009 – p. 30
  • 56. Directions of research • abstractive summarization is a distant goal but there are ways to go beyond sentence extraction « sentence compression « sentence fusion Text Summarization – 25.02.2009 – p. 31
  • 57. Sentence compression This is true, regardless of the opinion that some people have of Syria, and of their unhappiness at Syria’s presence in Lebanon. Text Summarization – 25.02.2009 – p. 32
  • 58. Sentence compression This is true, regardless of the opinion that some people have of Syria, and of their unhappiness at Syria’s presence in Lebanon. Text Summarization – 25.02.2009 – p. 32
  • 59. Sentence compression This is true, regardless of the opinion that some people have of Syria, and of their unhappiness at Syria’s presence in Lebanon. • summarization on the sentence level • in principle, a compression can be different from the input (different wording and structure) • to date, most systems use word deletion only • meanwhile there is a compression corpus available online http://homepages.inf.ed.ac.uk/s0460084/data • the performance can be evaluated automatically Text Summarization – 25.02.2009 – p. 32
  • 60. Sentence fusion 1 John Smith, born November 15 1900, studied chemistry and physics at the University of London. 2 From 1917 Mr. Smith studied at the University of London and in 1921 he graduated with distinction. Text Summarization – 25.02.2009 – p. 33
  • 61. Sentence fusion 1 John Smith, born November 15 1900, studied chemistry and physics at the University of London. 2 From 1917 Mr. Smith studied at the University of London and in 1921 he graduated with distinction. « Mr. Smith studied chemistry and physics at the University of London from 1917. • pieces of related sentences are used to generate a novel sentence • can be seen as a middle ground between extractive and abstractive summarization • addresses the incompleteness-redundancy problem Text Summarization – 25.02.2009 – p. 33
  • 62. Thank you! (FOR YOUR ATTENTION) Text Summarization – 25.02.2009 – p. 34
  • 63. References • R. Barzilay & M. Lapata, 2005: Modeling local coherence: An entity-based approach • S. Brin & L. Page, 1998: The anatomy of a large-scale hypertextual web search engine • J. G. Carbonell & J. Goldstein, 1998: The use of MMR, diversity-based reranking for reordering documents and producing summaries • H. P. Edmundson, 1969: New methods in automatic extracting • G. Erkan & D. Radev, 2004: LexRank: Graph-based lexical centrality as salience in text summarization • C. Fellbaum, 1998: WordNet: An electronic lexical database Text Summarization – 25.02.2009 – p. 35
  • 64. References • K. Forbes, E. Miltsakaki, R. Prasad, A. Sarkar, A. Joshi, B. L. Webber, 2001: DLTAG system – discourse parsing with a Lexicalized Tree Adjoining Grammar • M. Halliday & R. Hasan, 1996: Cohesion in text • E. H. Hovy, 2003: Text summarization • H. Kamp, 1981: A theory of truth and semantic representation • C.-Y. Lin, 2004: Automatic evaluation of summaries using N-gram co-occurrence statistics • H. P. Luhn, 1958: The automatic creation of literature abstracts • I. Mani, 2001: Automatic summarization Text Summarization – 25.02.2009 – p. 36
  • 65. References • W. C. Mann & S. A. Thompson, 1988: Rhetorical structure theory. Towards a functional theory of text organization • D. Marcu, 2000: The theory and practice of discourse parsing and summarization • R. Mihalcea & P. Tarau, 2004: TextRank: Bringing order into text • E. Skorochodko, 1972: Adaptive method of automatic abstracting and indexing • C. Sporleder & M. Lapata, 2005: Discourse chunking and its application to sentence compression • M. Strube & S. P. Ponzetto, 2006: WikiRelate! Computing semantic relatedness using Wikipedia Text Summarization – 25.02.2009 – p. 37
  • 66. References • B. L. Webber, M. Stone, A. Joshi, A. Knott, 2003: Anaphora and discourse structure • F. Wolf & E. Gibson, 2005: Representing discourse coherence: A corpus-based study Text Summarization – 25.02.2009 – p. 38