Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×

[DSC Europe 22] Transformers for extractive text summarization - Anton Guldinskii


Eche un vistazo a continuación

1 de 19 Anuncio

[DSC Europe 22] Transformers for extractive text summarization - Anton Guldinskii

Descargar para leer sin conexión

In NLP there are a lot of approaches to Text Summarization task. In my presentation I will cover one of them based on Transformers. We will look into details of what architecture we applied and what techniques we used for labeling, preprocessing and postprocessing. I will also speak about results of using this approach for a real production project.

In NLP there are a lot of approaches to Text Summarization task. In my presentation I will cover one of them based on Transformers. We will look into details of what architecture we applied and what techniques we used for labeling, preprocessing and postprocessing. I will also speak about results of using this approach for a real production project.


Más Contenido Relacionado

Más de DataScienceConferenc1 (20)

Más reciente (20)


[DSC Europe 22] Transformers for extractive text summarization - Anton Guldinskii

  1. 1. Transformers for extractive text summarization © 2022 EPAM Systems, Inc. Anton Guldinskii Senior Data scientist
  2. 2. © 2022 EPAM Systems, Inc. Agenda • Task formulation • Metrics • Solutions + some custom techniques • Results 2
  3. 3. © 2022 EPAM Systems, Inc. Task formulation • Develop a tool that will create an annotation for texts that will be available for a user inside Data Lake • User can use annotation to decide whether this text is valuable for her/him and ask for an access • Summarization Task - to create a subset (a summary) that represents the most important or relevant information within the original content. 3
  4. 4. © 2022 EPAM Systems, Inc. Extractive vs Abstractive summarization 4
  5. 5. © 2022 EPAM Systems, Inc. Extractive vs Abstractive summarization • Potential problems of abstractive approach: • Repeated phrases • Hallucination problem • Imaginary words • Premature end-of-sentence • Coreference issues • Misleading rephrasing In: She is the daughter of Alistair Crane . . . who secretly built . . . Out: She is the daughter of Alistair Crane. She secretly built . . In:. . . Zenica (Cyrillic: “Зеница”) is . . . Out: . . . Zenica (Cyrillic: “gratulationеница”) is . . . In: I’m your employee, to serve on your company. Out: I’m your company, to serve on your company In: Tobacco smokers may also experience . . . Out: anthropology smokers may also experience . . . In: By the way, my favorite football team is Manchester United. Out: By the way, my favorite football team is. In:. . . postal service was in no way responsible . . . Out: . . . postal service was responsible. . . 5
  6. 6. © 2022 EPAM Systems, Inc. Extractive summarization • Task mainly consists of: • Construction of an intermediate representation of the input text • Scoring the sentences based on the representation • Frequency-driven approaches • Latent Semantic Analysis • Graph methods • Machine Learning • Selection of a summary comprising of a number of sentences + avoid semantic redundancy 6
  7. 7. © 2022 EPAM Systems, Inc. ROUGE Metrics • Recall-Oriented Understudy for Gisting Evaluation • Token based metric the cat was under the bed Reference summary the cat was found under the bed Candidate summary number_of_overlapping_words total_words_in_reference_summary Rouge Recall = number_of_overlapping_words total_words_in_candidate_summary Rouge Precision = 6 7 Rouge Precision = = 0.86 6 6 Rouge Recall = = 1 Rouge F1 = 0.92 7
  8. 8. © 2022 EPAM Systems, Inc. ROUGE Metrics After taking too many borrows the company went bankrupt. Text The business ceased to exist because of many debts . Reference summary Rouge ≈ 0 8 There are other more sophisticated metrics (BERT score) but due to data specificity we chose ROUGE
  9. 9. © 2022 EPAM Systems, Inc. Labeling • We had pretty standard dataset for this task: texts + abstractive/extractive reference summaries • Greedy selection algorithm Text + reference summary Indexes of the most suitable sentences in a body text 9
  10. 10. © 2022 EPAM Systems, Inc. 1. We take a set of n-grams in reference summary 2. We take a set of n-grams for each sentence in a text 3. Iteratively append to our labeled summary a sentence with the biggest rouge Greedy selection algorithm R = {the, cat, was, under, the, bed, it, hungry} I came into the room. The cat was under the bed. It was hungry and meowed for food. Text Reference summary The cat was found under the bed. It was hungry. c1 = {i, came, into, the room} c2 = {the, cat, was, under, the, bed} c3 = {it, was, hungry, and, meowed, for, food} Extractive summary The cat was under the bed. Iteration 1 Rouge(R, c2) Iteration 2 Rouge(R, c2 + c3) It was hungry and meowed for food. 10
  11. 11. © 2022 EPAM Systems, Inc. Dataset • As our financial dataset was skewed (in the majority of samples first only sentences were a part summary) we mixed it with Daily Mail dataset • This allowed us to avoid model predicting only first sentences Final Dataset = 80% financial dataset + 20% Daily Mail 11
  12. 12. © 2022 EPAM Systems, Inc. Text Summarization with Pretrained Encoders (Presumm)* Sequence/Token classification Pretrained Scores for sentences Sent1 Sent2 Sent3 Sent4 Sent5 Pretrained Randomly initialized * 12
  13. 13. © 2022 EPAM Systems, Inc. Presumm Top Transformer BERTSUM extends BERT by inserting multiple [CLS] symbols to learn sentence representations and using interval segmentation embeddings (illustrated in red and green color) to distinguish multiple sentences. 13
  14. 14. © 2022 EPAM Systems, Inc. Expanding max sequence length for Transformer • By default it’s 512 tokens • We end up with 1024 Copied position embedding Pretrained position embedding 512 512 14
  15. 15. © 2022 EPAM Systems, Inc. To rank is not enough • We also had to address potential semantic redundancy and limit the length of the predictions in summary. 15
  16. 16. © 2022 EPAM Systems, Inc. Trigram blocking • Given selected summary S and a candidate sentence c, we will skip c if there exists a trigram overlapping between c and S. S = “Emergency service initiated the resolving of consequences of yesterdays hurricane” c = “Additional measures were taken to mitigate the vast consequences of yesterdays hurricane” 16
  17. 17. © 2022 EPAM Systems, Inc. Limiting length of final candidates Length of original text Ratio length_of_original_text length_of_reference_summary Ratio= Usually some fixed number of sentences is used We used fitted curve to fit our training data 17
  18. 18. © 2022 EPAM Systems, Inc. Experiments with different Transformer encoders • We tried Bert, Roberta, Electra, XLNET • Electra-small was chosen for productionalization outperforming even middle-sized models like Bert- base and Roberta-base. • XLNET was slightly better than Electra 18 Source Rouge 1 F LEAD-3 55.28 Electra-small 78.56 Bert-base 74.31 Roberta-base 74.51 XLNET 79.13
  19. 19. © 2022 EPAM Systems, Inc. Q&A If you have questions feel free to contact me by email 19

Notas del editor

  • Okay so today we will look at some interesting project that we had with one of our external customer
    So let's look at our agenda today first of all we will look at our business case and what we had to do in terms of data science.
    Then we will switch to metrics that we used for evaluation
    then we will look at our solution – model that we used and I also will describe some custom techniques and tricks that we applied specifically for our case for our data
    Finally we will look at results that we’ve got
  • For this project our customer is a major Financial analytics company that inside it’s infrastructure has a service that is called Data Lake. Data Lake is a some kind of a database consisting of a bunch of texts (these texts are Financial reports some transcriptions of meetings etc) and by default all texts inside Data Lake have restricted access. So if some user from inside company wants to take a look at some text he or she cannot do this until this person applied for access for this particular text and got the permission from the owner of this text.
    And our task was to develop a tool that will automatically create some annotations for all text in a database and any user can use the annotations to understand whether or not this text is valuable, interesting or relevant for this person.
    So to put it in a frame of data science. The task can be formulated and interpreted as Text Summarization Task. We have to create a subset of original text that best represents the meaning and central points of the original text.
    So here on the picture you see the screenshot from the real Data Lake and in the green you see the title of a text and synopsis is a summary the was created by our solution.
  • Okay so let's look at what text summarization task is in a nutshell. And what approaches to it task exist.
    So there are mainly two approaches: the first one is extractive and the second is abstractive.
    In an extractive summarization we get the subset of sentences or some other units from original text and form extractive summary from them so let's imagine here we take sentence 3 and sentence 5 from this text and they constitute our extractive summary.
    In abstractive summarization on the contrary we generate we do not use words from original text directly but generate new text, rewrite and reformulate it.
    And it may seem that abstractive summarization is more promising because it can create more coherent texts that are easier to understand. Well, yes and no
  • Even current state-of-the-art abstractive summarization systems have some problems that do not allow to use them in production projects.
    It can be repeated phrases - sometimes these systems fall in some kind of a loop and start to repeat themselves.
    sometimes it has hallucination problem - when abstractive model mentions some fact that was not present in the original text
    sometimes it's imaginary works - it's clear why it's not safe in production
    sometimes it's some accident Interruption of a sentence
    sometimes it's some coreference issues like here - pronounce who refers to another word in the original sentence
    sometimes it's misleading rephrasing for example when the model skip some negation and completely changes the meaning of a sentence
  • So that's why we decided to use extractive approach so you can look at the task of extractive summarization mainly consisting of three steps:
    The first one is a construction of intermediate representation of the text. So we want to get some numerical representation of a text some embedding let's say
    The second one is scoring the sentences based on these representations - we want to rank our sentences from the most relevant to the least relevant. And for that purpose approaches like frequency driven approach latent semantic analysis graph methods and machine learning can be used.
    And finally when we have our sentences ranked we want to choose only some number of relevant sentences and we also want to avoid problem of semantic redundancy - we do not want that our sentences let's say tell about the same thing
    Hear how the problem of extractive summarization look in the most general terms
  • Couple of words about metrics. So there are plenty of metrics for a summarization in the field but we used one of the simplest and it is Rouge score. Rouge is Recall Oriented Understudy for Gisting evaluation. Let’s look how it is calculated:
    In order to be able to calculate Rouge score you should have ground truth summary in this case it is called reference summary and some candidate summary – candidate summary is some summary generated by some system and in this case by our summarization model
    Let's imagine we have our reference summary which is some abstractive summary written by human and we have our candidate summary.
    first of all we calculate rouge recall – so we bake the number of overlapping Words in candidate summary and reference summary and divide it by total number of words in reference summary. For this toy example it is 6 / 6 is equal to one because all the words from reference summary are present in candidate summary
    Rouge Precision is calculated little bit differently so in the numerator we have number of overlapping words and we divide it by the total number of words in candidate summary. So in this case it's 6 / 7 because word found is not present in a reference.
    And usually F1 score which is a harmonic mean of Recall and precision is taken as a final metric for summarization. But it's worth to mention that rouge is far from perfect metric and there are edge cases like this.
  • Text here is one word sentence and reference summary is again one sentence and if you look at these texts they are basically telling about the same so they are almost equal in the meaning but Rouge between them will be almost zero because the only word they have in common is article “the”. And all the other words are different. so we have to bear that in mind that Rouge metric always has its let's say ceiling for each particular dataset.
  • So let's switch to our problem. Data Lake team provided us with a dataset where we had texts and the abstractive reference summaries written by humans. So it’s a pretty standard setup for a dataset for summarization because usually it's some texts and some abstractive or extractive reference summary.
    And our task was using this text and abstractive reference summaries to get the indexes of the most suitable sentences in the body text. And to get them we used greedy selection algorithm.
  • So let's look at this crazy selection algorithm in detail. So let's imagine we have some reference summary and have some text an we want to extract sentences that will form our extracted summary – our ground truth summary.
    What we do the first step we take a set of n-grams in our case it's unigrams from our reference summary so it's just all the unique words from the reference summary.
    Next step - we take a set of n-grams for each sentence in a text so for each sentence separately
    Then we start iterating over them on each iteration we do the following: we will check the rouge between our reference and each candidate sentence and on each iteration we pick only the candidate sentence with the largest Rouge. First iteration it will be second candidate because Rouge between them is the largest. On the second iteration we eliminate all the already chosen sentences and iterate over those that are left but right now the second argument will be candidate 2 with some other sentence. In this case 3 sentence will give us the higher score. Here how we formed our extractive summary which we can then transform in in the form of zeros and ones and use for our training and modeling
  • So yep it's worth the mention that the data that was provided us from data Lake team was skewed - in the majority of text only first sentences were part of the summary.
    We had to mix it with some other dataset to partially eliminated skewness and we chose Daily Mail open data set with newspaper articles and our final dataset that was applied in our solution consisted of 80% of our financial data set from data lake and 20% of Daily Mail. This allowed us to avoid model predicting only the first sentences and we come up with these ratio after some experiments.
  • The title of the slide “text summarization With pretrained encoders” is actually the name of paper that we used in inside our Solution.
    On the left you see how usually Transformers are used for let's say some simple NLP tasks like text classification or token classification. So we have a pretrained Transformer encoder – BERT, we have our input text, we put it through embedding layer and then to transformer encoder.
    As output we get these contextualized embeddings. In these embeddings all the attention weights are already applied so all tokens are connected to each other at some degree defined by Semantics.
    Then we'll use some classification layer which predicts us labels depending on what tasks we want to solve it's either one label for a text or one label for each token
    Let's look at how it happens when we use Transformers for summarization- so the beginning of a pipeline is basically the same we also have input – put it through embedding layer, feed it inside Bert, get the contextualized embeddings for each token
    Then what is the difference – we feed our output embeddings into another Transformer which is randomly initialized and it is typically much smaller than the first one. And from this top Transformer encoder we got contextualized embeddings for each sentence in the body text.
    Then we apply again fully connected layer which gives us scores for each individual sentences and we use these scores to rank them. The most suitable sentences get the highest score. So let's look inside this setup more in more detail
  • This is how the whole architecture looks with the original Bert for some text classification like we saw on the previous slide like on the left side.
    Bert by default impute some special tokens like CLS at the beginning of the whole sequence and SEP at the end of each sentence. Then for each word we get token embedding, segment embeddings which is 1 embedding for the whole sequence and the position embedding - we have to feed this position information inside Transformer because it doesn't work in subsequent way like recurrent neural networks do.
    CLS token in Bert gather the information about the hole input sequence so it is an embedding can be used as an representation of the whole input.
    What is the difference when we use it for summarization the difference is as follows:
    We place CLS tokens at the beginning of each sentence then we again give this token embeddings we got the segments embeddings and this time there are two types of segmented embeddings for uneven sentences we have E-A embedding and for even sentences we have E-B.
    And here each CLS output embedding is an embedding for the whole sentence following it. The first CLS output gathers information about the first sentence, the second about the second sentence etc.
    And then we take only CLS embeddings and feed them in top transformer where again apply this attention self attention mechanism and we'll get out as an output contextualized embeddings of sentences how each sentence is connected with each other semantically