In the last couple of years, we have seen enormous breakthroughs in automated Open Domain Restricted Context Question Answering, also known as Reading Comprehension, where the task is to find an answer to a question from a single document or paragraph. A potentially more useful task is to find an answer for a question from a corpus representing an entire body of knowledge, also known as Open Domain Open Context Question Answering.
To do this, we adapted the BERTSerini architecture (Yang, et al., 2019), using it to answer questions about clinical content from our corpus of 5000+ medical textbooks. The BERTSerini pipeline consists of two components -- a BERT model fine-tuned for Question Answering, and an Anserini (Yang, Fang, and Lin, 2017) IR pipeline for Passage Retrieval. Anserini, in turn, consists of pluggable components for different kinds of query expansion and result reranking. Given a question, Anserini retrieves candidate passages, which the BERT model uses to retrieve the answer from. The best answer is determined using a combination of passage retrieval and answer scores.
Evaluating this system using a locally developed dataset of medical passages, questions, and answers, we adapted the BERT Question Answering component to our content using a combination of fine-tuning with third party SQuAD data, and pre-training the model using our medical content. However, when we replaced the canned passages with passages retrieved using the Anserini pipeline, performance dropped significantly, indicating that the relevance of the retrieved passages was a limiting factor.
The presentation will describe the actions taken to improve the relevance of passages returned by the Anserini pipeline.
Crypto Cloud Review - How To Earn Up To $500 Per DAY Of Bitcoin 100% On AutoP...
Question Answering as Search - the Anserini Pipeline and Other Stories
1. August 13 2020
Sujit Pal, Elsevier Labs
Question Answering
as Search
The Anserini Pipeline and other stories
THE SEARCH RELEVANCE CONFERENCE
2. About Me
• Work at Elsevier Labs
• (Mostly self-taught) data scientist
• Ex-search guy, Lucene and Solr mainly
• Some NLP, traditional ML and Deep Learning,
some Computer Vision
• Started looking at Question Answering in 2019
• Specifically the BERTserini project from Jimmy
Lin’s lab.
2
3. Agenda
• Types of QA systems
• BERTSerini Pipeline
• Experiments and Results
3
4. Types of QA Systems
We will just cover the subset where the objective, given a question, is to get answer spans
from passages in a text corpus.
4
5. Types of QA systems
• Traditional QA pipeline
• 2 stage Retriever Reader
systems
• Dense Retriever and Reader
• Language model based
5
• Jurafsky and Martin, IBM Watson, YodaQA
• Choose keywords from question
• Predict Question type (who, what, when, …)
• Rank passage by answer type, question words
• Extract answer based on pattern matching and question type
6. Types of QA systems
• Traditional QA pipeline
• 2 stage Retriever Reader
systems
• Dense Retriever and Reader
• Language model based
6
• DrQA (2017), BERTserini (2019)
• Retriever is unsupervised
• Reader is supervised Reading Comprehension model
Reading Wikipedia to answer Open Domain Questions (Chen, et al, 2017)
End-to-end Open Domain Question Answering with BERTserini (Yang, et al, 2019)
7. Types of QA systems
• Traditional QA pipeline
• 2 stage Retriever Reader
systems
• Dense Retriever and Reader
• Language model based
7
• ORQA (2019), REALM (2020)
• Train retriever and reader end-to-end using question answer pairs.
• Answer ranked by vector similarity between learned embeddings (question and answer).
Latent Retrieval for Weakly Supervised Open Domain Question Answering (Lee, et al, 2019)
Retrieval Augmented Language Model Pre-training (Guu, et al, 2020)
8. Types of QA systems
• Traditional QA pipeline
• 2 stage Retriever Reader
systems
• Dense Retriever and Reader
• Language model based
8
• GPT-2, GPT-3, T5 (2019 - 2020)
• Fine tuned Language Model
• No corpus, LM stores world knowledge implicitly
Exploring the limits of Transfer Learning with a Unified Text-to-text Transformer (Raffel, et al, 2019)
13. BERT Reader changes
• Switched out BERT-base model fine-tuned with SQuAD 1.1 with SciBERT
model fine-tuned with SQuAD v 1.1.
• Also tried…
− Fine tuning other BERT models – BERT-large, BioBERT.
− Fine tuning using SQuAD v 2.0 dataset
− Additional Pre-training model using Clinical Key content
13
14. Anserini Retriever changes
• Switched out Lucene index with Solr
based index.
• Moved batch oriented Anserini
functionality to Solr plugin for
interactive use.
− Open source, available in github:
https://github.com/elsevierlabs-
os/anserini-solr-plugin
− Code could be cleaner, but developed
for use in POC code.
14
15. anserini-solr-plugin
15
• Input: HTTP GET request specify
query, sim, qtype, and rtype.
• Similarity (sim): query likelihood
(ql) and BM25 (bm25, default).
• Query Rewriting (qtype)
• Bag of Words (bow), Sequential
Dependency Model (sdm)
• Added edismax and raw
• Result Reranking (rtype)
• RM3 (rm3)
• Axiomatic (ax)
• Identity (no reranking)
• Added external (delegate to
external rerank service)
• Output: HTTP Response
Rewritten query
Reranking query
https://github.com/elsevierlabs-os/anserini-solr-plugin
17. Initial Setup
• Index paragraphs from ClinicalKey books
• Use BM25 + BOW + RM3
• Scoring:
• Use k=1, look at top answer only
• Scoring metric EM (exact match) and F1 (f1-score)
between label and predicted answers.
17
Paper says paragraphs and
these settings work best
We hope to use the top answer
for display without further post-
processing
SQuAD metrics
18. How well does BERTserini work on our data?
• 100 questions from nursing text,
classified as “Remembering” in
Bloom’s taxonomy.
• Run these questions against
pipeline and manually inspect
each answer.
• ~ 60 “reasonable” answers.
− Answer span correct, but…
− Passage answers question
18
What causes a condition known as black hairy
tongue?
Hairy tongue is a condition in which the patient has an
increased accumulation of keratin on the filiform papillae
that results in a white, “hairy” appearance. This may be
the result of either an increase in keratin production or a
decrease in normal desquamation. Unless otherwise
pigmented, the elongated filiform papillae are white (
Fig. 1.58). In the condition known as black hairy tongue,
the papillae are a brown-to-black color because
of chromogenic bacteria (Fig. 1.59). Tobacco and certain
foods may also discolor the papillae. Although the cause
is unknown, hydrogen peroxide, bismuth subsalicylates
for upset stomach, alcohol, or chemical rinses have
been suggested to stimulate the elongation of the filiform
papillae that results in the appearance of hairy tongue.
Oral Pathology for the Dental Hygienist: Introduction to Preliminary Diagnosis of
Oral Lesions (PII: B9780323400626000013, ISBN: 978-0-323-40062-6)
19. Some more good results
What is a cause of tooth mobility?
Periodontal probing is used to assess attachment
levels to the tooth and is a prime indicator of
health. Radiographic bone loss around a tooth
does not indicate the presence of a disease state
but is a reflection of past or present periodontal
disease. Occlusal trauma may cause an increase
in tooth mobility but does not cause marginal bone
loss in the absence of periodontal disease.
Contemporary Implant Dentistry: An Implant Is Not a Tooth: A Comparison
of Periodontal Indices (PII: B9780323043731500484, ISBN: 978-0-323-
04373-1)
19
What is the cause of stridor?
Stridor is a term used to describe a high-pitched sound
caused by partial obstruction of the airway. Stridor can
have an inspiratory, expiratory, or biphasic pattern (both
inspiratory and expiratory). An inspiratory pattern
suggests an upper airway cause (e.g., epiglottitis). An
expiratory pattern suggests a lower airway etiology
(e.g., tracheomalacia). A biphasic pattern suggests a
glottic or subglottic obstruction (e.g., subglottic
hemangioma). Imaging evaluation of the child with
stridor is commonly performed with neck and/or chest
radiographs, depending on the pattern of stridor and
associated clinical findings.
Emergency Radiology: The Requisites: Imaging Evaluation of Common Pediatric
Emergencies (PII: B9780323376402000066, ISBN: 978-0-323-37640-2)
20. As well as some fails
What special considerations must be
observed when a patient has epiglottitis?
What special considerations related to
her transplant need to be in place for this
patient during critical care resuscitation?
Advanced Critical Care Nursing: Bone Marrow Transplantation (PII:
B9781416032199100397, ISBN: 978-1-4160-3219-9)
20
What conditions are treated by methotrexate?
The combination of PUVA and methotrexate
successfully treated five patients
with erythrodermic psoriasis and two with
pustular psoriasis. According to the authors,
annual methotrexate doses could be reduced by
50% by adding PUVA to the regimen.
Treatment of Skin Disease: Comprehensive Therapeutic Strategies:
Psoriasis (PII: B978070206912300210X, ISBN: 978-0-7020-6912-3)Meaningless answer
Surely a better answer exists?
21. Reader Experiments
• Results of evaluating various Reader configurations (no Retriever)
against SQuAD dataset.
• Encouraging results for reading comprehension task, i.e., when
appropriate passage is provided.
21
Parameters EM F1
BERT-base uncased + SQuAD 1.1 75.86 82.41
BERT-base uncased + SQuAD 2.0 74.03 77.30
SciBERT + SQuAD 1.1 79.10 87.26
Human (SQuAD v2) 86.83 89.45
22. MedSQuAD dataset
• SQuAD contains (question, passage, answer)
triples.
− Task is Reading Comprehension, i.e., find the
most appropriate span in the passage to return
as an answer to the question.
• Nursing content = (question, answer) pairs.
• MedSQUAD dataset
− Good answers from nursing questions + top
passages, select best passage manually
− Passages in ClinicalKey + automatic question
generation, select triples manually
− Approximately 300 (question, passage, answer)
triples.
22
23. Retriever Experiments
• Adding default retriever backend
− Parse the question into appropriate query (BM25 + BoW worked best)
− Rerank (RM3 worked best) and return top 50 result passages
− Reader generates answer using each of the top 50 passages
− Returns the top (k=1) answer by segment and span score
• Scores drop by 40+ points!
23
Reader not getting
the “right”
passages?
Parameters EM F1
Baseline (no retriever) 65.11 76.03
Anserini retriever (BM25+Bow+RM3, 50 results, k=1) 23.02 30.50
Passage reranking? (Nogueira
and Cho, 2020)
Passage Reranking with BERT (Nogueira and Cho, 2020)
24. BERT Based Reranker (Unsupervised)
• BERT-as-a-service (BaaS) wraps a BERT-base-uncased model and
returns embeddings from its last layer [-1].
• Query embeddings produced from query using BaaS.
• Passage embeddings produced for top 50 passages returned by query
using BaaS.
• Cosine similarity computed between query vector and passage vectors
and passages reranked by similarity descending.
24
Parameters EM F1
BM25+BoW+RM3, 50 records 8.27 11.45
25. Query Sentence Relevance Reranker
• Model predicts relevance (0/1) between query and single sentence.
• Trained on TREC Microblog dataset (120,000 query sentence pairs)
• Classifier fine-tuned from BERT-base-uncased for 2 epochs, Adam
optimizer with learning rate 2e-5, and batch size 32, F1-score: 0.86.
• Passage is scored as #-relevant sentences / #-number of sentences and
reranked by score descending.
25
Parameters EM F1
BM25+BoW+RM3, 100 records 13.35 19.19
26. Passage Relevance Reranker
• Model predicts relevance (number between 0 and
1) between passage and question.
• Trained using SQuAD 1.1 (passage, question) pairs
with negative sampling.
• Regression model fine-tuned from BERT-base-
uncased for 2 epochs, batch size 8, Adam
optimizer with learning rate 2e-5, RMSE 0.3.
• Passage is ranked by relevance score descending.
26
Parameters EM F1
BM25+BoW+RM3, 50 records 8.99 15.69
27. Siamese BERT Reranker
• Pretrained models from https://github.com/UKPLab/sentence-transformers
to produce embeddings from question and passage text.
• Passage score is mean or maximum similarity between question vector
and all sentences in passage.
• Passages ranked by score descending.
27
Parameters EM F1
BM25+BoW+RM3, 50 records, max similarity,
model: bert_base_nli_mean_tokens.
16.91 24.95
28. Reranker Scores (all together)
28
Parameters EM F1
BERTserini (Paragraph, k=100) 38.6 46.1
BERTserini (Paragraph, k=29) 36.6 44.0
BERTserini (Article, k=5) 19.1 25.9
Anserini (BM25+BoW+RM3, SciBERT, MedSQuAD,
Paragraph, k=1) (baseline)
23.02 30.5
RM3 replaced with BERT reranker 8.27 11.45
RM3 replaced with Query Sentence Relevance reranker 13.35 19.19
RM3 replaced with Passage Relevance reranker 8.99 15.69
RM3 replaced with Siamese BERT reranker 16.91 24.95
From paper
Our results
29. Conclusions
• We couldn’t beat Anserini with any of our Passage Rerankers.
• Our results are comparable (paragraph, k=1) to those reported in the
paper.
• However, response time is unacceptable because of slow Reader
component (techniques such as model distillation might help somewhat).
• Quality of answer snippet at k=1 not always acceptable, and too risky for
production use.
• We also want to use the selected passage and the source metadata as
additional provenance for the answer
29
30. Thank you
I am @sujitpal on relevancy.slack.com if you have
questions