This thesis presents the design and architecture of an Active Learning system for Question Answering on Multiparty Dialogue. The goal of this system is to collect a robust Question Answering dataset and to improve the performance of the system on Question Answering challenges on Multiparty Dialogue. The system has an interactive web-based user interface which allows users to challenge the system with their own questions regarding a short passage of dialogues between multiple characters in a TV series. This system makes use of a state-of-art Machine Learning model to predict the answers to users' questions. In the same time, the system learns from users' responses and performs online update on the model. The system uses probability functions to guide user towards contributing data needed most for model improvement. The system is designed to handle heavy internet traffic by efficiently storing data and by carefully synchronizing the shared resources in the web system. The system has shown promising results in guiding users to contribute high quality data useful for model training.
2. Content
● Question Answering
● Data Collection
● Active Learning
● User Interaction
● System Architecture
● Results & Future Work
● “Question Answering”
3. Question Answering
Question Answering is a Computer Science discipline focuses on building
automated systems which are able to answer questions from human in
natural language.
5. Question Answering Data Sets
Text Source QA Source
Quasar-T Search Engine (Google / Bing) Trivia
Search QA Search Engine (Google / Bing) Jeopardy!
SQuAD Wikipedia Articles Annotation
6. Why Dialogue?
● Natural
● Machine User Interaction
● Availability
○ Transcripts
○ Texting
● Little previous work
Source: Statistics Brain
7. Question Answering in Dialogue
● TV Series Friends
● 10 Seasons
● 236 Episodes
● 3000 + Scenes
● Datasets from Character Mining
○ JSON formatted data
○ Tokenized
○ Season - Episode - Scene - Utterance
○ Plots available for 44% scenes
8. Classification on Question Types
● Based on type of answer:
○ Categorical (Multichoice) - Binary (Polar)
○ Continuous (Span of text)
● Based on Inference
○ Explicit
○ Implicit
● Based on answerability (newly introduced in sQuAD):
○ Unanswerable
○ Answerable
11. Explicit vs Implicit
● The contextual similarity between question and answer
● The amount of inference needed to resolve
● Q1: Explicit; Abundant Similarity; Little Inference
● Q2: Implicit; Little / None Similarity; Substantial Inference
13. Annotation
● Annotation Phases:
○ Experimental Phase - Small Data Chunk
○ Production Phase - All Data
● Tasks per phase:
○ Question & Answer Generation
○ Verification - Inter-Annotator Agreement
Experimental
Revision
On
Template
Stable
High
IAT
Production
14. Challenges on Annotation
● Ambiguous Pronouns:
○ Example: In a scene having Chandler and Joey: Is he excited about the date?
● Exact wording from the original text
● Low Agreement measured
● Attempted Resolution:
○ Update instructions
○ Integrate Plots in Scene
○ Reduce the number of Questions
16. Results from Annotation
● Second Round:
○ Added Plot
○ Updated Instructions
● Third Round: Dropped # of Questions
● Random guess would give 50%!
● Cannot obtain high quality data
17. Change in Path
Dialogue QA
Continuous
Binary
Annotation Model Dev Analysis
Annotation Model Dev Analysis
Active
Learning
System
Dev
Online
Production
Analysis
18. Active Learning
● Active Learning is a sub-branch of Machine Learning in which the learning
system will interactively query the user to obtain the desired data from user.
● The goal of our system is to:
○ Collect data for model needed for improvement
○ Improve the model by applying these data
● What we offer:
○ Answer queries from user
○ Learn from user
● What user provide:
○ Annotation on the data
19. Baseline Model
● BERT (Bidirectional Encoder
Representations from Transformers)
from Google AI
● Contextual vs Context Free
○ Bank account
○ River Bank
Pre-train
Network
Contextual
Representation
Downstream
Model
Output
20. Baseline Model
● Unprecedented results in sQuAD
● Power of Bidirectional Flow
○ Versus Left->Right; Right->Left
○ Allows learning a word from all
of its context
● Masked training
25. User Guidance
● Which Scene the user needs to work on
○ Ensure all scenes are evenly annotated
● Which Type of question the user needs to work on
○ Type we have least data on
○ Type the model performed worst
● User Experience: Too Monotonous?
26. User Guidance
● Scene Selection
● Randomly select from least
annotated
● Type Selection
● Use Probability Function to
Control randomly Select
27. User Guidance
● Constant c is used to linearly scale the probabilities
● Describes the degree of discrepancy between question types
28. User Guidance
● Train - Train the model
● Dev - Obtain stat for guidance
● Test - Evaluate Performance
● Test Statistics never shown to system
32. Controller - Security
● Server needs to know which question
user is changing
● Dummy id could create loophole
● Allow malicious user to change the
response from others
● Session is anonymous, unauthenticated
post-correction:
question-id: 1
question-id:
3/26-s1-e1-c1-1
post-correction:
question-id: 1
question-id:
3/26-s1-e1-c1-1
33. Controller - Security
● Solution - Hashing + Salt
● Password should not be stored in plain text
● Salt mitigates brute-force attack
● Hash also prevents secret disclosure:
○ Prevents user from know how we compute the
hash
● The hash itself is returned to user
34. Django Object-Relational Mapping (ORM)
● Mapping Between Database Language and Programming Language
● SQL <-> Python
● Apply structural changes to Database
● Query Database in Programming Language
● Widely used in industry & Reduce Error
36. Optimization on DB
● Indexing on fields need query
○ hash in User Response
○ count in Scene
● Delay in Database writes:
Receive
Request
Handle
Request
Return
Response
Database
IO
37. Concurrency on DB
● Two users could work on the same
question type / scene
● Increment the count at the same time
● Pessimistic Row-Level Locking
○ Must acquire lock before write
○ Prevents dirty write
39. BERT Service - Predict
● Workers
○ Dedicated Model
○ Dedicated Local Space for compute
● Worker Array - Size N
● Mutex Array - Size N
● Semaphore - N available
● Acquire Semaphore first
● Then acquire mutex
● Exception Handling ensure no deadlock
W W W W W
Semaphore
M M M MM
40. BERT Service - Train
● Query DB for new responses
● Check batch size
● Train with batch
● Populate new worker array
● Change pointers
BERT Service
W W W W W
W W W W W
41. Snapshot
● Keep track of model progress
● Cron Jobs
● Use the latest worker to test against
○ dev dataset
○ test dataset
● Record:
○ Respective performance
○ Counts
○ User-Model F1
42. Production
● Advertised through email to students in the department
● Collected data for 7 days
● Will continue online in future
43. Result - System Performance
● Measured by average of 100
requests
● Predict interface measured by 100
randomly selected scenes with test
questions
● Performance in deployment
environment
44. Results - Data Collection
● Collected 151 responses
● Concentrated on weak types (72.18% vs 50.64%)
● No evaluation improvement yet
● 1.76% of training data
45. Result - User-Model F1
● Model cannot learn from its own
prediction
● Denotes reverse of similarity
between model response and user
input
46. Future Work
● Funding
● Current Major Limitation: Responses
● More advertising through:
○ Community of NLP
○ Community of Friends