SlideShare una empresa de Scribd logo
1 de 22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
An Investigation of
Machine Translation Evaluation Metrics
in Cross-lingual Question Answering
Kyoshiro Sugiyama, Masahiro Mizukami, Graham Neubig,
Koichiro Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura
NAIST, Japan
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Question answering (QA)
One of the techniques for information retrieval
Input: Question  Output: Answer
Information
SourceWhere is the
capital of Japan?
Tokyo.
Retrieval
Retrieval Result
2/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA using knowledge bases
Convert question sentence into a query
Low ambiguity
Linguistic restriction of knowledge base
 Cross-lingual QA is necessary
Where is the
capital of Japan?
Tokyo.
Type.Location
⊓Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
3/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Cross-lingual QA (CLQA)
Question sentence (Linguistic difference) Information source
日本の首都は
どこ?
東京
Type.Location
⊓Country.Japan.CapitalCity
Knowledge base
Location.City.Tokyo
QA system using knowledge base
Query
Response
To create mapping:
High cost and
not re-usable in other languages
4/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
CLQA using machine translation
Machine translation (MT) can be used to perform CLQA
Easy, low cost and usable in many languages
QA accuracy depends on MT quality
日本の首都はどこ?
Where is the
capital of
Japan? Existing
QA system
Tokyo
Machine
Translation
東京
Machine
Translation
5/22
Any language
Any language
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Purpose of our work
To make clear how translation affects QA accuracy
Which MT metrics are suitable for the CLQA task?
 Creation of QA dataset using various translations systems
 Evaluation of the translation quality and QA accuracy
What kind of translation results influences QA accuracy?
 Case study (manual analysis of the QA results)
6/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA system
SEMPRE framework [Berant et al., 13]
3 steps of query generation:
Alignment
Convert entities in the question sentence
into “logical forms”
Bridging
Generate predicates compatible with
neighboring predicates
Scoring
Evaluate candidates using
scoring function
7/22
Scoring
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Data set creation
8/22
Training
(512 pairs)
Dev.
(129 pairs)
Test
(276 pairs)
(OR set)
Free917
JA set
HT set
GT set
YT set
Mo set
Tra set
Manual translation
into Japanese
Translation
into English
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation method
Manual Translation (“HT” set): Professional humans
Commercial MT systems
Google Translate (“GT” set)
Yahoo! Translate (“YT” set)
Moses (“Mo” set): Phrase-based MT system
Travatar (“Tra” set): Tree-to-String based MT system
9/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Experiments
Evaluation of translation quality of created data sets
Reference is the questions in the OR set
QA accuracy evaluation using created data sets
Using same model
 Investigation of correlation between them
10/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Metrics for evaluation of translation quality
11/22
BLEU+1: Evaluates local n-grams
1-WER: Evaluates whole word order strictly
RIBES: Evaluates rank correlation of word order
NIST: Evaluates local word order and correctness of infrequent words
Acceptability: Human evaluation
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality
12/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
QA accuracy
13/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality and QA accuracy
14/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Translation quality and QA accuracy
15/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level analysis
47% questions of OR set are not answered correctly
 These questions might be difficult to answer
even with the correct translation result
Dividing questions into two groups
Correct group (141*5=705 questions):
Translated from 141 questions answered correctly in OR set
Incorrect group (123*5=615 questions):
Translated from remaining 123 questions in OR set
16/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics
𝑹 𝟐
(correct group)
𝑹 𝟐
(incorrect group)
BLUE+1 0.900 0.007
1-WER 0.690 0.092
RIBES 0.418 0.311
NIST 0.942 0.210
Acceptability 0.890 0.547
17/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sentence-level correlation
Metrics
𝑹 𝟐
(correct group)
𝑹 𝟐
(incorrect group)
BLUE+1 0.900 0.007
1-WER 0.690 0.092
RIBES 0.418 0.311
NIST 0.942 0.210
Acceptability 0.890 0.547
Very little
correlation
NIST has the highest correlation
 Importance of content words
If the reference cannot be answered correctly,
the sentences are not suitable,
even for negative samples
18/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 1
19/22
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 2
20/22
Lack of the
question type-word
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Sample 3
21/22
All questions were answered correctly
though they are grammatically incorrect.
Kyoshiro SUGIYAMA , AHC-Lab. , NAIST
Conclusion
NIST score has the highest correlation
NIST is sensitive to the change of content words
If reference cannot be answered correctly,
there is very little correlation
between translation quality and QA accuracy
Answerable references should be used
3 factors which cause change of QA results:
content words, question types and syntax
22/22

Más contenido relacionado

La actualidad más candente

Reverse engineering and theory building v3
Reverse engineering and theory building v3Reverse engineering and theory building v3
Reverse engineering and theory building v3ClarkTony
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...Alexander Panchenko
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systemsQi He
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Johann Petrak
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...Jinho Choi
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overviewalessio_ferrari
 
Aspect Based Sentiment Analysis
Aspect Based Sentiment AnalysisAspect Based Sentiment Analysis
Aspect Based Sentiment AnalysisGaurav kumar
 
Flexible querying of relational databases fuzzy set based approach 27-11
Flexible querying of relational databases fuzzy set based approach 27-11Flexible querying of relational databases fuzzy set based approach 27-11
Flexible querying of relational databases fuzzy set based approach 27-11Adel Sabour
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overviewalessio_ferrari
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Isabelle Augenstein
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesMatthew Lease
 
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...Shaghayegh (Sherry) Sahebi
 
Improving Neural Question Generation using Answer Separation.
Improving Neural Question Generation using Answer Separation.Improving Neural Question Generation using Answer Separation.
Improving Neural Question Generation using Answer Separation.Yang Hoon Kim
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science CommunicationIsabelle Augenstein
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningHoa Le
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
Non-parametric Subject Prediction
Non-parametric Subject PredictionNon-parametric Subject Prediction
Non-parametric Subject PredictionShenghui Wang
 

La actualidad más candente (20)

Reverse engineering and theory building v3
Reverse engineering and theory building v3Reverse engineering and theory building v3
Reverse engineering and theory building v3
 
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
IIT-TUDA at SemEval-2016 Task 5: Beyond Sentiment Lexicon: Combining Domain ...
 
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems[KDD 2018 tutorial] End to-end goal-oriented question answering systems
[KDD 2018 tutorial] End to-end goal-oriented question answering systems
 
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
Semantics2018 Zhang,Petrak,Maynard: Adapted TextRank for Term Extraction: A G...
 
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
The Pupil Has Become the Master: Teacher-Student Model-Based Word Embedding D...
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an OverviewNatural Language Processing (NLP) for Requirements Engineering (RE): an Overview
Natural Language Processing (NLP) for Requirements Engineering (RE): an Overview
 
Aspect Based Sentiment Analysis
Aspect Based Sentiment AnalysisAspect Based Sentiment Analysis
Aspect Based Sentiment Analysis
 
Flexible querying of relational databases fuzzy set based approach 27-11
Flexible querying of relational databases fuzzy set based approach 27-11Flexible querying of relational databases fuzzy set based approach 27-11
Flexible querying of relational databases fuzzy set based approach 27-11
 
Empirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an OverviewEmpirical Methods in Software Engineering - an Overview
Empirical Methods in Software Engineering - an Overview
 
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
Machine Reading Using Neural Machines (talk at Microsoft Research Faculty Sum...
 
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & OpportunitiesDeep Learning for Information Retrieval: Models, Progress, & Opportunities
Deep Learning for Information Retrieval: Models, Progress, & Opportunities
 
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...
Parameterized Exercises in Java Programming: using Knowledge Structure for Pe...
 
Improving Neural Question Generation using Answer Separation.
Improving Neural Question Generation using Answer Separation.Improving Neural Question Generation using Answer Separation.
Improving Neural Question Generation using Answer Separation.
 
Icml2018 naver review
Icml2018 naver reviewIcml2018 naver review
Icml2018 naver review
 
Determining the Credibility of Science Communication
Determining the Credibility of Science CommunicationDetermining the Credibility of Science Communication
Determining the Credibility of Science Communication
 
B4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearningB4UConference_machine learning_deeplearning
B4UConference_machine learning_deeplearning
 
IROS 2017 Slides
IROS 2017 SlidesIROS 2017 Slides
IROS 2017 Slides
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
Non-parametric Subject Prediction
Non-parametric Subject PredictionNon-parametric Subject Prediction
Non-parametric Subject Prediction
 

Destacado

11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translationRIILP
 
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015RIILP
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT IntroductionRIILP
 
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...Lifeng (Aaron) Han
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...RIILP
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...Lifeng (Aaron) Han
 
Kone ja joukko – kääntäjän uudet ystävät?
Kone ja joukko – kääntäjän uudet ystävät?Kone ja joukko – kääntäjän uudet ystävät?
Kone ja joukko – kääntäjän uudet ystävät?Jari Herrgård
 
10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine TranslationRIILP
 

Destacado (8)

11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
11. manuel leiva & juanjo arevalillo (hermes) evaluation of machine translation
 
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
ESR12 Hanna Bechara - EXPERT Summer School - Malaga 2015
 
2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction2. Constantin Orasan (UoW) EXPERT Introduction
2. Constantin Orasan (UoW) EXPERT Introduction
 
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...
ACL-WMT13 poster.Quality Estimation for Machine Translation Using the Joint M...
 
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
12. Gloria Corpas, Jorge Leiva, Miriam Seghiri (UMA) Human Translation & Tran...
 
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
COLING 2012 - LEPOR: A Robust Evaluation Metric for Machine Translation with ...
 
Kone ja joukko – kääntäjän uudet ystävät?
Kone ja joukko – kääntäjän uudet ystävät?Kone ja joukko – kääntäjän uudet ystävät?
Kone ja joukko – kääntäjän uudet ystävät?
 
10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation10. Lucia Specia (USFD) Evaluation of Machine Translation
10. Lucia Specia (USFD) Evaluation of Machine Translation
 

Similar a An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering

Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Puya - Hossein Vahabi
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Webfeiwin
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approachGarima Nanda
 
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Komei Sugiura
 
Dcn 20170823 yjy
Dcn 20170823 yjyDcn 20170823 yjy
Dcn 20170823 yjy재연 윤
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekingeProf. Wim Van Criekinge
 
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...Kitware Kitware
 
Part 1
Part 1Part 1
Part 1butest
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..butest
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query UnderstandingAbhay Prakash
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Artificial Intelligence Institute at UofSC
 
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...David Talby
 
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference PresentationRecent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentationstewhir
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsTaesu Kim
 

Similar a An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering (20)

Spoken Content Retrieval
Spoken Content RetrievalSpoken Content Retrieval
Spoken Content Retrieval
 
Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017Query Recommendation - Barcelona 2017
Query Recommendation - Barcelona 2017
 
Mining Product Reputations On the Web
Mining Product Reputations On the WebMining Product Reputations On the Web
Mining Product Reputations On the Web
 
ICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptxICSE20_Tao_slides.pptx
ICSE20_Tao_slides.pptx
 
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy GryshchukGrammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
Grammarly Meetup: Paraphrase Detection in NLP (PART 2) - Andriy Gryshchuk
 
Ssbse12b.ppt
Ssbse12b.pptSsbse12b.ppt
Ssbse12b.ppt
 
Question Answering System using machine learning approach
Question Answering System using machine learning approachQuestion Answering System using machine learning approach
Question Answering System using machine learning approach
 
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
Spatio-Temporal Pseudo Relevance Feedback for Large-Scale and Heterogeneous S...
 
Dcn 20170823 yjy
Dcn 20170823 yjyDcn 20170823 yjy
Dcn 20170823 yjy
 
2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge2016 bioinformatics i_database_searching_wimvancriekinge
2016 bioinformatics i_database_searching_wimvancriekinge
 
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...
A Framework for Comparison and Evaluation of Nonlinear Intra-Subject Image Re...
 
ISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-MondalISEC-2021-Presentation-Saikat-Mondal
ISEC-2021-Presentation-Saikat-Mondal
 
Ssbse12b.ppt
Ssbse12b.pptSsbse12b.ppt
Ssbse12b.ppt
 
Part 1
Part 1Part 1
Part 1
 
32_Nov07_MachineLear..
32_Nov07_MachineLear..32_Nov07_MachineLear..
32_Nov07_MachineLear..
 
Techniques For Deep Query Understanding
Techniques For Deep Query UnderstandingTechniques For Deep Query Understanding
Techniques For Deep Query Understanding
 
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
Vahid Taslimitehrani PhD Dissertation Defense: Contrast Pattern Aided Regress...
 
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
Introducing the Open-Source Library for Testing NLP Models - Healthcare NLP S...
 
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference PresentationRecent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
Recent and Robust Query Auto-Completion - WWW 2014 Conference Presentation
 
Issues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applicationsIssues in AI product development and practices in audio applications
Issues in AI product development and practices in audio applications
 

Último

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptxogubuikealex
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEMCharmi13
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...漢銘 謝
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per MVidyaAdsule1
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRachelAnnTenibroAmaz
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitysandeepnani2260
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptxerickamwana1
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxAsifArshad8
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityApp Ethena
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Coolerenquirieskenstar
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...Sebastiano Panichella
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SESaleh Ibne Omar
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxRoquia Salam
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerkumenegertelayegrama
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Sebastiano Panichella
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRRsarwankumar4524
 

Último (17)

Chizaram's Women Tech Makers Deck. .pptx
Chizaram's Women Tech Makers Deck.  .pptxChizaram's Women Tech Makers Deck.  .pptx
Chizaram's Women Tech Makers Deck. .pptx
 
Quality by design.. ppt for RA (1ST SEM
Quality by design.. ppt for  RA (1ST SEMQuality by design.. ppt for  RA (1ST SEM
Quality by design.. ppt for RA (1ST SEM
 
GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024GESCO SE Press and Analyst Conference on Financial Results 2024
GESCO SE Press and Analyst Conference on Financial Results 2024
 
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
THE COUNTRY WHO SOLVED THE WORLD_HOW CHINA LAUNCHED THE CIVILIZATION REVOLUTI...
 
General Elections Final Press Noteas per M
General Elections Final Press Noteas per MGeneral Elections Final Press Noteas per M
General Elections Final Press Noteas per M
 
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATIONRACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
RACHEL-ANN M. TENIBRO PRODUCT RESEARCH PRESENTATION
 
cse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber securitycse-csp batch4 review-1.1.pptx cyber security
cse-csp batch4 review-1.1.pptx cyber security
 
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
05.02 MMC - Assignment 4 - Image Attribution Lovepreet.pptx
 
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptxEngaging Eid Ul Fitr Presentation for Kindergartners.pptx
Engaging Eid Ul Fitr Presentation for Kindergartners.pptx
 
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunityDon't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
Don't Miss Out: Strategies for Making the Most of the Ethena DigitalOpportunity
 
A Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air CoolerA Guide to Choosing the Ideal Air Cooler
A Guide to Choosing the Ideal Air Cooler
 
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...Testing with Fewer Resources:  Toward Adaptive Approaches for Cost-effective ...
Testing with Fewer Resources: Toward Adaptive Approaches for Cost-effective ...
 
Internship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SEInternship Presentation | PPT | CSE | SE
Internship Presentation | PPT | CSE | SE
 
Application of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptxApplication of GIS in Landslide Disaster Response.pptx
Application of GIS in Landslide Disaster Response.pptx
 
proposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeegerproposal kumeneger edited.docx A kumeeger
proposal kumeneger edited.docx A kumeeger
 
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
Testing and Development Challenges for Complex Cyber-Physical Systems: Insigh...
 
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRRINDIAN GCP GUIDELINE. for Regulatory  affair 1st sem CRR
INDIAN GCP GUIDELINE. for Regulatory affair 1st sem CRR
 

An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering

  • 1. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST An Investigation of Machine Translation Evaluation Metrics in Cross-lingual Question Answering Kyoshiro Sugiyama, Masahiro Mizukami, Graham Neubig, Koichiro Yoshino, Sakriani Sakti, Tomoki Toda, Satoshi Nakamura NAIST, Japan
  • 2. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Question answering (QA) One of the techniques for information retrieval Input: Question  Output: Answer Information SourceWhere is the capital of Japan? Tokyo. Retrieval Retrieval Result 2/22
  • 3. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST QA using knowledge bases Convert question sentence into a query Low ambiguity Linguistic restriction of knowledge base  Cross-lingual QA is necessary Where is the capital of Japan? Tokyo. Type.Location ⊓Country.Japan.CapitalCity Knowledge base Location.City.Tokyo QA system using knowledge base Query Response 3/22
  • 4. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Cross-lingual QA (CLQA) Question sentence (Linguistic difference) Information source 日本の首都は どこ? 東京 Type.Location ⊓Country.Japan.CapitalCity Knowledge base Location.City.Tokyo QA system using knowledge base Query Response To create mapping: High cost and not re-usable in other languages 4/22 Any language Any language
  • 5. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST CLQA using machine translation Machine translation (MT) can be used to perform CLQA Easy, low cost and usable in many languages QA accuracy depends on MT quality 日本の首都はどこ? Where is the capital of Japan? Existing QA system Tokyo Machine Translation 東京 Machine Translation 5/22 Any language Any language
  • 6. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Purpose of our work To make clear how translation affects QA accuracy Which MT metrics are suitable for the CLQA task?  Creation of QA dataset using various translations systems  Evaluation of the translation quality and QA accuracy What kind of translation results influences QA accuracy?  Case study (manual analysis of the QA results) 6/22
  • 7. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST QA system SEMPRE framework [Berant et al., 13] 3 steps of query generation: Alignment Convert entities in the question sentence into “logical forms” Bridging Generate predicates compatible with neighboring predicates Scoring Evaluate candidates using scoring function 7/22 Scoring
  • 8. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Data set creation 8/22 Training (512 pairs) Dev. (129 pairs) Test (276 pairs) (OR set) Free917 JA set HT set GT set YT set Mo set Tra set Manual translation into Japanese Translation into English
  • 9. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Translation method Manual Translation (“HT” set): Professional humans Commercial MT systems Google Translate (“GT” set) Yahoo! Translate (“YT” set) Moses (“Mo” set): Phrase-based MT system Travatar (“Tra” set): Tree-to-String based MT system 9/22
  • 10. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Experiments Evaluation of translation quality of created data sets Reference is the questions in the OR set QA accuracy evaluation using created data sets Using same model  Investigation of correlation between them 10/22
  • 11. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Metrics for evaluation of translation quality 11/22 BLEU+1: Evaluates local n-grams 1-WER: Evaluates whole word order strictly RIBES: Evaluates rank correlation of word order NIST: Evaluates local word order and correctness of infrequent words Acceptability: Human evaluation
  • 12. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Translation quality 12/22
  • 13. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST QA accuracy 13/22
  • 14. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Translation quality and QA accuracy 14/22
  • 15. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Translation quality and QA accuracy 15/22
  • 16. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sentence-level analysis 47% questions of OR set are not answered correctly  These questions might be difficult to answer even with the correct translation result Dividing questions into two groups Correct group (141*5=705 questions): Translated from 141 questions answered correctly in OR set Incorrect group (123*5=615 questions): Translated from remaining 123 questions in OR set 16/22
  • 17. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sentence-level correlation Metrics 𝑹 𝟐 (correct group) 𝑹 𝟐 (incorrect group) BLUE+1 0.900 0.007 1-WER 0.690 0.092 RIBES 0.418 0.311 NIST 0.942 0.210 Acceptability 0.890 0.547 17/22
  • 18. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sentence-level correlation Metrics 𝑹 𝟐 (correct group) 𝑹 𝟐 (incorrect group) BLUE+1 0.900 0.007 1-WER 0.690 0.092 RIBES 0.418 0.311 NIST 0.942 0.210 Acceptability 0.890 0.547 Very little correlation NIST has the highest correlation  Importance of content words If the reference cannot be answered correctly, the sentences are not suitable, even for negative samples 18/22
  • 19. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sample 1 19/22
  • 20. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sample 2 20/22 Lack of the question type-word
  • 21. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Sample 3 21/22 All questions were answered correctly though they are grammatically incorrect.
  • 22. Kyoshiro SUGIYAMA , AHC-Lab. , NAIST Conclusion NIST score has the highest correlation NIST is sensitive to the change of content words If reference cannot be answered correctly, there is very little correlation between translation quality and QA accuracy Answerable references should be used 3 factors which cause change of QA results: content words, question types and syntax 22/22

Notas del editor

  1. Thank you for the kind introduction. Hello everyone, I’m Kyoshiro Sugiyama, a master student of NAIST in Japan. In this presentation, I'd like to talk about translation quality with regards to cross-lingual question-answering tasks. Our investigation makes clear what kind of evaluation metric is appropriate for question answering tasks, and what kind of mistranslations affect accuracy of question answering.
  2. #イントロ少しだけ急ぐ At first, I’d like to talk about the focus of our study, question-answering, QA. QA is a type of information retrieval technique. The input is a question sentence and the output is an answer to the input question. For example, when I ask the system “Where is the capital of Japan?,” anim. 1: the system will retrieve the information from an information source, anim. 2: get a retrieval result and then output “Tokyo” according to the retrieval result.
  3. #イントロ少しだけ急ぐ Recently, there is much work on question answering systems using knowledge bases. Such systems convert a question sentence into a logical expression for a knowledge base and reply with an answer using the response from the knowledge base. By using knowledge bases, we can reduce ambiguity of the answers. However, knowledge bases are only constructed for a few major languages, such as English. Therefore, it often occurs / that it is necessary to perform cross-lingual QA.
  4. #イントロ少しだけ急ぐ Cross-lingual QA, CLQA, is QA when the language of the question / differs / from the language of the information source. anim. 1: In CLQA using knowledge bases, it takes high cost to create a mapping from question sentence to query because data collection / and annotation by humans are necessary. Also, we must perform these / in each language / that we want to use.
  5. #イントロ少しだけ急ぐ Machine translation is one reasonable answer for this problem. If a question sentence is translated into the language of the information source, it is answerable by a mono-lingual QA system, and we can get an answer by re-translation from the system’s answer. This way is easy to implement, low cost because it is possible to use existing systems, and usable in any languages which has linguistic resources. And, it is clear that better translations improve QA accuracy. However, existing MT systems are optimized for human consumption, and it is not clear whether an MT system that is better for human consumption is also better for CLQA.
  6. The purpose of our work is to make clear how translation affects QA accuracy, and to do so we performed experiments. At first, we investigated the relationship between translation quality and QA accuracy. We created data sets using various translation systems and performed question answering using existing QA systems. Then, we manually analyzed in detail what kind of translation results cause changes of QA results.
  7. In our experiments, we used the question answering framework of Berant et al., SEMPRE. In this framework, the system answers in 3 steps. anim. 1: First, in the alignment step, the system converts entities in the question sentence into “logical forms,” such as “Type.University” or “BarackObama.” anim. 2: Second, in the bridging step, the system generates logical predicates compatible with neighboring predicates, and merges adjacent logical forms two by two, until only one logical form remains, and the remaining one becomes a query. These two steps are not deterministic, so the system generates many candidates anim. 3: and in the scoring step, the system determines the best candidate according to the scoring function. Training of the system is optimization of this scoring function.
  8. Next, I’d like to talk about how to create our data sets. We used the Free917 data set as the original data set. Free917 is a data set for question answering using Freebase, which is a structured large-scale knowledge base and open to the public for free. It consists of 917 pairs of question sentence and correct query. It is separated into three sets, training, development and test. In this presentation, I call this original test set “the OR set.” anim. 1: For our experiments, we made translated test sets. We manually translated each question sentence included in the OR set into Japanese. I call this Japanese test set “the JA set.” anim. 2: And then we created translations of the JA set into English using five different methods.
  9. This slide shows the methods we used. First, we asked a professional translation company to manually translate the questions from Japanese to English. We call this manually translated set “the HT set”. To create the GT and YT set, we used Google and Yahoo! translation engines. Also, we used the phrase-based Moses toolkit and tree-to-string Travatar toolkit, to create the Mo and Tra set.
  10. Now we have 6 test sets, the original set OR and 5 translated sets. We evaluated their translation quality using various metrics. In the evaluation, the reference is the OR set. Also, using the created data sets, we performed question answering using SEMPRE. The QA system is trained using the training and dev. sets of Free917.
  11. We used these 5 metrics in this slide. BLEU+1 is a smoothed version of BLEU, which is / the most widely used metric / for measurement of machine translation quality. Word error rate, WER, evaluates the edit distance / between reference and translation. It has the feature of evaluating word order / more strictly than BLEU. We used the value of 1-WER to adjust the axis direction so larger values are better. RIBES is a metric based on rank correlation coefficient of word order in the translation and reference, and thus focuses on whether the translation / was able to achieve correct ordering. NIST is a metric based on n-gram precision and each n-gram’s weight. Less frequent words are given more importance than more frequent words. Finally, acceptability is a 5-grade manual evaluation metric. It combines aspects of both fluency and adequacy.
  12. This slide shows the result of translation quality evaluation. anim. 1: As you can see, the HT set, created by manual translation, has the highest quality on all metrics. This indicates that manual translation is still more accurate than machines in this language pair and task. #以下省いてもよいかも? GT is the 2nd best on NIST and WER, while YT is higher than GT on acceptability and RIBES. This confirms previous report that RIBES is well correlated with human judgments of acceptability for Japanese-English translation tasks.
  13. Next, this graph shows question answering accuracy of each data set. I’ll talk about a few interesting points in the next few slides.
  14. We can see that YT has lower accuracy than GT, although it’s the best MT system for acceptability. This result indicates that translation which is good for CLQA is clearly different from translation which are good for humans.
  15. On the other hand, GT has the highest QA accuracy. anim. 2: It is expressed with NIST score well. Also, the shapes of these two graphs are very similar, so probably there is a high correlation between NIST score and QA accuracy.
  16. Next, we performed a sentence-level analysis of QA results. First, we note that about 47% of questions in the OR set couldn’t be answered correctly. These questions might be difficult to answer even with the correct translation, so we divided all translated questions into 2 groups. One is the “correct” group. This group consists of questions that were answered correctly in the OR set. In the OR set, the 141 questions were answered correctly, so the correct group has 705 questions. The other group is the “incorrect” group. This group consists of the remaining questions in the OR set. 123 questions were not answered correctly, so the incorrect group has 615 questions.
  17. This table shows the sentence-level correlation between question answering accuracy and evaluation score of each group and metric.
  18. As you can see, the NIST metric has the highest correlation in the correct group. This indicates the importance of content words, which are scored more heavily by NIST’s frequency weighting. anim. 1: Also you can see, all metrics except acceptability have a very little correlation in the incorrect group. anim. 2: This indicates that if the reference cannot be answered correctly, the sentences are not suitable to optimize an MT system, even for negative samples.
  19. Next, I’d like to show you some examples of question answering results. In the first sample, the phrase “interstate 579” has been translated in various ways. anim. 1: Only OR and Tra have the phrase “interstate 579” and have been answered correctly. The output logical form of other translations lack the entity of highway “interstate 579”, mistaking it for another entity. anim. 2: For example, the phrase “interstate highway 579” is instead aligned to the entity of the music album “interstate highway.” Likewise in GT, the phrase “has been” is aligned to the music album “has been,” and in Mo, “highway 579” is aligned to another highway. This is indicating that the change of content words to the point that they don’t match entities in the entity lexicon is a very important problem. To ameliorate this problem, it may be possible to modify the translation system to consider the named entity lexicon as a feature in the translation process.
  20. In the second example, the translations which have all of the content words are answered incorrectly. The common point of these questions is lack of the question type-word, such as “how many.” These words are frequent, so even NIST score will not be able to perform adequate evaluation, indicating that other measures may be necessary.
  21. In the third example, all questions are answered correctly, though they are grammatically incorrect. This example indicates that, at least for the relatively simple questions in Free917, achieving correct word ordering plays only a secondary role in achieving high QA accuracy.
  22. This is our conclusion, thank you very much. -------------------------If I have a time-------------------------- This is the conclusion of our presentation. First, NIST score, which is sensitive to the change of content words, has the highest correlation with question answering accuracy. This indicates importance of content words. Therefore, in cross-lingual question answering tasks, we recommend to use NIST or other metrics putting a weight on content words. Second, if references cannot be answered correctly, there is very little correlation between translation quality and QA accuracy. So, references that are actually answerable by the QA system should be used. Finally, we identified the 3 factors which cause change of QA results: content words, question type-words and syntax. That’s all, thank you for listening.
  23. On the other hand, in the 4th example, OR and HT are grammatically correct but were answered incorrectly.