SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Tomoya Mizumoto†, Mamoru Komachi†,
Masaaki Nagata‡, Yuji Matsumoto†	
 
† Nara Institute of Science and Technology
‡ NTT Communication Science Laboratories 	
 
Mining Revision Log of Language Learning SNS
for Automated Japanese Error Correction of
Second Language Learners
1	
2011.11.09 IJCNLP
Background	
 
!  Number of Japanese language learners has
increased
!  3.65 million people in 133 countries and regions
!  Only 50,000 Japanese language teachers overseas
!  High demand to find good instructors for writers of
Japanese as a Second Language (JSL)
2
Recent error correction for language
learners	
 
!  NLP research has begun to pay attention to
second language learning
!  Most previous research deals with restricted type
of learners’ errors
!  E.g. research for JSL learners’ error
!  Mainly focus on Japanese case particles
!  Real JSL learners’ writing contains various errors
!  Spelling errors
!  Collocation errors
3
Error correction using SMT
!  Proposed to correct ESL learners’ errors using
statistical machine translation
!  Advantage is that it doesn’t require expert
knowledge
!  Learns a correction model from learners’ and
corrected corpora
!  Not easy to acquire large scale learners’ corpora
!  Japanese sentences is not segmented into words
!  JSL learners’ sentences are hard to tokenize
4	
[Brockett et al., 2006]
Purpose of our study	
 
5	
1.  Solve the knowledge acquisition bottleneck
!  Create a large-scale learners’ corpus from error
revision logs of language learning SNS
2.  Solve the problem of word segmentation errors
caused by erroneous input using SMT
techniques with extracted learners’ corpus
SNS sites that helps language learners	
 
6	
!  smart.fm
!  Helps learners’ practice language learning
!  Livemocha
!  Offers course of grammar instructions, reading
comprehension exercises and practice
!  Lang-8
!  Multi-lingual language learning and language
exchange SNS
!  Soon after the learners write a passage in a learning
language, native speakers of the language correct
errors in it
!  Sentence of JSL learners: 925,588
!  Corrected sentences: 1,288,934
!  Example of corrected sentence from Lang-8
Lang-8 data	
 
7	
Learner: ビデオゲームをやまシた。
Correct: ビデオゲームをやりまシした。	
Pairs of learners sentence
and corrected sentence 	
Language	
 English	
 Japanese	
 Mandarin	
 Korean	
Number of sentences	
 1,069,549	
 925,588	
 136,203	
 93,955
Types of correction	
 
8	
!  Correction by insertion, deletion and substitution
!  Correction with a comment
!  Exist “corrected” sentences to which only the
word “GOOD” is appended at the end
!  Removing comments
!  Number of sentence pair: 1,288,934 → 849,894
Learner: ビデオゲームをやまシた。
Correct: ビデオゲームをやりまシした。	
Learner: 銭湯に行った。
Correct: 銭湯に行った。 いつ行ったかがあるほうがいい	
Comment	
Learner: 銭湯に行った。
Correct: 銭湯に行った。 GOOD
Comparison of Japanese learners’
corpora	
 
Corpus	
 Data size	
Our Lang-8 corpus 849,894 sentences
448 MB	
Teramura error Data
(1990)	
4,601 sentences
420 KB
Ohso Database
(1998)	
756 files
15 MB	
JSL learners parallel database
(2000)	
1,500 JSL learners’ writings	
9
Error correction using SMT	
 
10	
€
ˆe = argmax
e
P e f( ) = argmax
e
P e( )P f e( )
SMT
e: target sentences
f: source sentences
P(e): probability of the language model
P(f|e): probability of the translation model
Error correction using SMT	
 
11	
€
ˆe = argmax
e
P e f( ) = argmax
e
P e( )P f e( )
SMT
e: target sentences
f: source sentences
P(e): probability of the language model
P(f|e): probability of the translation model	
Error correction
e: corrected sentences
f: Japanese learners’ sentences
P(e): probability of the language model
P(f|e): probability of the translation model	
Can be learned from the
sentence-aligned learners’ corpus	
Can be learned from a monolingual
corpus of the language to be learned
Difficulty of handling the JSL learners’
sentences	
 
!  Word segmentation is usually performed as a pre-
processing
!  JSL learners’ sentences contain many errors and
hiragana (phonetic characters)
!  hard to tokenize by traditional morphological
analyzer
12
Difficulty of handling the JSL learners’
sentences	
 
13	
!  E.g.
Learner: でもじ ょずじゃりません
Correct: でもじょうずじゃありません	
tokenize	
Learner: でも じ ょずじゃりません
Correct: でも じょうず じゃ ありません
Character-wise model	
 
!  Character-wise segmented
!  e.g.
!  Not affected by word segmentation errors
!  Expected to be more robust
14	
でもじょずじゃりません	
→ で も じ ょ ず じ ゃ り ま せ ん	
Learner: で も じ ょ  ず じ ゃ   り ま せ ん
Correct: で も じ ょ う ず じ ゃ あ り ま せ ん
Experiment	
 
15	
!  Carried out an experiment to see
1.  the effect of corpus size
2.  the effect of granularity of tokenization
Experimental setting	
 
16	
!  Methods
!  Baseline:Word-wise model
!  Proposed method: Character-wise model
!  Language model: 3-gram
!  Language model: 5-gram
!  Data
!  Extracted from revision logs of Lang-8
!  849,849 sentences
!  Test: 500 sentences
!  Re-annotated 500 sentences to make gold-standard
Evaluation metrics	
 
!  BLEU
!  Adopted to BLEU for automatic assessment of ESL
errors.
!  Followed their use of BLEU in the error correction
task of JSL learners
!  JSL learners’ sentences are hard to tokenize by
morphological analyzer
!  Character-based BLEU
17	
[Park and Levy, 2011]
Larger the corpus, the higher the BLEU 	
 
!  Character-wise model: Character 5-gram
18	
81
81.1
81.2
81.3
81.4
81.5
81.6
81.7
81.8
81.9
0.1M 0.15M 0.3M 0.85M
BLEU
Learning data size ofTM	
The difference is not statistically significant
Character-wise models are better than
word-wise model	
 
!  TM Training corpus: 0.3M sentences
!  Achieves the best result
19	
Word
3-gram	
Character
3-gram	
Character
5-gram	
80.72	
 81.63 	
 81.81
Both 0.1M and 0.3M model corrected	
 
20	
Learner: またど もう ありがとう
(Thanks, Mantadomou (OOV))
Correct: またど うも ありがとう
(Thank you again)
Learner: TRUTH わ 美しいです
(TRUTH wa beautiful)
Correct: TRUTH は 美しいです
(TRUTH is beautiful)
0.3M model corrected
21	
Learner: 学生な るたら 学校に行ける
(The learner made an error in conjunction form)
Correct: 学生な ったら 学校に行ける
(Becoming a student, I can go to school)
0.1M : 学生な るため 学校に行ける
(I can go to school to be student)
0.3M : 学生な ったら 学校に行ける
(Becoming a student, I can go to school)
Conclusions	
 
!  Make use of a large-scale corpus from the revision
logs of a language learning SNS
!  Adopted SMT approaches to alleviate the
problem of erroneous input from learners
!  Character-wise model outperforms the word-wise
model
!  Apply method using SMT techniques with
extracted learners’ corpus to error correction of
English as a second language learners
22
Handling the comment	
 
23	
!  Conduct the following three pre-processing steps
1.  If the corrected sentence contain only “GOOD”
or “OK”, we don’t include it in the corpus
2.  If edit distance between the learner’s sentences
and corrected sentences is larger than 5, we
simply drop the sentence for the corpus
3.  If the corrected sentence ends with “OK” or
“GOOD”, we remove it and retain the sentence
pair.
Feature work	
 
24	
!  Apply method using SMT techniques with
extracted learners’ corpus to error correction of
English as a second language
!  Apply factored language and translation models
incorporating the POS information of the words
on the target side, while learners’ input is
processed by a character-wise model
Approach for correcting unrestricted
errors	
 
!  EM-based unsupervised approach to perform
whole sentence grammar correction
!  Types of error must be pre-determined
!  Requires expert knowledge of L2 teaching
!  Error correction using SMT
!  Advantage is that it doesn’t require expert
knowledge
!  Learns a correction model from learners’ corpora
!  Not easy to acquire large scale learners’ corpora
25	
[Park and Levy, 2011]	
[Brockett et al., 2006]
Statistical machine translation	
 
Japanese Corpus
26	
Parallel Corpus
English	
 Japanese	
I like English ー 私は英語が好き
・・・	
Translation
Model	
English
sentence
Language
Model	
Japanese
sentence
Japanese
sentence
Japanese
sentence
TM is learned from sentence-
aligned parallel corpus	
LM is learned from Japanese
monolingual corpus
Japanese error correction	
 
27	
Japanese Corpus
Learners’ Corpus
Learner	
 Correct	
私わ英語が好き ー 私は英語が好き
・・・	
Translation
Model	
Learner’s
sentence
Language
Model	
Correct
sentence
Correct
sentence
Correct
sentence
TM is learned from sentence-
aligned learners’ corpus	
LM is learned from Japanese
monolingual corpus
Evaluation metrics	
 
!  Character-based BLEU
!  Recall and precision based on LCS
!  F-measure: harmonic average between R and P
!  e.g. correct: 私 は 学 生 で す
system: 私 は 学 生 だ
28	
€
recall(R) =
NLCS
NSYSTEM
€
precision(P) =
NLCS
NCORRECT
,	
     : number of character contained in corrected answers
    : number of character contained in system results
  : number of character contained in LCS of corrected answers and system results	
NCORRECT
NSYSTEM
NLCS
NCORRECT = 6,NSYSTEM = 5
NLCS = 4
R = 5,P = 4
6
[Park and Levy, 2011]
Experimental results
- granularity of tokenization -	
 
!  Training corpus: L1= ALL
!  Test corpus: L1= English
!  TM size: 0.3M sentences
29	
W	
 C3	
 C5	
Recall 90.43 	
 90.89 	
 90.83 	
Precision	
 91.75 	
 92.34 	
 92.43 	
F-measure	
 91.09 	
 91.61 	
 91.62 	
BLEU	
 80.72	
 81.63 	
 81.81
Purpose of our study	
 
30	
1.  Solve the knowledge acquisition bottleneck
!  Create a large-scale learners’ corpus from error
revision logs of language learning SNS
2.  Propose a method using SMT techniques with
extracted learners’ corpus
3.  Solve the problem of word segmentation errors
caused by erroneous input
Experiment	
 
31	
!  Carried out an experiment to see
1.  the effect of granularity of tokenization
2.  the effect of corpus size
3.  the difference of first language (L1)
Experimental data	
 
!  Training data
!  Extracted from revision logs of Lang-8
!  Prepare three L1 models
!  L1= ALL: 849,894 sentences
!  L1= English: 320,655 sentences
!  L1= Mandarin: 186,807 sentences
!  Test data
!  extracted 500 sentences from each L1=English and
L1 Mandarin
!  Re-annotated 500 sentences to make gold-standard
32
Experimental results
- granularity of tokenization -	
 
!  Training corpus: L1= ALL
!  Test corpus: L1= English
33	
W	
 C3	
 C5	
80.72	
 81.63 	
 81.81
Experimental results
- corpus size -	
 
!  Training corpus: L1= ALL model
!  Test corpus: L1= English model
34	
81
81.1
81.2
81.3
81.4
81.5
81.6
81.7
81.8
81.9
0.1M 0.15M 0.3M 0.85M
BLEU
Learning data size ofTM
Experimental results
- L1 model -	
 
!  TM size: 0.18M sentences	
35	
L1 of test data	
L1 of
training
data	
English	
 Mandarin	
English 81.48	
 85.73	
Mandarin	
 80.83	
 85.89	
all	
 81.21	
 85.53
Difficulty of handling the JSL learners’
sentences	
 
!  Word segmentation is usually performed as a pre-
processing
!  JSL learners’ sentences contain many errors and
hiragana (phonetic characters)
!  hard to tokenize by traditional morphological
analyzer
!  e.g.
36	
Learner: でもじょずじゃりません
Correct: でもじょうずじゃありません	
tokenize	
Learner: でも じ ょずじゃりません
Correct: でも じょうず じゃ ありません
Statistical machine translation	
 
37	
Japanese Corpus
Parallel Corpus
English	
 Japanese	
I like English ー 私は英語が好き
・・・	
Translation
Model	
went to America.
Language
Model	
Japanese
sentence
Japanese
sentence
私はアメリカに
行った
私は英語が好き
・・・
Statistical machine translation	
 
38	
Japanese Corpus
Learner Corpus
Learner	
 Correct	
私わ英語が好き ー 私は英語が好き
・・・	
Translation
Model	
私わアメリカに
行った
Language
Model	
Correct
sentence
Correct
sentence
私はアメリカに
行った
私は英語が好き
・・・

Más contenido relacionado

La actualidad más candente

Toefl introduction
Toefl introductionToefl introduction
Toefl introduction
Kenglsih
 
FSI Bulgarian Basic Text (Part 1)
FSI Bulgarian Basic Text (Part 1)FSI Bulgarian Basic Text (Part 1)
FSI Bulgarian Basic Text (Part 1)
Chuck Anderson
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
unyil96
 

La actualidad más candente (19)

ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGEADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
ADVANCEMENTS ON NLP APPLICATIONS FOR MANIPURI LANGUAGE
 
9780521528092ws
9780521528092ws9780521528092ws
9780521528092ws
 
Toefl introduction
Toefl introductionToefl introduction
Toefl introduction
 
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On SilenceSegmentation Words for Speech Synthesis in Persian Language Based On Silence
Segmentation Words for Speech Synthesis in Persian Language Based On Silence
 
English language lab presentation 2019
English language lab presentation 2019English language lab presentation 2019
English language lab presentation 2019
 
Running a college Language Lab
Running a college Language LabRunning a college Language Lab
Running a college Language Lab
 
Monitoring and feedback in the process of language acquisition analysis and ...
Monitoring and feedback in the process of language acquisition  analysis and ...Monitoring and feedback in the process of language acquisition  analysis and ...
Monitoring and feedback in the process of language acquisition analysis and ...
 
part of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXTpart of speech tagger for ARABIC TEXT
part of speech tagger for ARABIC TEXT
 
Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)Welcome to International Journal of Engineering Research and Development (IJERD)
Welcome to International Journal of Engineering Research and Development (IJERD)
 
Phonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech SystemsPhonetic Recognition In Words For Persian Text To Speech Systems
Phonetic Recognition In Words For Persian Text To Speech Systems
 
FSI Bulgarian Basic Text (Part 1)
FSI Bulgarian Basic Text (Part 1)FSI Bulgarian Basic Text (Part 1)
FSI Bulgarian Basic Text (Part 1)
 
top pte institute coaching centre in chandigarh
top pte institute coaching centre in chandigarhtop pte institute coaching centre in chandigarh
top pte institute coaching centre in chandigarh
 
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
Implementation of English-Text to Marathi-Speech (ETMS) SynthesizerImplementation of English-Text to Marathi-Speech (ETMS) Synthesizer
Implementation of English-Text to Marathi-Speech (ETMS) Synthesizer
 
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEMIMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
IMPORTANCE OF VERB SUFFIX MAPPING IN DISCOURSE TRANSLATION SYSTEM
 
Techniques for automatically correcting words in text
Techniques for automatically correcting words in textTechniques for automatically correcting words in text
Techniques for automatically correcting words in text
 
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATIONIjnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
Ijnlc020306NAMED ENTITY RECOGNITION IN NATURAL LANGUAGES USING TRANSLITERATION
 
Quality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemmingQuality estimation of machine translation outputs through stemming
Quality estimation of machine translation outputs through stemming
 
Ijetcas14 458
Ijetcas14 458Ijetcas14 458
Ijetcas14 458
 
Pxc3898474
Pxc3898474Pxc3898474
Pxc3898474
 

Destacado

Sentences Types: Simple, Compound, Complex, Compound-Complex
Sentences Types: Simple, Compound, Complex, Compound-ComplexSentences Types: Simple, Compound, Complex, Compound-Complex
Sentences Types: Simple, Compound, Complex, Compound-Complex
Belachew Weldegebriel
 
Testing Grammar
Testing GrammarTesting Grammar
Testing Grammar
songoten77
 

Destacado (18)

SAT Preparation & Tutoring Services in MA
SAT Preparation & Tutoring Services in MASAT Preparation & Tutoring Services in MA
SAT Preparation & Tutoring Services in MA
 
The fourth step in writing 1
The fourth step in writing 1The fourth step in writing 1
The fourth step in writing 1
 
Paragraph Basics
Paragraph BasicsParagraph Basics
Paragraph Basics
 
Focus on writing ch. 20
Focus on writing ch. 20Focus on writing ch. 20
Focus on writing ch. 20
 
Types of sentences group project
Types of sentences group projectTypes of sentences group project
Types of sentences group project
 
The Third and Fourth Steps in Writing
The Third and Fourth Steps in WritingThe Third and Fourth Steps in Writing
The Third and Fourth Steps in Writing
 
19.4 - Error Analysis for Sentences
19.4 - Error Analysis for Sentences19.4 - Error Analysis for Sentences
19.4 - Error Analysis for Sentences
 
SAT Prep - Improving Sentences
SAT  Prep - Improving SentencesSAT  Prep - Improving Sentences
SAT Prep - Improving Sentences
 
Apostrophes
ApostrophesApostrophes
Apostrophes
 
Detection of semantic errors from simple bangla sentences
Detection of semantic errors from simple bangla sentencesDetection of semantic errors from simple bangla sentences
Detection of semantic errors from simple bangla sentences
 
Commas
CommasCommas
Commas
 
SURVEY OF ERROR ANALYSIS
SURVEY OF ERROR ANALYSISSURVEY OF ERROR ANALYSIS
SURVEY OF ERROR ANALYSIS
 
Error correction using redundant residue number system
Error correction using redundant residue number systemError correction using redundant residue number system
Error correction using redundant residue number system
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
Sentences Types: Simple, Compound, Complex, Compound-Complex
Sentences Types: Simple, Compound, Complex, Compound-ComplexSentences Types: Simple, Compound, Complex, Compound-Complex
Sentences Types: Simple, Compound, Complex, Compound-Complex
 
Testing Grammar
Testing GrammarTesting Grammar
Testing Grammar
 
Paragraph Structure
Paragraph StructureParagraph Structure
Paragraph Structure
 
Grammar for Journalists
Grammar for JournalistsGrammar for Journalists
Grammar for Journalists
 

Similar a Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
Tomoya Mizumoto
 
Eltmethodstogether
EltmethodstogetherEltmethodstogether
Eltmethodstogether
wilmuchis20
 
Acc4300 1 an over view of esl teaching in korea
Acc4300 1 an over  view of  esl teaching in koreaAcc4300 1 an over  view of  esl teaching in korea
Acc4300 1 an over view of esl teaching in korea
Pei Zhao
 
Overview of teaching english in non english speaking
Overview of teaching english in non english speakingOverview of teaching english in non english speaking
Overview of teaching english in non english speaking
April Pei
 

Similar a Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011 (20)

The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
The Effect of Learner Corpus Size in Grammatical Error Correction of ESL Writ...
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
Deep network notes.pdf
Deep network notes.pdfDeep network notes.pdf
Deep network notes.pdf
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
ATTENTION-BASED SYLLABLE LEVEL NEURAL MACHINE TRANSLATION SYSTEM FOR MYANMAR ...
 
1° clase presencial
1° clase presencial1° clase presencial
1° clase presencial
 
Spm module 1119
Spm module 1119 Spm module 1119
Spm module 1119
 
Teaching specific aspect of language
Teaching specific aspect of languageTeaching specific aspect of language
Teaching specific aspect of language
 
Helping Japanese Students Overcome Common Pronunciation Problems caused by ka...
Helping Japanese Students Overcome Common Pronunciation Problems caused by ka...Helping Japanese Students Overcome Common Pronunciation Problems caused by ka...
Helping Japanese Students Overcome Common Pronunciation Problems caused by ka...
 
Grammar translation method
Grammar translation methodGrammar translation method
Grammar translation method
 
Whats up teachers-level2
Whats up teachers-level2Whats up teachers-level2
Whats up teachers-level2
 
Eltmethodstogether
EltmethodstogetherEltmethodstogether
Eltmethodstogether
 
Acc4300 1 an over view of esl teaching in korea
Acc4300 1 an over  view of  esl teaching in koreaAcc4300 1 an over  view of  esl teaching in korea
Acc4300 1 an over view of esl teaching in korea
 
Overview of teaching english in non english speaking
Overview of teaching english in non english speakingOverview of teaching english in non english speaking
Overview of teaching english in non english speaking
 
Verbal ability for Interviews
Verbal ability for InterviewsVerbal ability for Interviews
Verbal ability for Interviews
 
Tl525 task related variation in interlanguage
Tl525  task related variation in interlanguageTl525  task related variation in interlanguage
Tl525 task related variation in interlanguage
 
On English Vocabulary Teaching Methods in Chinese Senior High Schools
On English Vocabulary Teaching Methods in Chinese Senior High SchoolsOn English Vocabulary Teaching Methods in Chinese Senior High Schools
On English Vocabulary Teaching Methods in Chinese Senior High Schools
 
SSSLW 2017
SSSLW 2017SSSLW 2017
SSSLW 2017
 
TEACHING OF GRAMMAR.pptx
TEACHING OF GRAMMAR.pptxTEACHING OF GRAMMAR.pptx
TEACHING OF GRAMMAR.pptx
 
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
Junki Matsuo - 2015 - Source Phrase Segmentation and Translation for Japanese...
 

Último

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 

Último (20)

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
Deploy with confidence: VMware Cloud Foundation 5.1 on next gen Dell PowerEdg...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

  • 1. Tomoya Mizumoto†, Mamoru Komachi†, Masaaki Nagata‡, Yuji Matsumoto† † Nara Institute of Science and Technology ‡ NTT Communication Science Laboratories  Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners 1 2011.11.09 IJCNLP
  • 2. Background !  Number of Japanese language learners has increased !  3.65 million people in 133 countries and regions !  Only 50,000 Japanese language teachers overseas !  High demand to find good instructors for writers of Japanese as a Second Language (JSL) 2
  • 3. Recent error correction for language learners !  NLP research has begun to pay attention to second language learning !  Most previous research deals with restricted type of learners’ errors !  E.g. research for JSL learners’ error !  Mainly focus on Japanese case particles !  Real JSL learners’ writing contains various errors !  Spelling errors !  Collocation errors 3
  • 4. Error correction using SMT !  Proposed to correct ESL learners’ errors using statistical machine translation !  Advantage is that it doesn’t require expert knowledge !  Learns a correction model from learners’ and corrected corpora !  Not easy to acquire large scale learners’ corpora !  Japanese sentences is not segmented into words !  JSL learners’ sentences are hard to tokenize 4 [Brockett et al., 2006]
  • 5. Purpose of our study 5 1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error revision logs of language learning SNS 2.  Solve the problem of word segmentation errors caused by erroneous input using SMT techniques with extracted learners’ corpus
  • 6. SNS sites that helps language learners 6 !  smart.fm !  Helps learners’ practice language learning !  Livemocha !  Offers course of grammar instructions, reading comprehension exercises and practice !  Lang-8 !  Multi-lingual language learning and language exchange SNS !  Soon after the learners write a passage in a learning language, native speakers of the language correct errors in it
  • 7. !  Sentence of JSL learners: 925,588 !  Corrected sentences: 1,288,934 !  Example of corrected sentence from Lang-8 Lang-8 data 7 Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。 Pairs of learners sentence and corrected sentence Language English Japanese Mandarin Korean Number of sentences 1,069,549 925,588 136,203 93,955
  • 8. Types of correction 8 !  Correction by insertion, deletion and substitution !  Correction with a comment !  Exist “corrected” sentences to which only the word “GOOD” is appended at the end !  Removing comments !  Number of sentence pair: 1,288,934 → 849,894 Learner: ビデオゲームをやまシた。 Correct: ビデオゲームをやりまシした。 Learner: 銭湯に行った。 Correct: 銭湯に行った。 いつ行ったかがあるほうがいい Comment Learner: 銭湯に行った。 Correct: 銭湯に行った。 GOOD
  • 9. Comparison of Japanese learners’ corpora Corpus Data size Our Lang-8 corpus 849,894 sentences 448 MB Teramura error Data (1990) 4,601 sentences 420 KB Ohso Database (1998) 756 files 15 MB JSL learners parallel database (2000) 1,500 JSL learners’ writings 9
  • 10. Error correction using SMT 10 € ˆe = argmax e P e f( ) = argmax e P e( )P f e( ) SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model
  • 11. Error correction using SMT 11 € ˆe = argmax e P e f( ) = argmax e P e( )P f e( ) SMT e: target sentences f: source sentences P(e): probability of the language model P(f|e): probability of the translation model Error correction e: corrected sentences f: Japanese learners’ sentences P(e): probability of the language model P(f|e): probability of the translation model Can be learned from the sentence-aligned learners’ corpus Can be learned from a monolingual corpus of the language to be learned
  • 12. Difficulty of handling the JSL learners’ sentences !  Word segmentation is usually performed as a pre- processing !  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological analyzer 12
  • 13. Difficulty of handling the JSL learners’ sentences 13 !  E.g. Learner: でもじ ょずじゃりません Correct: でもじょうずじゃありません tokenize Learner: でも じ ょずじゃりません Correct: でも じょうず じゃ ありません
  • 14. Character-wise model !  Character-wise segmented !  e.g. !  Not affected by word segmentation errors !  Expected to be more robust 14 でもじょずじゃりません → で も じ ょ ず じ ゃ り ま せ ん Learner: で も じ ょ  ず じ ゃ   り ま せ ん Correct: で も じ ょ う ず じ ゃ あ り ま せ ん
  • 15. Experiment 15 !  Carried out an experiment to see 1.  the effect of corpus size 2.  the effect of granularity of tokenization
  • 16. Experimental setting 16 !  Methods !  Baseline:Word-wise model !  Proposed method: Character-wise model !  Language model: 3-gram !  Language model: 5-gram !  Data !  Extracted from revision logs of Lang-8 !  849,849 sentences !  Test: 500 sentences !  Re-annotated 500 sentences to make gold-standard
  • 17. Evaluation metrics !  BLEU !  Adopted to BLEU for automatic assessment of ESL errors. !  Followed their use of BLEU in the error correction task of JSL learners !  JSL learners’ sentences are hard to tokenize by morphological analyzer !  Character-based BLEU 17 [Park and Levy, 2011]
  • 18. Larger the corpus, the higher the BLEU !  Character-wise model: Character 5-gram 18 81 81.1 81.2 81.3 81.4 81.5 81.6 81.7 81.8 81.9 0.1M 0.15M 0.3M 0.85M BLEU Learning data size ofTM The difference is not statistically significant
  • 19. Character-wise models are better than word-wise model !  TM Training corpus: 0.3M sentences !  Achieves the best result 19 Word 3-gram Character 3-gram Character 5-gram 80.72 81.63 81.81
  • 20. Both 0.1M and 0.3M model corrected 20 Learner: またど もう ありがとう (Thanks, Mantadomou (OOV)) Correct: またど うも ありがとう (Thank you again) Learner: TRUTH わ 美しいです (TRUTH wa beautiful) Correct: TRUTH は 美しいです (TRUTH is beautiful)
  • 21. 0.3M model corrected 21 Learner: 学生な るたら 学校に行ける (The learner made an error in conjunction form) Correct: 学生な ったら 学校に行ける (Becoming a student, I can go to school) 0.1M : 学生な るため 学校に行ける (I can go to school to be student) 0.3M : 学生な ったら 学校に行ける (Becoming a student, I can go to school)
  • 22. Conclusions !  Make use of a large-scale corpus from the revision logs of a language learning SNS !  Adopted SMT approaches to alleviate the problem of erroneous input from learners !  Character-wise model outperforms the word-wise model !  Apply method using SMT techniques with extracted learners’ corpus to error correction of English as a second language learners 22
  • 23. Handling the comment 23 !  Conduct the following three pre-processing steps 1.  If the corrected sentence contain only “GOOD” or “OK”, we don’t include it in the corpus 2.  If edit distance between the learner’s sentences and corrected sentences is larger than 5, we simply drop the sentence for the corpus 3.  If the corrected sentence ends with “OK” or “GOOD”, we remove it and retain the sentence pair.
  • 24. Feature work 24 !  Apply method using SMT techniques with extracted learners’ corpus to error correction of English as a second language !  Apply factored language and translation models incorporating the POS information of the words on the target side, while learners’ input is processed by a character-wise model
  • 25. Approach for correcting unrestricted errors !  EM-based unsupervised approach to perform whole sentence grammar correction !  Types of error must be pre-determined !  Requires expert knowledge of L2 teaching !  Error correction using SMT !  Advantage is that it doesn’t require expert knowledge !  Learns a correction model from learners’ corpora !  Not easy to acquire large scale learners’ corpora 25 [Park and Levy, 2011] [Brockett et al., 2006]
  • 26. Statistical machine translation Japanese Corpus 26 Parallel Corpus English Japanese I like English ー 私は英語が好き ・・・ Translation Model English sentence Language Model Japanese sentence Japanese sentence Japanese sentence TM is learned from sentence- aligned parallel corpus LM is learned from Japanese monolingual corpus
  • 27. Japanese error correction 27 Japanese Corpus Learners’ Corpus Learner Correct 私わ英語が好き ー 私は英語が好き ・・・ Translation Model Learner’s sentence Language Model Correct sentence Correct sentence Correct sentence TM is learned from sentence- aligned learners’ corpus LM is learned from Japanese monolingual corpus
  • 28. Evaluation metrics !  Character-based BLEU !  Recall and precision based on LCS !  F-measure: harmonic average between R and P !  e.g. correct: 私 は 学 生 で す system: 私 は 学 生 だ 28 € recall(R) = NLCS NSYSTEM € precision(P) = NLCS NCORRECT ,      : number of character contained in corrected answers     : number of character contained in system results   : number of character contained in LCS of corrected answers and system results NCORRECT NSYSTEM NLCS NCORRECT = 6,NSYSTEM = 5 NLCS = 4 R = 5,P = 4 6 [Park and Levy, 2011]
  • 29. Experimental results - granularity of tokenization - !  Training corpus: L1= ALL !  Test corpus: L1= English !  TM size: 0.3M sentences 29 W C3 C5 Recall 90.43 90.89 90.83 Precision 91.75 92.34 92.43 F-measure 91.09 91.61 91.62 BLEU 80.72 81.63 81.81
  • 30. Purpose of our study 30 1.  Solve the knowledge acquisition bottleneck !  Create a large-scale learners’ corpus from error revision logs of language learning SNS 2.  Propose a method using SMT techniques with extracted learners’ corpus 3.  Solve the problem of word segmentation errors caused by erroneous input
  • 31. Experiment 31 !  Carried out an experiment to see 1.  the effect of granularity of tokenization 2.  the effect of corpus size 3.  the difference of first language (L1)
  • 32. Experimental data !  Training data !  Extracted from revision logs of Lang-8 !  Prepare three L1 models !  L1= ALL: 849,894 sentences !  L1= English: 320,655 sentences !  L1= Mandarin: 186,807 sentences !  Test data !  extracted 500 sentences from each L1=English and L1 Mandarin !  Re-annotated 500 sentences to make gold-standard 32
  • 33. Experimental results - granularity of tokenization - !  Training corpus: L1= ALL !  Test corpus: L1= English 33 W C3 C5 80.72 81.63 81.81
  • 34. Experimental results - corpus size - !  Training corpus: L1= ALL model !  Test corpus: L1= English model 34 81 81.1 81.2 81.3 81.4 81.5 81.6 81.7 81.8 81.9 0.1M 0.15M 0.3M 0.85M BLEU Learning data size ofTM
  • 35. Experimental results - L1 model - !  TM size: 0.18M sentences 35 L1 of test data L1 of training data English Mandarin English 81.48 85.73 Mandarin 80.83 85.89 all 81.21 85.53
  • 36. Difficulty of handling the JSL learners’ sentences !  Word segmentation is usually performed as a pre- processing !  JSL learners’ sentences contain many errors and hiragana (phonetic characters) !  hard to tokenize by traditional morphological analyzer !  e.g. 36 Learner: でもじょずじゃりません Correct: でもじょうずじゃありません tokenize Learner: でも じ ょずじゃりません Correct: でも じょうず じゃ ありません
  • 37. Statistical machine translation 37 Japanese Corpus Parallel Corpus English Japanese I like English ー 私は英語が好き ・・・ Translation Model went to America. Language Model Japanese sentence Japanese sentence 私はアメリカに 行った 私は英語が好き ・・・
  • 38. Statistical machine translation 38 Japanese Corpus Learner Corpus Learner Correct 私わ英語が好き ー 私は英語が好き ・・・ Translation Model 私わアメリカに 行った Language Model Correct sentence Correct sentence 私はアメリカに 行った 私は英語が好き ・・・