Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Tomoya Mizumoto†, Mamoru Komachi†,
Masaaki Nagata‡, Yuji Matsumoto†

† Nara Institute of Science and Technology
‡ NTT Communication Science Laboratories　

Mining Revision Log of Language Learning SNS
for Automated Japanese Error Correction of
Second Language Learners
1
2011.11.09 IJCNLP

Background

!  Number of Japanese language learners has
increased
!  3.65 million people in 133 countries and regions
!  Only 50,000 Japanese language teachers overseas
!  High demand to find good instructors for writers of
Japanese as a Second Language (JSL)
2

Recent error correction for language
learners

!  NLP research has begun to pay attention to
second language learning
!  Most previous research deals with restricted type
of learners’ errors
!  E.g. research for JSL learners’ error
!  Mainly focus on Japanese case particles
!  Real JSL learners’ writing contains various errors
!  Spelling errors
!  Collocation errors
3

Error correction using SMT
!  Proposed to correct ESL learners’ errors using
statistical machine translation
!  Advantage is that it doesn’t require expert
knowledge
!  Learns a correction model from learners’ and
corrected corpora
!  Not easy to acquire large scale learners’ corpora
!  Japanese sentences is not segmented into words
!  JSL learners’ sentences are hard to tokenize
4
[Brockett et al., 2006]

Purpose of our study

5
1.  Solve the knowledge acquisition bottleneck
!  Create a large-scale learners’ corpus from error
revision logs of language learning SNS
2.  Solve the problem of word segmentation errors
caused by erroneous input using SMT
techniques with extracted learners’ corpus

SNS sites that helps language learners

6
!  smart.fm
!  Helps learners’ practice language learning
!  Livemocha
!  Offers course of grammar instructions, reading
comprehension exercises and practice
!  Lang-8
!  Multi-lingual language learning and language
exchange SNS
!  Soon after the learners write a passage in a learning
language, native speakers of the language correct
errors in it

!  Sentence of JSL learners: 925,588
!  Corrected sentences: 1,288,934
!  Example of corrected sentence from Lang-8
Lang-8 data

7
Learner: ビデオゲームをやまシた。
Correct: ビデオゲームをやりまシした。
Pairs of learners sentence
and corrected sentence
Language
English
Japanese
Mandarin
Korean
Number of sentences
1,069,549
925,588
136,203
93,955

Types of correction

8
!  Correction by insertion, deletion and substitution
!  Correction with a comment
!  Exist “corrected” sentences to which only the
word “GOOD” is appended at the end
!  Removing comments
!  Number of sentence pair: 1,288,934 → 849,894
Learner: ビデオゲームをやまシた。
Correct: ビデオゲームをやりまシした。
Learner: 銭湯に行った。
Correct: 銭湯に行った。　いつ行ったかがあるほうがいい
Comment
Learner: 銭湯に行った。
Correct: 銭湯に行った。　GOOD

Comparison of Japanese learners’
corpora

Corpus
Data size
Our Lang-8 corpus 849,894 sentences
448 MB
Teramura error Data
(1990)
4,601 sentences
420 KB
Ohso Database
(1998)
756 files
15 MB
JSL learners parallel database
(2000)
1,500 JSL learners’ writings
9


10
€
ˆe = argmax
e
P e f( ) = argmax
e
P e( )P f e( )
SMT
e: target sentences
f: source sentences
P(e): probability of the language model
P(f|e): probability of the translation model


11
€
ˆe = argmax
e
P e f( ) = argmax
e
P e( )P f e( )
SMT
e: target sentences
f: source sentences
Error correction
e: corrected sentences
f: Japanese learners’ sentences
Can be learned from the
sentence-aligned learners’ corpus
Can be learned from a monolingual
corpus of the language to be learned

Difficulty of handling the JSL learners’
sentences

!  Word segmentation is usually performed as a pre-
processing
!  JSL learners’ sentences contain many errors and
hiragana (phonetic characters)
!  hard to tokenize by traditional morphological
analyzer
12

sentences

13
!  E.g.
Learner: でもじょずじゃりません
Correct: でもじょうずじゃありません
tokenize

Character-wise model

!  Character-wise segmented
!  e.g.
!  Not affected by word segmentation errors
!  Expected to be more robust
14
でもじょずじゃりません
→ でもじょずじゃりません
Learner: でもじょ　ずじゃ　りません

Experiment

15
!  Carried out an experiment to see
1.  the effect of corpus size
2.  the effect of granularity of tokenization

Experimental setting

16
!  Methods
!  Baseline:Word-wise model
!  Proposed method: Character-wise model
!  Language model: 3-gram
!  Language model: 5-gram
!  Data
!  Extracted from revision logs of Lang-8
!  849,849 sentences
!  Test: 500 sentences
!  Re-annotated 500 sentences to make gold-standard

Evaluation metrics

!  BLEU
!  Adopted to BLEU for automatic assessment of ESL
errors.
!  Followed their use of BLEU in the error correction
task of JSL learners
!  JSL learners’ sentences are hard to tokenize by
morphological analyzer
!  Character-based BLEU
17
[Park and Levy, 2011]

Larger the corpus, the higher the BLEU

!  Character-wise model: Character 5-gram
18
81
81.1
81.2
81.3
81.4
81.5
81.6
81.7
81.8
81.9
0.1M 0.15M 0.3M 0.85M
BLEU
Learning data size ofTM
The difference is not statistically significant

Character-wise models are better than
word-wise model

!  TM Training corpus: 0.3M sentences
!  Achieves the best result
19
Word
3-gram
Character
3-gram
Character
5-gram
80.72
81.63
81.81

Both 0.1M and 0.3M model corrected

20
Learner：またどもうありがとう
(Thanks, Mantadomou (OOV))
Correct：またどうもありがとう
(Thank you again)
Learner： TRUTH わ美しいです
(TRUTH wa beautiful)
Correct： TRUTH は美しいです
(TRUTH is beautiful)

0.3M model corrected
21
Learner：学生なるたら学校に行ける
(The learner made an error in conjunction form)
Correct：学生なったら学校に行ける
(Becoming a student, I can go to school)
0.1M ：学生なるため学校に行ける
(I can go to school to be student)
0.3M ：学生なったら学校に行ける
(Becoming a student, I can go to school)

Conclusions

!  Make use of a large-scale corpus from the revision
logs of a language learning SNS
!  Adopted SMT approaches to alleviate the
problem of erroneous input from learners
!  Character-wise model outperforms the word-wise
model
!  Apply method using SMT techniques with
extracted learners’ corpus to error correction of
English as a second language learners
22

Handling the comment

23
!  Conduct the following three pre-processing steps
1.  If the corrected sentence contain only “GOOD”
or “OK”, we don’t include it in the corpus
2.  If edit distance between the learner’s sentences
and corrected sentences is larger than 5, we
simply drop the sentence for the corpus
3.  If the corrected sentence ends with “OK” or
“GOOD”, we remove it and retain the sentence
pair.

Feature work

24
!  Apply method using SMT techniques with
extracted learners’ corpus to error correction of
English as a second language
!  Apply factored language and translation models
incorporating the POS information of the words
on the target side, while learners’ input is
processed by a character-wise model

Approach for correcting unrestricted
errors

!  EM-based unsupervised approach to perform
whole sentence grammar correction
!  Types of error must be pre-determined
!  Requires expert knowledge of L2 teaching
!  Error correction using SMT
!  Advantage is that it doesn’t require expert
knowledge
!  Learns a correction model from learners’ corpora
!  Not easy to acquire large scale learners’ corpora
25
[Brockett et al., 2006]

Statistical machine translation

Japanese Corpus
26
Parallel Corpus
English
Japanese
I like English ー私は英語が好き
・・・
Translation
Model
English
sentence
Language
Model
Japanese
sentence
Japanese
sentence
Japanese
sentence
TM is learned from sentence-
aligned parallel corpus
LM is learned from Japanese
monolingual corpus

Japanese error correction

27
Japanese Corpus
Learners’ Corpus
Learner
Correct
私わ英語が好きー私は英語が好き
・・・
Translation
Model
Learner’s
sentence
Language
Model
Correct
sentence
Correct
sentence
Correct
sentence
TM is learned from sentence-
aligned learners’ corpus
LM is learned from Japanese
monolingual corpus

Evaluation metrics

!  Character-based BLEU
!  Recall and precision based on LCS
!  F-measure: harmonic average between R and P
!  e.g. correct: 私は学生です
system: 私は学生だ
28
€
recall(R) =
NLCS
NSYSTEM
€
precision(P) =
NLCS
NCORRECT
，
　　　　： number of character contained in corrected answers
　　　： number of character contained in system results
　： number of character contained in LCS of corrected answers and system results
NCORRECT
NSYSTEM
NLCS
NCORRECT = 6,NSYSTEM = 5
NLCS = 4
R = 5,P = 4
6

Experimental results
- granularity of tokenization -

!  Training corpus: L1= ALL
!  Test corpus: L1= English
!  TM size: 0.3M sentences
29
W
C3
C5
Recall 90.43
90.89
90.83
Precision
91.75
92.34
92.43
F-measure
91.09
91.61
91.62
BLEU
80.72
81.63
81.81

Purpose of our study

30
1.  Solve the knowledge acquisition bottleneck
!  Create a large-scale learners’ corpus from error
revision logs of language learning SNS
2.  Propose a method using SMT techniques with
extracted learners’ corpus
3.  Solve the problem of word segmentation errors
caused by erroneous input

Experiment

31
!  Carried out an experiment to see
1.  the effect of granularity of tokenization
2.  the effect of corpus size
3.  the difference of first language (L1)

Experimental data

!  Training data
!  Extracted from revision logs of Lang-8
!  Prepare three L1 models
!  L1= ALL: 849,894 sentences
!  L1= English: 320,655 sentences
!  L1= Mandarin: 186,807 sentences
!  Test data
!  extracted 500 sentences from each L1=English and
L1 Mandarin
!  Re-annotated 500 sentences to make gold-standard
32

- granularity of tokenization -

!  Training corpus: L1= ALL
!  Test corpus: L1= English
33
W
C3
C5
80.72
81.63
81.81

- corpus size -

!  Training corpus: L1= ALL model
!  Test corpus: L1= English model
34
81
81.1
81.2
81.3
81.4
81.5
81.6
81.7
81.8
81.9
0.1M 0.15M 0.3M 0.85M
BLEU
Learning data size ofTM

- L1 model -

!  TM size: 0.18M sentences
35
L1 of test data
L1 of
training
data
English
Mandarin
English 81.48
85.73
Mandarin
80.83
85.89
all
81.21
85.53

sentences

!  Word segmentation is usually performed as a pre-
processing
!  JSL learners’ sentences contain many errors and
hiragana (phonetic characters)
!  hard to tokenize by traditional morphological
analyzer
!  e.g.
36
tokenize


37
Japanese Corpus
Parallel Corpus
English
Japanese
I like English ー私は英語が好き
・・・
Translation
Model
went to America.
Language
Model
Japanese
sentence
Japanese
sentence
私はアメリカに
行った
私は英語が好き
・・・


38
Japanese Corpus
Learner Corpus
Learner
Correct
私わ英語が好きー私は英語が好き
・・・
Translation
Model
私わアメリカに
行った
Language
Model
Correct
sentence
Correct
sentence
私はアメリカに
行った
私は英語が好き
・・・

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (18)

Similar a Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011

Similar a Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011 (20)

Último

Último (20)

Mining Revision Log of Language Learning SNS for Automated Japanese Error Correction of Second Language Learners @IJCNLP2011