SlideShare una empresa de Scribd logo
1 de 24
Human Interface Laboratory
Investigating an Effective Character-level
Embedding in Korean Sentence Classification
2019. 9. 13 @PACLIC 33
Won Ik Cho, Seok Min Kim, Nam Soo Kim
Contents
• Why character-level embedding?
 Case of Korean writing system
 Previous approaches
• Task description
• Experiment
 Feature engineering
 Implementation
• Result & Analysis
• Done and afterward
1
Why character-level embedding?
• Word level vs. Subword-level vs. (sub-)character-level
 In English (using alphabet)
• hello (word level)
• hel ##lo (subword level ~ word piece)
• h e l l o (character level)
 In Korean (using Hangul)
• 반갑다 (pan-kap-ta, word level)
• 반가- / -ㅂ- / -다 (morpheme level)
• 반갑 다 (subword level ~ word piece)
• 반 갑 다 (character level ~ word piece)
• ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level)
2
* Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels
Why character-level embedding?
• On Korean (sub-)character-level (Jamo) system
3
반A character (pan)
First sound (cho-seng)
Second sound (cung-seng)
Third sound (cong-seng)
Structure: {Syllable: CV(C)}
# First sound (C): 19
# Second sound (V): 11
# Third sound (C): 27 + ‘ ‘
Total 19 * 11 * 28 = 11,172 characters!
Why character-level embedding?
• Research questions
 To what extent can we think of the (sub-)character-level
embedding styles in Korean NLP?
 What kind of embedding best fits with the sentence classification
tasks (binary, multi-class)?
 How and why does the performance between the embedding
differs?
4
Why character-level embedding?
• Previous approaches
 Zhang et al. (2017)
• Which encoding is the best for text classification in C/E/J/K?
5
Mistake in the article!
Character-level features
far outperform the
word level features ...
Fortunately this does not
deter the goal of our study
Test errors
We use no morphological analysis, but only decomposition of the blocks!
Why character-level embedding?
• Previous approaches
 Shin et al. (2017): Jamo (sub-character)-level padding
 Cho et al. (2018c): Jamo-level + solely used Jamos
 Cho et al. (2018a)-Sparse: About 25K frequently used
characters in conversation-style corpus
 Cho et al. (2018a)-Dense: Utilizing subword representations
in dense embedding (fastText) ← Pretrained!
 Song et al. (2018): Multi-hot encoding
6
Why character-level embedding?
• On Korean (sub-)character-level embedding
 Sparse embedding
• One-hot or multi-hot representation
• Narrow but long sequence for Jamo-level features
• Wide but shorter sequence for character-level features
 Dense embedding
• word2vec-style representation utilizing skip-gram
• Narrow and short sequence, and especially for character-level feature
– The number of Jamos lack for training them as ‘word’s
– But for characters, real-life token usage up to 2,500 available
» A kind of subword/word piece!
7
Task description
• Two classification tasks
 Sentiment analysis (NSMC): binary
• aim to cover lexical semantics
 Intention identification (3i4K): 7-class
• aim to cover syntax-semantics
 Why classification?
• Easy/fast to train and featurize
• Result is clearly analyzable (straightforward metric such as F1,
accuracy)
• Featurization methodology can be extended to other tasks
– Role labeling, entity recognition, translation, generation etc.
8
Task description
• Sentiment analysis
 Naver sentiment movie corpus (NSMC)
• Widely used benchmark for evaluation of Korean LMs
• Annotation follows Maas et al. (2011)
• Positive label for reviews with score > 8 and negative for < 5
– Neutral reviews are removed; thus BINARY classification
• Contains various non-Jamo symbols
– e.g., ^^, @@, ...
• 150K/50K for training/test each
• https://github.com/e9t/nsmc
9
Task description
• Intention identification
 Intonation-aided intention identification for Korean (3i4K)
• Recently distributed open data for speech act classification (Cho
et al., 2018)
• Total seven categories
– Fragments, statement, question, command, rhetorical question
(RQ), rhetorical command (RC), intonation-dependent utterances
• Contains only full
hangul characters
(no sole sub-characters
nor non-letters)
• https://github.com/
warnikchow/3i4k
10
Experiment
• Feature engineering
 Sequence length
• NSMC: 420 (for Jamo-level) and 140 (for character-level)
• 3i4K: 240 (for Jamo-level) and 80 (for character-level)
 Sequence width
• Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded)
• Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used
Jamos)
– e.g., ㅜ, ㅠ, ㅡ, ㅋ, ...
• Cho et al. (2018a) – Sparse: 2,534
• Cho et al. (2018a) – Dense: 100
– length-1 subwords only!
• Song et al. (2018): 67 (in specific, 2 or 3-hot)
11
Experiment
• Implementation
 Bidirectional long short term memory (BiLSTM, Schuster and
Paliwal, 1997)
• Representative recurrent neural network (RNN) model
• Strong at representing sequential information
 Self-attentive embedding (SA, Lin, 2017)
• Different from self-attention, but frequently utilized in sentence
embedding
• Utilizes a context vector to make up an attention weight layer
12
• BiLSTM Architecture
 Input dimension: (L, D)
 RNN Hidden layer width: 64 (=322)
 The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)
 Output layer width: N (Activation: softmax)
• SA Architecture (Figure)
 Input dimension identical
 Context vector width: 64
(Activation: ReLU, Dropout 0.3)
 Additional MLPs and Dropouts
after a weighted sum
13
Experiment
• Implementation
 Python 3, Keras, Hangul toolkit, fastText
• Keras (Chollet et al., 2015) for NN training
– TensorFlow Backend, Very concise in implementation
• Hangul toolkit for decomposing the characters
– Decomposes characters into sub-character sequence (length x 3)
• fastText (Bojanowski et al., 2016) for dense character-level
embeddings
– Dense character vector obtained from a drama script (2M lines)
– Appropriate for colloquial expressions
 Optimizer: Adam 5e-4
 Loss function: Categorical cross-entropy
 Batch size: 64 for NSMC, 16 for 3i4K
 Device: Nvidia Tesla M40 24GB
14
Result & Analysis
• Result
 Only accuracy for NSMC (positive/negative 5:5)
• Why lower than the results reported in literature? (about 0.88)
– Since no non-letter tokens were utilized...
» And the data very sensitive to non-letter expressions (emojis
and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)
 F1 score: harmonic mean of precision and recall
15
Result & Analysis
• Analysis 1: Performance comparison
 Dense character-level embedding outperforms sparse ones
• Injection of a distributive information (word2vec-like) onto the tokens
• Some characters role as a case marker or a short ‘word’
 One-hot/Multi-hot
character-level?
• No additional information
• Will be powerful if dataset bigger & balanced
 Low performance of Jamo-level features in NSMC?
• Decomposition meaningful in syntax-semantic task (rather than
lexical semantics)?
 Using self-attention highly improves Jamo-level embeddings
16
Result & Analysis
• Analysis 2: Using self-attentive embedding
 Most emphasized in Jamo-level feature
 Least emphasized in One-hot character-level encoding
 Why?
• Decomposability of the blocks
– How the sub-character information is projected onto embedding
• e.g., 이상한 (i-sang-han, strange)
– In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle)
– Presence and role of the morphemes are pointed out
– Not guaranteed in block-preserving networks
– Strengthens syntax-semantic analysis?
17
Result & Analysis
• Analysis 3: Decomposability vs. Local semantics
 Disadvantage of character-level embeddings:
• characters cannot be decomposed,
even for multi-hot
 Then, outperforming
displayed for what?
• Seems to originate in
the preservation of the cluster of letters
– Which stably indicates where the token separation take place, e.g., for 반갑다 (pan-
kap-ta, hello),
» (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty>
» (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ>
– The tendency will be different if 1) sub-character level word piece model (byte pair
encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally
attached to tokens
18
Result & Analysis
• Analysis 4: Computation efficiency
 Computation for NSMC models
• Jamo-level
– Moderate parameter
size, but slow in
training
• Dense/Multi-hot
– Smaller parameter size (in case of SA)
– Faster training time
– Equal to or better performance
19
Discussion
• Primary goal of the paper:
 To search for a Jamo/character-level embedding that best fits
with the given Korean NLP task
• The utility of the comparison result
 Also can be applied to Japanese/Chinese NLP?
• Japanese: Morae (e.g., small tsu) that roughly matches with the
third sound of Korean
• Chinese/Japanese: Hanzi or Kanji can further be decomposed
into small forms of glyphs (Nguyen et al., 2017)
– e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“
• Many South/Southeast Asian languages
– Composition of consonant and vowel
– Maybe decomposing the properties is better than ..?
20
Done & Afterward
• Reviewed five (sub-)character-level embedding for a
character-rich and agglutinative language (Korean)
 Dense and multi-hot character-level representation perform best
• for dense one, maybe distributional information matters
 Multi-hot has potential to be utilized beyond the given tasks
• conciseness & computation efficiency
 Sub-character level features useful in tasks that require
morphological decomposition
• have potential to be improved via word piece approach or
information attachment
 Overall tendency useful for the text processing of other character-
rich languages with conjunct forms in the writing system
21
Reference (order of appearance)
• Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.
In Advances in neural information processing systems, pages 649–657.
• Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for
text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589.
• Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie
rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011.
• Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic
word segmentation for user-generated text. arXiv preprint arXiv:1810.13113.
• Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention
understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal
Processing, 45(11):2673–2681.
• Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
• Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606.
• Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
• Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese.
In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153.
22
Thank you!
EndOfPresentation

Más contenido relacionado

La actualidad más candente

Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translationkhyati gupta
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overviewananth
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needsIvan Berlocher
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Ahmed Magdy Ezzeldin, MSc.
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)YerevaNN research lab
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentationSurya Sg
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueJinho Choi
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH WarNik Chow
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarizationAbdelaziz Al-Rihawi
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsRoelof Pieters
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project PresentationAryak Sengupta
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingSeonghyun Kim
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsLifeng (Aaron) Han
 
Document Summarization
Document SummarizationDocument Summarization
Document SummarizationPratik Kumar
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information RetrievalRoelof Pieters
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentationSurya Sg
 

La actualidad más candente (20)

Language models
Language modelsLanguage models
Language models
 
Experiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine TranslationExperiments with Different Models of Statistcial Machine Translation
Experiments with Different Models of Statistcial Machine Translation
 
Text summarization
Text summarization Text summarization
Text summarization
 
MaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - OverviewMaxEnt (Loglinear) Models - Overview
MaxEnt (Loglinear) Models - Overview
 
VOC real world enterprise needs
VOC real world enterprise needsVOC real world enterprise needs
VOC real world enterprise needs
 
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
Arabic Question Answering: Challenges, Tasks, Approaches, Test-sets, Tools, A...
 
Arabic question answering ‫‬
Arabic question answering ‫‬Arabic question answering ‫‬
Arabic question answering ‫‬
 
DLBLR talk
DLBLR talkDLBLR talk
DLBLR talk
 
Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)Sentence representations and question answering (YerevaNN)
Sentence representations and question answering (YerevaNN)
 
Nlp research presentation
Nlp research presentationNlp research presentation
Nlp research presentation
 
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty DialogueTransformers to Learn Hierarchical Contexts in Multiparty Dialogue
Transformers to Learn Hierarchical Contexts in Multiparty Dialogue
 
2010 INTERSPEECH
2010 INTERSPEECH 2010 INTERSPEECH
2010 INTERSPEECH
 
Extraction Based automatic summarization
Extraction Based automatic summarizationExtraction Based automatic summarization
Extraction Based automatic summarization
 
Deep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word EmbeddingsDeep Learning for NLP: An Introduction to Neural Word Embeddings
Deep Learning for NLP: An Introduction to Neural Word Embeddings
 
NLP Project Presentation
NLP Project PresentationNLP Project Presentation
NLP Project Presentation
 
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language UnderstandingBERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
BERT: Pre-training of Deep Bidirectional Transformers for Language Understanding
 
Meta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methodsMeta-evaluation of machine translation evaluation methods
Meta-evaluation of machine translation evaluation methods
 
Document Summarization
Document SummarizationDocument Summarization
Document Summarization
 
Deep Learning for Information Retrieval
Deep Learning for Information RetrievalDeep Learning for Information Retrieval
Deep Learning for Information Retrieval
 
Nlp presentation
Nlp presentationNlp presentation
Nlp presentation
 

Similar a 1909 paclic

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Yuriy Guts
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfAdityaMishra178868
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlpankit_ppt
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...multimediaeval
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFJayavardhan Reddy Peddamail
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesSchwannden Kuo
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...Tomoki Koriyama
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP TasksMasahiro Kaneko
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlpLaraOlmosCamarena
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in DialogueJinho Choi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageJinho Choi
 

Similar a 1909 paclic (20)

Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
introtonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdfintrotonlp-190218095523 (1).pdf
introtonlp-190218095523 (1).pdf
 
Recent nlp trends
Recent nlp trendsRecent nlp trends
Recent nlp trends
 
Intro to nlp
Intro to nlpIntro to nlp
Intro to nlp
 
2106 ACM DIS
2106 ACM DIS2106 ACM DIS
2106 ACM DIS
 
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
LACS S y stem A nalysis on R etrieval M odels for the MediaEval 2014 Search a...
 
The NLP Muppets revolution!
The NLP Muppets revolution!The NLP Muppets revolution!
The NLP Muppets revolution!
 
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRFEnd-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
End-to-end sequence labeling via bi-directional LSTM-CNNs-CRF
 
PL Lecture 01 - preliminaries
PL Lecture 01 - preliminariesPL Lecture 01 - preliminaries
PL Lecture 01 - preliminaries
 
asdrfasdfasdf
asdrfasdfasdfasdrfasdfasdf
asdrfasdfasdf
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
N20190729
N20190729N20190729
N20190729
 
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable... Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
Semi-supervised Prosody Modeling Using Deep Gaussian Process Latent Variable...
 
FIRE2014_IIT-P
FIRE2014_IIT-PFIRE2014_IIT-P
FIRE2014_IIT-P
 
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
A Joint Many-Task Model: Growing a Neural Network for Multiple NLP Tasks
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Challenges in transfer learning in nlp
Challenges in transfer learning in nlpChallenges in transfer learning in nlp
Challenges in transfer learning in nlp
 
Topic Segmentation in Dialogue
Topic Segmentation in DialogueTopic Segmentation in Dialogue
Topic Segmentation in Dialogue
 
D3 dhanalakshmi
D3 dhanalakshmiD3 dhanalakshmi
D3 dhanalakshmi
 
Seq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese LanguageSeq2seq Model to Tokenize the Chinese Language
Seq2seq Model to Tokenize the Chinese Language
 

Más de WarNik Chow

2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inpersonWarNik Chow
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech datasetWarNik Chow
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2eWarNik Chow
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminarWarNik Chow
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate SpeechWarNik Chow
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLPWarNik Chow
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!WarNik Chow
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 posterWarNik Chow
 

Más de WarNik Chow (20)

2312 PACLIC
2312 PACLIC2312 PACLIC
2312 PACLIC
 
2311 EAAMO
2311 EAAMO2311 EAAMO
2311 EAAMO
 
2211 HCOMP
2211 HCOMP2211 HCOMP
2211 HCOMP
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
2211 AACL
2211 AACL2211 AACL
2211 AACL
 
2210 CODI
2210 CODI2210 CODI
2210 CODI
 
2206 FAccT_inperson
2206 FAccT_inperson2206 FAccT_inperson
2206 FAccT_inperson
 
2206 Modupop!
2206 Modupop!2206 Modupop!
2206 Modupop!
 
2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset2204 Kakao talk on Hate speech dataset
2204 Kakao talk on Hate speech dataset
 
2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e2108 [LangCon2021] kosp2e
2108 [LangCon2021] kosp2e
 
2106 PRSLLS
2106 PRSLLS2106 PRSLLS
2106 PRSLLS
 
2106 JWLLP
2106 JWLLP2106 JWLLP
2106 JWLLP
 
2104 Talk @SSU
2104 Talk @SSU2104 Talk @SSU
2104 Talk @SSU
 
2103 ACM FAccT
2103 ACM FAccT2103 ACM FAccT
2103 ACM FAccT
 
2102 Redone seminar
2102 Redone seminar2102 Redone seminar
2102 Redone seminar
 
2011 NLP-OSS
2011 NLP-OSS2011 NLP-OSS
2011 NLP-OSS
 
2010 HCLT Hate Speech
2010 HCLT Hate Speech2010 HCLT Hate Speech
2010 HCLT Hate Speech
 
2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP2009 DevC Seongnam - NLP
2009 DevC Seongnam - NLP
 
2008 [lang con2020] act!
2008 [lang con2020] act!2008 [lang con2020] act!
2008 [lang con2020] act!
 
2007 CogSci 2020 poster
2007 CogSci 2020 poster2007 CogSci 2020 poster
2007 CogSci 2020 poster
 

Último

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdfKamal Acharya
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesRAJNEESHKUMAR341697
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfJiananWang21
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Servicemeghakumariji156
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayEpec Engineered Technologies
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projectssmsksolar
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network DevicesChandrakantDivate1
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaOmar Fathy
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdfKamal Acharya
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxSCMS School of Architecture
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationBhangaleSonal
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesMayuraD1
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksMagic Marks
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086anil_gaur
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptNANDHAKUMARA10
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapRishantSharmaFr
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptxJIT KUMAR GUPTA
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Call Girls Mumbai
 

Último (20)

Online food ordering system project report.pdf
Online food ordering system project report.pdfOnline food ordering system project report.pdf
Online food ordering system project report.pdf
 
Engineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planesEngineering Drawing focus on projection of planes
Engineering Drawing focus on projection of planes
 
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced LoadsFEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
FEA Based Level 3 Assessment of Deformed Tanks with Fluid Induced Loads
 
data_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdfdata_management_and _data_science_cheat_sheet.pdf
data_management_and _data_science_cheat_sheet.pdf
 
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best ServiceTamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
Tamil Call Girls Bhayandar WhatsApp +91-9930687706, Best Service
 
Standard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power PlayStandard vs Custom Battery Packs - Decoding the Power Play
Standard vs Custom Battery Packs - Decoding the Power Play
 
2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects2016EF22_0 solar project report rooftop projects
2016EF22_0 solar project report rooftop projects
 
Integrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - NeometrixIntegrated Test Rig For HTFE-25 - Neometrix
Integrated Test Rig For HTFE-25 - Neometrix
 
Computer Networks Basics of Network Devices
Computer Networks  Basics of Network DevicesComputer Networks  Basics of Network Devices
Computer Networks Basics of Network Devices
 
Introduction to Serverless with AWS Lambda
Introduction to Serverless with AWS LambdaIntroduction to Serverless with AWS Lambda
Introduction to Serverless with AWS Lambda
 
Hostel management system project report..pdf
Hostel management system project report..pdfHostel management system project report..pdf
Hostel management system project report..pdf
 
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptxHOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
HOA1&2 - Module 3 - PREHISTORCI ARCHITECTURE OF KERALA.pptx
 
DC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equationDC MACHINE-Motoring and generation, Armature circuit equation
DC MACHINE-Motoring and generation, Armature circuit equation
 
DeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakesDeepFakes presentation : brief idea of DeepFakes
DeepFakes presentation : brief idea of DeepFakes
 
Learn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic MarksLearn the concepts of Thermodynamics on Magic Marks
Learn the concepts of Thermodynamics on Magic Marks
 
Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086Minimum and Maximum Modes of microprocessor 8086
Minimum and Maximum Modes of microprocessor 8086
 
Block diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.pptBlock diagram reduction techniques in control systems.ppt
Block diagram reduction techniques in control systems.ppt
 
Unleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leapUnleashing the Power of the SORA AI lastest leap
Unleashing the Power of the SORA AI lastest leap
 
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
COST-EFFETIVE  and Energy Efficient BUILDINGS ptxCOST-EFFETIVE  and Energy Efficient BUILDINGS ptx
COST-EFFETIVE and Energy Efficient BUILDINGS ptx
 
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
 

1909 paclic

  • 1. Human Interface Laboratory Investigating an Effective Character-level Embedding in Korean Sentence Classification 2019. 9. 13 @PACLIC 33 Won Ik Cho, Seok Min Kim, Nam Soo Kim
  • 2. Contents • Why character-level embedding?  Case of Korean writing system  Previous approaches • Task description • Experiment  Feature engineering  Implementation • Result & Analysis • Done and afterward 1
  • 3. Why character-level embedding? • Word level vs. Subword-level vs. (sub-)character-level  In English (using alphabet) • hello (word level) • hel ##lo (subword level ~ word piece) • h e l l o (character level)  In Korean (using Hangul) • 반갑다 (pan-kap-ta, word level) • 반가- / -ㅂ- / -다 (morpheme level) • 반갑 다 (subword level ~ word piece) • 반 갑 다 (character level ~ word piece) • ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level) 2 * Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels
  • 4. Why character-level embedding? • On Korean (sub-)character-level (Jamo) system 3 반A character (pan) First sound (cho-seng) Second sound (cung-seng) Third sound (cong-seng) Structure: {Syllable: CV(C)} # First sound (C): 19 # Second sound (V): 11 # Third sound (C): 27 + ‘ ‘ Total 19 * 11 * 28 = 11,172 characters!
  • 5. Why character-level embedding? • Research questions  To what extent can we think of the (sub-)character-level embedding styles in Korean NLP?  What kind of embedding best fits with the sentence classification tasks (binary, multi-class)?  How and why does the performance between the embedding differs? 4
  • 6. Why character-level embedding? • Previous approaches  Zhang et al. (2017) • Which encoding is the best for text classification in C/E/J/K? 5 Mistake in the article! Character-level features far outperform the word level features ... Fortunately this does not deter the goal of our study Test errors We use no morphological analysis, but only decomposition of the blocks!
  • 7. Why character-level embedding? • Previous approaches  Shin et al. (2017): Jamo (sub-character)-level padding  Cho et al. (2018c): Jamo-level + solely used Jamos  Cho et al. (2018a)-Sparse: About 25K frequently used characters in conversation-style corpus  Cho et al. (2018a)-Dense: Utilizing subword representations in dense embedding (fastText) ← Pretrained!  Song et al. (2018): Multi-hot encoding 6
  • 8. Why character-level embedding? • On Korean (sub-)character-level embedding  Sparse embedding • One-hot or multi-hot representation • Narrow but long sequence for Jamo-level features • Wide but shorter sequence for character-level features  Dense embedding • word2vec-style representation utilizing skip-gram • Narrow and short sequence, and especially for character-level feature – The number of Jamos lack for training them as ‘word’s – But for characters, real-life token usage up to 2,500 available » A kind of subword/word piece! 7
  • 9. Task description • Two classification tasks  Sentiment analysis (NSMC): binary • aim to cover lexical semantics  Intention identification (3i4K): 7-class • aim to cover syntax-semantics  Why classification? • Easy/fast to train and featurize • Result is clearly analyzable (straightforward metric such as F1, accuracy) • Featurization methodology can be extended to other tasks – Role labeling, entity recognition, translation, generation etc. 8
  • 10. Task description • Sentiment analysis  Naver sentiment movie corpus (NSMC) • Widely used benchmark for evaluation of Korean LMs • Annotation follows Maas et al. (2011) • Positive label for reviews with score > 8 and negative for < 5 – Neutral reviews are removed; thus BINARY classification • Contains various non-Jamo symbols – e.g., ^^, @@, ... • 150K/50K for training/test each • https://github.com/e9t/nsmc 9
  • 11. Task description • Intention identification  Intonation-aided intention identification for Korean (3i4K) • Recently distributed open data for speech act classification (Cho et al., 2018) • Total seven categories – Fragments, statement, question, command, rhetorical question (RQ), rhetorical command (RC), intonation-dependent utterances • Contains only full hangul characters (no sole sub-characters nor non-letters) • https://github.com/ warnikchow/3i4k 10
  • 12. Experiment • Feature engineering  Sequence length • NSMC: 420 (for Jamo-level) and 140 (for character-level) • 3i4K: 240 (for Jamo-level) and 80 (for character-level)  Sequence width • Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded) • Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used Jamos) – e.g., ㅜ, ㅠ, ㅡ, ㅋ, ... • Cho et al. (2018a) – Sparse: 2,534 • Cho et al. (2018a) – Dense: 100 – length-1 subwords only! • Song et al. (2018): 67 (in specific, 2 or 3-hot) 11
  • 13. Experiment • Implementation  Bidirectional long short term memory (BiLSTM, Schuster and Paliwal, 1997) • Representative recurrent neural network (RNN) model • Strong at representing sequential information  Self-attentive embedding (SA, Lin, 2017) • Different from self-attention, but frequently utilized in sentence embedding • Utilizes a context vector to make up an attention weight layer 12
  • 14. • BiLSTM Architecture  Input dimension: (L, D)  RNN Hidden layer width: 64 (=322)  The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)  Output layer width: N (Activation: softmax) • SA Architecture (Figure)  Input dimension identical  Context vector width: 64 (Activation: ReLU, Dropout 0.3)  Additional MLPs and Dropouts after a weighted sum 13
  • 15. Experiment • Implementation  Python 3, Keras, Hangul toolkit, fastText • Keras (Chollet et al., 2015) for NN training – TensorFlow Backend, Very concise in implementation • Hangul toolkit for decomposing the characters – Decomposes characters into sub-character sequence (length x 3) • fastText (Bojanowski et al., 2016) for dense character-level embeddings – Dense character vector obtained from a drama script (2M lines) – Appropriate for colloquial expressions  Optimizer: Adam 5e-4  Loss function: Categorical cross-entropy  Batch size: 64 for NSMC, 16 for 3i4K  Device: Nvidia Tesla M40 24GB 14
  • 16. Result & Analysis • Result  Only accuracy for NSMC (positive/negative 5:5) • Why lower than the results reported in literature? (about 0.88) – Since no non-letter tokens were utilized... » And the data very sensitive to non-letter expressions (emojis and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)  F1 score: harmonic mean of precision and recall 15
  • 17. Result & Analysis • Analysis 1: Performance comparison  Dense character-level embedding outperforms sparse ones • Injection of a distributive information (word2vec-like) onto the tokens • Some characters role as a case marker or a short ‘word’  One-hot/Multi-hot character-level? • No additional information • Will be powerful if dataset bigger & balanced  Low performance of Jamo-level features in NSMC? • Decomposition meaningful in syntax-semantic task (rather than lexical semantics)?  Using self-attention highly improves Jamo-level embeddings 16
  • 18. Result & Analysis • Analysis 2: Using self-attentive embedding  Most emphasized in Jamo-level feature  Least emphasized in One-hot character-level encoding  Why? • Decomposability of the blocks – How the sub-character information is projected onto embedding • e.g., 이상한 (i-sang-han, strange) – In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle) – Presence and role of the morphemes are pointed out – Not guaranteed in block-preserving networks – Strengthens syntax-semantic analysis? 17
  • 19. Result & Analysis • Analysis 3: Decomposability vs. Local semantics  Disadvantage of character-level embeddings: • characters cannot be decomposed, even for multi-hot  Then, outperforming displayed for what? • Seems to originate in the preservation of the cluster of letters – Which stably indicates where the token separation take place, e.g., for 반갑다 (pan- kap-ta, hello), » (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty> » (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ> – The tendency will be different if 1) sub-character level word piece model (byte pair encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally attached to tokens 18
  • 20. Result & Analysis • Analysis 4: Computation efficiency  Computation for NSMC models • Jamo-level – Moderate parameter size, but slow in training • Dense/Multi-hot – Smaller parameter size (in case of SA) – Faster training time – Equal to or better performance 19
  • 21. Discussion • Primary goal of the paper:  To search for a Jamo/character-level embedding that best fits with the given Korean NLP task • The utility of the comparison result  Also can be applied to Japanese/Chinese NLP? • Japanese: Morae (e.g., small tsu) that roughly matches with the third sound of Korean • Chinese/Japanese: Hanzi or Kanji can further be decomposed into small forms of glyphs (Nguyen et al., 2017) – e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“ • Many South/Southeast Asian languages – Composition of consonant and vowel – Maybe decomposing the properties is better than ..? 20
  • 22. Done & Afterward • Reviewed five (sub-)character-level embedding for a character-rich and agglutinative language (Korean)  Dense and multi-hot character-level representation perform best • for dense one, maybe distributional information matters  Multi-hot has potential to be utilized beyond the given tasks • conciseness & computation efficiency  Sub-character level features useful in tasks that require morphological decomposition • have potential to be improved via word piece approach or information attachment  Overall tendency useful for the text processing of other character- rich languages with conjunct forms in the writing system 21
  • 23. Reference (order of appearance) • Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification. In Advances in neural information processing systems, pages 649–657. • Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589. • Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011. • Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic word segmentation for user-generated text. arXiv preprint arXiv:1810.13113. • Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint arXiv:1811.04231. • Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal Processing, 45(11):2673–2681. • Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio. 2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130. • Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with subword information. arXiv preprint arXiv:1607.04606. • Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras. • Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese. In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153. 22

Notas del editor

  1. .
  2. overview: gender bias in NLP – various problems translation: real-world problem - example e.g. Turkish, Korean..? How is it treated in previous works? Why should it be guaranteed? problem statement: with KR-EN example why not investigated in previous works? why appropriate for investigating gender bias? what examples are observed? construction: what are to be considered? formality (걔 vs 그 사람) politeness (-어 vs –어요) lexicon sentiment polarity (positive & negative & occupation) + things to be considered in... (not to threaten the fairness) - Measure? how the measure is defined, and proved to be bounded (and have optimum when the condition fits with the ideal case) concept of Vbias and Sbias – how they are aggregated into the measure << disadvantage? how the usage is justified despite disadvantages the strong points? - Experiment? how the EEC is used in evaluation, and how the arithmetic averaging is justified the result: GT > NP > KT? - Analysis? quantitative analysis – Vbias and Sbias, significant with style-related features qualitative analysis – observed with the case of occupation words Done: tgbi for KR-EN, with an EEC Afterward: how Sbias can be considered more explicitly? what if among context? how about with other target/source language?