Bhubaneswar🌹Call Girls Bhubaneswar ❤Komal 9777949614 💟 Full Trusted CALL GIRL...
1909 paclic
1. Human Interface Laboratory
Investigating an Effective Character-level
Embedding in Korean Sentence Classification
2019. 9. 13 @PACLIC 33
Won Ik Cho, Seok Min Kim, Nam Soo Kim
2. Contents
• Why character-level embedding?
Case of Korean writing system
Previous approaches
• Task description
• Experiment
Feature engineering
Implementation
• Result & Analysis
• Done and afterward
1
3. Why character-level embedding?
• Word level vs. Subword-level vs. (sub-)character-level
In English (using alphabet)
• hello (word level)
• hel ##lo (subword level ~ word piece)
• h e l l o (character level)
In Korean (using Hangul)
• 반갑다 (pan-kap-ta, word level)
• 반가- / -ㅂ- / -다 (morpheme level)
• 반갑 다 (subword level ~ word piece)
• 반 갑 다 (character level ~ word piece)
• ㅂㅏㄴㄱㅏㅂㄷㅏ# (Jamo level)
2
* Jamo: letters of Korean Hangul, where Ja denotes consonants and Mo denotes vowels
4. Why character-level embedding?
• On Korean (sub-)character-level (Jamo) system
3
반A character (pan)
First sound (cho-seng)
Second sound (cung-seng)
Third sound (cong-seng)
Structure: {Syllable: CV(C)}
# First sound (C): 19
# Second sound (V): 11
# Third sound (C): 27 + ‘ ‘
Total 19 * 11 * 28 = 11,172 characters!
5. Why character-level embedding?
• Research questions
To what extent can we think of the (sub-)character-level
embedding styles in Korean NLP?
What kind of embedding best fits with the sentence classification
tasks (binary, multi-class)?
How and why does the performance between the embedding
differs?
4
6. Why character-level embedding?
• Previous approaches
Zhang et al. (2017)
• Which encoding is the best for text classification in C/E/J/K?
5
Mistake in the article!
Character-level features
far outperform the
word level features ...
Fortunately this does not
deter the goal of our study
Test errors
We use no morphological analysis, but only decomposition of the blocks!
7. Why character-level embedding?
• Previous approaches
Shin et al. (2017): Jamo (sub-character)-level padding
Cho et al. (2018c): Jamo-level + solely used Jamos
Cho et al. (2018a)-Sparse: About 25K frequently used
characters in conversation-style corpus
Cho et al. (2018a)-Dense: Utilizing subword representations
in dense embedding (fastText) ← Pretrained!
Song et al. (2018): Multi-hot encoding
6
8. Why character-level embedding?
• On Korean (sub-)character-level embedding
Sparse embedding
• One-hot or multi-hot representation
• Narrow but long sequence for Jamo-level features
• Wide but shorter sequence for character-level features
Dense embedding
• word2vec-style representation utilizing skip-gram
• Narrow and short sequence, and especially for character-level feature
– The number of Jamos lack for training them as ‘word’s
– But for characters, real-life token usage up to 2,500 available
» A kind of subword/word piece!
7
9. Task description
• Two classification tasks
Sentiment analysis (NSMC): binary
• aim to cover lexical semantics
Intention identification (3i4K): 7-class
• aim to cover syntax-semantics
Why classification?
• Easy/fast to train and featurize
• Result is clearly analyzable (straightforward metric such as F1,
accuracy)
• Featurization methodology can be extended to other tasks
– Role labeling, entity recognition, translation, generation etc.
8
10. Task description
• Sentiment analysis
Naver sentiment movie corpus (NSMC)
• Widely used benchmark for evaluation of Korean LMs
• Annotation follows Maas et al. (2011)
• Positive label for reviews with score > 8 and negative for < 5
– Neutral reviews are removed; thus BINARY classification
• Contains various non-Jamo symbols
– e.g., ^^, @@, ...
• 150K/50K for training/test each
• https://github.com/e9t/nsmc
9
11. Task description
• Intention identification
Intonation-aided intention identification for Korean (3i4K)
• Recently distributed open data for speech act classification (Cho
et al., 2018)
• Total seven categories
– Fragments, statement, question, command, rhetorical question
(RQ), rhetorical command (RC), intonation-dependent utterances
• Contains only full
hangul characters
(no sole sub-characters
nor non-letters)
• https://github.com/
warnikchow/3i4k
10
12. Experiment
• Feature engineering
Sequence length
• NSMC: 420 (for Jamo-level) and 140 (for character-level)
• 3i4K: 240 (for Jamo-level) and 80 (for character-level)
Sequence width
• Shin et al. (2017): 67 = 19 + 11 + 27 (‘ ‘ zero-padded)
• Cho et al. (2018c): 118 = Shin et al. (2017) + 51 (solely-used
Jamos)
– e.g., ㅜ, ㅠ, ㅡ, ㅋ, ...
• Cho et al. (2018a) – Sparse: 2,534
• Cho et al. (2018a) – Dense: 100
– length-1 subwords only!
• Song et al. (2018): 67 (in specific, 2 or 3-hot)
11
13. Experiment
• Implementation
Bidirectional long short term memory (BiLSTM, Schuster and
Paliwal, 1997)
• Representative recurrent neural network (RNN) model
• Strong at representing sequential information
Self-attentive embedding (SA, Lin, 2017)
• Different from self-attention, but frequently utilized in sentence
embedding
• Utilizes a context vector to make up an attention weight layer
12
14. • BiLSTM Architecture
Input dimension: (L, D)
RNN Hidden layer width: 64 (=322)
The width of FCN connected to the last hidden layer: 128 (Activation: ReLU)
Output layer width: N (Activation: softmax)
• SA Architecture (Figure)
Input dimension identical
Context vector width: 64
(Activation: ReLU, Dropout 0.3)
Additional MLPs and Dropouts
after a weighted sum
13
15. Experiment
• Implementation
Python 3, Keras, Hangul toolkit, fastText
• Keras (Chollet et al., 2015) for NN training
– TensorFlow Backend, Very concise in implementation
• Hangul toolkit for decomposing the characters
– Decomposes characters into sub-character sequence (length x 3)
• fastText (Bojanowski et al., 2016) for dense character-level
embeddings
– Dense character vector obtained from a drama script (2M lines)
– Appropriate for colloquial expressions
Optimizer: Adam 5e-4
Loss function: Categorical cross-entropy
Batch size: 64 for NSMC, 16 for 3i4K
Device: Nvidia Tesla M40 24GB
14
16. Result & Analysis
• Result
Only accuracy for NSMC (positive/negative 5:5)
• Why lower than the results reported in literature? (about 0.88)
– Since no non-letter tokens were utilized...
» And the data very sensitive to non-letter expressions (emojis
and solely used subcharacters e.g., ㅠㅠ, ㅋㅋ)
F1 score: harmonic mean of precision and recall
15
17. Result & Analysis
• Analysis 1: Performance comparison
Dense character-level embedding outperforms sparse ones
• Injection of a distributive information (word2vec-like) onto the tokens
• Some characters role as a case marker or a short ‘word’
One-hot/Multi-hot
character-level?
• No additional information
• Will be powerful if dataset bigger & balanced
Low performance of Jamo-level features in NSMC?
• Decomposition meaningful in syntax-semantic task (rather than
lexical semantics)?
Using self-attention highly improves Jamo-level embeddings
16
18. Result & Analysis
• Analysis 2: Using self-attentive embedding
Most emphasized in Jamo-level feature
Least emphasized in One-hot character-level encoding
Why?
• Decomposability of the blocks
– How the sub-character information is projected onto embedding
• e.g., 이상한 (i-sang-han, strange)
– In morpheme: 이상하 (i-sang-ha, the root) + -ㄴ (-n, a particle)
– Presence and role of the morphemes are pointed out
– Not guaranteed in block-preserving networks
– Strengthens syntax-semantic analysis?
17
19. Result & Analysis
• Analysis 3: Decomposability vs. Local semantics
Disadvantage of character-level embeddings:
• characters cannot be decomposed,
even for multi-hot
Then, outperforming
displayed for what?
• Seems to originate in
the preservation of the cluster of letters
– Which stably indicates where the token separation take place, e.g., for 반갑다 (pan-
kap-ta, hello),
» (Jamo-level) ㅂ/ㅏ/ㄴ/ㄱ/ㅏ/ㅂ/ㄷ/ㅏ/<empty>
» (character-level) <ㅂㅏㄴ><ㄱㅏㅂ><ㄷㅏ>
– The tendency will be different if 1) sub-character level word piece model (byte pair
encoding) is implemented or 2) sub-character property (1 & 3rd sound) is additionally
attached to tokens
18
20. Result & Analysis
• Analysis 4: Computation efficiency
Computation for NSMC models
• Jamo-level
– Moderate parameter
size, but slow in
training
• Dense/Multi-hot
– Smaller parameter size (in case of SA)
– Faster training time
– Equal to or better performance
19
21. Discussion
• Primary goal of the paper:
To search for a Jamo/character-level embedding that best fits
with the given Korean NLP task
• The utility of the comparison result
Also can be applied to Japanese/Chinese NLP?
• Japanese: Morae (e.g., small tsu) that roughly matches with the
third sound of Korean
• Chinese/Japanese: Hanzi or Kanji can further be decomposed
into small forms of glyphs (Nguyen et al., 2017)
– e.g., 鯨 ``whale" to 魚 ``fish" and 京 ``capital city“
• Many South/Southeast Asian languages
– Composition of consonant and vowel
– Maybe decomposing the properties is better than ..?
20
22. Done & Afterward
• Reviewed five (sub-)character-level embedding for a
character-rich and agglutinative language (Korean)
Dense and multi-hot character-level representation perform best
• for dense one, maybe distributional information matters
Multi-hot has potential to be utilized beyond the given tasks
• conciseness & computation efficiency
Sub-character level features useful in tasks that require
morphological decomposition
• have potential to be improved via word piece approach or
information attachment
Overall tendency useful for the text processing of other character-
rich languages with conjunct forms in the writing system
21
23. Reference (order of appearance)
• Xiang Zhang, Junbo Zhao, and Yann LeCun. 2015. Character-level convolutional networks for text classification.
In Advances in neural information processing systems, pages 649–657.
• Haebin Shin, Min-Gwan Seo, and Hyeongjin Byeon. 2017. Korean alphabet level convolution neural network for
text classification. In Proceedings of Korea Computer Congress 2017 [in Korean], pages 587–589.
• Yong Woo Cho, Gyu Su Han, and Hyuk Jun Lee. 2018c. Character level bi-directional lstm-cnn model for movie
rating prediction. In Proceedings of Korea Computer Congress 2018 [in Korean], pages 1009–1011.
• Won Ik Cho, Sung Jun Cheon, Woo Hyun Kang, Ji Won Kim, and Nam Soo Kim. 2018a. Real-time automatic
word segmentation for user-generated text. arXiv preprint arXiv:1810.13113.
• Won Ik Cho, Hyeon Seung Lee, Ji Won Yoon, Seok Min Kim, and Nam Soo Kim. 2018b. Speech intention
understanding in a head-final language: A disambiguation utilizing intonation-dependency. arXiv preprint
arXiv:1811.04231.
• Mike Schuster and Kuldip K Paliwal. 1997. Bidirectional recurrent neural networks. IEEE Transactions on Signal
Processing, 45(11):2673–2681.
• Zhouhan Lin,Minwei Feng, Cicero Nogueira dos Santos, Mo Yu, Bing Xiang, Bowen Zhou, and Yoshua Bengio.
2017. A structured self-attentive sentence embedding. arXiv preprint arXiv:1703.03130.
• Piotr Bojanowski, Edouard Grave, Armand Joulin, and Tomas Mikolov. 2016. Enriching word vectors with
subword information. arXiv preprint arXiv:1607.04606.
• Francois Chollet et al. 2015. Keras. https://github.com/fchollet/keras.
• Viet Nguyen, Julian Brooke, and Timothy Baldwin. 2017. Sub-character neural language modelling in japanese.
In Proceedings of the First Workshop on Subword and Character Level Models in NLP, pages 148–153.
22
overview:
gender bias in NLP – various problems
translation: real-world problem - example e.g. Turkish, Korean..?
How is it treated in previous works?
Why should it be guaranteed?
problem statement: with KR-EN example
why not investigated in previous works?
why appropriate for investigating gender bias?
what examples are observed?
construction:
what are to be considered?
formality (걔 vs 그 사람)
politeness (-어 vs –어요)
lexicon sentiment polarity (positive & negative & occupation)
+ things to be considered in... (not to threaten the fairness)
- Measure?
how the measure is defined, and proved to be bounded (and have optimum when the condition fits with the ideal case)
concept of Vbias and Sbias – how they are aggregated into the measure << disadvantage?
how the usage is justified despite disadvantages
the strong points?
- Experiment?
how the EEC is used in evaluation, and how the arithmetic averaging is justified
the result: GT > NP > KT?
- Analysis?
quantitative analysis – Vbias and Sbias, significant with style-related features
qualitative analysis – observed with the case of occupation words
Done: tgbi for KR-EN, with an EEC
Afterward: how Sbias can be considered more explicitly? what if among context? how about with other target/source language?