Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language

Statistical and neural language
models for the Ukrainian Language
Anastasiia Khaburska
Master Student
Ukrainian Catholic University
Ciklum
a.khaburska@ucu.edu.ua
anastasykhab@gmail.com
Igor Tytyk
Supervisor
NLP Freelance
igor.tytyk@gmail.com
Lviv 2020
Acknowledgements
Co-Supervisor Artem Chernodub
Computational resources Ciklum, Grammarly
Brown-Uk, Lang-Uk,
Dmitriy Chaplinsky, Vasyl Starko

Content
1. Introduction
2. Language Modelling review
3. Language Modelling for the Ukrainian language
4. Results
5. Contribution & Future work

Introduction
Language Models
The objective of Language Modeling is to learn a probability distribution .
CORPUS –— a collection of texts.
Computing conditional probabilities from the context : P(linguistic unit |context)
P( “Захист дипломних робіт на факультеті прикладних наук.” )
Linguistic units, seen by the model, compose model’s DICTIONARY.
Probability of a sequence :
P( “Захист дипломних робіт на факультеті прикладних наук.” )

Introduction
Language Modelling. Use cases
Grammar correction
P(" Хлопець відкрив двері . " ) = 0.32 10-9
P(" Хлопець відчинив двері . " ) = 4.20 10-9
Machine translation
“ Who are You? “
P(" Хто є ти ? ") = 1.32 10-9
P(" Хто ти є ? ") = 5.85 10-9
Natural language generation
Text summarization
Speech recognition

Introduction
Motivation. Research gap
A lot has been done for the English Language
● statistical and neural approaches
Ukrainian language modelling is still lacking
● Published scientific literature
● Publicly available resources
● Established benchmarks

Introduction
Goals
To set a number of BASELINES in the task of building LMs
for the Ukrainian language
1. Compose training and evaluation datasets.
2. Explore existing methods for language modeling.
3. Train and Evaluate LMs for the Ukrainian language.
4. Suggest a benchmark for Ukrainian language models.

Language Modelling review
Corpora
In linguistics and natural language processing (NLP), CORPUS refers to a collection of texts.
Penn
Treebank
1 Billion Word WikiText-2 WikiText-103
Tokens 929 K
73 K
82 K
820M
7 M
159 K
2 M
217 K
245 K
103 M
217 K
245 K
Train
Valid.
Test
Vocabulary size 10 000 793 471 33 278 267 735
OOV rate 4.8% 0.28% 2.6% 0.4%
OOV rate = percentage of tokens replaced by an <unk> token

Models
Example:
Сьогодні !
Сьогодні на факультеті прикладних наук…
Чекаємо на Вас Cьогодні о 20.00.
Сьогодні на проспекті Шевченка ...
…
ARPA ﬁle
2-gram occurrences
START Сьогодні 3
Вас Cьогодні 1
…
3-gram
START Сьогодні на 2
START Сьогодні ! 1
Вас Cьогодні о 1
...
Token probability estimation:
p(на | START Сьогодні) = ⅔
p(! | START Сьогодні) = ⅓
p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼
Smoothing techniques (KenLM)
Statistical approaches . N-gram models

Models
Dictionary of tokens.
Embedding: token vector ( )
Sequence processing:
Сьогодні на [?] Dictionary:
Сьогодні
факультеті
проспекті
.
.
.
END
START Сьогоні на
Neural LMs. Recurrent neural networks (RNN).
Embeddings
Tokens
dimension
of embedding
size of
dictionary

Tokenization Levels
Word level:
Програму було започатковано у _#year_ р .
Character level:
^П р о г р а м у ^б у л о ^з а п о ч а т к о в а н о ^у _#year_ ^р .
Subword level (BPE - Bytes Pair Encoding):
^Про гра му ^було ^за ^почат ков ано ^у _#year_ ^р .
Subword level (BCN - Bags of Character N-grams):
Each word separate vector = sum of n-grams
Example n=3: Програму {‘^пр’, ‘про’, ‘рог’, ‘огр’, ‘гра’, ‘рам’, ‘аму’, ‘му_’ }

Evaluation
Better model is one which assigns a higher probability to the word that actually occurs.
PERPLEXITY –— an intrinsic metric for the evaluation of word-level models.
Inverse probability of the text normalized by the number of words.
Minimizing perplexity is the same as maximizing probability.

Language Modelling for Ukrainian language
Data
● OOV rate = percentage of tokens replaced by an <unk> token
Korrespondent
+Ukrainian ﬁction
Brown Ukrainian
Corpus
Tokens 262 598 163 779 000
Vocab.size 300 000 300 000
OOV rate 1.96% 3.84%

Preprocessing
_#year_ 1990
_#date_ 01.01.2001
_#time_ 00:00
_#foreign_ Some Words
_#media_ Most of the media sources
(news, magazines, newspapers)
_#number_ 1234
_#ﬂoat_ 10,5

Data
Number of Tokens

Word level: KenLM, LSTM
● OOV replaced with _#unknown_ token
● Vocabulary 300 000 tokens
Model Parameters Training time Size of file Perplexity
KenLM
N-gram
6-gram
Pruning: 000011
ARPA file:
---:11m:51s
18.2G 749
LSTM
LSTM
Embed: 300
Layers 3x500
Embed: 300
Layers 2x500
1 epoch:
25h:30m:---
1 epoch:
23h:50m:---
2.82G
2.79G
258
275
8 epoch

FastText (by Facebook AI Research)
FastText 300 dimensional embeddings for 157 languages (including
Ukrainian) computed using bags of character n-grams.
We took 300 000 pretrained vectors

Word level: KenLM, LSTM, FastText
Model Parameters Training time Size of file Perplexity
KenLM
N-gram
6-gram
Pruning: 000011
ARPA file:
---:11m:51s
18.2G 749
LSTM
LSTM
Embed: 300
Layers 3x500
Embed: 300
Layers 2x500
1 epoch:
25h:30m:---
1 epoch:
23h:50m:---
2.82G
2.79G
258
275
LSTM
FastText
freezed
Embed: 300
Layers 2x500
1 epoch:
16h:20m:---
2.1G 234
8 epoch

Subword level: LSTM
Model Parameters Training time
(1 epoch)
Size of model file Perplexity
LSTM
LSTM
Embed: 500
Layers 3x1024
Embed: 500
Layers 3x2048
6h:30m:---
17h:00m:---
0.61G
1.59G
426
354
10 epoch

Contribution & Future work
Contribution
● Composed, preprocessed and described a dataset (262M tokens) sufficient for training
neural LMs for the Ukrainian language. As a benchmark evaluation corpus, we
propose to use publicly available BrUk (779K tokens).
● Set a number of baselines
○ word-level N-gram Kneser-Ney ( 6-gram, perplexity 749)
○ word-level NN LSTM (perplexity 258)
○ LSTM with pretrained FastText embeddings (perplexity 234).
○ subword-level LSTM (perplexity 354)
● Contributed to the BrUk GitHub project and send preprocessed train set on the LangUk page.

Future work
● Incorporate data from tother domains.
● Experimenting with the Mogrifier LSTM extension.
● Measuring language model’s performance against a diverse hand-crafted set
of linguistic tasks

Review question
1. The final training dataset was not made available, so it is impossible to
verify its quality. It is mentioned that the effort is underway to publish it, but
it was not complete by the time this review was written.

Review question
2. There are issues with perplexity evaluation of the created language models:
○ There is no comparison with similar (in architecture and size) models for English.
State-of-the-art results for English LMs are at least an order of magnitude better. This may
indicate either a flaw of the models, or of the evaluation, or something else. It would also be
nice to perform additional evaluation and analysis on some common (regardless of
language differences) dataset. Such a dataset could be obtained, for example, by
automatically translating an existing English dataset that has a permissive license.
○ It is well known that perplexity on its own is a metric that doesn’t provide sufficient
information about the quality of the model. Usually, additional extrinsic evaluation or manual
error analysis is performed. This was not done in this work.

Corpora
Korrespondent
+Ukrainian ﬁction
1 Billion Word
Tokens 262M 820M
Vocabulary size 300 000 793 471
OOV rate 1.96% 0.28%
<unk> occurrences < 15 occurrences< 3

Review question
3. Besides perplexity analysis, I lacked a comprehensive overall analysis of
the difference between the produced models, their pros and cons.

Review answers
4. And some minor details:
○ This sentence is not factually correct: “As a rule, for data-science tasks, the available data
is split into training, validation and test partitions according to the proportion of 90%, 5%
and 5% respectively.” Such a rule, if it exists, may only be applicable to language modeling
projects, as a normal approach will use a totally different split (60, 20, 20 being the usual
one).
○ Removal of sentences longer than 60 tokens from the training set will limit its usefulness.
I believe, that was not necessary and could be addressed with special handling during
training.

Introduction
Motivation. Research gap
● A lot has been done for the English Language
○ statistical and neural approaches
● Ukrainian language modelling is still lacking
○ Published scientific literature
○ Established benchmarks
○ Publicly available resources
http://nlpprogress.com/english/language_modeling.html

Introduction
Language Models
The objective of Language Modeling is to learn a probability distribution
Empirical distribution P of a language Learned distribution Q
S - sequences of linguistic units
ui - i-th unit (characters, words or
phrases).
Conditional probabilities from the context of linguistic unit :
For example :
Linguistic units, seen by the model, compose model’s dictionary U

Models
ARPA model
contains n-grams
and number
of occurrences
probability
of a sentence
token
probability
estimation
Smoothing
techniques (KenLM)
Dictionary (U) of tokens.
softmax
● size of Embedding (m)
● size of Vocabulary (v)
token probability
estimation for each
token in vocabulary

Models
ARPA model
contains n-grams
and number
of occurrences
probability
of a sentence
token
probability
estimation
Smoothing
techniques (KenLM)
Example:
Сьогодні !
Сьогодні на факультеті прикладних наук…
Чекаємо на Вас Cьогодні о 20.00.
Сьогодні на проспекті Шевченка ...
...
2-gram
START Сьогодні 3
Вас Cьогодні 1
...
3-gram
START Сьогодні на 2
START Сьогодні ! 1
Вас Cьогодні о 1
...
p(на | START Сьогодні) = ⅔
p(! | START Сьогодні) = ⅓
p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼

Models
Dictionary (U) of tokens.
softmax
● size of Embedding (m)
● size of Vocabulary (v)
token probability
estimation for each
token in vocabulary
Example:
Сьогодні на [?]
Dictionary:
Сьогодні
факультеті
проспекті
.
.
.
END
START Сьогоні на

Evaluation
Empirical distribution P of a language Learned distribution Q
Cross - Entropy .
Word level LMs
Perplexity
After applying
a chain rule
Character level LMs
Bits Per Character
Word level LM and Character level LM
Al-Rfou, Rami et al. (2018). “Character-Level Language Modeling with Deeper Self-
Attention”. In: CoRR abs/ 1808.04444. arXiv: 1808.04444

Corpora
● In linguistics and natural language processing (NLP), CORPUS refers to a collection of
texts.
Penn
Treebank
1 Billion Word WikiText-2 WikiText-103
Tokens 929 590
73 761
82 431
829 250 940 - 1%
≈ 7 823 242
159 658
2 088 628
217 646
245 569
103 227 021
217 646
245 569
Train
Valid.
Test
Articles -
-
-
-
-
-
600
60
60
28 475
60
60
Train
Valid.
Test
Vocabulary size 10 000 793 471 33 278 267 735
OOV rate 4.8% 0.28% 2.6% 0.4%

Data
Brown Ukrainian Corpus Korrespondent
+Ukrainian ﬁction
Tokens 779 000 262 598 163
Sentences 39 900 14 335 495
Vocab.size 300 000 300 000
OOV rate 3.84% 1.96%

Preprocessing
1990
1990-ому
90-х
1990/2000
...
_#year_
_#year_-ому
_#year_-х
_#year_/_#year
...
01.01
01.01.2001
01.01.01
_#date_
0:00 _#time_
One
One-Two
One Two Three
...
_#foreign_
Most of the media sources
(news, magazines,
newspapers)
_#media_
1 noun
2 noun
3 noun
4 noun
один/одного/одному/
перший/першому...
два...
три...
чотири...
1234
10, 000 000
10:40
_#number_
_#number_:_#number_
10.00
10,5
_#ﬂoat_

KenLM
KenLM
N-gram
Pruning Training time
(minutes)
Size of
ARPA file
Perplexity
including
OOV
Perplexity
excluding
OOV
3-gram
6-gram
6-gram
6-gram
6-gram
6-gram
-
0 1 1 1 1 1
0 0 1 1 1 1
0 0 0 0 1 1
0 0 0 0 0 1
-
03:55
07:24
08:45
11:51
23:25
40:05
6.4G
5.0G
6.0G
18.2G
29.4G
41.1G
814.93
871.97
800.39
748.81
742.42
736.83
697.82
749.91
686.53
641.11
635.54
630.69

ZipfPlot

Subword level: LSTM
Model Parameters Training time
(1 epoch)
Size of model file Perplexity
LSTM Embed: 300
Layers 2x300
2h:05m:--- 0.15G 704
LSTM
LSTM
Embed: 500
Layers 3x1024
Embed: 500
Layers 3x2048
6h:30m:---
17h:00m:---
0.61G
1.59G
426
354
10 epoch

Master Student
Ciklum
Igor Tytyk
Supervisor
NLP Freelance
Acknowledgements
Computational resources Ciklum,
Grammarly
Brown-Uk, Lang-Uk,
Lviv 2020

Master Student
Ciklum
Igor Tytyk
Supervisor
NLP Freelance
Acknowledgements
Computational resources Ciklum, Grammarly
Brown-Uk, Lang-Uk,
Lviv 2020

Count-based approaches (N-grams with smoothing )
● Kneser-Ney smoothed 5-gram models (Kneser and Ney, 1995)
Neural methods
● based on simple RNNs (Bengio et al., 2003; Mikolov et al., 2010)
● character-aware Neural Language Models (Kim et al., 2015)
● based on LSTMs (Jozefowicz et al., 2016)
● gated convolutional networks (N. Dauphin et al., 2017)
● self-attentional networks (Al-Rfou et al., 2018).
Background
Language Models

What do we have for the Ukrainian language.
Corpora
● Ukrainian Brown Corpus is a well balanced and redacted corpus of original
Ukrainian texts published between 2010 and 2018 years, comprised of 9 domains
● Uber-Text Corpus - More than 6 Gb amount of Ukrainian texts, but unfortunately,
because of legal rules, split into sentences and then shuffled randomly. Without
punctuation.
Models
● Lang-Uk’s NER model and Word Embeddings
● Ukrainian FastText embeddings
Background
Research Gap

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language

Similar a Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language (20)

Más de Lviv Data Science Summer School

Más de Lviv Data Science Summer School (20)

Último

Último (20)

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language