SlideShare una empresa de Scribd logo
1 de 43
Descargar para leer sin conexión
Statistical and neural language
models for the Ukrainian Language
Anastasiia Khaburska
Master Student
Ukrainian Catholic University
Ciklum
a.khaburska@ucu.edu.ua
anastasykhab@gmail.com
Igor Tytyk
Supervisor
NLP Freelance
igor.tytyk@gmail.com
Lviv 2020
Acknowledgements
Co-Supervisor Artem Chernodub
Computational resources Ciklum, Grammarly
Brown-Uk, Lang-Uk,
Dmitriy Chaplinsky, Vasyl Starko
Ukrainian Catholic University
Content
1. Introduction
2. Language Modelling review
3. Language Modelling for the Ukrainian language
4. Results
5. Contribution & Future work
Introduction
Language Models
The objective of Language Modeling is to learn a probability distribution .
CORPUS –— a collection of texts.
Computing conditional probabilities from the context : P(linguistic unit |context)
P( “Захист дипломних робіт на факультеті прикладних наук.” )
Linguistic units, seen by the model, compose model’s DICTIONARY.
Probability of a sequence :
P( “Захист дипломних робіт на факультеті прикладних наук.” )
Introduction
Language Modelling. Use cases
Grammar correction
P(" Хлопець відкрив двері . " ) = 0.32 10-9
P(" Хлопець відчинив двері . " ) = 4.20 10-9
Machine translation
“ Who are You? “
P(" Хто є ти ? ") = 1.32 10-9
P(" Хто ти є ? ") = 5.85 10-9
Natural language generation
Text summarization
Speech recognition
Introduction
Motivation. Research gap
A lot has been done for the English Language
● statistical and neural approaches
Ukrainian language modelling is still lacking
● Published scientific literature
● Publicly available resources
● Established benchmarks
Introduction
Goals
To set a number of BASELINES in the task of building LMs
for the Ukrainian language
1. Compose training and evaluation datasets.
2. Explore existing methods for language modeling.
3. Train and Evaluate LMs for the Ukrainian language.
4. Suggest a benchmark for Ukrainian language models.
Language Modelling review
Corpora
In linguistics and natural language processing (NLP), CORPUS refers to a collection of texts.
Penn
Treebank
1 Billion Word WikiText-2 WikiText-103
Tokens 929 K
73 K
82 K
820M
7 M
159 K
2 M
217 K
245 K
103 M
217 K
245 K
Train
Valid.
Test
Vocabulary size 10 000 793 471 33 278 267 735
OOV rate 4.8% 0.28% 2.6% 0.4%
OOV rate = percentage of tokens replaced by an <unk> token
Language Modelling review
Models
Example:
Сьогодні !
Сьогодні на факультеті прикладних наук…
Чекаємо на Вас Cьогодні о 20.00.
Сьогодні на проспекті Шевченка ...
…
ARPA file
2-gram occurrences
START Сьогодні 3
Вас Cьогодні 1
…
3-gram
START Сьогодні на 2
START Сьогодні ! 1
Вас Cьогодні о 1
...
Token probability estimation:
p(на | START Сьогодні) = ⅔
p(! | START Сьогодні) = ⅓
p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼
Smoothing techniques (KenLM)
Statistical approaches . N-gram models
Language Modelling review
Models
Dictionary of tokens.
Embedding: token vector ( )
Sequence processing:
Сьогодні на [?] Dictionary:
Сьогодні
факультеті
проспекті
.
.
.
END
START Сьогоні на
Neural LMs. Recurrent neural networks (RNN).
Embeddings
Tokens
dimension
of embedding
size of
dictionary
Language Modelling review
Tokenization Levels
Word level:
Програму було започатковано у _#year_ р .
Character level:
^П р о г р а м у ^б у л о ^з а п о ч а т к о в а н о ^у _#year_ ^р .
Subword level (BPE - Bytes Pair Encoding):
^Про гра му ^було ^за ^почат ков ано ^у _#year_ ^р .
Subword level (BCN - Bags of Character N-grams):
Each word separate vector = sum of n-grams
Example n=3: Програму {‘^пр’, ‘про’, ‘рог’, ‘огр’, ‘гра’, ‘рам’, ‘аму’, ‘му_’ }
Language Modelling review
Evaluation
Better model is one which assigns a higher probability to the word that actually occurs.
PERPLEXITY –— an intrinsic metric for the evaluation of word-level models.
Inverse probability of the text normalized by the number of words.
Minimizing perplexity is the same as maximizing probability.
Language Modelling for Ukrainian language
Data
● OOV rate = percentage of tokens replaced by an <unk> token
Korrespondent
+Ukrainian fiction
Brown Ukrainian
Corpus
Tokens 262 598 163 779 000
Vocab.size 300 000 300 000
OOV rate 1.96% 3.84%
Language Modelling for Ukrainian language
Preprocessing
_#year_ 1990
_#date_ 01.01.2001
_#time_ 00:00
_#foreign_ Some Words
_#media_ Most of the media sources
(news, magazines, newspapers)
_#number_ 1234
_#float_ 10,5
Language Modelling for Ukrainian language
Data
Number of Tokens
Language Modelling for Ukrainian language
Word level: KenLM, LSTM
● OOV replaced with _#unknown_ token
● Vocabulary 300 000 tokens
Model Parameters Training time Size of file Perplexity
KenLM
N-gram
6-gram
Pruning: 000011
ARPA file:
---:11m:51s
18.2G 749
LSTM
LSTM
Embed: 300
Layers 3x500
Embed: 300
Layers 2x500
1 epoch:
25h:30m:---
1 epoch:
23h:50m:---
2.82G
2.79G
258
275
8 epoch
Language Modelling for Ukrainian language
FastText (by Facebook AI Research)
FastText 300 dimensional embeddings for 157 languages (including
Ukrainian) computed using bags of character n-grams.
We took 300 000 pretrained vectors
Language Modelling for Ukrainian language
Word level: KenLM, LSTM, FastText
Model Parameters Training time Size of file Perplexity
KenLM
N-gram
6-gram
Pruning: 000011
ARPA file:
---:11m:51s
18.2G 749
LSTM
LSTM
Embed: 300
Layers 3x500
Embed: 300
Layers 2x500
1 epoch:
25h:30m:---
1 epoch:
23h:50m:---
2.82G
2.79G
258
275
LSTM
FastText
freezed
Embed: 300
Layers 2x500
1 epoch:
16h:20m:---
2.1G 234
8 epoch
● Vocabulary 300 000 tokens
Language Modelling for Ukrainian language
Subword level: LSTM
Model Parameters Training time
(1 epoch)
Size of model file Perplexity
LSTM
LSTM
Embed: 500
Layers 3x1024
Embed: 500
Layers 3x2048
6h:30m:---
17h:00m:---
0.61G
1.59G
426
354
10 epoch
● Vocabulary 20 000 tokens
Contribution & Future work
Contribution
● Composed, preprocessed and described a dataset (262M tokens) sufficient for training
neural LMs for the Ukrainian language. As a benchmark evaluation corpus, we
propose to use publicly available BrUk (779K tokens).
● Set a number of baselines
○ word-level N-gram Kneser-Ney ( 6-gram, perplexity 749)
○ word-level NN LSTM (perplexity 258)
○ LSTM with pretrained FastText embeddings (perplexity 234).
○ subword-level LSTM (perplexity 354)
● Contributed to the BrUk GitHub project and send preprocessed train set on the LangUk page.
Contribution & Future work
Future work
● Incorporate data from tother domains.
● Experimenting with the Mogrifier LSTM extension.
● Measuring language model’s performance against a diverse hand-crafted set
of linguistic tasks
Questions?
Review question
1. The final training dataset was not made available, so it is impossible to
verify its quality. It is mentioned that the effort is underway to publish it, but
it was not complete by the time this review was written.
Review question
2. There are issues with perplexity evaluation of the created language models:
○ There is no comparison with similar (in architecture and size) models for English.
State-of-the-art results for English LMs are at least an order of magnitude better. This may
indicate either a flaw of the models, or of the evaluation, or something else. It would also be
nice to perform additional evaluation and analysis on some common (regardless of
language differences) dataset. Such a dataset could be obtained, for example, by
automatically translating an existing English dataset that has a permissive license.
○ It is well known that perplexity on its own is a metric that doesn’t provide sufficient
information about the quality of the model. Usually, additional extrinsic evaluation or manual
error analysis is performed. This was not done in this work.
Language Modelling review
Corpora
Korrespondent
+Ukrainian fiction
1 Billion Word
Tokens 262M 820M
Vocabulary size 300 000 793 471
OOV rate 1.96% 0.28%
<unk> occurrences < 15 occurrences< 3
Review question
3. Besides perplexity analysis, I lacked a comprehensive overall analysis of
the difference between the produced models, their pros and cons.
Contribution & Future work
Review answers
4. And some minor details:
○ This sentence is not factually correct: “As a rule, for data-science tasks, the available data
is split into training, validation and test partitions according to the proportion of 90%, 5%
and 5% respectively.” Such a rule, if it exists, may only be applicable to language modeling
projects, as a normal approach will use a totally different split (60, 20, 20 being the usual
one).
○ Removal of sentences longer than 60 tokens from the training set will limit its usefulness.
I believe, that was not necessary and could be addressed with special handling during
training.
Questions?
Introduction
Motivation. Research gap
● A lot has been done for the English Language
○ statistical and neural approaches
● Ukrainian language modelling is still lacking
○ Published scientific literature
○ Established benchmarks
○ Publicly available resources
http://nlpprogress.com/english/language_modeling.html
Introduction
Language Models
The objective of Language Modeling is to learn a probability distribution
Empirical distribution P of a language Learned distribution Q
S - sequences of linguistic units
ui - i-th unit (characters, words or
phrases).
Conditional probabilities from the context of linguistic unit :
For example :
Linguistic units, seen by the model, compose model’s dictionary U
Language Modelling review
Models
Statistical approaches . N-gram models
ARPA model
contains n-grams
and number
of occurrences
probability
of a sentence
token
probability
estimation
Smoothing
techniques (KenLM)
Neural LMs. Recurrent neural networks (RNN).
Dictionary (U) of tokens.
Embedding: token vector ( )
Sequence processing:
softmax
● size of Embedding (m)
● size of Vocabulary (v)
token probability
estimation for each
token in vocabulary
Language Modelling review
Models
ARPA model
contains n-grams
and number
of occurrences
probability
of a sentence
token
probability
estimation
Smoothing
techniques (KenLM)
Example:
Сьогодні !
Сьогодні на факультеті прикладних наук…
Чекаємо на Вас Cьогодні о 20.00.
Сьогодні на проспекті Шевченка ...
...
2-gram
START Сьогодні 3
Вас Cьогодні 1
...
3-gram
START Сьогодні на 2
START Сьогодні ! 1
Вас Cьогодні о 1
...
p(на | START Сьогодні) = ⅔
p(! | START Сьогодні) = ⅓
p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼
Statistical approaches . N-gram models
Language Modelling review
Models
Dictionary (U) of tokens.
Embedding: token vector ( )
Sequence processing:
softmax
● size of Embedding (m)
● size of Vocabulary (v)
token probability
estimation for each
token in vocabulary
Example:
Сьогодні на [?]
Dictionary:
Сьогодні
факультеті
проспекті
.
.
.
END
START Сьогоні на
Neural LMs. Recurrent neural networks (RNN).
Language Modelling review
Evaluation
Empirical distribution P of a language Learned distribution Q
Cross - Entropy .
Word level LMs
Perplexity
After applying
a chain rule
Character level LMs
Bits Per Character
Word level LM and Character level LM
Al-Rfou, Rami et al. (2018). “Character-Level Language Modeling with Deeper Self-
Attention”. In: CoRR abs/ 1808.04444. arXiv: 1808.04444
Language Modelling review
Corpora
● OOV rate = percentage of tokens replaced by an <unk> token
● In linguistics and natural language processing (NLP), CORPUS refers to a collection of
texts.
Penn
Treebank
1 Billion Word WikiText-2 WikiText-103
Tokens 929 590
73 761
82 431
829 250 940 - 1%
≈ 7 823 242
159 658
2 088 628
217 646
245 569
103 227 021
217 646
245 569
Train
Valid.
Test
Articles -
-
-
-
-
-
600
60
60
28 475
60
60
Train
Valid.
Test
Vocabulary size 10 000 793 471 33 278 267 735
OOV rate 4.8% 0.28% 2.6% 0.4%
Language Modelling for Ukrainian language
Data
● OOV rate = percentage of tokens replaced by an <unk> token
Brown Ukrainian Corpus Korrespondent
+Ukrainian fiction
Tokens 779 000 262 598 163
Sentences 39 900 14 335 495
Vocab.size 300 000 300 000
OOV rate 3.84% 1.96%
Language Modelling for Ukrainian language
Preprocessing
1990
1990-ому
90-х
1990/2000
...
_#year_
_#year_-ому
_#year_-х
_#year_/_#year
...
01.01
01.01.2001
01.01.01
_#date_
0:00 _#time_
One
One-Two
One Two Three
...
_#foreign_
Most of the media sources
(news, magazines,
newspapers)
_#media_
1 noun
2 noun
3 noun
4 noun
один/одного/одному/
перший/першому...
два...
три...
чотири...
1234
10, 000 000
10:40
_#number_
_#number_:_#number_
10.00
10,5
_#float_
Language Modelling for Ukrainian language
KenLM
KenLM
N-gram
Pruning Training time
(minutes)
Size of
ARPA file
Perplexity
including
OOV
Perplexity
excluding
OOV
3-gram
6-gram
6-gram
6-gram
6-gram
6-gram
-
0 1 1 1 1 1
0 0 1 1 1 1
0 0 0 0 1 1
0 0 0 0 0 1
-
03:55
07:24
08:45
11:51
23:25
40:05
6.4G
5.0G
6.0G
18.2G
29.4G
41.1G
814.93
871.97
800.39
748.81
742.42
736.83
697.82
749.91
686.53
641.11
635.54
630.69
Language Modelling for Ukrainian language
ZipfPlot
Language Modelling for Ukrainian language
Subword level: LSTM
Model Parameters Training time
(1 epoch)
Size of model file Perplexity
LSTM Embed: 300
Layers 2x300
2h:05m:--- 0.15G 704
LSTM
LSTM
Embed: 500
Layers 3x1024
Embed: 500
Layers 3x2048
6h:30m:---
17h:00m:---
0.61G
1.59G
426
354
10 epoch
● Vocabulary 20 000 tokens
Statistical and neural language
models for the Ukrainian Language
Anastasiia Khaburska
Master Student
Ukrainian Catholic University
Ciklum
a.khaburska@ucu.edu.ua
anastasykhab@gmail.com
Igor Tytyk
Supervisor
NLP Freelance
igor.tytyk@gmail.com
Acknowledgements
Co-Supervisor Artem Chernodub
Computational resources Ciklum,
Grammarly
Brown-Uk, Lang-Uk,
Dmitriy Chaplinsky, Vasyl Starko
Ukrainian Catholic University
Lviv 2020
Statistical and neural language
models for the Ukrainian Language
Anastasiia Khaburska
Master Student
Ukrainian Catholic University
Ciklum
a.khaburska@ucu.edu.ua
anastasykhab@gmail.com
Igor Tytyk
Supervisor
NLP Freelance
igor.tytyk@gmail.com
Acknowledgements
Co-Supervisor Artem Chernodub
Computational resources Ciklum, Grammarly
Brown-Uk, Lang-Uk,
Dmitriy Chaplinsky, Vasyl Starko
Ukrainian Catholic University
Lviv 2020
Count-based approaches (N-grams with smoothing )
● Kneser-Ney smoothed 5-gram models (Kneser and Ney, 1995)
Neural methods
● based on simple RNNs (Bengio et al., 2003; Mikolov et al., 2010)
● character-aware Neural Language Models (Kim et al., 2015)
● based on LSTMs (Jozefowicz et al., 2016)
● gated convolutional networks (N. Dauphin et al., 2017)
● self-attentional networks (Al-Rfou et al., 2018).
Background
Language Models
What do we have for the Ukrainian language.
Corpora
● Ukrainian Brown Corpus is a well balanced and redacted corpus of original
Ukrainian texts published between 2010 and 2018 years, comprised of 9 domains
● Uber-Text Corpus - More than 6 Gb amount of Ukrainian texts, but unfortunately,
because of legal rules, split into sentences and then shuffled randomly. Without
punctuation.
Models
● Lang-Uk’s NER model and Word Embeddings
● Ukrainian FastText embeddings
Background
Research Gap

Más contenido relacionado

La actualidad más candente

Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Marcin Junczys-Dowmunt
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Vsevolod Dyomkin
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLPSatyam Saxena
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...Ilia Karpov
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languageshs0041
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problemJaeHo Jang
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022NU_I_TODALAB
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...Francisco Manuel Rangel Pardo
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information RetrievalNik Spirin
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector spaceAbdullah Khan Zehady
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine ReadingIsabelle Augenstein
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionFlorian Leitner
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Francisco Manuel Rangel Pardo
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...Matīss ‎‎‎‎‎‎‎  
 

La actualidad más candente (20)

Searching for the Best Machine Translation Combination
Searching for the Best Machine Translation CombinationSearching for the Best Machine Translation Combination
Searching for the Best Machine Translation Combination
 
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
Automatic Grammatical Error Correction for ESL-Learners by SMT - Getting it r...
 
1909 paclic
1909 paclic1909 paclic
1909 paclic
 
Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)Crash Course in Natural Language Processing (2016)
Crash Course in Natural Language Processing (2016)
 
Anthiil Inside workshop on NLP
Anthiil Inside workshop on NLPAnthiil Inside workshop on NLP
Anthiil Inside workshop on NLP
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 
NLP Project Full Cycle
NLP Project Full CycleNLP Project Full Cycle
NLP Project Full Cycle
 
A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...A general method applicable to the search for anglicisms in russian social ne...
A general method applicable to the search for anglicisms in russian social ne...
 
Word Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented LanguagesWord Segmentation and Lexical Normalization for Unsegmented Languages
Word Segmentation and Lexical Normalization for Unsegmented Languages
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022The VoiceMOS Challenge 2022
The VoiceMOS Challenge 2022
 
A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...A Low Dimensionality Representation for Language Variety Identification (CICL...
A Low Dimensionality Representation for Language Variety Identification (CICL...
 
Language Models for Information Retrieval
Language Models for Information RetrievalLanguage Models for Information Retrieval
Language Models for Information Retrieval
 
Word representations in vector space
Word representations in vector spaceWord representations in vector space
Word representations in vector space
 
Weakly Supervised Machine Reading
Weakly Supervised Machine ReadingWeakly Supervised Machine Reading
Weakly Supervised Machine Reading
 
CoLing 2016
CoLing 2016CoLing 2016
CoLing 2016
 
OUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information ExtractionOUTDATED Text Mining 5/5: Information Extraction
OUTDATED Text Mining 5/5: Information Extraction
 
AINL 2016: Yagunova
AINL 2016: YagunovaAINL 2016: Yagunova
AINL 2016: Yagunova
 
Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...Language Variety Identification using Distributed Representations of Words an...
Language Variety Identification using Distributed Representations of Words an...
 
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 Neural Network Language Models for Candidate Scoring in Multi-System Machine... Neural Network Language Models for Candidate Scoring in Multi-System Machine...
Neural Network Language Models for Candidate Scoring in Multi-System Machine...
 

Similar a Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language

2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.pptmilkesa13
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...Lifeng (Aaron) Han
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Lifeng (Aaron) Han
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools Lifeng (Aaron) Han
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT Lifeng (Aaron) Han
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLifeng (Aaron) Han
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelHemantha Kulathilake
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingJakub Absolon
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processingMinh Pham
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)Abdullah al Mamun
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationChamani Shiranthika
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...alessio_ferrari
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersYoung Seok Kim
 
Poster Tweet-Norm 2013
Poster Tweet-Norm 2013Poster Tweet-Norm 2013
Poster Tweet-Norm 2013pruiz_
 
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffnlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffSushantVyas1
 

Similar a Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language (20)

2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt2-Chapter Two-N-gram Language Models.ppt
2-Chapter Two-N-gram Language Models.ppt
 
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
pptphrase-tagset-mapping-for-french-and-english-treebanks-and-its-application...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
MT SUMMIT PPT: Language-independent Model for Machine Translation Evaluation ...
 
Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...Pptphrase tagset mapping for french and english treebanks and its application...
Pptphrase tagset mapping for french and english treebanks and its application...
 
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
CUHK intern PPT. Machine Translation Evaluation: Methods and Tools
 
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT LEPOR: an augmented machine translation evaluation metric - Thesis PPT
LEPOR: an augmented machine translation evaluation metric - Thesis PPT
 
Lepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metricLepor: augmented automatic MT evaluation metric
Lepor: augmented automatic MT evaluation metric
 
ACL 2018 Recap
ACL 2018 RecapACL 2018 Recap
ACL 2018 Recap
 
NLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language ModelNLP_KASHK:Evaluating Language Model
NLP_KASHK:Evaluating Language Model
 
haenelt.ppt
haenelt.ppthaenelt.ppt
haenelt.ppt
 
Methodology of MT Post-Editors Training
Methodology of MT Post-Editors TrainingMethodology of MT Post-Editors Training
Methodology of MT Post-Editors Training
 
Introduction to natural language processing
Introduction to natural language processingIntroduction to natural language processing
Introduction to natural language processing
 
Natural Language Processing (NLP)
Natural Language Processing (NLP)Natural Language Processing (NLP)
Natural Language Processing (NLP)
 
Integration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translationIntegration of speech recognition with computer assisted translation
Integration of speech recognition with computer assisted translation
 
Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...Natural language processing for requirements engineering: ICSE 2021 Technical...
Natural language processing for requirements engineering: ICSE 2021 Technical...
 
GPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask LearnersGPT-2: Language Models are Unsupervised Multitask Learners
GPT-2: Language Models are Unsupervised Multitask Learners
 
2211 APSIPA
2211 APSIPA2211 APSIPA
2211 APSIPA
 
Poster Tweet-Norm 2013
Poster Tweet-Norm 2013Poster Tweet-Norm 2013
Poster Tweet-Norm 2013
 
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffffnlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
nlp-01.pptxvvvffffffvvvvvfeddeeddffffffffff
 

Más de Lviv Data Science Summer School

Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Lviv Data Science Summer School
 
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Lviv Data Science Summer School
 
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...Lviv Data Science Summer School
 
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...Lviv Data Science Summer School
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata ItemsLviv Data Science Summer School
 
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...Lviv Data Science Summer School
 
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...Lviv Data Science Summer School
 
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Lviv Data Science Summer School
 
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...Lviv Data Science Summer School
 
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...Lviv Data Science Summer School
 
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...Lviv Data Science Summer School
 
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...Lviv Data Science Summer School
 
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...Lviv Data Science Summer School
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...Lviv Data Science Summer School
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Lviv Data Science Summer School
 
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...Lviv Data Science Summer School
 
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...Lviv Data Science Summer School
 
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...Lviv Data Science Summer School
 
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...Lviv Data Science Summer School
 
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...Lviv Data Science Summer School
 

Más de Lviv Data Science Summer School (20)

Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
 
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
Master defence 2020 - Andrew Kurochkin - Meme Generation for Social Media Aud...
 
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
Master defence 2020 - Nazariy Perepichka - Parameterizing of Human Speech Gen...
 
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
Master defence 2020 - Serhii Tiutiunnyk - Context-based Question-answering Sy...
 
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
Master defence 2020 - Kateryna Liubonko - Matching Red Links to Wikidata Items
 
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
Master defence 2020 - Dmytro Babenko - Determining Sentiment and Important Pr...
 
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
Master defence 2020 - Oleh Lukianykhin - Reinforcement Learning for Voltage C...
 
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
Master defence 2020 - Borys Olshanetskyi -Context Independent Speaker Classif...
 
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
Master defence 2020 - Philipp Kofman - Efficient Generation of Complex Data D...
 
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
Master defence 2020 - Anastasiia Kasprova - Customer Lifetime Value for Retai...
 
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
Master defence 2020 - Dmitri Glusco - Replica Exchange For Multiple-Environme...
 
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
Master defence 2020 - Ivan Prodaiko - Person Re-identification in a Top-view ...
 
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...Master defence 2020 - Yevhen Pozdniakov -  Changing Clothing on People Images...
Master defence 2020 - Yevhen Pozdniakov - Changing Clothing on People Images...
 
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar... Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
Master defence 2020 - Oleh Onyshchak - Image Recommendation for Wikipedia Ar...
 
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
Master defence 2020 - Oleh Misko - Ensembling and Transfer Learning for Multi...
 
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...Master defence 2020 -  Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
Master defence 2020 - Roman Riazantsev - 3D Reconstruction of Video Sign Lan...
 
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
Master defence 2020 - Vadym Korshunov - Region-Selected Image Generation with...
 
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
Master defence 2020 -Roman Moiseiev - Stock Market Prediction Utilizing Centr...
 
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
Master defence 2020 - Maksym Opirskyi -Topological Approach to Wikipedia Arti...
 
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
Master defence 2020 - Oleksandr Smyrnov - A Multifactorial Optimization of Pe...
 

Último

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhousejana861314
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxpradhanghanshyam7136
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)PraveenaKalaiselvan1
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksSérgio Sacani
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxjana861314
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptxanandsmhk
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptxRajatChauhan518211
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...Sérgio Sacani
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfSumit Kumar yadav
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxAleenaTreesaSaji
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINsankalpkumarsahoo174
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPirithiRaju
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...Sérgio Sacani
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxkessiyaTpeter
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfSumit Kumar yadav
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PPRINCE C P
 

Último (20)

GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Orientation, design and principles of polyhouse
Orientation, design and principles of polyhouseOrientation, design and principles of polyhouse
Orientation, design and principles of polyhouse
 
Cultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptxCultivation of KODO MILLET . made by Ghanshyam pptx
Cultivation of KODO MILLET . made by Ghanshyam pptx
 
Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)Recombinant DNA technology (Immunological screening)
Recombinant DNA technology (Immunological screening)
 
The Philosophy of Science
The Philosophy of ScienceThe Philosophy of Science
The Philosophy of Science
 
Formation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disksFormation of low mass protostars and their circumstellar disks
Formation of low mass protostars and their circumstellar disks
 
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptxBroad bean, Lima Bean, Jack bean, Ullucus.pptx
Broad bean, Lima Bean, Jack bean, Ullucus.pptx
 
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptxUnlocking  the Potential: Deep dive into ocean of Ceramic Magnets.pptx
Unlocking the Potential: Deep dive into ocean of Ceramic Magnets.pptx
 
Green chemistry and Sustainable development.pptx
Green chemistry  and Sustainable development.pptxGreen chemistry  and Sustainable development.pptx
Green chemistry and Sustainable development.pptx
 
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
PossibleEoarcheanRecordsoftheGeomagneticFieldPreservedintheIsuaSupracrustalBe...
 
Botany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdfBotany 4th semester series (krishna).pdf
Botany 4th semester series (krishna).pdf
 
GFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptxGFP in rDNA Technology (Biotechnology).pptx
GFP in rDNA Technology (Biotechnology).pptx
 
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATINChromatin Structure | EUCHROMATIN | HETEROCHROMATIN
Chromatin Structure | EUCHROMATIN | HETEROCHROMATIN
 
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdfPests of cotton_Sucking_Pests_Dr.UPR.pdf
Pests of cotton_Sucking_Pests_Dr.UPR.pdf
 
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
All-domain Anomaly Resolution Office U.S. Department of Defense (U) Case: “Eg...
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptxSOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
SOLUBLE PATTERN RECOGNITION RECEPTORS.pptx
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
Zoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdfZoology 4th semester series (krishna).pdf
Zoology 4th semester series (krishna).pdf
 
VIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C PVIRUSES structure and classification ppt by Dr.Prince C P
VIRUSES structure and classification ppt by Dr.Prince C P
 

Master defence 2020 - Anastasiia Khaburska - Statistical and Neural Language Models for the Ukrainian Language

  • 1. Statistical and neural language models for the Ukrainian Language Anastasiia Khaburska Master Student Ukrainian Catholic University Ciklum a.khaburska@ucu.edu.ua anastasykhab@gmail.com Igor Tytyk Supervisor NLP Freelance igor.tytyk@gmail.com Lviv 2020 Acknowledgements Co-Supervisor Artem Chernodub Computational resources Ciklum, Grammarly Brown-Uk, Lang-Uk, Dmitriy Chaplinsky, Vasyl Starko Ukrainian Catholic University
  • 2. Content 1. Introduction 2. Language Modelling review 3. Language Modelling for the Ukrainian language 4. Results 5. Contribution & Future work
  • 3. Introduction Language Models The objective of Language Modeling is to learn a probability distribution . CORPUS –— a collection of texts. Computing conditional probabilities from the context : P(linguistic unit |context) P( “Захист дипломних робіт на факультеті прикладних наук.” ) Linguistic units, seen by the model, compose model’s DICTIONARY. Probability of a sequence : P( “Захист дипломних робіт на факультеті прикладних наук.” )
  • 4. Introduction Language Modelling. Use cases Grammar correction P(" Хлопець відкрив двері . " ) = 0.32 10-9 P(" Хлопець відчинив двері . " ) = 4.20 10-9 Machine translation “ Who are You? “ P(" Хто є ти ? ") = 1.32 10-9 P(" Хто ти є ? ") = 5.85 10-9 Natural language generation Text summarization Speech recognition
  • 5. Introduction Motivation. Research gap A lot has been done for the English Language ● statistical and neural approaches Ukrainian language modelling is still lacking ● Published scientific literature ● Publicly available resources ● Established benchmarks
  • 6. Introduction Goals To set a number of BASELINES in the task of building LMs for the Ukrainian language 1. Compose training and evaluation datasets. 2. Explore existing methods for language modeling. 3. Train and Evaluate LMs for the Ukrainian language. 4. Suggest a benchmark for Ukrainian language models.
  • 7. Language Modelling review Corpora In linguistics and natural language processing (NLP), CORPUS refers to a collection of texts. Penn Treebank 1 Billion Word WikiText-2 WikiText-103 Tokens 929 K 73 K 82 K 820M 7 M 159 K 2 M 217 K 245 K 103 M 217 K 245 K Train Valid. Test Vocabulary size 10 000 793 471 33 278 267 735 OOV rate 4.8% 0.28% 2.6% 0.4% OOV rate = percentage of tokens replaced by an <unk> token
  • 8. Language Modelling review Models Example: Сьогодні ! Сьогодні на факультеті прикладних наук… Чекаємо на Вас Cьогодні о 20.00. Сьогодні на проспекті Шевченка ... … ARPA file 2-gram occurrences START Сьогодні 3 Вас Cьогодні 1 … 3-gram START Сьогодні на 2 START Сьогодні ! 1 Вас Cьогодні о 1 ... Token probability estimation: p(на | START Сьогодні) = ⅔ p(! | START Сьогодні) = ⅓ p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼ Smoothing techniques (KenLM) Statistical approaches . N-gram models
  • 9. Language Modelling review Models Dictionary of tokens. Embedding: token vector ( ) Sequence processing: Сьогодні на [?] Dictionary: Сьогодні факультеті проспекті . . . END START Сьогоні на Neural LMs. Recurrent neural networks (RNN). Embeddings Tokens dimension of embedding size of dictionary
  • 10. Language Modelling review Tokenization Levels Word level: Програму було започатковано у _#year_ р . Character level: ^П р о г р а м у ^б у л о ^з а п о ч а т к о в а н о ^у _#year_ ^р . Subword level (BPE - Bytes Pair Encoding): ^Про гра му ^було ^за ^почат ков ано ^у _#year_ ^р . Subword level (BCN - Bags of Character N-grams): Each word separate vector = sum of n-grams Example n=3: Програму {‘^пр’, ‘про’, ‘рог’, ‘огр’, ‘гра’, ‘рам’, ‘аму’, ‘му_’ }
  • 11. Language Modelling review Evaluation Better model is one which assigns a higher probability to the word that actually occurs. PERPLEXITY –— an intrinsic metric for the evaluation of word-level models. Inverse probability of the text normalized by the number of words. Minimizing perplexity is the same as maximizing probability.
  • 12. Language Modelling for Ukrainian language Data ● OOV rate = percentage of tokens replaced by an <unk> token Korrespondent +Ukrainian fiction Brown Ukrainian Corpus Tokens 262 598 163 779 000 Vocab.size 300 000 300 000 OOV rate 1.96% 3.84%
  • 13. Language Modelling for Ukrainian language Preprocessing _#year_ 1990 _#date_ 01.01.2001 _#time_ 00:00 _#foreign_ Some Words _#media_ Most of the media sources (news, magazines, newspapers) _#number_ 1234 _#float_ 10,5
  • 14. Language Modelling for Ukrainian language Data Number of Tokens
  • 15. Language Modelling for Ukrainian language Word level: KenLM, LSTM ● OOV replaced with _#unknown_ token ● Vocabulary 300 000 tokens Model Parameters Training time Size of file Perplexity KenLM N-gram 6-gram Pruning: 000011 ARPA file: ---:11m:51s 18.2G 749 LSTM LSTM Embed: 300 Layers 3x500 Embed: 300 Layers 2x500 1 epoch: 25h:30m:--- 1 epoch: 23h:50m:--- 2.82G 2.79G 258 275 8 epoch
  • 16. Language Modelling for Ukrainian language FastText (by Facebook AI Research) FastText 300 dimensional embeddings for 157 languages (including Ukrainian) computed using bags of character n-grams. We took 300 000 pretrained vectors
  • 17. Language Modelling for Ukrainian language Word level: KenLM, LSTM, FastText Model Parameters Training time Size of file Perplexity KenLM N-gram 6-gram Pruning: 000011 ARPA file: ---:11m:51s 18.2G 749 LSTM LSTM Embed: 300 Layers 3x500 Embed: 300 Layers 2x500 1 epoch: 25h:30m:--- 1 epoch: 23h:50m:--- 2.82G 2.79G 258 275 LSTM FastText freezed Embed: 300 Layers 2x500 1 epoch: 16h:20m:--- 2.1G 234 8 epoch ● Vocabulary 300 000 tokens
  • 18. Language Modelling for Ukrainian language Subword level: LSTM Model Parameters Training time (1 epoch) Size of model file Perplexity LSTM LSTM Embed: 500 Layers 3x1024 Embed: 500 Layers 3x2048 6h:30m:--- 17h:00m:--- 0.61G 1.59G 426 354 10 epoch ● Vocabulary 20 000 tokens
  • 19. Contribution & Future work Contribution ● Composed, preprocessed and described a dataset (262M tokens) sufficient for training neural LMs for the Ukrainian language. As a benchmark evaluation corpus, we propose to use publicly available BrUk (779K tokens). ● Set a number of baselines ○ word-level N-gram Kneser-Ney ( 6-gram, perplexity 749) ○ word-level NN LSTM (perplexity 258) ○ LSTM with pretrained FastText embeddings (perplexity 234). ○ subword-level LSTM (perplexity 354) ● Contributed to the BrUk GitHub project and send preprocessed train set on the LangUk page.
  • 20. Contribution & Future work Future work ● Incorporate data from tother domains. ● Experimenting with the Mogrifier LSTM extension. ● Measuring language model’s performance against a diverse hand-crafted set of linguistic tasks
  • 22. Review question 1. The final training dataset was not made available, so it is impossible to verify its quality. It is mentioned that the effort is underway to publish it, but it was not complete by the time this review was written.
  • 23. Review question 2. There are issues with perplexity evaluation of the created language models: ○ There is no comparison with similar (in architecture and size) models for English. State-of-the-art results for English LMs are at least an order of magnitude better. This may indicate either a flaw of the models, or of the evaluation, or something else. It would also be nice to perform additional evaluation and analysis on some common (regardless of language differences) dataset. Such a dataset could be obtained, for example, by automatically translating an existing English dataset that has a permissive license. ○ It is well known that perplexity on its own is a metric that doesn’t provide sufficient information about the quality of the model. Usually, additional extrinsic evaluation or manual error analysis is performed. This was not done in this work.
  • 24. Language Modelling review Corpora Korrespondent +Ukrainian fiction 1 Billion Word Tokens 262M 820M Vocabulary size 300 000 793 471 OOV rate 1.96% 0.28% <unk> occurrences < 15 occurrences< 3
  • 25. Review question 3. Besides perplexity analysis, I lacked a comprehensive overall analysis of the difference between the produced models, their pros and cons.
  • 26. Contribution & Future work Review answers 4. And some minor details: ○ This sentence is not factually correct: “As a rule, for data-science tasks, the available data is split into training, validation and test partitions according to the proportion of 90%, 5% and 5% respectively.” Such a rule, if it exists, may only be applicable to language modeling projects, as a normal approach will use a totally different split (60, 20, 20 being the usual one). ○ Removal of sentences longer than 60 tokens from the training set will limit its usefulness. I believe, that was not necessary and could be addressed with special handling during training.
  • 28. Introduction Motivation. Research gap ● A lot has been done for the English Language ○ statistical and neural approaches ● Ukrainian language modelling is still lacking ○ Published scientific literature ○ Established benchmarks ○ Publicly available resources http://nlpprogress.com/english/language_modeling.html
  • 29. Introduction Language Models The objective of Language Modeling is to learn a probability distribution Empirical distribution P of a language Learned distribution Q S - sequences of linguistic units ui - i-th unit (characters, words or phrases). Conditional probabilities from the context of linguistic unit : For example : Linguistic units, seen by the model, compose model’s dictionary U
  • 30. Language Modelling review Models Statistical approaches . N-gram models ARPA model contains n-grams and number of occurrences probability of a sentence token probability estimation Smoothing techniques (KenLM) Neural LMs. Recurrent neural networks (RNN). Dictionary (U) of tokens. Embedding: token vector ( ) Sequence processing: softmax ● size of Embedding (m) ● size of Vocabulary (v) token probability estimation for each token in vocabulary
  • 31. Language Modelling review Models ARPA model contains n-grams and number of occurrences probability of a sentence token probability estimation Smoothing techniques (KenLM) Example: Сьогодні ! Сьогодні на факультеті прикладних наук… Чекаємо на Вас Cьогодні о 20.00. Сьогодні на проспекті Шевченка ... ... 2-gram START Сьогодні 3 Вас Cьогодні 1 ... 3-gram START Сьогодні на 2 START Сьогодні ! 1 Вас Cьогодні о 1 ... p(на | START Сьогодні) = ⅔ p(! | START Сьогодні) = ⅓ p(о | START Сьогодні) = 0 p(о |Сьогодні) = ¼ Statistical approaches . N-gram models
  • 32. Language Modelling review Models Dictionary (U) of tokens. Embedding: token vector ( ) Sequence processing: softmax ● size of Embedding (m) ● size of Vocabulary (v) token probability estimation for each token in vocabulary Example: Сьогодні на [?] Dictionary: Сьогодні факультеті проспекті . . . END START Сьогоні на Neural LMs. Recurrent neural networks (RNN).
  • 33. Language Modelling review Evaluation Empirical distribution P of a language Learned distribution Q Cross - Entropy . Word level LMs Perplexity After applying a chain rule Character level LMs Bits Per Character Word level LM and Character level LM Al-Rfou, Rami et al. (2018). “Character-Level Language Modeling with Deeper Self- Attention”. In: CoRR abs/ 1808.04444. arXiv: 1808.04444
  • 34. Language Modelling review Corpora ● OOV rate = percentage of tokens replaced by an <unk> token ● In linguistics and natural language processing (NLP), CORPUS refers to a collection of texts. Penn Treebank 1 Billion Word WikiText-2 WikiText-103 Tokens 929 590 73 761 82 431 829 250 940 - 1% ≈ 7 823 242 159 658 2 088 628 217 646 245 569 103 227 021 217 646 245 569 Train Valid. Test Articles - - - - - - 600 60 60 28 475 60 60 Train Valid. Test Vocabulary size 10 000 793 471 33 278 267 735 OOV rate 4.8% 0.28% 2.6% 0.4%
  • 35. Language Modelling for Ukrainian language Data ● OOV rate = percentage of tokens replaced by an <unk> token Brown Ukrainian Corpus Korrespondent +Ukrainian fiction Tokens 779 000 262 598 163 Sentences 39 900 14 335 495 Vocab.size 300 000 300 000 OOV rate 3.84% 1.96%
  • 36. Language Modelling for Ukrainian language Preprocessing 1990 1990-ому 90-х 1990/2000 ... _#year_ _#year_-ому _#year_-х _#year_/_#year ... 01.01 01.01.2001 01.01.01 _#date_ 0:00 _#time_ One One-Two One Two Three ... _#foreign_ Most of the media sources (news, magazines, newspapers) _#media_ 1 noun 2 noun 3 noun 4 noun один/одного/одному/ перший/першому... два... три... чотири... 1234 10, 000 000 10:40 _#number_ _#number_:_#number_ 10.00 10,5 _#float_
  • 37. Language Modelling for Ukrainian language KenLM KenLM N-gram Pruning Training time (minutes) Size of ARPA file Perplexity including OOV Perplexity excluding OOV 3-gram 6-gram 6-gram 6-gram 6-gram 6-gram - 0 1 1 1 1 1 0 0 1 1 1 1 0 0 0 0 1 1 0 0 0 0 0 1 - 03:55 07:24 08:45 11:51 23:25 40:05 6.4G 5.0G 6.0G 18.2G 29.4G 41.1G 814.93 871.97 800.39 748.81 742.42 736.83 697.82 749.91 686.53 641.11 635.54 630.69
  • 38. Language Modelling for Ukrainian language ZipfPlot
  • 39. Language Modelling for Ukrainian language Subword level: LSTM Model Parameters Training time (1 epoch) Size of model file Perplexity LSTM Embed: 300 Layers 2x300 2h:05m:--- 0.15G 704 LSTM LSTM Embed: 500 Layers 3x1024 Embed: 500 Layers 3x2048 6h:30m:--- 17h:00m:--- 0.61G 1.59G 426 354 10 epoch ● Vocabulary 20 000 tokens
  • 40. Statistical and neural language models for the Ukrainian Language Anastasiia Khaburska Master Student Ukrainian Catholic University Ciklum a.khaburska@ucu.edu.ua anastasykhab@gmail.com Igor Tytyk Supervisor NLP Freelance igor.tytyk@gmail.com Acknowledgements Co-Supervisor Artem Chernodub Computational resources Ciklum, Grammarly Brown-Uk, Lang-Uk, Dmitriy Chaplinsky, Vasyl Starko Ukrainian Catholic University Lviv 2020
  • 41. Statistical and neural language models for the Ukrainian Language Anastasiia Khaburska Master Student Ukrainian Catholic University Ciklum a.khaburska@ucu.edu.ua anastasykhab@gmail.com Igor Tytyk Supervisor NLP Freelance igor.tytyk@gmail.com Acknowledgements Co-Supervisor Artem Chernodub Computational resources Ciklum, Grammarly Brown-Uk, Lang-Uk, Dmitriy Chaplinsky, Vasyl Starko Ukrainian Catholic University Lviv 2020
  • 42. Count-based approaches (N-grams with smoothing ) ● Kneser-Ney smoothed 5-gram models (Kneser and Ney, 1995) Neural methods ● based on simple RNNs (Bengio et al., 2003; Mikolov et al., 2010) ● character-aware Neural Language Models (Kim et al., 2015) ● based on LSTMs (Jozefowicz et al., 2016) ● gated convolutional networks (N. Dauphin et al., 2017) ● self-attentional networks (Al-Rfou et al., 2018). Background Language Models
  • 43. What do we have for the Ukrainian language. Corpora ● Ukrainian Brown Corpus is a well balanced and redacted corpus of original Ukrainian texts published between 2010 and 2018 years, comprised of 9 domains ● Uber-Text Corpus - More than 6 Gb amount of Ukrainian texts, but unfortunately, because of legal rules, split into sentences and then shuffled randomly. Without punctuation. Models ● Lang-Uk’s NER model and Word Embeddings ● Ukrainian FastText embeddings Background Research Gap