REVISTA DE BIOLOGIA E CIÊNCIAS DA TERRA ISSN 1519-5228 - Artigo_Bioterra_V24_...
From Phonology to Syntax: Unsupervised Linguistic Typology at Different Levels with Language Embeddings
1. Johannes Bjerva, Isabelle Augenstein
Department of Computer Science, University of Copenhagen
{bjerva | augenstein}@di.ku.dk
From Phonology to Syntax:
Unsupervised Linguistic Typology at Different Levels with Language Embeddings
Task-specific Language Embeddings
Data
• World Atlas of Language Structure (WALS)
• Encodes typological features for ~2500 languages
• Features divided into categories:
Phonology
Morphology
Syntax
…
Results
Key findings
Language representations can predict
typological features
Task-specific representations are better at
predicting task-related features
Models
Phonological and Morphological tasks
• Sequence-to-sequence bi-LSTM w/Attention
Syntactic tasks
• Sequence-labelling Bi-LSTM
Language embeddings pre-trained on
multilingual language modelling, and then fine-
tuned
Computational Typology
• Problem: 7,000 languages in the world, only 100
are fully covered in typological databases.
• Approach: Predicting typological features with
unsupervised language representations, fine-
tuned for specific NLP tasks.
• Research questions:
1. What typological properties are encoded in
task-specific language embeddings?
2. Do the encoded properties change with fine-
tuning?
3. How are language similarities encoded?
h 1 h 2 h 3 h 4 h n
+
L S T M E n c o d e r
A t t e n t io n
L S T M D e c o d e r h 1 h 2 h 3 h 4 h n
s a p a n d ın ız
h 1
h 1
s
h 2
h 2
a
h 3
h 3
p
h 4
h 4
a
h n
h n
n
E m b e d
h n
h n
n
N ; L G S P E C 1 ; 2 P ; S G ; P S T
O n e - h o t
C o n c a t .
1 1 0
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 1 6
1 1 7
1 1 8
1 1 9
1 2 0
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
1 3 0
1 3 1
1 3 2
1 3 3
1 3 4
1 3 5
1 3 6
1 3 7
1 3 8
1 3 9
1 4 0
1 4 1
1 4 2
1 4 3
1 4 4
1 4 5
1 4 6
1 4 7
1 4 8
1 4 9
p l o r e d l e a r n i n g l a n g u a
t e x t o f n e u r a l m a c h i n e
2 0 1 7 ) . I n t h i s w o r k , w
t r a i n e d b y ¨O s t l i n g a n d
t h e i r o r i g i n a l s t a t e , a n d
P o S t a g g i n g .
2 .3 T y p o l o g i c a l d a t a
I n t h e e x p e r i m e n t s f o
d i c t t y p o l o g i c a l f e a t u r
w e a i m t o p r e d i c t f r o m
m a t h , 2 0 1 3 ) . W e c o n s
c o d e d f o r a l l f o u r U r a l
2 .4 L a n g u a g e E m b e
P a r a m e t e r V e c t
3 M o r p h o l o g y
3 .1 M o r p h o l o g i c a l i
U n i m o r p h
3 .2 M o r p h o l o g i c a l E
W e t r a i n a s e q u e n c e
o n t h e s y s t e m d e v e l o
( 2 0 1 7 ) . T h e n e u r a l a
a s t o i n c l u d e a n e m b e
t i o n . D u r i n g t r a i n i n g
p r o p a g a t e d i n t o t h i s e m
e n c o d e d r e p r e s e n t a t i o
t a s k i s l e a r n e d . T h e
p i c t e d i n F i g u r e ? ? .
~l
4 P h o n o l o g y
4 .1 G r a p h e m e - t o - p
g 2 p d a t a
4 .2 P h o n o l o g i c a l E x
W e t r a i n a s e q u e n c e - t
t o t h e m o r p h o l o g i c a l s
p h o n e m e d a t a .
2
1 0 0
1 0 1
1 0 2
1 0 3
1 0 4
1 0 5
1 0 6
1 0 7
1 0 8
1 0 9
1 1 0
1 1 1
1 1 2
1 1 3
1 1 4
1 1 5
1 1 6
1 1 7
1 1 8
1 1 9
1 2 0
1 2 1
1 2 2
1 2 3
1 2 4
1 2 5
1 2 6
1 2 7
1 2 8
1 2 9
1 3 0
1 3 1
1 3 2
1 3 3
1 3 4
1 3 5
1 3 6
1 3 7
1 3 8
1 3 9
1 4 0
1 4 1
1 4 2
1 4 3
1 4 4
1 4 5
1 4 6
1 4 7
1 4 8
1 4 9
1 5 0
1 5 1
1 5 2
1 5 3
1 5 4
1 5 5
1 5 6
1 5 7
1 5 8
1 5 9
1 6 0
1 6 1
1 6 2
1 6 3
1 6 4
1 6 5
1 6 6
1 6 7
1 6 8
1 6 9
1 7 0
1 7 1
1 7 2
1 7 3
1 7 4
1 7 5
1 7 6
1 7 7
1 7 8
1 7 9
1 8 0
1 8 1
1 8 2
1 8 3
1 8 4
1 8 5
1 8 6
1 8 7
1 8 8
1 8 9
1 9 0
1 9 1
1 9 2
1 9 3
1 9 4
1 9 5
1 9 6
1 9 7
1 9 8
1 9 9
N A A C L - H L T 2 0 1 8 S u b m is s io n ***. C o n fi d e n t ia l R e v ie w C o p y . D O N O T D IS T R IB U T E .
2 0 1 0 ) s i m u l t a n e o u s l y f o r d i f f e r e n t l a n g u a g e s
( T s v e tk o v e t a l . , 2 0 1 6 ; ¨O s tl i n g a n d T i e d e m a n n ,
2 0 1 7 ) . I n t h e se r e c u r r e n t m u l ti l i n g u a l l a n -
g u a g e m o d e l s w i t h l o n g s h o r t - t e r m m e m o r y c e l l s
( L S T M , H o c h r e i t e r a n d S c h m i d h u b e r , 1 9 9 7 ) , l a n -
g u a g e s a r e e m b e d d e d i n to a n - d i m e n s i o n a l s p a c e .
I n o r d e r f o r m u l t i li n g u a l p a r a m e t e r s h a r i n g to b e
s u c c e s s f u l i n t h i s s e tt i n g , t h e n e u r a l n e tw o r k i s e n -
c o u r a g e d t o u s e t h e l a n g u a g e e m b e d d i n g s to e n -
c o d e f e a t u r e s o f la n g u a g e . O t h e r w o r k h a s e x -
p l o r e d l e a r n i n g l a n g u a g e e m b e d d i n g s i n th e c o n -
t e x t o f n e u r a l m a c h i n e t r a n s l a ti o n ( M a l a v i y a e t a l . ,
2 0 1 7 ) . I n t h i s w o r k , w e e x p l o r e t h e e m b e d d i n g s
t r a i n e d b y ¨O s tl i n g a n d T i e d e m a n n ( 2 0 1 7 ) , b o t h i n
t h e i r o r i g i n a l s t a te , a n d b y f u r t h e r t u n i n g t h e m f o r
P o S t a g g i n g .
2 .3 T y p o l o g i c a l d a t a
I n t h e e x p e r i m e n t s f o r R Q 3 , w e a t te m p t t o p r e -
d i c t t y p o l o g i c a l f e a t u r e s . W e e x t r a c t t h e f e a t u r e s
w e a i m t o p r e d i c t f r o m W A L S ( D r y e r a n d H a sp e l-
m a t h , 2 0 1 3 ) . W e c o n s i d e r f e a t u r e s w h i c h a r e e n -
c o d e d f o r a l l f o u r U r a l i c l a n g u a g e s i n o u r s a m p l e .
2 .4 L a n g u a g e E m b e d d i n g s a s C h o m s k y a n
P a r a m e t e r V e c t o r s
3 M o r p h o l o g y
3 .1 M o r p h o l o g i c a l i n fl e c t i o n
U n i m o r p h
3 .2 M o r p h o l o g i c a l E x p e r i m e n t s
W e t r a i n a s e q u e n c e - t o - s e q u e n c e m o d e l b a s e d
o n t h e s y s t e m d e v e lo p e d b y ¨O s tl i n g a n d B j e r v a
( 2 0 1 7 ) . T h e n e u r a l a r c h it e c t u r e i s m o d i fi e d s o
a s t o i n c l u d e a n e m b e d d e d la n g u a g e r e p r e s e n t a -
t i o n . D u r i n g t r a i n i n g , t h e e r r o r s a r e a l s o b a c k -
p r o p a g a t e d i n t o t h i s e m b e d d i n g , m e a n i n g t h a t t h e
e n c o d e d r e p r e s e n t a ti o n w i l l b e fi n e - t u n e d a s t h e
t a sk i s l e a r n e d . T h e s y s t e m a r c h i t e c t u r e i s d e -
p i c t e d i n F i g u r e ? ? .
~l
4 P h o n o l o g y
4 .1 G r a p h e m e - t o - p h o n e m e
g 2 p d a t a
4 .2 P h o n o l o g i c a l E x p e r i m e n t s
W e tr a i n a s e q u e n c e - t o - s e q u e n c e m o d e l i d e n t i c a l
t o t h e m o r p h o l o g ic a l s y s te m , u s i n g g r a p h e m e - to -
p h o n e m e d a t a .
5 M o r p h o s y n t a x
5 .1 P a r t - o f - s p e e c h t a g g i n g
W e u s e P o S a n n o t a t i o n s f r o m v e r s io n 2 o f t h e U n i -
v e r sa l D e p e n d e n c i e s ( N i v r e e t a l. , 2 0 1 6 ) . W e f o c u s
o n t h e f o u r U r a l i c l a n g u a g e s p r e s e n t i n t h e U D ,
n a m e l y F i n n i s h ( b a se d o n t h e T u r k u D e p e n d e n c y
T r e e b a n k , P y y s a l o e t a l . , 2 0 1 5 ) , E s t o n i a n ( M u i s -
c h n e k e t a l . , 2 0 1 6 ) , H u n g a r i a n ( b a s e d o n t h e H u n -
g a r i a n D e p e n d e n c y T r e e b a n k , V i n c z e e t a l . , 2 0 1 0 ) ,
a n d N o r t h S ´a m i ( S h e y a n o v a a n d T y e r s , 2 0 1 7 ) . A s
w e a r e m a i n l y i n t e r e s te d i n o b s e r v i n g t h e l a n g u a g e
e m b e d d i n g s , w e d o w n - s a m p l e a l l t r a i n i n g se t s t o
1 5 0 0 s e n t e n c e s ( a p p r o x i m a t e n u m b e r o f s e n t e n c e s
i n t h e H u n g a r i a n d a t a ) , s o a s t o m i n i m i s e a n y s i z e -
b a s e d e f f e c t s .
6 M e t h o d a n d e x p e r i m e n t s
W e a p p r o a c h t h e t a sk o f P o S t a g g i n g u s i n g a f a i r l y
s t a n d a r d b i - d i r e c t io n a l L S T M a r c h i t e c t u r e , b a s e d
o n P l a n k e t a l . ( 2 0 1 6 ) . T h e s y s te m i s i m p l e m e n t e d
u s i n g D y N e t ( N e u b i g e t a l ., 2 0 1 7 ) . W e tr a i n
u s i n g t h e A d a m o p t i m is a t i o n a l g o r i t h m ( K i n g m a
a n d B a , 2 0 1 4 ) o v e r a m a x i m u m o f 1 0 e p o c h s ,
u s i n g e a r l y st o p p i n g . W e m a k e t w o m o d i fi c a -
ti o n s t o t h e b i - L S T M a r c h i t e c t u r e o f P l a n k e t a l .
( 2 0 1 6 ) . F i r s t o f a l l , w e d o n o t u s e a n y a to m i c
e m b e d d e d w o r d r e p r e s e n t a ti o n s , b u t r a t h e r u s e
o n l y c h a r a c te r - b a s e d w o r d r e p r e s e n t a t i o n s . T h i s
c h o i c e w a s m a d e s o a s t o e n c o u r a g e t h e m o d e l
n o t to r e ly o n l a n g u a g e - s p e c i fi c v o c a b u l a r y . A d -
d i t i o n a ll y , w e c o n c a t e n a t e a p r e - t r a i n e d l a n g u a g e
e m b e d d i n g t o e a c h w o r d r e p r e s e n t a t i o n . T h a t i s
to s a y , i n t h e o r i g in a l b i - L S T M f o r m u l a t i o n o f
P l a n k e t a l. ( 2 0 1 6 ) , e a c h w o r d w i s r e p r e s e n t e d a s
~w + L S T M c ( w ) , w h e r e ~w i s a n e m b e d d e d w o r d
r e p r e s e n t a ti o n , a n d L S T M c ( w ) is t h e fi n a l s t a t e s
o f a c h a r a c t e r b i - L S T M r u n n i n g o v e r t h e c h a r a c -
te r s i n a w o r d . I n o u r f o r m u l a t i o n , e a c h w o r d w
i n l a n g u a g e l i s r e p r e s e n te d a s L S T M c ( w ) + ~l ,
w h e r e L S T M c ( w ) i s d e fi n e d a s b e f o r e , a n d ~l i s
a n e m b e d d e d l a n g u a g e r e p r e s e n t a t i o n . W e u s e a
tw o - l a y e r d e e p b i - L S T M , w i t h 1 0 0 u n i t s i n e a c h
l a y e r . T h e c h a r a c t e r e m b e d d i n g s u s e d a l so h a v e
1 0 0 d i m e n s i o n s. W e u p d a t e t h e l a n g u a g e r e p r e -
s e n t a ti o n s, ~l , d u r i n g t r a i n i n g . T h e l a n g u a g e r e p r e -
s e n t a ti o n s a r e 6 4 - d i m e n s i o n a l , a n d a r e i n i ti a l is e d
u s i n g th e l a n g u a g e e m b e d d i n g s f r o m ¨O s t l i n g a n d
T i e d e m a n n ( 2 0 1 7 ) . A ll P o S t a g g i n g r e s u l t s r e -
p o r t e d a r e th e a v e r a g e o f fi v e r u n s , e a c h w i t h d i f -
f e r e n t i n i t i a l i sa t i o n s e e d s , s o a s to m in i m i s e r a n -
C o n c a t .
NLP Tasks
• Phonological tasks
Grapheme-to-Phoneme (102 languages)
ASJP Phonological reconstruction (824
languages)
• Morphological tasks:
Morphological inflection (29 languages)
• Syntactic tasks:
Part-of-speech tagging (27 languages)
We gratefully acknowledge the support of the NVIDIA Corporation with the donation of the
Titan Xp GPU used for this research.
Presented at the University of Copenhagen
Grand Opening of the Science AI Centre (12 April 2018)