SlideShare a Scribd company logo
1 of 27
Juman++ v2:
A Practical and Modern
Morphological Analyzer
Arseny Tolmachev
Kyoto University
Kurohashi-Kawahara lab
2018-03-14
#jumanppgithub.com/ku-nlp/jumanpp
What is Juman++
外国人参政権
[Morita+ EMNLP2015]
外 国
foreign 人 参
carrot
政 権
regime
人
person
参 政
vote
権
right0.399 0.00003
0.002
0.0021
0.909
p(外国,人参,政権)
= 0.252 x 10-8
p(外国,人,参政,権)
= 0.761 x 10-7
Main idea: Use a Recurrent Neural Network Language Model to
consider semantic plausibility in addition to usual model score 2
Why Juman++? Accuracy!
Kyoto University Text Corpus (NEWS),
Kyoto University Web Document Leads Corpus (WEB)
Setting
10 million raw sentences
crawled from the web
Dataset for training RNNLM: Dataset for training base model and evaluation:
3
Juman++
感想 | や | ご | 要望
andimpression requesthonorific
prefix
JUMAN
感想|やご|要望
a larva of a dragonfly
(F1)
Word segmentation & POS tags
97.4
97.6
97.8
98
98.2
98.4
98.6
JUMAN MeCab Juman++
Number of fatal errors in 1,000 sentences
0
20
40
60
80
Juman++JUMAN
Why not Juman++? Speed!
Analyzer Analysis speed
(Sentences/Second)
Mecab 53,000
KyTea 2,000
JUMAN 8,800
Juman++ 16
4
Why not Juman++ V1? Speed!
Analyzer Analysis speed
(Sentences/Second)
Mecab 53,000
KyTea 2,000
JUMAN 8,800
Juman++ V1 16
Juman++ V2 4,800
5
250x
Why Juman++ V2? Accuracy!!!
6
* = Optimized hyper-parameters on 10-fold cross-validation
Using the same Jumandic + concatenation of Kyoto/KWDLC corpora for training
95
96
97
98
99
100
Kyoto (Seg) + Pos Leads (Seg) + Pos
JUMAN
Kytea
Mecab*
J++ V1
J++ V2*
F1
What is Juman++ v2 (in its core)
A dictionary independent
thread-safe
library (not just a binary program)
for morphological analysis
using lattice-based segmentation
optimized for n-best output
7
Reasons of speedup
• Dictionary representation
• No non-necessary computations
• Reduced search space
• Cache-friendly data structures
• Lattice => Struct of arrays
• Code generation
• Mostly branch-free feature extraction
• Weight prefetching
• RNN-specific: vectorization, batching
8
Algorithmic
(Micro)
architectural
Juman++ Model
9
𝑠
= ∑𝑤𝑖 𝜙𝑖 + 𝛼(𝑠 𝑅𝑁𝑁 + 𝛽)
For an input
sentence:
Build a lattice
Assign a score to
each path
through the
lattice Linear Model
FeaturesWeights
…
…
Linear Model Features
• BIGRAM (品詞)(品詞)
• BIGRAM (品詞-細分類)(品詞-細分類)
• TRIGRAM (品詞)(品詞)(品詞)
...
10
Created based on ngram feature templates
Use dictionary info, surface characters, character type
1,2,3-grams
67 ngram feature templates for Jumandic, like:
Dictionary as column database
Dictionary field values become pointers
11
Field data is
deduplicated
and each field
is stored
separately
Length
Dictionary: Details
• Dictionary is mmap-able
• Loading from disk is almost free and cacheable by OS
• What dictionary gives us
• Handle strings as integers (data pointers)
• Use data pointers as primitive features
• Compute ngram features by
hashing components together
• Dictionary + smart hashing implementation =
8x speedup
12
Dic/model size on Jumandic
Kytea Mecab Juman++ V1 Juman++ V2
Dictionary - 311M 445M 158M
Model 200M 7.7M 135M 16M
13
Raw dictionary (CSV with expanded inflections) is 256MB
Kytea doesn’t store all dictionary information:
(It uses only surface, pos, subpos information)
Note: 44% of V2 dictionary is DARTS trie
Quiz: Where is the bottleneck?
1. Dictionary lookup/lattice construction
• Trie lookup, dictionary access
2. Feature computation
• Hashing, many conditionals
3. Score computation
• Score += weights[feature & (length – 1)];
4. Output
• Formatting, string concatenation
14
Quiz: Where is the bottleneck?
1. Dictionary lookup/lattice construction
• Trie lookup
2. Feature computation
• Hashing, many conditionals
3. Score computation
• Score += weights[feature & (length – 1)];
4. Output
• Formatting, string concatenation
15
Array access (not sum)
was taking ~80% of all
time. Reason:
L2 cache/dTLB misses.
Latency numbers every
programmer should know
Action Time Comments
L1 cache reference 0.5 ns
Branch mispredict 5 ns
L2 cache reference 7 ns 14x L1 cache
Main memory reference 100 ns 20x L2 cache, 200x L1 cache
16
Cache misses are really slow!
Random memory access is almost guaranteed to be a
cache miss!
Direction: Reduce number of memory accesses
-> reduce number of score computations
Lattice and terms
17Boundary
Right
nodes
Begin on the
boundary
Left
nodes
End on the
boundary
Also, t0
Also, t1
Context
or t2
Beams in Juman++
18
Each t1 node keeps
only a small number
of top-scoring paths
from t2
Using beams is crucial for correct
usage of high-order features
New in V2: Search space trimming
19
There could be a lot of paths going through the lattice
Most of the candidate paths can’t be even remotely correct
Global Beam: Left
20
Instead of considering all paths ending on a boundary,
consider only top k paths ending on the current boundary
Global Beam: Right
21
Use top m left paths to compute scores of right nodes.
Compute the connections to remaining (k-m) left paths
only for top l right nodes. (m=1, l=1 on the figure)
Idea:
Don’t consider
nodes which
don’t make
sense in
context
Effects of global beams
• Almost every sentence has situations when there are
20-30 left/right nodes
• Juman++v2 default settings
• Local beam size = 5
• Left beam size = 6
• Right beam size = 1/5
• Considering much less candidates
• In total, ~4x speedup
• Accuracy considerations
• Use surface characters after the current token as features
• During training use smaller global beam sizes
22
Ranking procedure
should (at least) keep
sensible paths in the
right beam
Seg+POS F1 on different global
beam parameters
23
Train the
model on one
set of global
beam
parameters,
test on
another
𝜌 here is the
number of
both left and
right beams
Not using
global beam
Juman++ V2 RNNLM
• V1 RNNLM implementation was
non-batched and non-vectorizable
• V2 RNNLM
• Uses Eigen for vectorization
• Batches all computations
• Deduplicates paths with identical states
• Finally, we evaluate only paths which have remained in
EOS beam
• Basically, RNN is used to reorder paths
• Opinion: Evaluating the whole lattice is a waste of energy
• Help wanted for LSTM language model implementation
(Mikolov’s RNNLM now)
24
Future work
• Improve robustness on informal Japanese (web)
• Create Jumanpp-Unidic
• Use it to bootstrap correct readings in our
annotated corpora:
• Kyoto Corpus
• Kyoto Web Document Leads Corpus
25
26
Conclusion
• Fast trigram feature/RNNLM-based morphological
analyzer
• Usable as library in multithreaded environment
• Not hardwired to Jumandic!
• Can be used with different dictionaries/standards
• SOTA on Jumandic F1
• Smallest model size (when without RNN)
• Can use surface features (character/character type)
• Can train with partially annotated data
27
github.com/ku-nlp/jumanpp
#jumanpp

More Related Content

What's hot

What's hot (20)

クラシックな機械学習の入門 7. オンライン学習
クラシックな機械学習の入門  7. オンライン学習クラシックな機械学習の入門  7. オンライン学習
クラシックな機械学習の入門 7. オンライン学習
 
全脳アーキテクチャ若手の会 強化学習
全脳アーキテクチャ若手の会 強化学習全脳アーキテクチャ若手の会 強化学習
全脳アーキテクチャ若手の会 強化学習
 
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...Empirical Study Incorporating Linguistic Knowledge on Filled Pausesfor Pers...
Empirical Study Incorporating Linguistic Knowledge on Filled Pauses for Pers...
 
金融理論における深層学習の活用について
金融理論における深層学習の活用について金融理論における深層学習の活用について
金融理論における深層学習の活用について
 
[DL輪読会]Attentive neural processes
[DL輪読会]Attentive neural processes[DL輪読会]Attentive neural processes
[DL輪読会]Attentive neural processes
 
How Much Position Information Do Convolutional Neural Networks Encode?
How Much Position Information Do Convolutional Neural Networks Encode?How Much Position Information Do Convolutional Neural Networks Encode?
How Much Position Information Do Convolutional Neural Networks Encode?
 
最新の多様な深層強化学習モデルとその応用(第40回強化学習アーキテクチャ講演資料)
最新の多様な深層強化学習モデルとその応用(第40回強化学習アーキテクチャ講演資料)最新の多様な深層強化学習モデルとその応用(第40回強化学習アーキテクチャ講演資料)
最新の多様な深層強化学習モデルとその応用(第40回強化学習アーキテクチャ講演資料)
 
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
不老におけるOptunaを利用した分散ハイパーパラメータ最適化 - 今村秀明(名古屋大学 Optuna講習会)
 
大規模な組合せ最適化問題に対する発見的解法
大規模な組合せ最適化問題に対する発見的解法大規模な組合せ最適化問題に対する発見的解法
大規模な組合せ最適化問題に対する発見的解法
 
Rによるprincomp関数を使わない主成分分析
Rによるprincomp関数を使わない主成分分析Rによるprincomp関数を使わない主成分分析
Rによるprincomp関数を使わない主成分分析
 
HiPPO/S4解説
HiPPO/S4解説HiPPO/S4解説
HiPPO/S4解説
 
単語分散表現のアライメントに基づく文間類似度を用いたテキスト平易化のための単言語パラレルコーパスの構築
単語分散表現のアライメントに基づく文間類似度を用いたテキスト平易化のための単言語パラレルコーパスの構築単語分散表現のアライメントに基づく文間類似度を用いたテキスト平易化のための単言語パラレルコーパスの構築
単語分散表現のアライメントに基づく文間類似度を用いたテキスト平易化のための単言語パラレルコーパスの構築
 
質問応答システム入門
質問応答システム入門質問応答システム入門
質問応答システム入門
 
【DL輪読会】GPT-4Technical Report
【DL輪読会】GPT-4Technical Report【DL輪読会】GPT-4Technical Report
【DL輪読会】GPT-4Technical Report
 
Oracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslassoOracle property and_hdm_pkg_rigorouslasso
Oracle property and_hdm_pkg_rigorouslasso
 
警察庁オープンデータで交通事故の世界にDeepDive!
警察庁オープンデータで交通事故の世界にDeepDive!警察庁オープンデータで交通事故の世界にDeepDive!
警察庁オープンデータで交通事故の世界にDeepDive!
 
Deep forest
Deep forestDeep forest
Deep forest
 
近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer近年のHierarchical Vision Transformer
近年のHierarchical Vision Transformer
 
最適輸送入門
最適輸送入門最適輸送入門
最適輸送入門
 
mROS:組込みデバイス向けのROS1ノード軽量実行環境
mROS:組込みデバイス向けのROS1ノード軽量実行環境mROS:組込みデバイス向けのROS1ノード軽量実行環境
mROS:組込みデバイス向けのROS1ノード軽量実行環境
 

Similar to Juman++ v2: A Practical and Modern Morphological Analyzer

2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
Nitay Joffe
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
Nitay Joffe
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
Zbigniew Jerzak
 

Similar to Juman++ v2: A Practical and Modern Morphological Analyzer (20)

On the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of PythonOn the Necessity and Inapplicability of Python
On the Necessity and Inapplicability of Python
 
On the necessity and inapplicability of python
On the necessity and inapplicability of pythonOn the necessity and inapplicability of python
On the necessity and inapplicability of python
 
Quantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task BenchQuantifying Overheads in Charm++ and HPX using Task Bench
Quantifying Overheads in Charm++ and HPX using Task Bench
 
5_RNN_LSTM.pdf
5_RNN_LSTM.pdf5_RNN_LSTM.pdf
5_RNN_LSTM.pdf
 
2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords2013 06-03 berlin buzzwords
2013 06-03 berlin buzzwords
 
2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group2013.09.10 Giraph at London Hadoop Users Group
2013.09.10 Giraph at London Hadoop Users Group
 
Online learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and HadoopOnline learning, Vowpal Wabbit and Hadoop
Online learning, Vowpal Wabbit and Hadoop
 
Pregel
PregelPregel
Pregel
 
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
Driving Moore's Law with Python-Powered Machine Learning: An Insider's Perspe...
 
Making fitting in RooFit faster
Making fitting in RooFit fasterMaking fitting in RooFit faster
Making fitting in RooFit faster
 
AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)AutoML lectures (ACDL 2019)
AutoML lectures (ACDL 2019)
 
Keynote at IWLS 2017
Keynote at IWLS 2017Keynote at IWLS 2017
Keynote at IWLS 2017
 
Computer Vision Landscape : Present and Future
Computer Vision Landscape : Present and FutureComputer Vision Landscape : Present and Future
Computer Vision Landscape : Present and Future
 
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
An early look at the LDBC Social Network Benchmark's Business Intelligence wo...
 
Big learning 1.2
Big learning   1.2Big learning   1.2
Big learning 1.2
 
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
NS-CUK Seminar: S.T.Nguyen, Review on "Do We Really Need Complicated Model Ar...
 
The Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it WorkThe Power of Auto ML and How Does it Work
The Power of Auto ML and How Does it Work
 
Deep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr SanparitDeep learning and image analytics using Python by Dr Sanparit
Deep learning and image analytics using Python by Dr Sanparit
 
Modern classification techniques
Modern classification techniquesModern classification techniques
Modern classification techniques
 
Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...Optimization of Continuous Queries in Federated Database and Stream Processin...
Optimization of Continuous Queries in Federated Database and Stream Processin...
 

Recently uploaded

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Safe Software
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Recently uploaded (20)

Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers:  A Deep Dive into Serverless Spatial Data and FMECloud Frontiers:  A Deep Dive into Serverless Spatial Data and FME
Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FME
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 AmsterdamDEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
DEV meet-up UiPath Document Understanding May 7 2024 Amsterdam
 
[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUKSpring Boot vs Quarkus the ultimate battle - DevoxxUK
Spring Boot vs Quarkus the ultimate battle - DevoxxUK
 

Juman++ v2: A Practical and Modern Morphological Analyzer

  • 1. Juman++ v2: A Practical and Modern Morphological Analyzer Arseny Tolmachev Kyoto University Kurohashi-Kawahara lab 2018-03-14 #jumanppgithub.com/ku-nlp/jumanpp
  • 2. What is Juman++ 外国人参政権 [Morita+ EMNLP2015] 外 国 foreign 人 参 carrot 政 権 regime 人 person 参 政 vote 権 right0.399 0.00003 0.002 0.0021 0.909 p(外国,人参,政権) = 0.252 x 10-8 p(外国,人,参政,権) = 0.761 x 10-7 Main idea: Use a Recurrent Neural Network Language Model to consider semantic plausibility in addition to usual model score 2
  • 3. Why Juman++? Accuracy! Kyoto University Text Corpus (NEWS), Kyoto University Web Document Leads Corpus (WEB) Setting 10 million raw sentences crawled from the web Dataset for training RNNLM: Dataset for training base model and evaluation: 3 Juman++ 感想 | や | ご | 要望 andimpression requesthonorific prefix JUMAN 感想|やご|要望 a larva of a dragonfly (F1) Word segmentation & POS tags 97.4 97.6 97.8 98 98.2 98.4 98.6 JUMAN MeCab Juman++ Number of fatal errors in 1,000 sentences 0 20 40 60 80 Juman++JUMAN
  • 4. Why not Juman++? Speed! Analyzer Analysis speed (Sentences/Second) Mecab 53,000 KyTea 2,000 JUMAN 8,800 Juman++ 16 4
  • 5. Why not Juman++ V1? Speed! Analyzer Analysis speed (Sentences/Second) Mecab 53,000 KyTea 2,000 JUMAN 8,800 Juman++ V1 16 Juman++ V2 4,800 5 250x
  • 6. Why Juman++ V2? Accuracy!!! 6 * = Optimized hyper-parameters on 10-fold cross-validation Using the same Jumandic + concatenation of Kyoto/KWDLC corpora for training 95 96 97 98 99 100 Kyoto (Seg) + Pos Leads (Seg) + Pos JUMAN Kytea Mecab* J++ V1 J++ V2* F1
  • 7. What is Juman++ v2 (in its core) A dictionary independent thread-safe library (not just a binary program) for morphological analysis using lattice-based segmentation optimized for n-best output 7
  • 8. Reasons of speedup • Dictionary representation • No non-necessary computations • Reduced search space • Cache-friendly data structures • Lattice => Struct of arrays • Code generation • Mostly branch-free feature extraction • Weight prefetching • RNN-specific: vectorization, batching 8 Algorithmic (Micro) architectural
  • 9. Juman++ Model 9 𝑠 = ∑𝑤𝑖 𝜙𝑖 + 𝛼(𝑠 𝑅𝑁𝑁 + 𝛽) For an input sentence: Build a lattice Assign a score to each path through the lattice Linear Model FeaturesWeights … …
  • 10. Linear Model Features • BIGRAM (品詞)(品詞) • BIGRAM (品詞-細分類)(品詞-細分類) • TRIGRAM (品詞)(品詞)(品詞) ... 10 Created based on ngram feature templates Use dictionary info, surface characters, character type 1,2,3-grams 67 ngram feature templates for Jumandic, like:
  • 11. Dictionary as column database Dictionary field values become pointers 11 Field data is deduplicated and each field is stored separately Length
  • 12. Dictionary: Details • Dictionary is mmap-able • Loading from disk is almost free and cacheable by OS • What dictionary gives us • Handle strings as integers (data pointers) • Use data pointers as primitive features • Compute ngram features by hashing components together • Dictionary + smart hashing implementation = 8x speedup 12
  • 13. Dic/model size on Jumandic Kytea Mecab Juman++ V1 Juman++ V2 Dictionary - 311M 445M 158M Model 200M 7.7M 135M 16M 13 Raw dictionary (CSV with expanded inflections) is 256MB Kytea doesn’t store all dictionary information: (It uses only surface, pos, subpos information) Note: 44% of V2 dictionary is DARTS trie
  • 14. Quiz: Where is the bottleneck? 1. Dictionary lookup/lattice construction • Trie lookup, dictionary access 2. Feature computation • Hashing, many conditionals 3. Score computation • Score += weights[feature & (length – 1)]; 4. Output • Formatting, string concatenation 14
  • 15. Quiz: Where is the bottleneck? 1. Dictionary lookup/lattice construction • Trie lookup 2. Feature computation • Hashing, many conditionals 3. Score computation • Score += weights[feature & (length – 1)]; 4. Output • Formatting, string concatenation 15 Array access (not sum) was taking ~80% of all time. Reason: L2 cache/dTLB misses.
  • 16. Latency numbers every programmer should know Action Time Comments L1 cache reference 0.5 ns Branch mispredict 5 ns L2 cache reference 7 ns 14x L1 cache Main memory reference 100 ns 20x L2 cache, 200x L1 cache 16 Cache misses are really slow! Random memory access is almost guaranteed to be a cache miss! Direction: Reduce number of memory accesses -> reduce number of score computations
  • 17. Lattice and terms 17Boundary Right nodes Begin on the boundary Left nodes End on the boundary Also, t0 Also, t1 Context or t2
  • 18. Beams in Juman++ 18 Each t1 node keeps only a small number of top-scoring paths from t2 Using beams is crucial for correct usage of high-order features
  • 19. New in V2: Search space trimming 19 There could be a lot of paths going through the lattice Most of the candidate paths can’t be even remotely correct
  • 20. Global Beam: Left 20 Instead of considering all paths ending on a boundary, consider only top k paths ending on the current boundary
  • 21. Global Beam: Right 21 Use top m left paths to compute scores of right nodes. Compute the connections to remaining (k-m) left paths only for top l right nodes. (m=1, l=1 on the figure) Idea: Don’t consider nodes which don’t make sense in context
  • 22. Effects of global beams • Almost every sentence has situations when there are 20-30 left/right nodes • Juman++v2 default settings • Local beam size = 5 • Left beam size = 6 • Right beam size = 1/5 • Considering much less candidates • In total, ~4x speedup • Accuracy considerations • Use surface characters after the current token as features • During training use smaller global beam sizes 22 Ranking procedure should (at least) keep sensible paths in the right beam
  • 23. Seg+POS F1 on different global beam parameters 23 Train the model on one set of global beam parameters, test on another 𝜌 here is the number of both left and right beams Not using global beam
  • 24. Juman++ V2 RNNLM • V1 RNNLM implementation was non-batched and non-vectorizable • V2 RNNLM • Uses Eigen for vectorization • Batches all computations • Deduplicates paths with identical states • Finally, we evaluate only paths which have remained in EOS beam • Basically, RNN is used to reorder paths • Opinion: Evaluating the whole lattice is a waste of energy • Help wanted for LSTM language model implementation (Mikolov’s RNNLM now) 24
  • 25. Future work • Improve robustness on informal Japanese (web) • Create Jumanpp-Unidic • Use it to bootstrap correct readings in our annotated corpora: • Kyoto Corpus • Kyoto Web Document Leads Corpus 25
  • 26. 26
  • 27. Conclusion • Fast trigram feature/RNNLM-based morphological analyzer • Usable as library in multithreaded environment • Not hardwired to Jumandic! • Can be used with different dictionaries/standards • SOTA on Jumandic F1 • Smallest model size (when without RNN) • Can use surface features (character/character type) • Can train with partially annotated data 27 github.com/ku-nlp/jumanpp #jumanpp

Editor's Notes

  1. setsubi = length 2