SlideShare una empresa de Scribd logo
1 de 29
Improving Distributional Similarity
with Lessons Learned from Word
Embeddings
Omer Levy, Yoav Goldberg, Ido Dagan
arXivTimes, 2017/1/24
Presenter: Hironsan
Twitter : Hironsan13
GitHub : Hironsan
Transactions of the Association for Computational Linguistics, 2015
Abstract
• word embeddingの性能向上はHyperparametersのチューニングによ
るところが大きいのではないかということで検証した論文
• 4つの手法について様々な Hyperparameters の組み合わせで検証
• word similarityタスクにおいて、count-baseの手法でも
prediction-baseの手法と同等の性能を示した
• analogyタスクにおいてはprediction-base手法の方が優っていた
Word Similarity & Relatedness
• How similar is pizza to ?
• How related is pizza to Italy?
• 単語をベクトルとして表現することで類似度を
計算できるようになる
Approaches for Representing Words
Word Embeddings (Predict)
• Inspired by deep learning
• word2vec (Mikolov et al., 2013)
• GloVe (Pennington et al., 2014)
Underlying Theory: The Distributional Hypothesis
(Harris, ’54)
“似た単語は似た文脈で現れる”
Distributional Semantics (Count)
• Used since the 90’s
• Sparse word-context PMI/PPMI matrix
• Decomposed with SVD
Approaches for Representing Words
Both approaches:
• distributional hypothesisに頼っている
• 使用するデータも同じ
• 数学的にも関連している
• “Neural Word Embedding as Implicit MatrixFactorization”
(NIPS 2014)
• どっちの手法の方が良いのだろうか?
• “Don’t Count, Predict!” (Baroni et al., ACL 2014)
• タイトルの通りPredictベースを使うべきと主張。本当に?
The Contributions of Word Embeddings
Novel Algorithms
(objective + training method)
• Skip Grams + Negative Sampling
• CBOW + Hierarchical Softmax
• Noise Contrastive Estimation
• GloVe
• …
どちらが性能を改善するのに重要なのか?
New Hyperparameters
(preprocessing, smoothing, etc.)
• Subsampling
• Dynamic Context Windows
• Context Distribution Smoothing
• Adding Context Vectors
• …
Contributions
1) Identifying the existence of new hyperparameters
• Not always mentioned in papers
2) Adapting the hyperparameters across algorithms
• Must understand the mathematical relation between
algorithms
3) Comparing algorithms across all hyperparameter
settings
• Over 5,000 experiments
Comparing Methods
以下の4つの手法についてハイパーパラメータを変えて比較
• PPMI(Positive Pointwise Mutual Information)
• SVD(Singular Value Decomposition)
• SGNS(Skip-Gram with Negative Sampling)
• GloVe(Global Vectors)
Count-base
Prediction-base
Hyperparameters
Preprocessing Hyperparameters
• Window Size(win)
• Dynamic Context Window (dyn)
• Subsampling (sub)
• Deleting Rare Words (del)
Association Metric Hyperparameters
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
Post-processing Hyperparameters
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
Dynamic Context Windows
•
Subsampling (sub)
• Subsamplingは頻出語を除去するために使う
• 閾値 t 以上に頻出する単語を確率 p で除去する
• f は単語の頻度
• t として設定する値は10-3〜10-5
Deleting Rare Words (del)
• 低頻度語の除去
• context windowsを作る前に除去する
Transferable Hyperparameters
Preprocessing Hyperparameters
• Window Size(win)
• Dynamic Context Window (dyn)
• Subsampling (sub)
• Deleting Rare Words (del)
Association Metric Hyperparameters
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
Post-processing Hyperparameters
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
Shifted PMI (neg)
• PMIからSGNSのNegative Samplingのパラメータ k をshiftする
(Levy and Goldberg 2014)
SPPMI(w, c) = max(PMI(w, c) − log k, 0)
• Negative Samplingのパラメータ k をPMIに適用している
Context Distribution Smoothing (cds)
• negative samplingは単語のユニグラム分布 P をスムージングした
分布を用いて行う
• PMIでも同様にスムージング
• こうすることで低頻度語の影響を緩和でき、かなり効果的
Transferable Hyperparameters
Preprocessing Hyperparameters
• Window Size(win)
• Dynamic Context Window (dyn)
• Subsampling (sub)
• Deleting Rare Words (del)
Association Metric Hyperparameters
• Shifted PMI (neg)
• Context Distribution Smoothing (cds)
Post-processing Hyperparameters
• Adding Context Vectors (w+c)
• Eigenvalue Weighting (eig)
• Vector Normalization (nrm)
Adding Context Vectors (w+c)
• SGNSは単語ベクトル w を生成する
• SGNSは文脈ベクトル c も生成する
• GloVeとSVDも
• 単語を w で表さず、w+c で表す by Pennington et al. (2014)
• 今まではGloVeだけに適応されてた
Eigenvalue Weighting (eig)
• SVDで分解した結果得られた固有値の重み付け方法を変更する
• SVDしてWとCを作るが類似度タスクに最適な作り方とは限らない
• symmetricな方法がsemanticタスクには適している
Symmetric
Vector Normalization (nrm)
• 得られた単語ベクトルは正規化
• 単位ベクトルにする
Transferable Hyperparameters
• 各アルゴリズムにおけるHyperparametersの対応をとることで、
なるべく条件を揃えて実験する
Comparing
Algorithms
Systematic Experiments
• 9 Hyperparameters
• 4 Word Representation Algorithms
• PPMI (Sparse & Explicit)
• SVD(PPMI)
• SGNS
• GloVe
• 8 Benchmarks
• 6 Word Similarity Tasks
• 2 Analogy Tasks
Hyperparameter Settings
Classic Vanilla Setting
(commonly used for baselines)
• Preprocessing
• <None>
•
• Postprocessing
• <None>
• Association Metric
• Vanilla PMI/PPMI
Recommended word2vec Setting
(tuned for SGNS)
• Preprocessing
• Dynamic Context Window
• Subsampling
• Postprocessing
• <None>
• Association Metric
• Shifted PMI/PPMI
• Context Distribution Smoothing
Experiments: Hyperparameter Tuning
0.54
0.587
0.688
0.623
0.697 0.681
0.3
0.4
0.5
0.6
0.7
PPMI (Sparse Vectors) SGNS (Embeddings)
Spearman’sCorrelation
WordSim-353 Relatedness
[optimal]
Overall Results
• Hyperparameters のチューニングは時に algorithms より重要
• word similarityタスクにおいて、count-baseの手法でも
prediction-baseの手法と同等の性能を示した
• analogyタスクにおいてはprediction-base手法の方が優っていた
Conclusions
Conclusions: Distributional Similarity
Word Embeddingsの性能には以下の2つが影響する:
• Novel Algorithms
• New Hyperparameters
何が性能にとって本当に重要なのか?
• Hyperparameters (mostly)
• Algorithms
References
• Omer Levy and Yoav Goldberg, 2014, Neural word embeddings as
implicit matrix factorization
• Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013,
Efficient Estimation of Word Representations in Vector Space
• Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014,
GloVe: Global Vectors for Word Representation
If you are interested in
Machine Learning, Natural Language Processing, Computer Vision,
Follow arXivTimes @ https://twitter.com/arxivtimes

Más contenido relacionado

Similar a Improving Distributional Similarity with Lessons Learned from Word Embeddings

大規模日本語ブログコーパスにおける言語モデルの構築と評価
大規模日本語ブログコーパスにおける言語モデルの構築と評価大規模日本語ブログコーパスにおける言語モデルの構築と評価
大規模日本語ブログコーパスにおける言語モデルの構築と評価Yahoo!デベロッパーネットワーク
 
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based ControlDeep Learning JP
 
Introduction to search_and_recommend_algolithm
Introduction to search_and_recommend_algolithmIntroduction to search_and_recommend_algolithm
Introduction to search_and_recommend_algolithmHiroki Iida
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイDeep Learning JP
 
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...Hiroshi Matsumoto
 
Reusing weights in subword aware neural language models
Reusing weights in subword aware neural language modelsReusing weights in subword aware neural language models
Reusing weights in subword aware neural language models広樹 本間
 

Similar a Improving Distributional Similarity with Lessons Learned from Word Embeddings (6)

大規模日本語ブログコーパスにおける言語モデルの構築と評価
大規模日本語ブログコーパスにおける言語モデルの構築と評価大規模日本語ブログコーパスにおける言語モデルの構築と評価
大規模日本語ブログコーパスにおける言語モデルの構築と評価
 
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control
[DL輪読会]Learning to Adapt: Meta-Learning for Model-Based Control
 
Introduction to search_and_recommend_algolithm
Introduction to search_and_recommend_algolithmIntroduction to search_and_recommend_algolithm
Introduction to search_and_recommend_algolithm
 
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ[DL輪読会]ドメイン転移と不変表現に関するサーベイ
[DL輪読会]ドメイン転移と不変表現に関するサーベイ
 
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
Graph Propagation for Paraphrasing Out-of-Vocabulary Words in Statistical Mac...
 
Reusing weights in subword aware neural language models
Reusing weights in subword aware neural language modelsReusing weights in subword aware neural language models
Reusing weights in subword aware neural language models
 

Improving Distributional Similarity with Lessons Learned from Word Embeddings

  • 1. Improving Distributional Similarity with Lessons Learned from Word Embeddings Omer Levy, Yoav Goldberg, Ido Dagan arXivTimes, 2017/1/24 Presenter: Hironsan Twitter : Hironsan13 GitHub : Hironsan Transactions of the Association for Computational Linguistics, 2015
  • 2. Abstract • word embeddingの性能向上はHyperparametersのチューニングによ るところが大きいのではないかということで検証した論文 • 4つの手法について様々な Hyperparameters の組み合わせで検証 • word similarityタスクにおいて、count-baseの手法でも prediction-baseの手法と同等の性能を示した • analogyタスクにおいてはprediction-base手法の方が優っていた
  • 3. Word Similarity & Relatedness • How similar is pizza to ? • How related is pizza to Italy? • 単語をベクトルとして表現することで類似度を 計算できるようになる
  • 4. Approaches for Representing Words Word Embeddings (Predict) • Inspired by deep learning • word2vec (Mikolov et al., 2013) • GloVe (Pennington et al., 2014) Underlying Theory: The Distributional Hypothesis (Harris, ’54) “似た単語は似た文脈で現れる” Distributional Semantics (Count) • Used since the 90’s • Sparse word-context PMI/PPMI matrix • Decomposed with SVD
  • 5. Approaches for Representing Words Both approaches: • distributional hypothesisに頼っている • 使用するデータも同じ • 数学的にも関連している • “Neural Word Embedding as Implicit MatrixFactorization” (NIPS 2014) • どっちの手法の方が良いのだろうか? • “Don’t Count, Predict!” (Baroni et al., ACL 2014) • タイトルの通りPredictベースを使うべきと主張。本当に?
  • 6. The Contributions of Word Embeddings Novel Algorithms (objective + training method) • Skip Grams + Negative Sampling • CBOW + Hierarchical Softmax • Noise Contrastive Estimation • GloVe • … どちらが性能を改善するのに重要なのか? New Hyperparameters (preprocessing, smoothing, etc.) • Subsampling • Dynamic Context Windows • Context Distribution Smoothing • Adding Context Vectors • …
  • 7. Contributions 1) Identifying the existence of new hyperparameters • Not always mentioned in papers 2) Adapting the hyperparameters across algorithms • Must understand the mathematical relation between algorithms 3) Comparing algorithms across all hyperparameter settings • Over 5,000 experiments
  • 8. Comparing Methods 以下の4つの手法についてハイパーパラメータを変えて比較 • PPMI(Positive Pointwise Mutual Information) • SVD(Singular Value Decomposition) • SGNS(Skip-Gram with Negative Sampling) • GloVe(Global Vectors) Count-base Prediction-base
  • 9. Hyperparameters Preprocessing Hyperparameters • Window Size(win) • Dynamic Context Window (dyn) • Subsampling (sub) • Deleting Rare Words (del) Association Metric Hyperparameters • Shifted PMI (neg) • Context Distribution Smoothing (cds) Post-processing Hyperparameters • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
  • 11. Subsampling (sub) • Subsamplingは頻出語を除去するために使う • 閾値 t 以上に頻出する単語を確率 p で除去する • f は単語の頻度 • t として設定する値は10-3〜10-5
  • 12. Deleting Rare Words (del) • 低頻度語の除去 • context windowsを作る前に除去する
  • 13. Transferable Hyperparameters Preprocessing Hyperparameters • Window Size(win) • Dynamic Context Window (dyn) • Subsampling (sub) • Deleting Rare Words (del) Association Metric Hyperparameters • Shifted PMI (neg) • Context Distribution Smoothing (cds) Post-processing Hyperparameters • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
  • 14. Shifted PMI (neg) • PMIからSGNSのNegative Samplingのパラメータ k をshiftする (Levy and Goldberg 2014) SPPMI(w, c) = max(PMI(w, c) − log k, 0) • Negative Samplingのパラメータ k をPMIに適用している
  • 15. Context Distribution Smoothing (cds) • negative samplingは単語のユニグラム分布 P をスムージングした 分布を用いて行う • PMIでも同様にスムージング • こうすることで低頻度語の影響を緩和でき、かなり効果的
  • 16. Transferable Hyperparameters Preprocessing Hyperparameters • Window Size(win) • Dynamic Context Window (dyn) • Subsampling (sub) • Deleting Rare Words (del) Association Metric Hyperparameters • Shifted PMI (neg) • Context Distribution Smoothing (cds) Post-processing Hyperparameters • Adding Context Vectors (w+c) • Eigenvalue Weighting (eig) • Vector Normalization (nrm)
  • 17. Adding Context Vectors (w+c) • SGNSは単語ベクトル w を生成する • SGNSは文脈ベクトル c も生成する • GloVeとSVDも • 単語を w で表さず、w+c で表す by Pennington et al. (2014) • 今まではGloVeだけに適応されてた
  • 18. Eigenvalue Weighting (eig) • SVDで分解した結果得られた固有値の重み付け方法を変更する • SVDしてWとCを作るが類似度タスクに最適な作り方とは限らない • symmetricな方法がsemanticタスクには適している Symmetric
  • 19. Vector Normalization (nrm) • 得られた単語ベクトルは正規化 • 単位ベクトルにする
  • 22. Systematic Experiments • 9 Hyperparameters • 4 Word Representation Algorithms • PPMI (Sparse & Explicit) • SVD(PPMI) • SGNS • GloVe • 8 Benchmarks • 6 Word Similarity Tasks • 2 Analogy Tasks
  • 23. Hyperparameter Settings Classic Vanilla Setting (commonly used for baselines) • Preprocessing • <None> • • Postprocessing • <None> • Association Metric • Vanilla PMI/PPMI Recommended word2vec Setting (tuned for SGNS) • Preprocessing • Dynamic Context Window • Subsampling • Postprocessing • <None> • Association Metric • Shifted PMI/PPMI • Context Distribution Smoothing
  • 24. Experiments: Hyperparameter Tuning 0.54 0.587 0.688 0.623 0.697 0.681 0.3 0.4 0.5 0.6 0.7 PPMI (Sparse Vectors) SGNS (Embeddings) Spearman’sCorrelation WordSim-353 Relatedness [optimal]
  • 25. Overall Results • Hyperparameters のチューニングは時に algorithms より重要 • word similarityタスクにおいて、count-baseの手法でも prediction-baseの手法と同等の性能を示した • analogyタスクにおいてはprediction-base手法の方が優っていた
  • 27. Conclusions: Distributional Similarity Word Embeddingsの性能には以下の2つが影響する: • Novel Algorithms • New Hyperparameters 何が性能にとって本当に重要なのか? • Hyperparameters (mostly) • Algorithms
  • 28. References • Omer Levy and Yoav Goldberg, 2014, Neural word embeddings as implicit matrix factorization • Tomas Mikolov, Kai Chen, Greg Corrado, Jeffrey Dean, 2013, Efficient Estimation of Word Representations in Vector Space • Jeffrey Pennington, Richard Socher, Christopher D. Manning, 2014, GloVe: Global Vectors for Word Representation
  • 29. If you are interested in Machine Learning, Natural Language Processing, Computer Vision, Follow arXivTimes @ https://twitter.com/arxivtimes

Notas del editor

  1. …the more relevant it is. word2vec does just that, by randomly sampling the size of the context window around each token. What you see here are the probabilities that each specific context word will be included in the training data. GloVe also does something similar, but in a deterministic manner, and with a slightly different distribution. There are of course other ways to do this, And they were all, apparently, Applied to traditional algorithms about a decade ago.
  2. What I’m going to do, is show you an example of how previous work compared algorithms, And how things change when you compare apples to apples.
  3. On the other hand, we have the recommended setting of word2vec, Which was tuned for SGNS… …and does quite a bit more. Now, when we just look at these 2 settings, side by side, It’s easier to see that part of word2vec’s contribution is this collection of hyperparameters.
  4. BUT this isn’t the end of the story. What if the word2vec setting isn’t the best setting for this task? After all, it was tuned for a certain type of analogies. The right thing to do is to allow each method... …to tune every hyperparameter. As you can all see, tuning hyperparameters can take us well beyond conventional settings, And also give us a more elaborate and realistic comparison of different algorithms. <> Now, it’s important to stress that these two settings, Can be, and usually are, Very different in practice. In fact, each task and each algorithm usually require different settings, Which is why we used cross-validation to tune hyperparameters, when we compared the different algorithms.
  5. Now, this is only a little taste of our experiments, And you’ll have to read the paper to get the entire picture, But I will try to point out three major observations: First, tuning hyperparameters often has a stronger positive effect than switching to a fancier algorithm. Second, tuning hyperparameters might be even more effective than adding more data. And third, previous claims that one algorithm is better than another, are not 100% true.
  6. So, in conclusion.
  7. And besides, SGNS is so memory-efficient and easy-to-use, It’s become my personal favorite.