Se ha denunciado esta presentación.
Se está descargando tu SlideShare. ×
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Anuncio
Próximo SlideShare
AINL 2016: Yagunova
AINL 2016: Yagunova
Cargando en…3
×

Eche un vistazo a continuación

1 de 37 Anuncio

Más Contenido Relacionado

Presentaciones para usted (20)

A los espectadores también les gustó (20)

Anuncio

Similares a AINL 2016: Malykh (20)

Anuncio

Más reciente (20)

AINL 2016: Malykh

  1. 1. Robust Word Vectors for Russian Language V. Malykh1 1Laboratory of Neural Systems and Deep learning, Moscow Institute of Physics and Technology (State University) http://www.mipt.ru/ Artificial Intelligence and Natural Language Conference, 2016 V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 1 / 27
  2. 2. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 2 / 27
  3. 3. Word Vectors Vector representations of words is a basic idea to enable computers to work with words in more convenient manner than with simple categorical features. Figure is adopted from deeplearning4j.com V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 3 / 27
  4. 4. Existing approaches Two main word-level approaches: Local context (e.g. Word2Vec, [Mikolov2013]) Co-occurence matrix decomposition (e.g. GloVe, [Pennington2014]) Drawbacks: A model need to recognize the word exactly. Out of vocabulary words. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 4 / 27
  5. 5. Existing approaches (2) Char-level approach: Read the letters and try to predict the word, which it represents. E.g. [Pennington2015] Drawback: Again out of vocabulary words. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 5 / 27
  6. 6. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 6 / 27
  7. 7. Our Approach Description In our architecture we do not use any co-occurrence matrices. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  8. 8. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  9. 9. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. Context is handled by memorising in weights of a neural net. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  10. 10. Our Approach Description In our architecture we do not use any co-occurrence matrices. There is no entity of vocabulary in the model. Context is handled by memorising in weights of a neural net. We use recurrent layers to achieve this property. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 7 / 27
  11. 11. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 8 / 27
  12. 12. Long Short-Term Memory Figure is adopted from http://deeplearning.net/tutorial/lstm.html V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 9 / 27
  13. 13. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 10 / 27
  14. 14. BME Representation B - begin, first 3 letters in one-hot form. M - middle, all letters in alphabet counters form. E - end, last 3 letters in one-hot form. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 11 / 27
  15. 15. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 12 / 27
  16. 16. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 13 / 27
  17. 17. Math description Negative Contrast Estimation loss: NCE = e−s(v,c) + es(v,c ) (1) where v - a word vector, c - word vector from the word context, c - word vector outside of the word context, and s(x, y) is some scoring function. We are using cosine similarity as scoring function: cos(x, y) = x · y |x||y| (2) V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 14 / 27
  18. 18. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 15 / 27
  19. 19. Corpus Description News headings corpus in Russian V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  20. 20. Corpus Description News headings corpus in Russian 3 classes: strong paraphrase, weak paraphrase, and non-paraphrase V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  21. 21. Corpus Description News headings corpus in Russian 3 classes: strong paraphrase, weak paraphrase, and non-paraphrase Firstly introduced in 2015 in [Pronoza2015], in 2016 extended version. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 16 / 27
  22. 22. Corpus Description Statistics Pairs1 7227 Strong Paraphrase Pairs 1668 Weak Paraphrase Pairs 2957 Non Paraphrase Pairs 2582 1 Remark: in our experiments we use only 1 & -1 classes. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 17 / 27
  23. 23. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  24. 24. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). We’re adding artificial noise - additional & vanishing letters, replacement of letters. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  25. 25. Experiment setup The metric is ROC AUC on cosine similarity interpreted as probability of the positive class (strong paraphrase). We’re adding artificial noise - additional & vanishing letters, replacement of letters. 10 runs with each noise level. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 18 / 27
  26. 26. Results Figure: Results on Paraphraser corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 19 / 27
  27. 27. Outline 1 Word Vectors 2 Our Approach Our Approach Description LSTM BME Representation Architecture 3 Experiments 1st Corpus 2nd Corpus V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 20 / 27
  28. 28. Corpus Description Plagiarism detection in scientific papers. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  29. 29. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  30. 30. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. 3 human experts should produce their evaluation in [0, 1]. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  31. 31. Corpus Description Plagiarism detection in scientific papers. 150 pairs of articles’ titles & descriptions in Russian. 3 human experts should produce their evaluation in [0, 1]. Was introduced in 2014 in work [Derbenev2014]. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 21 / 27
  32. 32. Results (2) Table: Results of testing on scientific plagiarism corpus System Quality Random Baseline 0.213 ± 0.025 Word2Vec Baseline 0.189 Robust Word2Vec 0.232 V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 22 / 27
  33. 33. Summary We have introduced an architecture to produce word vectors, basing on characters. It does not store explicitly word vectors, so it has only fixed weights number, does not depending on the vocabulary size. The architecture does not rely on any type of pre-processing (i.e. stemming). The architecture outperforming the existing word vectors models in noisy environment. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 23 / 27
  34. 34. Future Work We should find more corpora for paraphrase, maybe naturally noisy (e.g. Twitter Paraphrase Corpus for English). Try the architecture on other languages. Try to improve the quality on low noise regions, by the means of more deep architecture, attention, etc. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 24 / 27
  35. 35. Thank you for your attention! I would be happy to answer your questions. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 25 / 27
  36. 36. References I N. V. Derbenev, D. A. Kozliuk, V. V. Nikitin, V. O. Tolcheev Experimental Research of Near-Duplicate Detection Methods for Scientific Papers. Machine Learning and Data Analysis. Vol. 1 (7), 2014 (in Russian). E. Pronoza, E. Yagunova, A. Pronoza Construction of a Russian Paraphrase Corpus: Unsupervised Paraphrase Extraction. Proceedings of the 9th Russian Summer School in Information Retrieval, August 24?28, 2015, Saint-Petersburg, Russia, (RuSSIR 2015, Young Scientist Conference), Springer CCIS T. Mikolov et al. Distributed representations of words and phrases and their compositionality. Advances in neural information processing systems. 2013. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 26 / 27
  37. 37. References II W. Ling et al. Finding function in form: Compositional character models for open vocabulary word representation. In Proc. of EMNLP2015 J. Pennington, R. Socher, and C.D. Manning. Glove: Global Vectors for Word Representation. EMNLP. Vol. 14. 2014. V. Malykh (MIPT) Robust Word Vectors for Russian Language AINL 2016 27 / 27

×