SlideShare una empresa de Scribd logo
1 de 38
Descargar para leer sin conexión
Using Word Embedding for Cross-Language
Plagiarism Detection
Authors
Jérémy Ferrero Frédéric Agnès Laurent Besacier Didier Schwab
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 1
What is Cross-Language Plagiarism Detection?
Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been
plagiarized while being translated (manually or automatically).
From a text in a language L, we must find similar passage(s) in other text(s) from
among a set of candidate texts in language L’ (cross-language textual similarity).
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 2
Why is it so important?
Sources:
- McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School.
- Josephson Institute. (2011). What would honest Abe Lincoln say?
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 3
Research Questions
• Are Word Embeddings useful for cross-language
plagiarism detection?
• Is syntax weighting in distributed representations of
sentences useful for the text entailment?
• Are cross-language plagiarism detection methods
complementary?
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 4
State-of-the-Art Methods
MT-Based Models
Translation + Monolingual Analysis [Muhr et al., 2010, Barrón-Cedeño, 2012]
Comparable Corpora-Based Models
CL-KGA, CL-ESA [Gabrilovich and Markovitch, 2007, Potthast et al., 2008]
Parallel Corpora-Based Models
CL-ASA [Barrón-Cedeño et al., 2008, Pinto et al., 2009], CL-LSI, CL-KCCA
Dictionary-Based Models
CL-VSM, CL-CTS [Gupta et al., 2012, Pataki, 2012]
Syntax-Based Models
Length Model, CL-CnG [Mcnamee and Mayfield, 2004, Potthast et al., 2011], Cognateness
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 5
Augmented CL-CTS
We use DBNary [Sérasset, 2015] as linked lexical resource.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
Augmented CL-CTS
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
Augmented CL-CTS
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 6
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WES: Cross-Language Word Embedding-based Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 7
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec)
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 8
Evaluation Dataset
[Ferrero et al., 2016]1
• French, English and Spanish;
• Parallel and comparable (mix of Wikipedia, conference papers, product reviews,
Europarl and JRC);
• Different granularities: document level, sentence level and chunk level;
• Human and machine translated texts;
• Obfuscated (to make the similarity detection more complicated) and without
added noise;
• Written and translated by multiple types of authors;
• Cover various fields.
1
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity
Detection. In Proceedings of LREC 2016.
https://github.com/FerreroJeremy/Cross-Language-Dataset
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 9
Evaluation Protocol
• We compared each English textual unit to its corresponding
French unit and to 999 other units randomly selected;
• We threshold the obtained distance matrix to find the threshold
giving the best F1 score;
• We repeat these two steps 10 times, leading to a 10 folds:
- 2 folds for tuning (CL-WESS) and fusion (Decision Tree)
- 8 folds for validation
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 10
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-CTS-WE: Cross-Language Conceptual Thesaurus-based Similarity with
Word-Embedding
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-WES: Cross-Language Word Embedding-based Similarity
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97%
on chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves the
results.
CL-C3G: Cross-Language Character 3-Gram
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Results
Overall (%)
Chunk-Level Sentence-Level
State-of-the-Art Methods
CL-C3G 50.76 49.34
CL-CTS 42.84 47.50
CL-ASA 47.32 35.81
CL-ESA 14.81 14.44
T+MA 37.12 37.42
New Proposed Methods
CL-CTS-WE 46.67 50.69
CL-WES 41.95 41.43
CL-WESS 53.73 56.35
Decision Tree 89.15 88.50
Table: Average F1 scores of methods applied
on EN→FR sub-corpora.
• CL-CTS-WE boosts CL-CTS (+3.83% on
chunks and +3.19% on sentences);
• CL-WESS boosts CL-WES (+11.78% on
chunks and +14.92% on sentences);
• CL-WESS is better than CL-C3G (+2.97% on
chunks and +7.01% on sentences);
• Decision Tree fusion significantly improves
the results.
Decision Tree fusion: C4.5 [Quinlan, 1993]
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 11
Conclusion
• Augmentation of several baseline approaches using word embeddings
instead of lexical resources;
• CL-WESS beats in overall the precedent best state-of-the-art methods;
• Methods are complementary and their fusion significantly helps
cross-language textual similarity detection performance;
• Winning method at SemEval-2017 Task 1 track 4a, i.e. the task on
Spanish-English Cross-lingual Semantic Textual Similarity detection.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 12
Thank you for your attention.
Do you have any questions?
 jeremy.ferrero@compilatio.net
 @FerreroJeremy
 github.com/FerreroJeremy
 fr.linkedin.com/in/FerreroJeremy
 researchgate.net/profile/Jeremy_Ferrero
Augmented CL-CTS
• CL-CTS-WE uses the top 10 closest words in the embeddings model to build the
BOW of a word;
• A BOW of a sentence is a merge of the BOW of its words;
• Jaccard distance between the two BOW.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 14
CL-WES: Cross-Language Word Embedding-based Similarity
The similarity between two sentences S and S is calculated by Cosine Distance
between the two vectors V and V , built such as:
V (S) =
|S|
i=1
(vector(ui)) (1)
• ui is the ith word of S;
• vector is the function which gives the word embedding vector of a word.
This feature is available in MultiVec2 [Berard et al., 2016].
2
https://github.com/eske/multivec
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 15
CL-WESS: Cross-Language Word Embedding-based Syntax Similarity
V (S) =
|S|
i=1
(weight(pos(ui)).vector(ui)) (2)
• ui is the ith word of S;
• pos is the function which gives the universal part-of-speech tag of a word;
• weight is the function which gives the weight of a part-of-speech;
• vector is the function which gives the word embedding vector of a word;
• . is the scalar product.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 16
Complementarity
Figure: Distribution histograms of CL-CNG (left) and CL-ASA (right) for 1000 positives and
1000 negatives (mis)matches.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 17
Fusions
Weighted Average Fusion Decision Tree Fusion C4.5 [Quinlan, 1993]
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 18
Weighted Fusion
fus(M) =
|M|
j=1(wj ∗ mj)
|M|
j=1 wj
(3)
• M is the set of the scores of the methods for one match;
• mj and wj are the score and the weight of the jth method respectively.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 19
Results at Chunk-Level
Chunk level
Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%)
CL-C3G 63.04 40.80 36.80 80.69 53.26 50.76
CL-CTS 58.05 33.66 30.15 67.88 45.31 42.84
CL-ASA 23.70 23.24 33.06 26.34 55.45 47.32
CL-ESA 64.86 23.73 13.91 23.01 13.98 14.81
T+MA 58.26 38.90 28.81 73.25 36.60 37.12
CL-CTS-WE 58.00 38.04 31.73 73.13 49.91 46.67
CL-WES 37.53 21.70 32.96 39.14 46.01 41.95
CL-WESS 52.68 34.49 45.00 56.83 57.06 53.73
Average fusion 81.34 65.78 61.87 91.87 79.77 75.82
Weighed fusion 84.61 69.69 67.02 94.38 83.74 80.01
Decision Tree 95.25 74.10 72.19 97.05 95.16 89.15
Table: Average F1 scores of cross-language similarity detection methods applied on chunk-level
EN→FR sub-corpora – 8 folds validation.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 20
Results at Sentence-Level
Sentence level
Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%)
CL-C3G 48.24 48.19 36.85 61.30 52.70 49.34
CL-CTS 46.71 38.93 28.38 51.43 53.35 47.50
CL-ASA 27.68 27.33 34.78 25.95 36.73 35.81
CL-ESA 50.89 14.41 14.45 14.18 14.09 14.44
T+MA 50.39 37.66 32.31 61.95 37.70 37.42
CL-CTS-WE 47.26 43.93 31.63 57.85 56.39 50.69
CL-WES 28.48 24.37 33.99 39.10 44.06 41.43
CL-WESS 45.65 40.45 48.64 58.08 58.84 56.35
Decision Tree 80.45 80.89 72.70 78.91 94.04 88.50
Table: Average F1 scores of cross-language similarity detection methods applied on
sentence-level EN→FR sub-corpora – 8 folds validation.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 21
References I
Barrón-Cedeño, A. (2012).
On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism.
In PhD thesis, València, Spain.
Barrón-Cedeño, A., Rosso, P., Pinto, D., and Juan, A. (2008).
On Cross-lingual Plagiarism Analysis using a Statistical Model.
In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings
of the ECAI’08 PAN Workshop: Uncovering Plagiarism, Authorship and Social
Software Misuse, pages 9–13, Patras, Greece.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 13
References II
Berard, A., Servan, C., Pietquin, O., and Besacier, L. (2016).
MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP.
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC’16), pages 4188–4192, Portoroz, Slovenia. European Language
Resources Association (ELRA).
Ferrero, J., Agnès, F., Besacier, L., and Schwab, D. (2016).
A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language
Textual Similarity Detection.
In Proceedings of the Tenth International Conference on Language Resources and
Evaluation (LREC’16), pages 4162–4169, Portoroz, Slovenia. European Language
Resources Association (ELRA).
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 14
References III
Gabrilovich, E. and Markovitch, S. (2007).
Computing Semantic Relatedness using Wikipedia-based Explicit Semantic
Analysis.
In Proceedings of the 20th International Joint Conference on Artifical Intelligence
(IJCAI’07), pages 1606–1611, Hyderabad, India. Morgan Kaufmann Publishers Inc.
Gupta, P., Barrón-Cedeño, A., and Rosso, P. (2012).
Cross-language High Similarity Search using a Conceptual Thesaurus.
In Information Access Evaluation. Multilinguality, Multimodality, and Visual
Analytics, pages 67–75, Rome, Italy. Springer Berlin Heidelberg.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 15
References IV
Mcnamee, P. and Mayfield, J. (2004).
Character N-Gram Tokenization for European Language Text Retrieval.
In Information Retrieval Proceedings, volume 7, pages 73–97. Kluwer Academic
Publishers.
Muhr, M., Kern, R., Zechner, M., and Granitzer, M. (2010).
External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and
Segmentation System - Lab Report for PAN at CLEF 2010.
In Braschler, M., Harman, D., and Pianta, E., editors, CLEF Notebook, Padua,
Italy.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 16
References V
Pataki, M. (2012).
A New Approach for Searching Translated Plagiarism.
In Proceedings of the 5th International Plagiarism Conference, pages 49–64,
Newcastle, UK.
Pinto, D., Civera, J., Juan, A., Rosso, P., and Barrón-Cedeño, A. (2009).
A Statistical Approach to Crosslingual Natural Language Tasks.
In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51–60.
Potthast, M., Barrón-Cedeño, A., Stein, B., and Rosso, P. (2011).
Cross-Language Plagiarism Detection.
In Language Resources and Evaluation, volume 45, pages 45–62.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 17
References VI
Potthast, M., Stein, B., and Anderka, M. (2008).
A Wikipedia-Based Multilingual Retrieval Model.
In 30th European Conference on IR Research (ECIR’08), volume 4956 of LNCS of
Lecture Notes in Computer Science, pages 522–530, Glasgow, Scotland. Springer.
Quinlan, J. R. (1993).
C4.5: Programs for Machine Learning.
The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers
Inc., San Francisco, CA, USA.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 18
References VII
Sérasset, G. (2015).
DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF.
In Semantic Web Journal (special issue on Multilingual Linked Open Data),
volume 6, pages 355–361.
Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017
Using Word Embedding for Cross-Language Plagiarism Detection 19

Más contenido relacionado

La actualidad más candente

Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distancesGanesh Borle
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimEdgar Marca
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Bhaskar Mitra
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithmAndrew Koo
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalBhaskar Mitra
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopiwan_rg
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - IntroductionChristian Perone
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结君 廖
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsBhaskar Mitra
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Hady Elsahar
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Jinpyo Lee
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Daniele Di Mitri
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesFelipe Moraes
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modelingSerhii Havrylov
 
Interview presentation
Interview presentationInterview presentation
Interview presentationJoseph Gubbins
 

La actualidad más candente (18)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
Word Embedding to Document distances
Word Embedding to Document distancesWord Embedding to Document distances
Word Embedding to Document distances
 
Word2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensimWord2vec: From intuition to practice using gensim
Word2vec: From intuition to practice using gensim
 
Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)Neural Text Embeddings for Information Retrieval (WSDM 2017)
Neural Text Embeddings for Information Retrieval (WSDM 2017)
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
Using Text Embeddings for Information Retrieval
Using Text Embeddings for Information RetrievalUsing Text Embeddings for Information Retrieval
Using Text Embeddings for Information Retrieval
 
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshopورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
ورشة تضمين الكلمات في التعلم العميق Word embeddings workshop
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结(Deep) Neural Networks在 NLP 和 Text Mining 总结
(Deep) Neural Networks在 NLP 和 Text Mining 总结
 
A Simple Introduction to Word Embeddings
A Simple Introduction to Word EmbeddingsA Simple Introduction to Word Embeddings
A Simple Introduction to Word Embeddings
 
What is word2vec?
What is word2vec?What is word2vec?
What is word2vec?
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Word2vec slide(lab seminar)
Word2vec slide(lab seminar)Word2vec slide(lab seminar)
Word2vec slide(lab seminar)
 
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
Word2Vec: Learning of word representations in a vector space - Di Mitri & Her...
 
Representation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and PhrasesRepresentation Learning of Vectors of Words and Phrases
Representation Learning of Vectors of Words and Phrases
 
(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling(Kpi summer school 2015) word embeddings and neural language modeling
(Kpi summer school 2015) word embeddings and neural language modeling
 
Interview presentation
Interview presentationInterview presentation
Interview presentation
 

Similar a Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection

Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Association for Computational Linguistics
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiHiroyuki Miyoshi
 
Combining Vocabulary Alignment Techniques
Combining Vocabulary Alignment TechniquesCombining Vocabulary Alignment Techniques
Combining Vocabulary Alignment Techniquespancsurka
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsDimitris Kontokostas
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...csandit
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsijnlc
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationGennadi Lembersky
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESkevig
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysiskrukob9
 
Bangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorBangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorMdAlAmin187
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationijnlc
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsTae Hwan Jung
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...cscpconf
 
Doc format.
Doc format.Doc format.
Doc format.butest
 
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...csandit
 
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...cscpconf
 

Similar a Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection (20)

Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
Jérémy Ferrero - 2017 - Deep Investigation of Cross-Language Plagiarism Det...
 
ijcai11
ijcai11ijcai11
ijcai11
 
Interspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshiInterspeech 2017 s_miyoshi
Interspeech 2017 s_miyoshi
 
Combining Vocabulary Alignment Techniques
Combining Vocabulary Alignment TechniquesCombining Vocabulary Alignment Techniques
Combining Vocabulary Alignment Techniques
 
NLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology ConstraintsNLP Data Cleansing Based on Linguistic Ontology Constraints
NLP Data Cleansing Based on Linguistic Ontology Constraints
 
Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...Improvement wsd dictionary using annotated corpus and testing it with simplif...
Improvement wsd dictionary using annotated corpus and testing it with simplif...
 
Parafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdfParafraseo-Chenggang.pdf
Parafraseo-Chenggang.pdf
 
Word sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy wordsWord sense disambiguation using wsd specific wordnet of polysemy words
Word sense disambiguation using wsd specific wordnet of polysemy words
 
The Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine TranslationThe Effect of Translationese on Statistical Machine Translation
The Effect of Translationese on Statistical Machine Translation
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIESTHE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
THE ABILITY OF WORD EMBEDDINGS TO CAPTURE WORD SIMILARITIES
 
Sentence analysis
Sentence analysisSentence analysis
Sentence analysis
 
Bangla spell checker & suggestion generator
Bangla spell checker & suggestion generatorBangla spell checker & suggestion generator
Bangla spell checker & suggestion generator
 
A survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classificationA survey on phrase structure learning methods for text classification
A survey on phrase structure learning methods for text classification
 
Neural machine translation of rare words with subword units
Neural machine translation of rare words with subword unitsNeural machine translation of rare words with subword units
Neural machine translation of rare words with subword units
 
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
ON THE UTILITY OF A SYLLABLE-LIKE SEGMENTATION FOR LEARNING A TRANSLITERATION...
 
Doc format.
Doc format.Doc format.
Doc format.
 
Wsd final paper
Wsd final paperWsd final paper
Wsd final paper
 
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
Analysis of lexico syntactic patterns for antonym pair extraction from a turk...
 
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
ANALYSIS OF LEXICO-SYNTACTIC PATTERNS FOR ANTONYM PAIR EXTRACTION FROM A TURK...
 

Más de Association for Computational Linguistics

Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Association for Computational Linguistics
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Association for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsAssociation for Computational Linguistics
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Association for Computational Linguistics
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Association for Computational Linguistics
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Association for Computational Linguistics
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopAssociation for Computational Linguistics
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Association for Computational Linguistics
 

Más de Association for Computational Linguistics (20)

Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal TextMuis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
Muis - 2016 - Weak Semi-Markov CRFs for NP Chunking in Informal Text
 
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
Castro - 2018 - A High Coverage Method for Automatic False Friends Detection ...
 
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour AnalysisCastro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
Castro - 2018 - A Crowd-Annotated Spanish Corpus for Humour Analysis
 
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
Muthu Kumar Chandrasekaran - 2018 - Countering Position Bias in Instructor In...
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text SimplificationElior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
Elior Sulem - 2018 - Semantic Structural Evaluation for Text Simplification
 
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future DirectionsDaniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
Daniel Gildea - 2018 - The ACL Anthology: Current State and Future Directions
 
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
Wenqiang Lei - 2018 - Sequicity: Simplifying Task-oriented Dialogue Systems w...
 
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
Matthew Marge - 2017 - Exploring Variation of Natural Human Commands to a Rob...
 
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
Venkatesh Duppada - 2017 - SeerNet at EmoInt-2017: Tweet Emotion Intensity Es...
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
John Richardson - 2015 - KyotoEBMT System Description for the 2nd Workshop on...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
Zhongyuan Zhu - 2015 - Evaluating Neural Machine Translation in English-Japan...
 
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
Hyoung-Gyu Lee - 2015 - NAVER Machine Translation System for WAT 2015
 
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 WorkshopSatoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
Satoshi Sonoh - 2015 - Toshiba MT System Description for the WAT2015 Workshop
 
Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015Chenchen Ding - 2015 - NICT at WAT 2015
Chenchen Ding - 2015 - NICT at WAT 2015
 
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
Graham Neubig - 2015 - Neural Reranking Improves Subjective Quality of Machin...
 

Último

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdfSoniaTolstoy
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...RKavithamani
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...EduSkills OECD
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactdawncurless
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityGeoBlogs
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxRoyAbrique
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpinRaunakKeshri1
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Educationpboyjonauth
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionSafetyChain Software
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphThiyagu K
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformChameera Dedduwage
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104misteraugie
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfJayanti Pande
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeThiyagu K
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxpboyjonauth
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdfQucHHunhnh
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationnomboosow
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactPECB
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Krashi Coaching
 

Último (20)

BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdfBASLIQ CURRENT LOOKBOOK  LOOKBOOK(1) (1).pdf
BASLIQ CURRENT LOOKBOOK LOOKBOOK(1) (1).pdf
 
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
Privatization and Disinvestment - Meaning, Objectives, Advantages and Disadva...
 
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
Presentation by Andreas Schleicher Tackling the School Absenteeism Crisis 30 ...
 
Accessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impactAccessible design: Minimum effort, maximum impact
Accessible design: Minimum effort, maximum impact
 
Paris 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activityParis 2024 Olympic Geographies - an activity
Paris 2024 Olympic Geographies - an activity
 
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptxContemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
Contemporary philippine arts from the regions_PPT_Module_12 [Autosaved] (1).pptx
 
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"Mattingly "AI & Prompt Design: The Basics of Prompt Design"
Mattingly "AI & Prompt Design: The Basics of Prompt Design"
 
Student login on Anyboli platform.helpin
Student login on Anyboli platform.helpinStudent login on Anyboli platform.helpin
Student login on Anyboli platform.helpin
 
Introduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher EducationIntroduction to ArtificiaI Intelligence in Higher Education
Introduction to ArtificiaI Intelligence in Higher Education
 
Mastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory InspectionMastering the Unannounced Regulatory Inspection
Mastering the Unannounced Regulatory Inspection
 
Z Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot GraphZ Score,T Score, Percential Rank and Box Plot Graph
Z Score,T Score, Percential Rank and Box Plot Graph
 
A Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy ReformA Critique of the Proposed National Education Policy Reform
A Critique of the Proposed National Education Policy Reform
 
Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104Nutritional Needs Presentation - HLTH 104
Nutritional Needs Presentation - HLTH 104
 
Web & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdfWeb & Social Media Analytics Previous Year Question Paper.pdf
Web & Social Media Analytics Previous Year Question Paper.pdf
 
Measures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and ModeMeasures of Central Tendency: Mean, Median and Mode
Measures of Central Tendency: Mean, Median and Mode
 
Introduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptxIntroduction to AI in Higher Education_draft.pptx
Introduction to AI in Higher Education_draft.pptx
 
1029 - Danh muc Sach Giao Khoa 10 . pdf
1029 -  Danh muc Sach Giao Khoa 10 . pdf1029 -  Danh muc Sach Giao Khoa 10 . pdf
1029 - Danh muc Sach Giao Khoa 10 . pdf
 
Interactive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communicationInteractive Powerpoint_How to Master effective communication
Interactive Powerpoint_How to Master effective communication
 
Beyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global ImpactBeyond the EU: DORA and NIS 2 Directive's Global Impact
Beyond the EU: DORA and NIS 2 Directive's Global Impact
 
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
Kisan Call Centre - To harness potential of ICT in Agriculture by answer farm...
 

Jérémy Ferrero - 2017 - Using Word Embedding for Cross-Language Plagiarism Detection

  • 1. Using Word Embedding for Cross-Language Plagiarism Detection Authors Jérémy Ferrero Frédéric Agnès Laurent Besacier Didier Schwab Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 1
  • 2. What is Cross-Language Plagiarism Detection? Cross-Language Plagiarism is a plagiarism by translation, i.e. a text has been plagiarized while being translated (manually or automatically). From a text in a language L, we must find similar passage(s) in other text(s) from among a set of candidate texts in language L’ (cross-language textual similarity). Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 2
  • 3. Why is it so important? Sources: - McCabe, D. (2010). Students’ cheating takes a high-tech turn. In Rutgers Business School. - Josephson Institute. (2011). What would honest Abe Lincoln say? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 3
  • 4. Research Questions • Are Word Embeddings useful for cross-language plagiarism detection? • Is syntax weighting in distributed representations of sentences useful for the text entailment? • Are cross-language plagiarism detection methods complementary? Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 4
  • 5. State-of-the-Art Methods MT-Based Models Translation + Monolingual Analysis [Muhr et al., 2010, Barrón-Cedeño, 2012] Comparable Corpora-Based Models CL-KGA, CL-ESA [Gabrilovich and Markovitch, 2007, Potthast et al., 2008] Parallel Corpora-Based Models CL-ASA [Barrón-Cedeño et al., 2008, Pinto et al., 2009], CL-LSI, CL-KCCA Dictionary-Based Models CL-VSM, CL-CTS [Gupta et al., 2012, Pataki, 2012] Syntax-Based Models Length Model, CL-CnG [Mcnamee and Mayfield, 2004, Potthast et al., 2011], Cognateness Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 5
  • 6. Augmented CL-CTS We use DBNary [Sérasset, 2015] as linked lexical resource. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 7. Augmented CL-CTS Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 8. Augmented CL-CTS Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 6
  • 9. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 10. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 11. CL-WES: Cross-Language Word Embedding-based Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 7
  • 12. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 13. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 14. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 15. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity This feature is available in MultiVec [Berard et al., 2016] (https://github.com/eske/multivec) Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 8
  • 16. Evaluation Dataset [Ferrero et al., 2016]1 • French, English and Spanish; • Parallel and comparable (mix of Wikipedia, conference papers, product reviews, Europarl and JRC); • Different granularities: document level, sentence level and chunk level; • Human and machine translated texts; • Obfuscated (to make the similarity detection more complicated) and without added noise; • Written and translated by multiple types of authors; • Cover various fields. 1 A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of LREC 2016. https://github.com/FerreroJeremy/Cross-Language-Dataset Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 9
  • 17. Evaluation Protocol • We compared each English textual unit to its corresponding French unit and to 999 other units randomly selected; • We threshold the obtained distance matrix to find the threshold giving the best F1 score; • We repeat these two steps 10 times, leading to a 10 folds: - 2 folds for tuning (CL-WESS) and fusion (Decision Tree) - 8 folds for validation Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 10
  • 18. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-CTS-WE: Cross-Language Conceptual Thesaurus-based Similarity with Word-Embedding Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 19. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-WES: Cross-Language Word Embedding-based Similarity CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 20. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. CL-C3G: Cross-Language Character 3-Gram CL-WESS: Cross-Language Word Embedding-based Syntax Similarity Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 21. Results Overall (%) Chunk-Level Sentence-Level State-of-the-Art Methods CL-C3G 50.76 49.34 CL-CTS 42.84 47.50 CL-ASA 47.32 35.81 CL-ESA 14.81 14.44 T+MA 37.12 37.42 New Proposed Methods CL-CTS-WE 46.67 50.69 CL-WES 41.95 41.43 CL-WESS 53.73 56.35 Decision Tree 89.15 88.50 Table: Average F1 scores of methods applied on EN→FR sub-corpora. • CL-CTS-WE boosts CL-CTS (+3.83% on chunks and +3.19% on sentences); • CL-WESS boosts CL-WES (+11.78% on chunks and +14.92% on sentences); • CL-WESS is better than CL-C3G (+2.97% on chunks and +7.01% on sentences); • Decision Tree fusion significantly improves the results. Decision Tree fusion: C4.5 [Quinlan, 1993] Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 11
  • 22. Conclusion • Augmentation of several baseline approaches using word embeddings instead of lexical resources; • CL-WESS beats in overall the precedent best state-of-the-art methods; • Methods are complementary and their fusion significantly helps cross-language textual similarity detection performance; • Winning method at SemEval-2017 Task 1 track 4a, i.e. the task on Spanish-English Cross-lingual Semantic Textual Similarity detection. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 12
  • 23. Thank you for your attention. Do you have any questions?  jeremy.ferrero@compilatio.net  @FerreroJeremy  github.com/FerreroJeremy  fr.linkedin.com/in/FerreroJeremy  researchgate.net/profile/Jeremy_Ferrero
  • 24. Augmented CL-CTS • CL-CTS-WE uses the top 10 closest words in the embeddings model to build the BOW of a word; • A BOW of a sentence is a merge of the BOW of its words; • Jaccard distance between the two BOW. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 14
  • 25. CL-WES: Cross-Language Word Embedding-based Similarity The similarity between two sentences S and S is calculated by Cosine Distance between the two vectors V and V , built such as: V (S) = |S| i=1 (vector(ui)) (1) • ui is the ith word of S; • vector is the function which gives the word embedding vector of a word. This feature is available in MultiVec2 [Berard et al., 2016]. 2 https://github.com/eske/multivec Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 15
  • 26. CL-WESS: Cross-Language Word Embedding-based Syntax Similarity V (S) = |S| i=1 (weight(pos(ui)).vector(ui)) (2) • ui is the ith word of S; • pos is the function which gives the universal part-of-speech tag of a word; • weight is the function which gives the weight of a part-of-speech; • vector is the function which gives the word embedding vector of a word; • . is the scalar product. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 16
  • 27. Complementarity Figure: Distribution histograms of CL-CNG (left) and CL-ASA (right) for 1000 positives and 1000 negatives (mis)matches. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 17
  • 28. Fusions Weighted Average Fusion Decision Tree Fusion C4.5 [Quinlan, 1993] Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 18
  • 29. Weighted Fusion fus(M) = |M| j=1(wj ∗ mj) |M| j=1 wj (3) • M is the set of the scores of the methods for one match; • mj and wj are the score and the weight of the jth method respectively. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 19
  • 30. Results at Chunk-Level Chunk level Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%) CL-C3G 63.04 40.80 36.80 80.69 53.26 50.76 CL-CTS 58.05 33.66 30.15 67.88 45.31 42.84 CL-ASA 23.70 23.24 33.06 26.34 55.45 47.32 CL-ESA 64.86 23.73 13.91 23.01 13.98 14.81 T+MA 58.26 38.90 28.81 73.25 36.60 37.12 CL-CTS-WE 58.00 38.04 31.73 73.13 49.91 46.67 CL-WES 37.53 21.70 32.96 39.14 46.01 41.95 CL-WESS 52.68 34.49 45.00 56.83 57.06 53.73 Average fusion 81.34 65.78 61.87 91.87 79.77 75.82 Weighed fusion 84.61 69.69 67.02 94.38 83.74 80.01 Decision Tree 95.25 74.10 72.19 97.05 95.16 89.15 Table: Average F1 scores of cross-language similarity detection methods applied on chunk-level EN→FR sub-corpora – 8 folds validation. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 20
  • 31. Results at Sentence-Level Sentence level Methods Wikipedia (%) TALN (%) JRC (%) APR (%) Europarl (%) Overall (%) CL-C3G 48.24 48.19 36.85 61.30 52.70 49.34 CL-CTS 46.71 38.93 28.38 51.43 53.35 47.50 CL-ASA 27.68 27.33 34.78 25.95 36.73 35.81 CL-ESA 50.89 14.41 14.45 14.18 14.09 14.44 T+MA 50.39 37.66 32.31 61.95 37.70 37.42 CL-CTS-WE 47.26 43.93 31.63 57.85 56.39 50.69 CL-WES 28.48 24.37 33.99 39.10 44.06 41.43 CL-WESS 45.65 40.45 48.64 58.08 58.84 56.35 Decision Tree 80.45 80.89 72.70 78.91 94.04 88.50 Table: Average F1 scores of cross-language similarity detection methods applied on sentence-level EN→FR sub-corpora – 8 folds validation. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 21
  • 32. References I Barrón-Cedeño, A. (2012). On the Mono- and Cross-Language Detection of Text Re-Use and Plagiarism. In PhD thesis, València, Spain. Barrón-Cedeño, A., Rosso, P., Pinto, D., and Juan, A. (2008). On Cross-lingual Plagiarism Analysis using a Statistical Model. In Benno Stein and Efstathios Stamatatos and Moshe Koppel, editor, Proceedings of the ECAI’08 PAN Workshop: Uncovering Plagiarism, Authorship and Social Software Misuse, pages 9–13, Patras, Greece. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 13
  • 33. References II Berard, A., Servan, C., Pietquin, O., and Besacier, L. (2016). MultiVec: a Multilingual and Multilevel Representation Learning Toolkit for NLP. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4188–4192, Portoroz, Slovenia. European Language Resources Association (ELRA). Ferrero, J., Agnès, F., Besacier, L., and Schwab, D. (2016). A Multilingual, Multi-style and Multi-granularity Dataset for Cross-language Textual Similarity Detection. In Proceedings of the Tenth International Conference on Language Resources and Evaluation (LREC’16), pages 4162–4169, Portoroz, Slovenia. European Language Resources Association (ELRA). Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 14
  • 34. References III Gabrilovich, E. and Markovitch, S. (2007). Computing Semantic Relatedness using Wikipedia-based Explicit Semantic Analysis. In Proceedings of the 20th International Joint Conference on Artifical Intelligence (IJCAI’07), pages 1606–1611, Hyderabad, India. Morgan Kaufmann Publishers Inc. Gupta, P., Barrón-Cedeño, A., and Rosso, P. (2012). Cross-language High Similarity Search using a Conceptual Thesaurus. In Information Access Evaluation. Multilinguality, Multimodality, and Visual Analytics, pages 67–75, Rome, Italy. Springer Berlin Heidelberg. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 15
  • 35. References IV Mcnamee, P. and Mayfield, J. (2004). Character N-Gram Tokenization for European Language Text Retrieval. In Information Retrieval Proceedings, volume 7, pages 73–97. Kluwer Academic Publishers. Muhr, M., Kern, R., Zechner, M., and Granitzer, M. (2010). External and Intrinsic Plagiarism Detection Using a Cross-Lingual Retrieval and Segmentation System - Lab Report for PAN at CLEF 2010. In Braschler, M., Harman, D., and Pianta, E., editors, CLEF Notebook, Padua, Italy. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 16
  • 36. References V Pataki, M. (2012). A New Approach for Searching Translated Plagiarism. In Proceedings of the 5th International Plagiarism Conference, pages 49–64, Newcastle, UK. Pinto, D., Civera, J., Juan, A., Rosso, P., and Barrón-Cedeño, A. (2009). A Statistical Approach to Crosslingual Natural Language Tasks. In CEUR Workshop Proceedings, volume 64 of Journal of Algorithms, pages 51–60. Potthast, M., Barrón-Cedeño, A., Stein, B., and Rosso, P. (2011). Cross-Language Plagiarism Detection. In Language Resources and Evaluation, volume 45, pages 45–62. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 17
  • 37. References VI Potthast, M., Stein, B., and Anderka, M. (2008). A Wikipedia-Based Multilingual Retrieval Model. In 30th European Conference on IR Research (ECIR’08), volume 4956 of LNCS of Lecture Notes in Computer Science, pages 522–530, Glasgow, Scotland. Springer. Quinlan, J. R. (1993). C4.5: Programs for Machine Learning. The Morgan Kaufmann series in machine learning. Morgan Kaufmann Publishers Inc., San Francisco, CA, USA. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 18
  • 38. References VII Sérasset, G. (2015). DBnary: Wiktionary as a Lemon-Based Multilingual Lexical Resource in RDF. In Semantic Web Journal (special issue on Multilingual Linked Open Data), volume 6, pages 355–361. Jérémy Ferrero, Frédéric Agnès, Laurent Besacier and Didier Schwab EACL - April 2017 Using Word Embedding for Cross-Language Plagiarism Detection 19