SlideShare una empresa de Scribd logo
1 de 27
Descargar para leer sin conexión
Context-based movie search for user questions
that ask the title of the movie
장진규, 박정인
School of
Business
Administration
2018. 4. 18
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 4
Introduction
People sometimes have a craving for find a movie that they once glimpsed. At that time,
they used to ask the movie name through Q&A sites and get the result. Answerers often
seems ‘god of movie’, so we want to imitate their prophecy.
Question Examples
Text Mining Term Project 5
Data Gathering
We chose one expert of this field and gather his answers.
Q&A Site http://kin.naver.com
Expert ID xedz****
Question & Answer data 39,758
Date 2012 December ~ 2018March
Unique Movie 5,900
Gathered Data Information
Text Mining Term Project 6
2 Types of Text Representation
There are 2 kinds of text representation: sparse and dense
Sparse: One-Hot Encoding Dense: Word Embedding
Sparse Dense
Dimension ▪ As many as unique words
• Autonomous setting
• Usually 20~200 dimensions
Information
• Lots of 0 value
• No Information
• Every element has value
• Abundant Information
Comparison of Text Representations
source: https://dreamgonfly.github.io/machine/learning,/natural/language/processing/2017/08/16/word2vec_explained.html
Text Mining Term Project 7
Main Idea of Word2Vec
Word2Vec is one of the word embedding methods.
Its main idea is “You shall know a word by the company it keeps.”
Every word has friends around them
Text Mining Term Project 8
Algorithms of Word2Vec
Word2vec has two model architectures: continuous-bag-of-words (CBOW), skip-gram.
Diagrams of CBOW and Skip-gram
source: https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/
Text Mining Term Project 9
Algorithms of Doc2Vec
Doc2vec has two model architectures: distributed memory model (PV-DM) and
Distributed bag of words model(PV-DBOW).
Diagrams of PV-DM and PV-DBOW
source: Distributed Representations of Sentences and Documents
The concatenation or average of vector with a context of three
words is used to predict the fourth word. The paragraph vector
represents the missing information from the current context
Ignore the context words in the input, but force the model to
predict words randomly sampled from the paragraph in the
output. Similar to Skip-gram model
PV-DM PV-DBOW
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 11
▪ Tokenize with KoNLPy
• using Twitter package
▪ Pos-tagging
• only get noun, verb, and adjective
▪ Remove Token which has only one
character
▪ Remove Stop-words
▪ Delete questions of which token length are
less than 10
▪ Remove unnecessary words
• URL, Special characters (!, ?, *, @, <. >),
Emoticon(ㅋㅋ, ㅠㅠ), multispacer
▪ Stem words that dictionary cannot correct
• (남주 → 남자주인공), (페북 → 페이스북),
(영환 → 영화인데), (여자애 → 여자)
▪ Delete unnecessory phrase in question
• 좀 옛날 영화인데 ~, 페북에서 봤는데, ~
장면이 있었는데 기억이안나네요
▪ Delete questions of which length are less
than 30
Preprocessing
We did preprocessing for better performance and it is processed by 2 steps: whole text
data and tokenized data.
Raw Preprocessing Tokenizing
Text Mining Term Project 12
Select Movies and Split dataset
There are 5,900 movies in dataset, but many movies has few questions. So we remove
certain movies that have questions below cutoff value. Then we split the dataset with
8:2 ratio to test the model.
Movie Count
스파이더위크가의 비밀 259
캐빈 인 더 우즈 222
비밀의 숲 테라비시아 179
Cutoff
무서운 영화 2 1
전우 1
전우치 1
The number of question per movies
Movie Train Test
스파이더위크가의 비밀 207 52
케빈 인 더 우즈 177 45
비밀의 숲 테라비시아 143 36
레모니 스니켓의 위험한 대결 142 36
플립 141 36
… … …
Split Train and Test
*Basic cutoff = 3
*Using stratified method
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 14
Modeling – Word2vec
To train word2vec model, we put the answers (label) between the tokenized words in
the question. Using this corpus, we trained word2vec model.
*put labels in every 5 words
Q: question, A: answer(label), W: word
The number of labels in train, test data : 2021
Train data set : 22620 ,Test data set: 5655
skip-gram is employed
Dimensionality of the feature vectors - 300
Window size - 10
Hierarchical softmax used
Text Mining Term Project 15
Modeling – Word2vec
Text Mining Term Project 16
Modeling – Word2vec
Each word in the test set is embedded into the model to obtain a word vector.
Combine all the vectors into one vector on a question-by-question basis (Document vector)
Text Mining Term Project 17
Modeling – Word2vec
Also embedding the unique answers (label) into the model to obtain label vector. After
that, Calculate pairwise cosine similarity between the label vectors (𝑽 𝑨𝒏 ) and the
document vectors (𝐕′ 𝒌)
K: the number test set data
n: the number of unique labels
Text Mining Term Project 18
Modeling – Word2vec
Finally, Normalize the cosine similarity result for each document vectors, binarize the
answers(label), and evaluate the performance of the model
Test set
Ex)
𝑉′ 𝑘 = 0.05,0.001, 0.003 … . 0.002
𝐴 𝑘 = [1,0,0 …. 0]
Text Mining Term Project 19
Modeling – Doc2vec
In the Doc2vec model, we do not need to put the correct answer like in the word2vec
model, because the answer (label) is also learned as a tag.
PV distributed memory is employed.
Dimensionality of the feature vectors. - 300
Window size - 3
hierarchical softmax used
use the sum of the context word vectors.
The paragraph vectors(label vectors) are asked to a prediction task about
the next word in the sentence. Every paragraph is mapped to a unique
vector. The paragraph vector and word vectors are averaged to predict
the next word in a context.
Text Mining Term Project 20
Modeling – Doc2vec
*Computes cosine similarity between a simple mean of the projection weight vectors of the given docs.
Contents
I. Introduction
II. Preprocessing
III. Analysis
IV. Results
Text Mining Term Project 22
Model evaluation – ROC curve
The ROC curve results for each labels are evaluated by two methods: micro-averaging
and macro-averaging.
micro-averaging - considering each element of the label indicator matrix as a binary prediction
macro-averaging - gives equal weight to the classification of each label.
word2vec doc2vec
AUC 0.78, 0.82 AUC 0.97, 0.97
(Micro, macro)(Micro, macro)
Text Mining Term Project 23
Model evaluation – Top-n Accuracy approach
Top-n accuracy approach results for each labels are evaluated. For example, Top-5
accuracy means that any of our model 5 highest probability answers must match the
expected answer. (n: 1~10)
word2vec doc2vec
Accuracy* : top 1 and top 10 accuracy
Accuracy*
0.08 to 0.20
Accuracy*
0.49 to 0.73
Text Mining Term Project 24
Discussion
• Conclusion
✓ Overall, doc2vec shows better performance than word2vec model
✓ Building a service by presenting n (at least 5) correct answer lists for new questions
✓ Application to speech recognition based movie recommendation service
• Further study
✓ Problems that questions about untrained movies
- complementing through learning synopsis of the movies
✓ A method for dealing with imbalanced movie data is needed
Text Mining Term Project 25
Thank you
Text Mining Term Project 26
APPENDIX
We drew graphs to find which movies and which genres are highly asked. We could find
that people wanted to find mysterious and thrilling movies
259
222
179
178
177
166
151
147
143
131
스파이더위크가의 비밀
캐빈 인 더 우즈
비밀의 숲 테라비시아
레모니 스니켓의 위험한…
플립
트루먼 쇼
다이버전트
스플라이스
아바타
업사이드 다운
7447
5001
4585
2560
784
706
501
공포, 스릴러
SF, 판타지
로맨스, 멜로
액션, 무협
코미디
드라마
애니메이션
Asked Movie Ranking Asked Movie Genre
Text Mining Term Project 27
APPENDIX
Delete unnecessory phrase in question
At the beginning of the question and at the end, remove all phrases before and after the word.
If the word in the check words list (within 20% of the length of the question)

Más contenido relacionado

La actualidad más candente

Tutorial - Support vector machines
Tutorial - Support vector machinesTutorial - Support vector machines
Tutorial - Support vector machines
butest
 

La actualidad más candente (20)

[Paper review] BERT
[Paper review] BERT[Paper review] BERT
[Paper review] BERT
 
IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)IE: Named Entity Recognition (NER)
IE: Named Entity Recognition (NER)
 
Word embedding
Word embedding Word embedding
Word embedding
 
word level analysis
word level analysis word level analysis
word level analysis
 
Neural Models for Information Retrieval
Neural Models for Information RetrievalNeural Models for Information Retrieval
Neural Models for Information Retrieval
 
Lecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptxLecture_16_Self-supervised_Learning.pptx
Lecture_16_Self-supervised_Learning.pptx
 
Word Embeddings - Introduction
Word Embeddings - IntroductionWord Embeddings - Introduction
Word Embeddings - Introduction
 
NLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLPNLP Bootcamp 2018 : Representation Learning of text for NLP
NLP Bootcamp 2018 : Representation Learning of text for NLP
 
Word Embeddings, why the hype ?
Word Embeddings, why the hype ? Word Embeddings, why the hype ?
Word Embeddings, why the hype ?
 
Word2vec algorithm
Word2vec algorithmWord2vec algorithm
Word2vec algorithm
 
NGS data formats and analyses
NGS data formats and analysesNGS data formats and analyses
NGS data formats and analyses
 
Tutorial - Support vector machines
Tutorial - Support vector machinesTutorial - Support vector machines
Tutorial - Support vector machines
 
Understanding GloVe
Understanding GloVeUnderstanding GloVe
Understanding GloVe
 
Gene Ontology Project
Gene Ontology ProjectGene Ontology Project
Gene Ontology Project
 
An introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERTAn introduction to the Transformers architecture and BERT
An introduction to the Transformers architecture and BERT
 
Nlp and transformer (v3s)
Nlp and transformer (v3s)Nlp and transformer (v3s)
Nlp and transformer (v3s)
 
BERT introduction
BERT introductionBERT introduction
BERT introduction
 
Deep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word EmbeddingsDeep Learning for Natural Language Processing: Word Embeddings
Deep Learning for Natural Language Processing: Word Embeddings
 
Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기Dependency Parser, 의존 구조 분석기
Dependency Parser, 의존 구조 분석기
 
Knowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4jKnowledge graphs + Chatbots with Neo4j
Knowledge graphs + Chatbots with Neo4j
 

Similar a Context-based movie search using doc2vec, word2vec

IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
KtonNguyn2
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Lucidworks
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
AtharvaTanawade
 

Similar a Context-based movie search using doc2vec, word2vec (20)

Word2Vec
Word2VecWord2Vec
Word2Vec
 
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
A Benchmark for the Use of Topic Models for Text Visualization Tasks - Online...
 
Naver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltcNaver learning to rank question answer pairs using hrde-ltc
Naver learning to rank question answer pairs using hrde-ltc
 
IA3_presentation.pptx
IA3_presentation.pptxIA3_presentation.pptx
IA3_presentation.pptx
 
Transcription Factor DNA Binding Prediction
Transcription Factor DNA Binding PredictionTranscription Factor DNA Binding Prediction
Transcription Factor DNA Binding Prediction
 
Mathematical Modeling for Practical Problems
Mathematical Modeling for Practical ProblemsMathematical Modeling for Practical Problems
Mathematical Modeling for Practical Problems
 
Concept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based ModelsConcept Detection of Multiple Choice Questions using Transformer Based Models
Concept Detection of Multiple Choice Questions using Transformer Based Models
 
MongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDBMongoDB World 2019: Fast Machine Learning Development with MongoDB
MongoDB World 2019: Fast Machine Learning Development with MongoDB
 
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
Implementing Conceptual Search in Solr using LSA and Word2Vec: Presented by S...
 
Amazon Product Sentiment review
Amazon Product Sentiment reviewAmazon Product Sentiment review
Amazon Product Sentiment review
 
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxxIMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
IMAGE CAPTION GENERATOR.pptx1.pptxxxxxxxxxx
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
 
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
TEXT DATA LABELLING USING TRANSFORMER BASED SENTENCE EMBEDDINGS AND TEXT SIMI...
 
Open vocabulary problem
Open vocabulary problemOpen vocabulary problem
Open vocabulary problem
 
lecture04_movie_discussion.pdf
lecture04_movie_discussion.pdflecture04_movie_discussion.pdf
lecture04_movie_discussion.pdf
 
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...[Paper Reading]  Unsupervised Learning of Sentence Embeddings using Compositi...
[Paper Reading] Unsupervised Learning of Sentence Embeddings using Compositi...
 
Analyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et WekaAnalyse de sentiment et classification par approche neuronale en Python et Weka
Analyse de sentiment et classification par approche neuronale en Python et Weka
 
Probabilistic content models,
Probabilistic content models,Probabilistic content models,
Probabilistic content models,
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 

Último

Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
Nauman Safdar
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
allensay1
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
ZurliaSoop
 

Último (20)

PHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation FinalPHX May 2024 Corporate Presentation Final
PHX May 2024 Corporate Presentation Final
 
Uneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration PresentationUneak White's Personal Brand Exploration Presentation
Uneak White's Personal Brand Exploration Presentation
 
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book nowPARK STREET 💋 Call Girl 9827461493 Call Girls in  Escort service book now
PARK STREET 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAIGetting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
Getting Real with AI - Columbus DAW - May 2024 - Nick Woo from AlignAI
 
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTSJAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR  ESCORTS
JAJPUR CALL GIRL ❤ 82729*64427❤ CALL GIRLS IN JAJPUR ESCORTS
 
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165Lucknow Housewife Escorts  by Sexy Bhabhi Service 8250092165
Lucknow Housewife Escorts by Sexy Bhabhi Service 8250092165
 
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service AvailableBerhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
Berhampur Call Girl Just Call 8084732287 Top Class Call Girl Service Available
 
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NSCROSS CULTURAL NEGOTIATION BY PANMISEM NS
CROSS CULTURAL NEGOTIATION BY PANMISEM NS
 
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDINGBerhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
Berhampur CALL GIRL❤7091819311❤CALL GIRLS IN ESCORT SERVICE WE ARE PROVIDING
 
Mckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for ViewingMckinsey foundation level Handbook for Viewing
Mckinsey foundation level Handbook for Viewing
 
Arti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdfArti Languages Pre Seed Teaser Deck 2024.pdf
Arti Languages Pre Seed Teaser Deck 2024.pdf
 
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All TimeCall 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
Call 7737669865 Vadodara Call Girls Service at your Door Step Available All Time
 
New 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck TemplateNew 2024 Cannabis Edibles Investor Pitch Deck Template
New 2024 Cannabis Edibles Investor Pitch Deck Template
 
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book nowGUWAHATI 💋 Call Girl 9827461493 Call Girls in  Escort service book now
GUWAHATI 💋 Call Girl 9827461493 Call Girls in Escort service book now
 
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...joint cost.pptx  COST ACCOUNTING  Sixteenth Edition                          ...
joint cost.pptx COST ACCOUNTING Sixteenth Edition ...
 
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
Horngren’s Cost Accounting A Managerial Emphasis, Canadian 9th edition soluti...
 
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
Chennai Call Gril 80022//12248 Only For Sex And High Profile Best Gril Sex Av...
 
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al MizharAl Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
Al Mizhar Dubai Escorts +971561403006 Escorts Service In Al Mizhar
 
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan CytotecJual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
Jual Obat Aborsi ( Asli No.1 ) 085657271886 Obat Penggugur Kandungan Cytotec
 
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptxQSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
QSM Chap 10 Service Culture in Tourism and Hospitality Industry.pptx
 

Context-based movie search using doc2vec, word2vec

  • 1. Context-based movie search for user questions that ask the title of the movie 장진규, 박정인 School of Business Administration 2018. 4. 18
  • 4. Text Mining Term Project 4 Introduction People sometimes have a craving for find a movie that they once glimpsed. At that time, they used to ask the movie name through Q&A sites and get the result. Answerers often seems ‘god of movie’, so we want to imitate their prophecy. Question Examples
  • 5. Text Mining Term Project 5 Data Gathering We chose one expert of this field and gather his answers. Q&A Site http://kin.naver.com Expert ID xedz**** Question & Answer data 39,758 Date 2012 December ~ 2018March Unique Movie 5,900 Gathered Data Information
  • 6. Text Mining Term Project 6 2 Types of Text Representation There are 2 kinds of text representation: sparse and dense Sparse: One-Hot Encoding Dense: Word Embedding Sparse Dense Dimension ▪ As many as unique words • Autonomous setting • Usually 20~200 dimensions Information • Lots of 0 value • No Information • Every element has value • Abundant Information Comparison of Text Representations source: https://dreamgonfly.github.io/machine/learning,/natural/language/processing/2017/08/16/word2vec_explained.html
  • 7. Text Mining Term Project 7 Main Idea of Word2Vec Word2Vec is one of the word embedding methods. Its main idea is “You shall know a word by the company it keeps.” Every word has friends around them
  • 8. Text Mining Term Project 8 Algorithms of Word2Vec Word2vec has two model architectures: continuous-bag-of-words (CBOW), skip-gram. Diagrams of CBOW and Skip-gram source: https://aws.amazon.com/ko/blogs/korea/amazon-sagemaker-blazingtext-parallelizing-word2vec-on-multiple-cpus-or-gpus/
  • 9. Text Mining Term Project 9 Algorithms of Doc2Vec Doc2vec has two model architectures: distributed memory model (PV-DM) and Distributed bag of words model(PV-DBOW). Diagrams of PV-DM and PV-DBOW source: Distributed Representations of Sentences and Documents The concatenation or average of vector with a context of three words is used to predict the fourth word. The paragraph vector represents the missing information from the current context Ignore the context words in the input, but force the model to predict words randomly sampled from the paragraph in the output. Similar to Skip-gram model PV-DM PV-DBOW
  • 11. Text Mining Term Project 11 ▪ Tokenize with KoNLPy • using Twitter package ▪ Pos-tagging • only get noun, verb, and adjective ▪ Remove Token which has only one character ▪ Remove Stop-words ▪ Delete questions of which token length are less than 10 ▪ Remove unnecessary words • URL, Special characters (!, ?, *, @, <. >), Emoticon(ㅋㅋ, ㅠㅠ), multispacer ▪ Stem words that dictionary cannot correct • (남주 → 남자주인공), (페북 → 페이스북), (영환 → 영화인데), (여자애 → 여자) ▪ Delete unnecessory phrase in question • 좀 옛날 영화인데 ~, 페북에서 봤는데, ~ 장면이 있었는데 기억이안나네요 ▪ Delete questions of which length are less than 30 Preprocessing We did preprocessing for better performance and it is processed by 2 steps: whole text data and tokenized data. Raw Preprocessing Tokenizing
  • 12. Text Mining Term Project 12 Select Movies and Split dataset There are 5,900 movies in dataset, but many movies has few questions. So we remove certain movies that have questions below cutoff value. Then we split the dataset with 8:2 ratio to test the model. Movie Count 스파이더위크가의 비밀 259 캐빈 인 더 우즈 222 비밀의 숲 테라비시아 179 Cutoff 무서운 영화 2 1 전우 1 전우치 1 The number of question per movies Movie Train Test 스파이더위크가의 비밀 207 52 케빈 인 더 우즈 177 45 비밀의 숲 테라비시아 143 36 레모니 스니켓의 위험한 대결 142 36 플립 141 36 … … … Split Train and Test *Basic cutoff = 3 *Using stratified method
  • 14. Text Mining Term Project 14 Modeling – Word2vec To train word2vec model, we put the answers (label) between the tokenized words in the question. Using this corpus, we trained word2vec model. *put labels in every 5 words Q: question, A: answer(label), W: word The number of labels in train, test data : 2021 Train data set : 22620 ,Test data set: 5655 skip-gram is employed Dimensionality of the feature vectors - 300 Window size - 10 Hierarchical softmax used
  • 15. Text Mining Term Project 15 Modeling – Word2vec
  • 16. Text Mining Term Project 16 Modeling – Word2vec Each word in the test set is embedded into the model to obtain a word vector. Combine all the vectors into one vector on a question-by-question basis (Document vector)
  • 17. Text Mining Term Project 17 Modeling – Word2vec Also embedding the unique answers (label) into the model to obtain label vector. After that, Calculate pairwise cosine similarity between the label vectors (𝑽 𝑨𝒏 ) and the document vectors (𝐕′ 𝒌) K: the number test set data n: the number of unique labels
  • 18. Text Mining Term Project 18 Modeling – Word2vec Finally, Normalize the cosine similarity result for each document vectors, binarize the answers(label), and evaluate the performance of the model Test set Ex) 𝑉′ 𝑘 = 0.05,0.001, 0.003 … . 0.002 𝐴 𝑘 = [1,0,0 …. 0]
  • 19. Text Mining Term Project 19 Modeling – Doc2vec In the Doc2vec model, we do not need to put the correct answer like in the word2vec model, because the answer (label) is also learned as a tag. PV distributed memory is employed. Dimensionality of the feature vectors. - 300 Window size - 3 hierarchical softmax used use the sum of the context word vectors. The paragraph vectors(label vectors) are asked to a prediction task about the next word in the sentence. Every paragraph is mapped to a unique vector. The paragraph vector and word vectors are averaged to predict the next word in a context.
  • 20. Text Mining Term Project 20 Modeling – Doc2vec *Computes cosine similarity between a simple mean of the projection weight vectors of the given docs.
  • 22. Text Mining Term Project 22 Model evaluation – ROC curve The ROC curve results for each labels are evaluated by two methods: micro-averaging and macro-averaging. micro-averaging - considering each element of the label indicator matrix as a binary prediction macro-averaging - gives equal weight to the classification of each label. word2vec doc2vec AUC 0.78, 0.82 AUC 0.97, 0.97 (Micro, macro)(Micro, macro)
  • 23. Text Mining Term Project 23 Model evaluation – Top-n Accuracy approach Top-n accuracy approach results for each labels are evaluated. For example, Top-5 accuracy means that any of our model 5 highest probability answers must match the expected answer. (n: 1~10) word2vec doc2vec Accuracy* : top 1 and top 10 accuracy Accuracy* 0.08 to 0.20 Accuracy* 0.49 to 0.73
  • 24. Text Mining Term Project 24 Discussion • Conclusion ✓ Overall, doc2vec shows better performance than word2vec model ✓ Building a service by presenting n (at least 5) correct answer lists for new questions ✓ Application to speech recognition based movie recommendation service • Further study ✓ Problems that questions about untrained movies - complementing through learning synopsis of the movies ✓ A method for dealing with imbalanced movie data is needed
  • 25. Text Mining Term Project 25 Thank you
  • 26. Text Mining Term Project 26 APPENDIX We drew graphs to find which movies and which genres are highly asked. We could find that people wanted to find mysterious and thrilling movies 259 222 179 178 177 166 151 147 143 131 스파이더위크가의 비밀 캐빈 인 더 우즈 비밀의 숲 테라비시아 레모니 스니켓의 위험한… 플립 트루먼 쇼 다이버전트 스플라이스 아바타 업사이드 다운 7447 5001 4585 2560 784 706 501 공포, 스릴러 SF, 판타지 로맨스, 멜로 액션, 무협 코미디 드라마 애니메이션 Asked Movie Ranking Asked Movie Genre
  • 27. Text Mining Term Project 27 APPENDIX Delete unnecessory phrase in question At the beginning of the question and at the end, remove all phrases before and after the word. If the word in the check words list (within 20% of the length of the question)