Treasure Data Summer Internship Final Report

Summer Internship
Final Report
Naoki Ishikawa (@NeokiStones)
2015/09/30 13:30-

Who am I
2
• Naoki Ishikawa
• Waseda University, Information Science M1
• Research: Evolutional Computation/
Reinforcement Learning
• Laboratory: Sugawara Lab
• Laboratory theme: Artiﬁcial Intelligence

• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
3
Table of contents

4
Table of contents

Factorization Machine
5
• Algorithm for Recommendation
• Classiﬁcation(Clustering)
• Regression
• Supervised Learning
• Need Input/Output Data
• Suitable for Sparse Data

Application
7
• Prediction of Movie Rating
• Task: Prediction movie rating 
(real number)
• Regression 
- Input: Self-designed Matrix  
- Output: Rating Vector

8
Input Output
Prediction of Movie Rating

INPUT Details
9
• Identifier 
- User Identifier : [0, 0, …, 0, 1, 0, …,0] 
- Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]
• Designed Feature 
- Rating of Other Movie 
- Time 
- Last Movie rated

10
Recommendation Algorithm
• Collaborative Filtering
• Associations Analysis
• Bayesian Network

Prediction of Movie Rating
11
• Hivemall
• Matrix Factorization
• Recommendation

12
Diﬀerence from Matrix Factorization
• Data Structure
• Matrix Factorization
• User-Item Matrix
http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png
Input
Learning Parameter

13
Vv
k
Input
Learning Parameter
Wk
1

14
• Consider
• context data
• Interaction between valuables
Advantage of Factorization Machine

15
Prediction by Factorization Machine (d=2)

16
(mean)
Global bias
Interaction
Factorization
(Wkj)
Regression coeﬃcience
of k-th variable

17
Learning Method
Stochastic Gradient descent(SGD)

19
• d-way
• FM / MF
• assume K latent attributes
• Matrix Factorization: d = 2
• Factorization Machine: d 2

20
HyperParameter
• K: the number of hidden factor
• η: the regulation parameter

21
Implemented Model
• Implemented Model
• d = 2
• MapModel
• ArrayModel

22
Implemented Model
• MapModel
• For unknown data
• Flexible
• Suitable for Online Learning

23
Implemented Model
• ArrayModel
• For known data
• less overhead

24
Other Use Case
• E-Commerce User-Item Recommendation
• Input Data
• Age
• Purchase timezone
• Past bought items
• Cluster ID
• Target Data
• Evaluation of
an Item by User

25
Table of contents

Latent Dirichlet Allocation
26
• Most Popular Algorithm of Topic Model
• Mostly applied for text data
• Find hidden structure of data
• Unsupervised Learning
• Need Input Data only
• Generative Model

27
• Generative Modelling in LDA
• Mimic how to generate Document
• 1. Choose what you write about
• 2. Choose word from the Topic
• 3. Write

28
• Input
• Text data (Documents)
• Output
• Topic-word distribution
• Document-Topic distribution

29
https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpg
https://wellecks.wordpress.com/2014/10/26/ldaoverﬂow-with-online-lda/

Learning Method
30
• Deﬁne Generative model
• For documents
• Learn parameters to reproduce the
document

Learning Method
32
http://heartruptcy.blog.fc2.com/blog-entry-124.html

Graphical Model(Code)
33
• For Topic ={1,…, K}
• WordDistribution[k] Dir(β)
For Document={1,…, D}
TopicDistribution[d] Dir(α)
For Word={1,…, numOfWord[d]}
WordTopic[d][n] TopicDistribution[d]
Word[d][n] WordDistribution[WordTopic[d][n]]

Learning Method
34
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering

Learning Method
35
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
faster than Gibbs Sampling

Mini-batch Online LDA
36
• Faster than Batch Algorithm
• Less noise than pure Online LDA
Pure Online
Mini-batch
Online
Batch
Batch Size

37
Implemented Model
• Mini-Batch Map Model
• Don t assume Vocabulary List
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List

• Mini-Batch Map Model
• Don t assume Vocabulary List
38
Implemented Model
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List

• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset
39
Faced Implementation Problem

40
• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset

• TF-IDF
• can be calculated by Hivemall
• Input Data: (DocId, Words)
• https://github.com/myui/hivemall/wiki/
TFIDF-calculation
41

• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:
0.06564983513276658","law:0.065
• 64983513276658","based:0.06564983513276658","religion:
0.06564983513276658","viewpoints:0.03282491756638329","
• rationality:0.03282491756638329","including:0.03282491756638329","context:
0.03282491756638329","concept:0.032
• 82491756638329","rightness:0.03282491756638329","general:
0.03282491756638329","many:0.03282491756638329","dif
• fering:0.03282491756638329","fairness:0.03282491756638329","social:
0.03282491756638329","broadest:0.032824917
• 56638329 ,"equity:0.03282491756638329","includes:
0.03282491756638329","theology:0.03282491756638329"]
42
• TF-IDF

• Vocabulary List Model
• Initialize all lambda for all words at ﬁrst
• if word does not appear in the Doc:
• Lambda decreases at the same rate
• No initialization problem
43

• Online Map Model
• Initialize lambda when new word fetched
• ﬁnal lambda:  
depend on the ﬁrst appeared time
• Initialize problem
44

• Prepared Dummy Lambda
• Initialize dummy lambdas at ﬁrst
• Apply lambda update rule for dummy
lambda
45

• Implicit Φ Normalization
• Not written implicitly
46

• Not written implicitly
47

• Not written explicitly
48

49
• Diﬃcult Debugging
• Circular reference
Φ
γ β
:dependence

• Data: 20News
• Topic:6
• Iteration:10
50
Result: Online LDA

• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
51
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796

• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
52
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
Sports

• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 ﬁnd[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
53
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 ﬁle[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057

• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 ﬁnd[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
54
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 ﬁle[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
Computer

Impression about Internship
55
• Machine Learning
• Implementing ML algorithm from
Scratch was fun
• Contributing for OSS is precious
experience for me

Unﬁnished Business
56
• Documentation
• write entry for FM/Online LDA
• UDTF
• build the function into Hivemall

57
• Thank you for Listening

Treasure Data Summer Internship Final Report

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Treasure Data Summer Internship Final Report

Similar a Treasure Data Summer Internship Final Report (20)

Último

Último (20)

Treasure Data Summer Internship Final Report