SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Summer Internship
Final Report
Naoki Ishikawa (@NeokiStones)
2015/09/30 13:30-
Who am I
2
• Naoki Ishikawa
• Waseda University, Information Science M1
• Research: Evolutional Computation/
Reinforcement Learning
• Laboratory: Sugawara Lab
• Laboratory theme: Artificial Intelligence
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
3
Table of contents
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
4
Table of contents
Factorization Machine
5
• Algorithm for Recommendation
• Classification(Clustering)
• Regression
• Supervised Learning
• Need Input/Output Data
• Suitable for Sparse Data
Application
Application
7
• Prediction of Movie Rating
• Task: Prediction movie rating

(real number)
• Regression

- Input: Self-designed Matrix 

- Output: Rating Vector
8
Input Output
Prediction of Movie Rating
INPUT Details
9
• Identifier

- User Identifier : [0, 0, …, 0, 1, 0, …,0]

- Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0]
• Designed Feature

- Rating of Other Movie

- Time

- Last Movie rated
10
Recommendation Algorithm
• Collaborative Filtering
• Associations Analysis
• Bayesian Network
Prediction of Movie Rating
11
• Hivemall
• Matrix Factorization
• Recommendation
12
Difference from Matrix Factorization
• Data Structure
• Matrix Factorization
• User-Item Matrix
http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png
Input
Learning Parameter
13
Difference from Matrix Factorization
• Factorization Machine
Vv
k
Input
Learning Parameter
Wk
1
14
• Factorization Machine
• Consider
• context data
• Interaction between valuables
Advantage of Factorization Machine
15
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
16
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
(mean)
Global bias
Interaction
Factorization
(Wkj)
Regression coefficience
of k-th variable
17
Difference from Matrix Factorization
Prediction by Factorization Machine (d=2)
Learning Method
Stochastic Gradient descent(SGD)
18
Local Implementation
19
Difference from Matrix Factorization
• d-way
• FM / MF
• assume K latent attributes
• Matrix Factorization: d = 2
• Factorization Machine: d 2
20
HyperParameter
• K: the number of hidden factor
• η: the regulation parameter
21
Implemented Model
• Implemented Model
• d = 2
• MapModel
• ArrayModel
22
Implemented Model
• MapModel
• For unknown data
• Flexible
• Suitable for Online Learning
23
Implemented Model
• ArrayModel
• For known data
• less overhead
24
Other Use Case
• E-Commerce User-Item Recommendation
• Input Data
• Age
• Purchase timezone
• Past bought items
• Cluster ID
• Target Data
• Evaluation of
an Item by User
• Implemented Algorithm
• Factorization Machine
• Latent Dirichlet Allocation
25
Table of contents
Latent Dirichlet Allocation
26
• Most Popular Algorithm of Topic Model
• Mostly applied for text data
• Find hidden structure of data
• Unsupervised Learning
• Need Input Data only
• Generative Model
Latent Dirichlet Allocation
27
• Generative Modelling in LDA
• Mimic how to generate Document
• 1. Choose what you write about
• 2. Choose word from the Topic
• 3. Write
Latent Dirichlet Allocation
28
• Input
• Text data (Documents)
• Output
• Topic-word distribution
• Document-Topic distribution
Latent Dirichlet Allocation
29
https://www.vappingo.com/word-blog/wp-content/uploads/2011/01/paper2.jpg
https://wellecks.wordpress.com/2014/10/26/ldaoverflow-with-online-lda/
Learning Method
30
• Define Generative model
• For documents
• Learn parameters to reproduce the
document
Learning Method
31
K
Topic
Learning Method
32
http://heartruptcy.blog.fc2.com/blog-entry-124.html
Graphical Model(Code)
33
• For Topic ={1,…, K}
• WordDistribution[k] Dir(β)
For Document={1,…, D}
TopicDistribution[d] Dir(α)
For Word={1,…, numOfWord[d]}
WordTopic[d][n] TopicDistribution[d]
Word[d][n] WordDistribution[WordTopic[d][n]]
Learning Method
34
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
Learning Method
35
• Variational Bayes
• Gibbs Sampling (MCMC)
• Particle Filtering
faster than Gibbs Sampling
Mini-batch Online LDA
36
• Faster than Batch Algorithm
• Less noise than pure Online LDA
Pure Online
Mini-batch
Online
Batch
Batch Size
37
Implemented Model
• Mini-Batch Map Model
• For unknown data
• Don t assume Vocabulary List
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List
• Mini-Batch Map Model
• For unknown data
• Don t assume Vocabulary List
38
Implemented Model
• Mini-Batch Array Model (Other
implementation)
• For known data
• Assume Vocabulary List
• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset
39
Faced Implementation Problem
40
Faced Implementation Problem
• Meaning Less word
• LDA: Clustering word by co-occurrence
• a , the , I , He , is , in , on
• Stop Word: Ignore them
• TF-IDF: how important a word is to a
document in a collection or dataset
• TF-IDF
• can be calculated by Hivemall
• Input Data: (DocId, Words)
• https://github.com/myui/hivemall/wiki/
TFIDF-calculation
41
Faced Implementation Problem
• 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion:
0.06564983513276658","law:0.065
• 64983513276658","based:0.06564983513276658","religion:
0.06564983513276658","viewpoints:0.03282491756638329","
• rationality:0.03282491756638329","including:0.03282491756638329","context:
0.03282491756638329","concept:0.032
• 82491756638329","rightness:0.03282491756638329","general:
0.03282491756638329","many:0.03282491756638329","dif
• fering:0.03282491756638329","fairness:0.03282491756638329","social:
0.03282491756638329","broadest:0.032824917
• 56638329 ,"equity:0.03282491756638329","includes:
0.03282491756638329","theology:0.03282491756638329"]
42
Faced Implementation Problem
• TF-IDF
• Vocabulary List Model
• Initialize all lambda for all words at first
• if word does not appear in the Doc:
• Lambda decreases at the same rate
• No initialization problem
43
Faced Implementation Problem
• Online Map Model
• Initialize lambda when new word fetched
• final lambda: 

depend on the first appeared time
• Initialize problem
44
Faced Implementation Problem
• Prepared Dummy Lambda
• Initialize dummy lambdas at first
• Apply lambda update rule for dummy
lambda
45
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
46
Faced Implementation Problem
• Implicit Φ Normalization
• Not written implicitly
47
Faced Implementation Problem
• Implicit Φ Normalization
• Not written explicitly
48
Faced Implementation Problem
49
Faced Implementation Problem
• Difficult Debugging
• Circular reference
Φ
γ β
:dependence
• Data: 20News
• Topic:6
• Iteration:10
50
Result: Online LDA
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
51
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
• Topic:1
• No.0 writes[6]: 0.007909349
• No.1 article[7]: 0.006535292
• No.2 apr[3]: 0.0034389505
• No.3 team[4]: 0.00340712
• No.4 game[4]: 0.0033219245
• No.5 year[4]: 0.0032751847
• No.6 good[4]: 0.0032546786
• No.7 time[4]: 0.0030503264
• No.8 play[4]: 0.00262638
• No.9 games[5]: 0.002433915
• No.10 season[6]: 0.0022433712
• No.11 ll[2]: 0.0020719478
• No.12 players[7]: 0.0020332362
• No.13 win[3]: 0.0019284738
• No.14 hockey[6]: 0.0018870989
52
Result: Online LDA
• No.15 league[6]: 0.0018450991
• No.16 baseball[8]: 0.0018226414
• No.17 years[5]: 0.0017960512
• No.18 mail[4]: 0.0017936684
• No.19 people[6]: 0.0017642054
• No.20 teams[5]: 0.0016675185
• No.21 great[5]: 0.001642102
• No.22 ve[2]: 0.0015846819
• No.23 point[5]: 0.0015730233
• No.24 cs[2]:0.0015609838
• No.25 didn[4]: 0.0015398773
• No.26 lot[3]: 0.0015123658
• No.27 mike[4]: 0.0014935194
• No.28 university[10]: 0.0014718652
• No.29 player[6]: 0.0014655796
Sports
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
53
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
• Topic:3
• No.0 writes[6]: 0.0065424195
• No.1 article[7]: 0.005621346
• No.2 apr[3]: 0.002746017
• No.3 work[4]: 0.002731466
• No.4 good[4]: 0.00266331
• No.5 ve[2]: 0.0025969497
• No.6 time[4]: 0.0025880735
• No.7 system[6]: 0.0024449623
• No.8 problem[7]: 0.002349667
• No.9 mail[4]: 0.0023234019
• No.10 windows[7]: 0.0021310966
• No.11 people[6]: 0.0018598152
• No.12 find[4]: 0.0018072439
• No.13 computer[8]: 0.0017470584
• No.14 email[5]: 0.0017204053
54
Result: Online LDA
• No.15 drive[5]: 0.0017121765
• No.16 bit[3]: 0.0016401116
• No.17 program[7]: 0.001636191
• No.18 software[8]: 0.0016341405
• No.19 university[10]: 0.0015907411
• No.20 ll[2]: 0.0015530549
• No.21 thing[5]: 0.0015159848
• No.22 card[4]: 0.0013826761
• No.23 doesn[5]: 0.0013809163
• No.24 phone[5]: 0.0013786326
• No.25 question[8]: 0.0013721529
• No.26 internet[8]:0.001368883
• No.27 file[4]: 0.0013417117
• No.28 things[6]: 0.0013097903
• No.29 set[3]: 0.0013029057
Computer
Impression about Internship
55
• Machine Learning
• Implementing ML algorithm from
Scratch was fun
• Contributing for OSS is precious
experience for me
Unfinished Business
56
• Documentation
• write entry for FM/Online LDA
• UDTF
• build the function into Hivemall
57
• Thank you for Listening

Más contenido relacionado

Destacado

Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Modelirrrrr
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編Kentaro Yoshida
 
第2章アーキテクチャ
第2章アーキテクチャ第2章アーキテクチャ
第2章アーキテクチャKenta Hattori
 
EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07 EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07 Keizo Nagamine
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)muzzy4friends
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Krishna Bollojula
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)rchbeir
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...Damiano Spina
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationChristoph Trattner
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Kyunghoon Kim
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...Christos Katsanos
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_wordszukun
 
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題Aki Ariga
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Ra'Fat Al-Msie'deen
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011aneeshabakharia
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionCory Andrew Henson
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureRakuten Group, Inc.
 

Destacado (20)

Testing Forest-Isomorphism in the Adjacency List Model
Testing Forest-Isomorphismin the Adjacency List ModelTesting Forest-Isomorphismin the Adjacency List Model
Testing Forest-Isomorphism in the Adjacency List Model
 
トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編トレジャーデータ 導入体験記 リブセンス編
トレジャーデータ 導入体験記 リブセンス編
 
第2章アーキテクチャ
第2章アーキテクチャ第2章アーキテクチャ
第2章アーキテクチャ
 
EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07 EventSystemまわりの話@UnityFukuoka07
EventSystemまわりの話@UnityFukuoka07
 
tmu_science_cafe02
tmu_science_cafe02tmu_science_cafe02
tmu_science_cafe02
 
Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)Latent Semantic Indexing and Search Engines Optimimization (SEO)
Latent Semantic Indexing and Search Engines Optimimization (SEO)
 
Geometric Aspects of LSA
Geometric Aspects of LSAGeometric Aspects of LSA
Geometric Aspects of LSA
 
Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3Analysis of Reviews on Sony Z3
Analysis of Reviews on Sony Z3
 
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
LSI latent (par HATOUM Saria et DONGO ESCALANTE Irvin Franco)
 
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
SpeakerLDA: Discovering Topics in Transcribed Multi-Speaker Audio Contents @ ...
 
Recommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human CategorizationRecommending Tags with a Model of Human Categorization
Recommending Tags with a Model of Human Categorization
 
Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1Mathematical approach for Text Mining 1
Mathematical approach for Text Mining 1
 
AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...AutoCardSorter - Designing the Information Architecture of a web site using L...
AutoCardSorter - Designing the Information Architecture of a web site using L...
 
20 cv mil_models_for_words
20 cv mil_models_for_words20 cv mil_models_for_words
20 cv mil_models_for_words
 
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
クックパッドサマーインターン2015 機械学習・自然言語処理 実習課題
 
Practical Machine Learning
Practical Machine Learning Practical Machine Learning
Practical Machine Learning
 
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...Mining Features from the Object-Oriented Source Code of a Collection of Softw...
Mining Features from the Object-Oriented Source Code of a Collection of Softw...
 
SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011SNAPP - Learning Analytics and Knowledge Conference 2011
SNAPP - Learning Analytics and Knowledge Conference 2011
 
A Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine PerceptionA Semantics-based Approach to Machine Perception
A Semantics-based Approach to Machine Perception
 
Latent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet MixtureLatent Semantic Transliteration using Dirichlet Mixture
Latent Semantic Transliteration using Dirichlet Mixture
 

Similar a Treasure Data Summer Internship Final Report

Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructurejoshwills
 
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speechBig Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speechDavid Strom
 
Spcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise SearchSpcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise SearchAlex Kozhemiakin
 
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRLucidworks
 
The Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and RoadmapThe Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and RoadmapESUG
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data VisualizationRaffael Marty
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITchaitalidarode1
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...Petter Skodvin-Hvammen
 
datamining-introduction.pdf
datamining-introduction.pdfdatamining-introduction.pdf
datamining-introduction.pdfssuser3e6464
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataConnected Data World
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine LearningDavide Mauri
 
Data Analytics for Smart Product Development
Data Analytics for Smart Product DevelopmentData Analytics for Smart Product Development
Data Analytics for Smart Product DevelopmentTalentEvent
 
Analytics and Digital Storytelling
Analytics and Digital StorytellingAnalytics and Digital Storytelling
Analytics and Digital StorytellingmStoner, Inc.
 
Incident response before:after breach
Incident response before:after breachIncident response before:after breach
Incident response before:after breachSumedt Jitpukdebodin
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...Lucas Jellema
 

Similar a Treasure Data Summer Internship Final Report (20)

Big Search 4 Big Data War Stories
Big Search 4 Big Data War StoriesBig Search 4 Big Data War Stories
Big Search 4 Big Data War Stories
 
Production machine learning_infrastructure
Production machine learning_infrastructureProduction machine learning_infrastructure
Production machine learning_infrastructure
 
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speechBig Data Applied, Data Warehouse Institute St. Louis December 2013 speech
Big Data Applied, Data Warehouse Institute St. Louis December 2013 speech
 
Spcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise SearchSpcua 2013 Alexey Kozhemiakin Enterprise Search
Spcua 2013 Alexey Kozhemiakin Enterprise Search
 
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADRTweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
TweetMogaz - The Arabic Tweets Platform: Presented by Ahmed Adel, BADR
 
The Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and RoadmapThe Pharo Debugger and Debugging tools: Advances and Roadmap
The Pharo Debugger and Debugging tools: Advances and Roadmap
 
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
NISO-Altmetrics-NE-ACRL-ScholComIG-Nov2013
 
Big Data Visualization
Big Data VisualizationBig Data Visualization
Big Data Visualization
 
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLITSQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
SQL SCIPY STREAMLIT_Introduction to the basic of SQL SCIPY STREAMLIT
 
How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...How did it go? The first large enterprise search project in Europe using Shar...
How did it go? The first large enterprise search project in Europe using Shar...
 
datamining-introduction.pdf
datamining-introduction.pdfdatamining-introduction.pdf
datamining-introduction.pdf
 
Autodiscovery or The long tail of open data
Autodiscovery or The long tail of open dataAutodiscovery or The long tail of open data
Autodiscovery or The long tail of open data
 
Azure Machine Learning
Azure Machine LearningAzure Machine Learning
Azure Machine Learning
 
Data Analytics for Smart Product Development
Data Analytics for Smart Product DevelopmentData Analytics for Smart Product Development
Data Analytics for Smart Product Development
 
Core Hack Day 2
Core Hack Day 2Core Hack Day 2
Core Hack Day 2
 
Analytics and Digital Storytelling
Analytics and Digital StorytellingAnalytics and Digital Storytelling
Analytics and Digital Storytelling
 
Escaping Datageddon
Escaping DatageddonEscaping Datageddon
Escaping Datageddon
 
Incident response before:after breach
Incident response before:after breachIncident response before:after breach
Incident response before:after breach
 
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDefcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware
 
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
The Art of Intelligence – A Practical Introduction Machine Learning for Orac...
 

Último

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformationAreesha Ahmad
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Servicenishacall1
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...chandars293
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxFarihaAbdulRasheed
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPirithiRaju
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedDelhi Call girls
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticssakshisoni2385
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bSérgio Sacani
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Monika Rani
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .Poonam Aher Patil
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencySheetal Arora
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfSumit Kumar yadav
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfrohankumarsinghrore1
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxgindu3009
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxabhishekdhamu51
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...ssifa0344
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Lokesh Kothari
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Sérgio Sacani
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)Areesha Ahmad
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptxAlMamun560346
 

Último (20)

Conjugation, transduction and transformation
Conjugation, transduction and transformationConjugation, transduction and transformation
Conjugation, transduction and transformation
 
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
 
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
High Class Escorts in Hyderabad ₹7.5k Pick Up & Drop With Cash Payment 969456...
 
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptxCOST ESTIMATION FOR A RESEARCH PROJECT.pptx
COST ESTIMATION FOR A RESEARCH PROJECT.pptx
 
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdfPests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
Pests of cotton_Borer_Pests_Binomics_Dr.UPR.pdf
 
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verifiedConnaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
Connaught Place, Delhi Call girls :8448380779 Model Escorts | 100% verified
 
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceuticsPulmonary drug delivery system M.pharm -2nd sem P'ceutics
Pulmonary drug delivery system M.pharm -2nd sem P'ceutics
 
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43bNightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
Nightside clouds and disequilibrium chemistry on the hot Jupiter WASP-43b
 
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
Vip profile Call Girls In Lonavala 9748763073 For Genuine Sex Service At Just...
 
Factory Acceptance Test( FAT).pptx .
Factory Acceptance Test( FAT).pptx       .Factory Acceptance Test( FAT).pptx       .
Factory Acceptance Test( FAT).pptx .
 
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls AgencyHire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
Hire 💕 9907093804 Hooghly Call Girls Service Call Girls Agency
 
Botany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdfBotany 4th semester file By Sumit Kumar yadav.pdf
Botany 4th semester file By Sumit Kumar yadav.pdf
 
Forensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdfForensic Biology & Its biological significance.pdf
Forensic Biology & Its biological significance.pdf
 
Presentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptxPresentation Vikram Lander by Vedansh Gupta.pptx
Presentation Vikram Lander by Vedansh Gupta.pptx
 
American Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptxAmerican Type Culture Collection (ATCC).pptx
American Type Culture Collection (ATCC).pptx
 
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
TEST BANK For Radiologic Science for Technologists, 12th Edition by Stewart C...
 
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
Labelling Requirements and Label Claims for Dietary Supplements and Recommend...
 
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
Discovery of an Accretion Streamer and a Slow Wide-angle Outflow around FUOri...
 
GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)GBSN - Biochemistry (Unit 1)
GBSN - Biochemistry (Unit 1)
 
Seismic Method Estimate velocity from seismic data.pptx
Seismic Method Estimate velocity from seismic  data.pptxSeismic Method Estimate velocity from seismic  data.pptx
Seismic Method Estimate velocity from seismic data.pptx
 

Treasure Data Summer Internship Final Report

  • 1. Summer Internship Final Report Naoki Ishikawa (@NeokiStones) 2015/09/30 13:30-
  • 2. Who am I 2 • Naoki Ishikawa • Waseda University, Information Science M1 • Research: Evolutional Computation/ Reinforcement Learning • Laboratory: Sugawara Lab • Laboratory theme: Artificial Intelligence
  • 3. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 3 Table of contents
  • 4. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 4 Table of contents
  • 5. Factorization Machine 5 • Algorithm for Recommendation • Classification(Clustering) • Regression • Supervised Learning • Need Input/Output Data • Suitable for Sparse Data
  • 7. Application 7 • Prediction of Movie Rating • Task: Prediction movie rating
 (real number) • Regression
 - Input: Self-designed Matrix 
 - Output: Rating Vector
  • 9. INPUT Details 9 • Identifier
 - User Identifier : [0, 0, …, 0, 1, 0, …,0]
 - Movie Identifier : [0, 0, …, 0, 0, 1, 0, …,0] • Designed Feature
 - Rating of Other Movie
 - Time
 - Last Movie rated
  • 10. 10 Recommendation Algorithm • Collaborative Filtering • Associations Analysis • Bayesian Network
  • 11. Prediction of Movie Rating 11 • Hivemall • Matrix Factorization • Recommendation
  • 12. 12 Difference from Matrix Factorization • Data Structure • Matrix Factorization • User-Item Matrix http://ampcamp.berkeley.edu/big-data-mini-course/img/matrix_factorization.png Input Learning Parameter
  • 13. 13 Difference from Matrix Factorization • Factorization Machine Vv k Input Learning Parameter Wk 1
  • 14. 14 • Factorization Machine • Consider • context data • Interaction between valuables Advantage of Factorization Machine
  • 15. 15 Difference from Matrix Factorization Prediction by Factorization Machine (d=2)
  • 16. 16 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) (mean) Global bias Interaction Factorization (Wkj) Regression coefficience of k-th variable
  • 17. 17 Difference from Matrix Factorization Prediction by Factorization Machine (d=2) Learning Method Stochastic Gradient descent(SGD)
  • 19. 19 Difference from Matrix Factorization • d-way • FM / MF • assume K latent attributes • Matrix Factorization: d = 2 • Factorization Machine: d 2
  • 20. 20 HyperParameter • K: the number of hidden factor • η: the regulation parameter
  • 21. 21 Implemented Model • Implemented Model • d = 2 • MapModel • ArrayModel
  • 22. 22 Implemented Model • MapModel • For unknown data • Flexible • Suitable for Online Learning
  • 23. 23 Implemented Model • ArrayModel • For known data • less overhead
  • 24. 24 Other Use Case • E-Commerce User-Item Recommendation • Input Data • Age • Purchase timezone • Past bought items • Cluster ID • Target Data • Evaluation of an Item by User
  • 25. • Implemented Algorithm • Factorization Machine • Latent Dirichlet Allocation 25 Table of contents
  • 26. Latent Dirichlet Allocation 26 • Most Popular Algorithm of Topic Model • Mostly applied for text data • Find hidden structure of data • Unsupervised Learning • Need Input Data only • Generative Model
  • 27. Latent Dirichlet Allocation 27 • Generative Modelling in LDA • Mimic how to generate Document • 1. Choose what you write about • 2. Choose word from the Topic • 3. Write
  • 28. Latent Dirichlet Allocation 28 • Input • Text data (Documents) • Output • Topic-word distribution • Document-Topic distribution
  • 30. Learning Method 30 • Define Generative model • For documents • Learn parameters to reproduce the document
  • 33. Graphical Model(Code) 33 • For Topic ={1,…, K} • WordDistribution[k] Dir(β) For Document={1,…, D} TopicDistribution[d] Dir(α) For Word={1,…, numOfWord[d]} WordTopic[d][n] TopicDistribution[d] Word[d][n] WordDistribution[WordTopic[d][n]]
  • 34. Learning Method 34 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering
  • 35. Learning Method 35 • Variational Bayes • Gibbs Sampling (MCMC) • Particle Filtering faster than Gibbs Sampling
  • 36. Mini-batch Online LDA 36 • Faster than Batch Algorithm • Less noise than pure Online LDA Pure Online Mini-batch Online Batch Batch Size
  • 37. 37 Implemented Model • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  • 38. • Mini-Batch Map Model • For unknown data • Don t assume Vocabulary List 38 Implemented Model • Mini-Batch Array Model (Other implementation) • For known data • Assume Vocabulary List
  • 39. • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset 39 Faced Implementation Problem
  • 40. 40 Faced Implementation Problem • Meaning Less word • LDA: Clustering word by co-occurrence • a , the , I , He , is , in , on • Stop Word: Ignore them • TF-IDF: how important a word is to a document in a collection or dataset
  • 41. • TF-IDF • can be calculated by Hivemall • Input Data: (DocId, Words) • https://github.com/myui/hivemall/wiki/ TFIDF-calculation 41 Faced Implementation Problem
  • 42. • 1 ["justice:0.1641245850805637","found:0.06564983513276658","discussion: 0.06564983513276658","law:0.065 • 64983513276658","based:0.06564983513276658","religion: 0.06564983513276658","viewpoints:0.03282491756638329"," • rationality:0.03282491756638329","including:0.03282491756638329","context: 0.03282491756638329","concept:0.032 • 82491756638329","rightness:0.03282491756638329","general: 0.03282491756638329","many:0.03282491756638329","dif • fering:0.03282491756638329","fairness:0.03282491756638329","social: 0.03282491756638329","broadest:0.032824917 • 56638329 ,"equity:0.03282491756638329","includes: 0.03282491756638329","theology:0.03282491756638329"] 42 Faced Implementation Problem • TF-IDF
  • 43. • Vocabulary List Model • Initialize all lambda for all words at first • if word does not appear in the Doc: • Lambda decreases at the same rate • No initialization problem 43 Faced Implementation Problem
  • 44. • Online Map Model • Initialize lambda when new word fetched • final lambda: 
 depend on the first appeared time • Initialize problem 44 Faced Implementation Problem
  • 45. • Prepared Dummy Lambda • Initialize dummy lambdas at first • Apply lambda update rule for dummy lambda 45 Faced Implementation Problem
  • 46. • Implicit Φ Normalization • Not written implicitly 46 Faced Implementation Problem
  • 47. • Implicit Φ Normalization • Not written implicitly 47 Faced Implementation Problem
  • 48. • Implicit Φ Normalization • Not written explicitly 48 Faced Implementation Problem
  • 49. 49 Faced Implementation Problem • Difficult Debugging • Circular reference Φ γ β :dependence
  • 50. • Data: 20News • Topic:6 • Iteration:10 50 Result: Online LDA
  • 51. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 51 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796
  • 52. • Topic:1 • No.0 writes[6]: 0.007909349 • No.1 article[7]: 0.006535292 • No.2 apr[3]: 0.0034389505 • No.3 team[4]: 0.00340712 • No.4 game[4]: 0.0033219245 • No.5 year[4]: 0.0032751847 • No.6 good[4]: 0.0032546786 • No.7 time[4]: 0.0030503264 • No.8 play[4]: 0.00262638 • No.9 games[5]: 0.002433915 • No.10 season[6]: 0.0022433712 • No.11 ll[2]: 0.0020719478 • No.12 players[7]: 0.0020332362 • No.13 win[3]: 0.0019284738 • No.14 hockey[6]: 0.0018870989 52 Result: Online LDA • No.15 league[6]: 0.0018450991 • No.16 baseball[8]: 0.0018226414 • No.17 years[5]: 0.0017960512 • No.18 mail[4]: 0.0017936684 • No.19 people[6]: 0.0017642054 • No.20 teams[5]: 0.0016675185 • No.21 great[5]: 0.001642102 • No.22 ve[2]: 0.0015846819 • No.23 point[5]: 0.0015730233 • No.24 cs[2]:0.0015609838 • No.25 didn[4]: 0.0015398773 • No.26 lot[3]: 0.0015123658 • No.27 mike[4]: 0.0014935194 • No.28 university[10]: 0.0014718652 • No.29 player[6]: 0.0014655796 Sports
  • 53. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 53 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057
  • 54. • Topic:3 • No.0 writes[6]: 0.0065424195 • No.1 article[7]: 0.005621346 • No.2 apr[3]: 0.002746017 • No.3 work[4]: 0.002731466 • No.4 good[4]: 0.00266331 • No.5 ve[2]: 0.0025969497 • No.6 time[4]: 0.0025880735 • No.7 system[6]: 0.0024449623 • No.8 problem[7]: 0.002349667 • No.9 mail[4]: 0.0023234019 • No.10 windows[7]: 0.0021310966 • No.11 people[6]: 0.0018598152 • No.12 find[4]: 0.0018072439 • No.13 computer[8]: 0.0017470584 • No.14 email[5]: 0.0017204053 54 Result: Online LDA • No.15 drive[5]: 0.0017121765 • No.16 bit[3]: 0.0016401116 • No.17 program[7]: 0.001636191 • No.18 software[8]: 0.0016341405 • No.19 university[10]: 0.0015907411 • No.20 ll[2]: 0.0015530549 • No.21 thing[5]: 0.0015159848 • No.22 card[4]: 0.0013826761 • No.23 doesn[5]: 0.0013809163 • No.24 phone[5]: 0.0013786326 • No.25 question[8]: 0.0013721529 • No.26 internet[8]:0.001368883 • No.27 file[4]: 0.0013417117 • No.28 things[6]: 0.0013097903 • No.29 set[3]: 0.0013029057 Computer
  • 55. Impression about Internship 55 • Machine Learning • Implementing ML algorithm from Scratch was fun • Contributing for OSS is precious experience for me
  • 56. Unfinished Business 56 • Documentation • write entry for FM/Online LDA • UDTF • build the function into Hivemall
  • 57. 57 • Thank you for Listening