When it comes to recommendation systems and natural language processing, data that can be modeled as a multinomial or as a vector of counts is ubiquitous. For example if there are 2 possible user-generated ratings (like and dislike), then each item is represented as a vector of 2 counts. In a higher dimensional case, each document may be expressed as a count of words, and the vector size is large enough to encompass all the important words in that corpus of documents. The Dirichlet distribution is one of the basic probability distributions for describing this type of data.
The Dirichlet distribution is surprisingly expressive on its own, but it can also be used as a building block for even more powerful and deep models such as mixtures and topic models.
14. What does the raw data look like?
id
# dislikes
1
231
23
2
Counts!
# likes
81
40
3
67
9
4
121
14
5
9
31
6
18
0
7
1
1
15. What does the raw data look like?
More specifically:
- K columns of counts
- N rows of data
id
# likes
# dislikes
1
231
23
2
81
40
3
67
9
4
121
14
5
9
31
6
18
0
7
1
1
17. BUT...
We can estimate the multinomial
distribution with the counts, using
the maximum likelihood estimate
366
181
203
18. BUT...
We can estimate the multinomial
distribution with the counts, using
the maximum likelihood estimate
Sum =
366 + 181 + 203 =
750
366
181
203
19. BUT...
We can estimate the multinomial
distribution with the counts, using
the maximum likelihood estimate
366 / 750
181 / 750
203 / 750
366
181
203
20. BUT...
We can estimate the multinomial
distribution with the counts, using
the maximum likelihood estimate
48.8%
24.1%
27.1%
366
181
203
87. Chinese Restaurant Process
The expected
infinite
distribution
(mean) is
exponential.
# of white balls
controls the
exponent
88. So coming back to the count data:
20
0
2
What Dirichlet
parameters explain
the data?
0
1
17
14
6
0
15
5
0
0
20
0
0
14
6
89. So coming back to the count data:
20
0
2
Newton’s Method:
Requires Gradient
+ Hessian
0
1
17
14
6
0
15
5
0
0
20
0
0
14
6
90. So coming back to the count data:
20
0
2
Reads all of the
data...
0
1
17
14
6
0
15
5
0
0
20
0
0
14
6
91. So coming back to the count data:
20
https://github.
com/maxsklar/Baye
sPy/tree/master/Con
jugatePriorTools
0
0
2
1
17
14
6
0
15
5
0
0
20
0
0
14
6
92. So coming back to the count data:
Compress the data
into a Matrix and a
Vector:
Works for lots of
sparsely populated
rows
20
0
0
2
1
17
14
6
0
15
5
0
0
20
0
0
14
6