2. Exercise 1: Probabilities
How can Bayes' rule be derived from simpler definitions, such as the definition of conditional
probability, symmetry of joint probability, the chain rule? Give a step-wise derivation,
mentioning which rule you applied at each step.
We have a set of possible outcomes for values of x and y:
x = { x1, x2…,xn }
y = { y1, y2…,yn }
We need to show how the Bayes rule is implemented. The Bayes rule is as following:
P( X = x | Y = y ) = P( Y = y | X = x ) * P(X = x) / P(Y = y)
We use the chain rule:
P(X = x , Y = y)
Joint = condition * marginal distance
P(X = x, Y = y) = P(Y = y | X = x) * P(X = x)
P(Y = y | X = x) * P(Y = y) = P(Y = y | X = x) * P(X = x)
In conclusion: P(X = x) = P(x)
3. Exercise 2: Entropy
2.1 Assume a variable X with three possible values: a, b, and c. If p(a) = 0:4, and
p(b) = 0:25, what is the entropy of of X, i.e., what is H(X)?
To know the probability for C we calculate the P.
P(total) = 1
P(a) = 0.4
P(b) = 0.25
P(c) = P(total) – P(a) – P(b)
P(c) = 0.35
Now we calculate the Entropy by using all probabilities:
H = 0.4 log2(0.4) + 0.25 log2(0.25)+ 0.35log2(0.35)
H = 1.5589
2.2 Assume a variable X with three possible values: a, b, and c. What is the probability
distribution with the highest entropy? Which one(s) has/have the lowest one? Explain in a
sentence or two and in your in own words why these distributions have the highest and lowest
entropies.
We need to see what ‘P’ value is responsible for the highest entropy (so the maximum uncertainty).
If we don’t know anything about the values ‘a’, ‘b’ and ‘c’ then we can give now prediction on the
possible chances of having any of these values. Because of this we can state that these values are
indistinguishable. So the change of having an ‘a’-value is equal to the ‘b’ and ‘c’. We call this uniform
distribution.
P(a) = P(b) = P(c)
P(total) = P(a) – P(b) + P(c)
P(total) = 1
P(a) – P(b) + P(c) = 1/3
The lowest entropy would be when we know on forehand which value will be the outcome. So there
should be a 100% of having a ‘a’, ‘b’ or ‘c’ value.
2.3 In general, if a variable X has n possible values, what is the maximum entropy?
We can just sum up the change for P, we only need a uniform distribution:
P(x) = 1/ni
i = 1, 2, …, n