Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Próxima SlideShare
Cargando en…5
×

Introduction to Gaussian Processes

102 visualizaciones

Lecture on Gaussian Processes that was delivered for MSc level students at University of Tartu (2018 spring)

• Full Name
Comment goes here.

Are you sure you want to Yes No
• Sé el primero en comentar

• Sé el primero en recomendar esto

Introduction to Gaussian Processes

1. 1. Dmytro Fishman (dmytro@ut.ee) Introduction to Gaussian Processes
2. 2. x f(x) y Let’s take a look inside
3. 3. x y= f(x) y
4. 4. x Let be linear functiony= f(x) y= ✓0 + ✓1x y
5. 5. x Let be linear functiony= f(x) y= ✓0 + ✓1x y
6. 6. x Let be linear functiony= f(x) arg min ✓ nX i=1 (yi ˆyi)2 ˆyi = ✓0 + ✓1xi yi = ✓0 + ✓1xi + ✏i arg min ✓ nX i=1 (yi ˆyi)2 ˆyi = ✓0 + ✓1xi yi = ✓0 + ✓1xi + ✏i Find and by optimising error arg min ✓ nX i=1 (yi ˆyi)2 ˆyi = ✓0 + ✓1xi y = ✓ + ✓ x + ✏ arg min ✓ nX i=1 (yi ˆyi)2 ˆyi = ✓0 + ✓1xi y = ✓ + ✓ x + ✏ y= ✓0 + ✓1x y
7. 7. x But if data is not linear? y
8. 8. x But if data is not linear? y= ✓0 + ✓1x y
9. 9. x y= ✓0 + ✓1x + ✓2x2 But if data is not linear? y
10. 10. x But if data is not linear? y= ✓0 + ✓1x + ✓2x2 + ✓3x3 y
11. 11. x What if don’t want to assume a speciﬁc form? y
12. 12. x GPs let you model any function directly y
13. 13. x y x y Parametric ML Nonparametric ML A learning model that summarizes data with a set of parameters of ﬁxed size (independent of the number of training examples) is called a parametric model. Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. y= ✓0 + ✓1x
14. 14. x y x y Parametric ML Nonparametric ML A learning model that summarizes data with a set of parameters of ﬁxed size (independent of the number of training examples) is called a parametric model. Algorithms that do not make strong assumptions about the form of the mapping function are called nonparametric machine learning algorithms. y= ✓0 + ✓1x Question: is K-nearest neighbour parametric or nonparametric algorithm according to these deﬁnitions?
15. 15. x GPs let you model any function directly y
16. 16. x y GPs let you model any function directly estimates the uncertainty for each new prediction
17. 17. x y If I ask you to predict for xi xiyi
18. 18. x y If I ask you to predict for ? xiyi You better be very uncertain xi
19. 19. How is it even possible?
20. 20. We will need Normal distribution x xi y
21. 21. µ 1p 2⇡ e (x µ)2 2 2 With average coordinate and standard deviation from centre µ Many important processes follow normal distribution
22. 22. µ 1p 2⇡ e (x µ)2 2 2N(µ, 2 ) With average coordinate and standard deviation from centre µ Many important processes follow normal distribution
23. 23. X1 ⇠ N(µ1, 2 1)1p 2⇡ e (x µ)2 2 2 With average coordinate and standard deviation from centre µ Many important processes follow normal distribution µ1 1
24. 24. 1p 2⇡ e (x µ)2 2 2 What If I draw another distribution? With average coordinate and standard deviation from centre µ Many important processes follow normal distribution X1 ⇠ N(µ1, 2 1) µ1 1
25. 25. X1 ⇠ N(µ1, 2 1) X2 ⇠ N(µ2, 2 2) µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0
26. 26. X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) µ1 = 0 1 = 1 µ2 = 0 2 = 1
27. 27. µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
28. 28. µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
29. 29. µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
30. 30. µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
31. 31. µ1 = 0 1 = 1 µ2 = 0 2 = 1 X1 X20 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) What if we would join them into one plot?
32. 32. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
33. 33. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
34. 34. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
35. 35. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
36. 36. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1) X2 X1
37. 37. X2 X1 µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X =  x1 x2 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)
38. 38. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1)X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ Joint distribution of variables and X =  x1 x2 x1 x2 X2 X1
39. 39. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1) X2 X1
40. 40. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 X2 ⇠ N(0, 1) Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 X1
41. 41. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1
42. 42. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1
43. 43. µ1 = 0 1 = 1 µ2 = 0 2 = 1 0 M =  µ1 µ2 Joint distribution of variables and X =  x1 x2 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1 Covariance matrix or ⌃
44. 44. µ1 = 0 1 = 1 µ2 = 0 2 = 1 Joint distribution of variables and 0 x1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1
45. 45. µ1 = 0 1 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1
46. 46. µ1 = 0 1 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1 Similarity
47. 47.  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X10 0 X2 X10 0 X2  x1 x2 ⇠ N ✓ 0 0  1 0.5 0.5 1 ◆ Positive value of does not tell much about X1 X2 Some similarity (correlation) Positive value of with good probability means positive X1 X2 No similarity (no correlation)
48. 48. µ1 = 0 1 = 1 µ2 = 0 2 = 1 Joint distribution of variables andx1 x2 X1 ⇠ N(0, 1)  x1 x2 ⇠ N ✓ 0 0  1 0 0 1 ◆ X2 ⇠ N(0, 1) X2 X1
49. 49. µ1 = 0 1 = 1 µ2 = 0 2 = 1 Joint distribution of variables and 0 x1 x2 X1 ⇠ N(0, 1) X2 ⇠ N(0, 1) X2 X1  x1 x2 ⇠ N ✓ 0 0  1 0.5 0.5 1 ◆
50. 50. X2 X1 Joint distribution of variables andx1 x2 P(x1, x2)  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆
51. 51. X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆
52. 52. X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆
53. 53. X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 1|2 µ1|2  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆
54. 54. X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2)  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆
55. 55.  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21
56. 56. X2 X1 x2 1|2 µ1|2  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ Conditional distribution P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 P(x1, x2) Joint distribution  x1 x2 ⇠ N ✓ 0 0  1 0.5 0.5 1 ◆ X10 0 X2 N(µ, 2 ) Normal distribution or 1D Gaussian or 2D Gaussian
57. 57. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆
58. 58. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ Sampling Samples from 2D Gaussian
59. 59. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13) Sampling Samples from 2D Gaussian
60. 60. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13) Sampling Samples from 2D Gaussian 0 1st 2nd 1 1
61. 61. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13) Sampling Samples from 2D Gaussian 0 1st 2nd 1 1
62. 62. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
63. 63. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
64. 64. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65)
65. 65. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) There is little dependency
66. 66. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ There is little dependency
67. 67. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) There is little dependency Samples from 2D Gaussian
68. 68. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) There is little dependency 0 1st 2nd 1 1 Samples from 2D Gaussian
69. 69. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) There is little dependency 0 1st 2nd 1 1 Samples from 2D Gaussian
70. 70. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
71. 71. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
72. 72. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24)
73. 73. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24) More dependent values
74. 74. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆ (-0.23, 1.13)Sampling Samples from 2D Gaussian 0 1st 2nd 1 1 (-1.14, 0.65) 2D Gaussian ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Sampling (0.13,0.52) Samples from 2D Gaussian There is little dependency 0 1st 2nd 1 1 (-0.03,-0.24) More dependent values How would a sample from 20D Gaussian look like?
75. 75. 2D Gaussian ⇠ ✓ 0 0  1 0 0 1 ◆
76. 76. 20D Gaussian ?
77. 77. 20D Gaussian 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0 0 . . . 0 0 1 0 . . . 0 ... ... ... ... ... 0 0 0 . . . 1 3 7 7 7 5 1 C C C A Sampling (0.73, -0.12, 0.42, 1.2,…, 16 more) 0 1st 2nd 1 1 3rd 4th 5th 6th 7th
78. 78. 20D Gaussian Let’s add more dependency between points 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
79. 79. 20D Gaussian Let’s add more dependency between points (0.73, 0.18, 0.68, -0.2,…, 16 more) 0 1st 2nd 1 1 3rd 4th 5th 6th 7th 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
80. 80. 20D Gaussian Let’s add more dependency between points 0 1st 2nd 1 1 3rd 4th 5th 6th 7th We want some notion of smoothness between points… 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
81. 81. 20D Gaussian Let’s add more dependency between points 0 1st 2nd 1 1 3rd 4th 5th 6th 7th We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A
82. 82. 20D Gaussian Let’s add more dependency between points We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A We might have just increased corresponding values in covariance matrix, right?
83. 83. 20D Gaussian Let’s add more dependency between points We want some notion of smoothness between points… So that dependancy between 1st and 2nd points is larger than between 1st and the 3rd. 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 1 0.5 0.5 . . . 0.5 0.5 1 0.5 . . . 0.5 ... ... ... ... ... 0.5 0.5 0.5 . . . 1 3 7 7 7 5 1 C C C A We might have just increased corresponding values in covariance matrix, right? We need a way to generate a “smooth” covariance matrix automatically depending on the distance between points
84. 84. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 20D Gaussian
85. 85. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 20D Gaussian 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K120 K21 K22 K23 . . . K220 ... ... ... ... ... K201 K202 K203 . . . K2020 3 7 7 7 5 1 C C C A 0 1st 2nd 1 1 3rd 4th 5th 6th 7th
86. 86. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
87. 87. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
88. 88. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
89. 89. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
90. 90. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
91. 91. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 µ⇤ ⇤ 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
92. 92. We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj 200D Gaussian 0 1 1 µ⇤ ⇤ 0 B B B @ 2 6 6 6 4 0 0 ... 0 3 7 7 7 5 2 6 6 6 4 K11 K12 K13 . . . K1200 K21 K22 K23 . . . K2200 ... ... ... ... ... K2001 K2002 K2003 . . . K200200 3 7 7 7 5 1 C C C A
93. 93. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3
94. 94. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Previously we were using: to generate correlated points, can we do it again here?  f1 f2 ⇠ ✓ 0 0  1 0.5 0.5 1 ◆
95. 95. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Wait! But now we have three points, we cannot use the same formula!  f1 f2 ⇠ ✓ 0 0  1 0.5 0.5 1 ◆ Previously we were using: to generate correlated points, can we do it again here?
96. 96. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A
97. 97. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A Wait, did he just said that f2 should be more correlated to f1 than to f3?
98. 98. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Ok… What about now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.5 0.5 0.5 1 0.5 0.5 0.5 1 3 5 1 A Wait, did he just said that f2 should be more correlated to f1 than to f3? Arrrr….
99. 99. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A
100. 100. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A Yes, but what if we want to obtain this matrix automatically based on how close points are by (Z)?
101. 101. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 Better now? 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 2 4 1 0.7 0.2 0.7 1 0.5 0.2 0.5 1 3 5 1 A Yes, but what if we want to obtain this matrix automatically based on how close points are by (Z)? We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj
102. 102. f1 f2 f3 We are interested in modelling for given Z Z So that is more correlated with than z1 z2 z3 F(z) F(z) f1f2 f3 We will use a similarity measure Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj So now, it will become: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A
103. 103. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
104. 104. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
105. 105. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤
106. 106. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A
107. 107. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K)
108. 108. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*?
109. 109. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*? Well, probably again some kinda normal…
110. 110. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: 2 4 f1 f2 f3 3 5 ⇠ 0 @ 2 4 0 0 0 3 5 , 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 1 A Which is the same as saying: f ⇠ N(0, K) But how do we model f*? Well, probably again some kinda normal… Maybe something like: f⇤ ⇠ N(0, ?)
111. 111. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?)
112. 112. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this “?” covariance matrix of z* with z*?
113. 113. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this “?” covariance matrix of z* with z*? f⇤ ⇠ N(0, K⇤⇤)
114. 114. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) Maybe something like: f⇤ ⇠ N(0, ?) But what is this “?” covariance matrix of z* with z*? f⇤ ⇠ N(0, K⇤⇤) But isn’t K** is just 1? K⇤⇤ = e ||z⇤ z⇤||2 = 1
115. 115. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
116. 116. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)
117. 117. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A
118. 118. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1
119. 119. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Only one entity is left: K1⇤ = K(z1, z⇤)
120. 120. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: What else is left? f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Only one entity is left: K1⇤ = K(z1, z⇤) I guess we know how to calculate this one! Kij = e ||zi zj ||2 = ( 0, ||zi zj|| ! 1 1, zi = zj
121. 121. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it!
122. 122. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it! Wait… but what we do now?
123. 123. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i Yeah! We did it! Wait… but what we do now? Remember….
124. 124.  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21
125. 125.  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 What if we substitute x1 with f* and x2 with f?
126. 126.  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 Then we can compute mean and standard deviation of f*! What if we substitute x1 with f* and x2 with f?
127. 127.  x1 x2 i ⇠ N ✓ µ1 µ2  11 12 21 22 ◆ X2 X1 Joint distribution of variables andx1 x2 P(x1, x2) x2 Conditional distribution 1|2 µ1|2 P(x1|x2) = N(x1|µ1|2, 1|2) µ1|2 = µ1 + 12 + T 22(x2 µ2) 1|2 = 11 12 T 22 21 Then we can compute mean and standard deviation of f*! Exactly! What if we substitute x1 with f* and x2 with f?
128. 128. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf )
129. 129. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf )
130. 130. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
131. 131. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
132. 132. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ z⇤ z⇤ µ⇤ µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
133. 133. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤ z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ z⇤ z⇤ µ⇤ µ⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
134. 134. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤also given f⇤ Ok, so we have just modelled f: f ⇠ N(0, K) and f⇤ ⇠ N(0, K⇤⇤)  f f⇤ ⇠ 0 B B B B @  0 0 2 6 6 6 6 4 2 4 K11 K12 K13 K21 K22 K23 K31 K32 K33 3 5 2 4 K1⇤ K2⇤ K3⇤ 3 5 ⇥ K⇤1 K⇤2 K⇤3 ⇤ [K⇤⇤] 3 7 7 7 7 5 1 C C C C A K 1 Ki⇤ K⇤i µ⇤ µ⇤ µ⇤ z⇤z⇤ z⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
135. 135. What is ? f1 f2 f3 Zz1 z2 z3 F(z) Given: {(f1, z1); (f2, z2); (f3, z3)} z⇤also given f⇤ µ⇤ µ⇤ µ⇤ z⇤z⇤ z⇤ µ⇤= µ(z⇤) + KT ⇤ K 1 (f µf ) ⇤ = K⇤⇤ KT ⇤ K 1 K⇤
136. 136. Pros: 1. Can model almost any function directly 3. Provides uncertainty estimates 2. Can be made more ﬂexible with different kernels Cons: 1. Cannot be interpreted 2. Loose efﬁciency in high dimensional spaces 3. Overﬁtting
137. 137. Cat or Dog? “It’s always seemed obvious to me that it’s better to know that you don’t know, than to think you know and act on wrong information.” Katherine Bailey
138. 138. Teaching statistics Doing statistics
139. 139. Resources: Katherine Bailey’s presentation: http://katbailey.github.io/gp_talk/ Gaussian_Processes.pdf Katherine Bailey’s blog post: from both sides now: the math of linear regression (http://katbailey.github.io/post/from-both-sides-now-the- math-of-linear-regression/) Katherine Bailey’s blog post: Gaussian processes for dummies (http:// katbailey.github.io/post/gaussian-processes-for-dummies/) Kevin P. Murphy’s book: Machine Learning - A Probabilistic Perspective, Chapter 15 (https://www.amazon.com/Machine-Learning- Probabilistic-Perspective-Computation/dp/0262018020) Alex Bridgland’s blog post: Introduction to Gaussian Processes - Part I (http://bridg.land/posts/gaussian-processes-1) Nando de Freitas, Machine Learning - Introduction to Gaussian Processes (https://youtu.be/4vGiHC35j9s)
140. 140. in class Under the review