SlideShare a Scribd company logo
1 of 43
Kernel
Jungkyu Lee
Daum Search Quality Team
14.1 Introduction
• how do we represent a text document or protein sequence, which can be of variable length?
• One approach to such problems is to define a generative model for the data, and use the inferred latent
representation and/or the parameters of the model as features, and then to plug these features in to standard
methods.
• For example, in Chapter 28, we discuss deep learning, which is essentially an unsupervised way to learn good
feature representations.
• Another approach is to assume that we have some way of measuring the similarity between objects, that doesn’t
require preprocessing them into feature vector format.
• For example, when comparing strings, we can compute the edit distance between them.
• In this chapter, we will discuss several kinds of kernel functions
• 길이가 일정치 않은 feature는 어떻게 표현할까
• Latent representation  LDA 각 문서의 토픽 분포
• Kernel: 두 feature 간의 similarity만으로 feature를 표현
• e.g 두 스트링 비교  editing distance
• Support Vector Machine
• kernelized ridge regression 을 sparse linear machine으로 바꾸는 관점
• maximum margin classifier로 보는 관점
14.2 Kernel functions
• We define a kernel function to be a real-valued function of two arguments, κ(x,x)∈R,for x,x’∈X.
• symmetric
• non-negative
• can be interpreted as a measure of similarity
14.2.1 RBF kernels
• squared exponential kernel(SE kernel) or Gaussian kernel
• If Σ is diagonal, this can be written as
• We can interpret the σj as defining the characteristic length scale of dimension j.
radial basis function or RBF kernel
Here σ2 is known as the bandwidth.
• 시그마가 커질수록, 두 feature간 거리가 멀어도 어느정도 수치는 나오게 된다.
• 시그마가 작으면, 거리가 좀만 멀어져도 kernel 유사도는 급히 작아진다
14.2.2 Kernels for comparing documents
• If we use a bag of words representation, where xij is the number of times words j occurs in document i, we can use
the cosine similarity,
• Unfortunately, this simple method does not work very well, (stop word)
14.2.3 Mercer (positive definite) kernels
• In general, if the kernel is Mercer, then there exists a function φ mapping x∈X to RD such that
• For example, consider the (non-stationary)polynomial kernel
14.2.3 Mercer (positive definite) kernels
• Gram matrix,
• kernel function satisfy the requirement that the Gram matrix be positive definite for any set of inputs
• It can be shown (Schoelkopf and Smola 2002) that the Gaussian kernel is a Mercer kernel as is the cosine similarity
kernel (Sahami and Heilman 2006)
• Mercer’s theorem.
• If the Gram matrix is positive definite, we can compute an eigenvector decomposition of it as follows
14.2.4 Linear kernels
• Deriving the feature vector implied by a kernel is in general quite difficult, and only possible if the kernel is Mercer.
• However, deriving a kernel from a feature vector is easy: we just use
• If φ(x)=x,we get the linear kernel, defined by κ
• This is useful if the original data is already high dimensional, and if the original features are individually informative,
e.g., a bag of words representation where the vocabulary size is large, or the expression level of many genes.
• s. In such a case, the decision boundary is likely to be representable as a linear combination of the original features,
so it is not necessary to work in some other feature space.
• Of course, not all high dimensional problems are linearly separable. For example, images are high dimensional, but
individual pixels are not very informative, so image classification typically requires non-linear kernels (see e.g.,
Section 14.2.7).
14.2.5 Matern kernels
• The Matern kernel, which is commonly used in Gaussian process regression (see Section 15.2)
• Where r=||x−x ||, ν>0, l>0, and Kν is a modified Bessel function.
• As ν→∞, this approaches the SE kernel. If ν=12, the kernel simplifies to
14.2.6 String kernels
• The real power of kernels arises when the inputs are structured objects.
• As an example, we now describe one way of comparing two variable length strings using a string kernel.
• Consider two strings x, and x of lengths D, D’, each defined over the alphabet A
• A={A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V}
• Now let φs(x) denote the number of times that substrings appears in string x
• More formally and more generally, let us say that’s is a substring of x if we can write x=usv for some (possibly empty)
strings u, s and v .
• .
x string, x' string의 커널은, substring을 많이 공유하면 커진다
14.2.7 Pyramid match kernels
• 이미지에서 점을 추출하고 bag of words 로 표현한다(SIFT)
• 여러 레벨을 가진 grid로 나누고, 얼마나 일치하는지 본다(같은 grid에 얼마나 같이 등장하는지?)
• This is a Mercer kernel.
14.3 Using kernels inside GLMs
• We define a kernel machine to be a GLM where the input feature vector has the form
• If κ is an RBF kernel, this is called an RBF network We discuss ways to choose the μk parameters below.
• We will call Equation 14.28 a kernelised feature vector. Note that in this approach, the kernel need not be a Mercer
kernel.
• We can use the kernelized feature vector for logistic regression by defining
• This provides a simple way to define a non-linear decision boundary.
• As an example, consider the data coming from the exclusive or or xor function.
14.3 Using kernels inside GLMs
• Linear regression의 kernelized 버전
14.3.2 L1VMs, RVMs, and other sparse vector machines
• The main issue with kernel machines is: how do we choose the centroids μk?
• If the input is low-dimensional Euclidean space, we can uniformly tile the space occupied by the data with
prototypes, as we did in Figure 14.2(c). = 공간을 균일하게 나눈다
• However, this approach breaks down in higher numbers of dimensions because of the curse of dimensionality.
차원이 커지게 되면, 각 공간이 너무 많아 져서 차원의 저주에 걸리게 된다
• A simpler approach is to make each example xi be a prototype, so we get
• Now we see D = N, so we have as many parameters as data points. However, we can use any of the sparsity-
promoting priors for w discussed in Chapter 13 to efficiently select a subset of the training exemplars.
이런 접근법은 13장 sparse linear model에서 했던 w를 sparse하게 만드는 접근법을 training example의 subset을
고르는 문제로 바꾸는 것이면 가능하게 한다. wi=0이면, xi는 없어도 예측 가능
• We call this a sparse vector machine
• w에 대해서 1 regularization으로 penalty를 주는 sparse vector machine을 L1VM
• w에 대해서 2 regularization으로 penalty를 주는 sparse vector machine을 L2VM
• Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) (13.7)로 sparsity를 얻는 모델을
relevance vector machine or RVM (Tipping 2001)라 한다
• Another very popular approach to creating a sparse kernel machine is to use a support vector machine or SVM
Rather than using a sparsity-promoting prior, it essentially modifies the likelihood term, which is rather unnatural
from a Bayesian point of view. Nevertheless, the effect is similar, as we will see.
14.4 The kernel trick
• X자체나 X의 고차원 mapping을다루는 것이 아니라 kernel 자체를 다룸
• <x,x’> 를 k(x,x’)로 교체
• linear regression 같은데에서 커널을 쓰면, 고차원으로 사상된 training example 을 다루어야 했기 때문에, 학습
이나 예측 계산이 무거워졌었는데, 그런한 단점을 회피하는 trick 같은 것,
• 심지어 무한 차원 매핑까지 할 수 있다(RBF= Gaussian Kernel)
• 또는 structured object 의 다루기 어려운 feature 자체를 쓰는 것이 아니라, feature 간의 유사도라고 할 수 있는
kernel 을 씀으로써, 기존 알고리즘들이 structured object 에도 수행될 수 있게 해줌
• This is called the kernel trick.
14.4.1 Kernelized nearest neighbor classification
• Recall that in a 1NN classifier (Section 1.4.2), we just need to compute the Euclidean distance of a test vector to all
the training points, find the closest one, and look up its label.
• This can be kernelized by observing that
• This allows us to apply the nearest neighbor classifier to structured data objects.
• 예를 들어서, 전에 다루었던 문서 커널을 써서, KNN으로 문서 분류를 할 수 있다.(tf-idf가 가장 비슷한 문서 k개
로 test 문서를 분류)
Similarity를 kernel로만 계산
14.4.2 Kernelized K-medoids clustering
• K –means가 각 cluster에 속한 feature들의 평균을 중심으로 하는 것처럼, k-medoids는 각 cluster의 중간값
• feature 자체를 다루는 것이 아니기 때문에 평균을 내지 못한다
• e-step = 각 데이터 xi가 어느 cluster에 속하는지 정한다(가장 가까운 centroid), k means와 같음
• m-step = 각 cluster k의 중심은 k cluster에 속한 애들 중에서, 중간값을 구한다 = k cluster에 속한 애들 중에서 다
른 애들과의 거리의 합이 최소
• 예를 들어서 string kernel 을 써서, 공통된 substring이 많은 비슷한 string을 clustering 할 수 있다
14.4.3 Kernelized ridge regression
• Applying the kernel trick to distance-based methods was straightforward.
• It is not so obvious how to apply it to parametric models such as ridge regression.
• 14.4.3.1 The primal problem
• Let x ∈ RD be some feature vector, and X be the corresponding N × D design matrix
• Minimize
• The optimal solution is given b
• 14.4.3.2 The dual problem
• using the matrix inversion lemma (Equation 4.107)
• takes O(N3 + N2D) time to compute. This can be advantageous if D is large
14.4.3 Kernelized ridge regression
• we can partially kernelize this, by replacing XXT with the Gram matrix K.
• But what about the leading XT term?
• Let us define the following dual variables:
• Then we can rewrite the primal variables as follows
• This tells us that the solution vector is just a linear sum of the N training vectors
• When we plug this in at test time to compute the predictive mean, we get
• kernelize ridge regression 의 쓰임새? 만약, 다루고자 하는 training example의 차원이 100000000이라면,
원래의 ridge regression이라면, 역행렬을 계산하는데 O(D^3)의 시간복잡도가 들 것이지만, Kernel ridge
regression은 O(N^3)의 시간 복잡도를 가진다.
• 하지만, 예측 계산은 O(D)에서 O(ND)로 증가한다. 그러므로 α를 svm처럼 sparse하게 만들어서(예측에 support
vector만 써서) 해결해야 한다
14.5 Support vector machines (SVMs)
• If L is quadratic loss, this is equivalent to ridge regression,
• Kernelized ridge regression is kernelized, but not sparse
• loss function을 hinge(classification)나 epsilon insensitive loss function(regression)을 쓰면, solution이 sparse해지
고, prediction은 오직 training data의 일부(support vectors)만을 가지고, 예측할 수 있다.
• This combination of the kernel trick plus a modified loss function is known as a support vector machine or SVM
14.5.1 SVMs for regression
• The problem with kernelized ridge regression is that the solution vector w depends on all the training inputs.
• We now seek a method to produce a sparse estimate
• epsilon insensitive loss function
• This means that any point lying inside an-tube around the prediction is not penalized, as in Figure 14.10.
예측치의 오차가 특정 범위(epsilon)에 있으면, 페
널티는 가하지 않는다. 하지만, 그 범위를 넘어가
면 오차-epsilon만큼 페널티를 가한다
• his objective is convex and unconstrained, but not differentiable, because of the absolute value function in the loss
term.
• One popular approach is to formulate the problem as a constrained optimization problem.
• In particular, we introduce slack variables to represent the degree to which each point lies outside the tube
제약조건이 true가 되도록 ξ를 늘려주고, 늘린 만큼 페널티를 준다
• One can show (see e.g., (Schoelkopf and Smola 2002)) that the optimal solution has the form
• Once the model is trained, we can then make predictions using
14.5.2 SVMs for classification
• 14.5.2.1 Hinge loss
yη가 1보다 크면 penalty는 없다
f(X)가 > 1이고 y =1 이거나
f(X)가 < 1이고 y =-1 일 때
즉 정답일 때 pernality 없음
• The overall objective has the form
• Once again, this is non-differentiable, because of the max term. However, by introducing slack variables ξi, one can
show that this is equivalent to solving
• Standard solvers take O(N3) time.
• However, specialized algorithms, which avoid the use of generic QP solvers, have been developed for this problem,
such as the sequential minimal optimization or SMO algorithm (Platt 1998). In practice this can take O(N2)
• In such settings, it is common to use linear SVMs, which take O(N) time to train (Joachims 2006; Bottou et al. 2007).
제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼
페널티를 준다
• One can show that the solution has the form
• where αi = λiyi and where α is sparse (because of the hinge loss).
• The xi for which αi > 0 are called support vectors; these are points which are either incorrectly classified, or are
classified correctly but are on or inside the margin (we discuss margins below).
서포트 벡터는 오분류된 점들이거나, 잘 분류됐지만, margin 안에 있는 점들
• At test time, prediction is done using

• his takes O(sD) time to compute, where s ≤ N is the number of support vectors.
• This depends on the sparsity level, and hence on the regularizer C
모든 training example의 알파의 weighted sum
14.5.2.2 The large margin principle
• In this section, we derive Equation 14.58 form a completely different perspective.
• 목표는 feature가 커널에 의해서 사상된 space 상에서 linear한 discriminant function f(x)을 구하는 것이다.
X는 f(x)에 사영시킨 x⊥에서 r만큼 전진
14.5.2.2 The large margin principle
• We would like to make this distance r = f(x)/||w|| as large as possible, for reasons illustrated in Figure 14.11.
• intuitively, the best one to pick is the one that maximizes the margin, i.e., the perpendicular distance to the closest
point
• In addition, we want to ensure each point is on the correct side of the boundary, hence we want f(xi) yi > 0.
• So our objective becomes
• Note that by rescaling the parameters using w → kw and w0 → kw0, we do not change the distance of any point to
the boundary, since the k factor cancels out when we divide by ||w||.
• Therefore let us define the scale factor such that yifi = 1 for the point that is closest to the decision boundary.
yi(wT xi + w0) = 1 이 되게 rescaling
• We therefore want to optimize ()
• The constraint says that we want all points to be on the correct side of the decision boundary with a margin of at
least 1.
large margin classifier.
모든 training example에서 가장
가까운 점의 margin이 가장 큰 w
를 구하라
soft margin constraints
• If the data is not linearly separable (even after using the kernel trick), there will be no feasible solution in which yifi
≥ 1 for all i.
• We replace the hard constraints that yifi ≥ 0 with the soft margin constraints that yifi ≥ 1 − ξi. The new objective
becomes
• 1
• We therefore introduce slack variables ξi ≥ 0 such that ξi = 0 if the point is on or inside the correct margin boundary,
and ξi = |yi −fi| otherwise
• 0 < ξi ≤ 1 the point lies inside the margin, but on the correct side of the decision boundary.
• If ξi > 1, the point lies on the wrong side of the decision boundary.
제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼
페널티를 준다
14.5.2.3 Probabilistic output
• An SVM classifier produces a hard-labeling, ˆy(x) = sign(f(x)).
• However, we often want a measure of confidence in our prediction.
• One heuristic approach is to interpret f(x) as the log-odds ratio, log (p(y=1|x)/p(y=0|x) ).
• where a, b can be estimated by maximum likelihood on a separate validation set.
• However, the resulting probabilities are not particularly well calibrated
14.5.2.4 SVMs for multi-class classification
• In Section 8.3.7, we saw how we could “upgrade” a binary logistic regression model to the multiclass case, by
replacing the sigmoid function with the softmax, and the Bernoulli distribution with the multinomial.
• Upgrading an SVM to the multi-class case is not so easy, since the outputs are not on a calibrated scale and hence
are hard to compare to each other
• The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in which we train C binary
classifiers, fc(x), where the data from class c is treated as positive, and the data from all the other classes is treated
as negative.
• However, this can result in regions of input space which are ambiguously labeled, as shown in Figure 14.14(a).
• Another approach is to use the one-versus-one or OVO approach, also called all pairs, in which we train C(C−1)/2
classifiers to discriminate all pairs fc,c .
C(C−1)/2개의 분류기C개의 분류기
ambiguously labeled
14.5.3 Choosing C
• Typically C is chosen by cross-validation.
• however, that C interacts quite strongly with the kernel parameters.
• To choose C efficiently, one can develop a path following algorithm in the spirit of lars (Section 13.3.4).
• The basic idea is to start with λ large, so that the margin 1/||w(λ)|| is wide, and hence all points are inside of it and
have αi = 1.
• By slowly decreasing λ, a small set of points will move from inside the margin to outside, and their αi values will
change from 1 to 0, as they cease to be support vectors.
14.5.4 Summary of key points
• Summarizing the above discussion, we recognize that SVM classifiers involve
three key ingredients:
• the kernel trick :prevent underfitting,
• Sparsity, the large margin principle : prevent overfitting
14.6 Comparison of discriminative kernel methods
• We have mentioned several different methods for classification and regression based on kernels, which we
summarize in Table 14.1.
• GPs and L2VM are generally the slowest, taking O(N3) time
• if speed matters, use an RVM,
• if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP.
“tune” the kernel parameter, use efficient gradient based optimizers to maximize the marginal likelihood
확률 모델이냐? kernel 이 mercer가 아니어도 되냐?
14.7 Kernels for building generative models
• There is a different kind of kernel known as a smoothing kernel which can be used to create non-parametric density
estimates.
• This can be used for unsupervised density estimation, p(x), as well as for creating generative
models for classification and regression by making models of the form p(y, x).
• 14.7.1 Smoothing kernels
• A smoothing kernel is a function of one argument which satisfies the following properties:
• A simple example is the Gaussian kernel,
• We can control the width of the kernel by introducing a band width parameter h:
14.7.2 Kernel density estimation (KDE) aka Parzen window density estimator
• Recall Gaussian mixture model
• However, it requires specifying the number K and locations μk of the clusters
• Gaussian 갯수가 커지면 RD 의 모든 분포를 표현할 수 있고, 각 training example 마다 Gaussian을 가지고 있다면
• We can generalize the approach by writing
• simple non-parametric density model: no model fitting is required
• The disadvantage is that the model takes a lot of memory to store, and a lot of time to evaluate.
(training example 자체가 모델이고, 계산할 때도 training example을 모두 이용한다 )
특정 feature x의 확률은, training example xi마다 있는 N개의 Gaussian에 의해
서 결정되는데, x랑 가까이 있는 xi의 Gaussian이 더 큰 영향을 준다
x - xi 가 0에 가까울수록 커짐
• 각 data example xi마다 자기 자신을 평균으로 하는 N 개의 정규분포가 있고, 확률을 알고자 하는 x가 N 개의
정규 분포에 얼마나 떨어져 있는지 정규분포로 알아내서 합산
이 x는 3개의 정규분포와 가까워서 수치 높음
각 정규 분포의 분산을 늘리면 smooth 해짐
14.7.4 Kernel regression
wi(x) x가 각 xi와 가까우면 weighting이 커지므로, 가까운
training set의 y값을 다 많이 따른다
14.7.5 Locally weighted regression
• If we define κh(x−xi)=κ(x,xi), we can rewrite the prediction made by kernel regression as follows
• Note tha tκ(x,xi) need not be a smoothing kernel. If it is not, we no longer need the normalization term, so we can
just write
• N개의 function κ(x∗,xi)을 융합
• linear regression 은 모든 데이터에 대해서 동일한 w를 쓰는 반면
LWR은 test set x*마다 fitting 함으로 y를 예측한다. 즉 test set마다 다른 weight β(x∗)를 가지고 다음과 같이 fit
• solution은
현재 예측하고자 하는 x*에 대한 weight β(x∗) 로 모든
training set을 예측했을 때(커널로 가까우면 큰 에러를 갖
도록 조정) 에러가 최소되게 weight 학습
Murpy's Machine Learning:14. Kernel

More Related Content

What's hot

[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded ImaginationDeep Learning JP
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural NetworksMasahiro Suzuki
 
Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2TOMMYLINK1
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowAltoros
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi DivergenceMasahiro Suzuki
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via MeshingDon Sheehy
 
[컴퓨터비전과 인공지능] 6. 역전파 1
[컴퓨터비전과 인공지능] 6. 역전파 1[컴퓨터비전과 인공지능] 6. 역전파 1
[컴퓨터비전과 인공지능] 6. 역전파 1jdo
 
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsKeisuke OTAKI
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理hytae
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function InterpolationJesse Bettencourt
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Larry Guo
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural NetworksSharath TS
 
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation Model
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation ModelDesign of a Pseudo-Random Binary Code Generator via a Developed Simulation Model
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation ModelIDES Editor
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Universitat Politècnica de Catalunya
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing홍배 김
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learningmustafa sarac
 

What's hot (20)

Machine Learning 1
Machine Learning 1Machine Learning 1
Machine Learning 1
 
[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination[DL輪読会]Generative Models of Visually Grounded Imagination
[DL輪読会]Generative Models of Visually Grounded Imagination
 
(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks(研究会輪読) Weight Uncertainty in Neural Networks
(研究会輪読) Weight Uncertainty in Neural Networks
 
Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2Rabbit challenge 3 DNN Day2
Rabbit challenge 3 DNN Day2
 
Astaño 4
Astaño 4Astaño 4
Astaño 4
 
Dft
DftDft
Dft
 
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlowLearning Financial Market Data with Recurrent Autoencoders and TensorFlow
Learning Financial Market Data with Recurrent Autoencoders and TensorFlow
 
(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence(DL hacks輪読) Variational Inference with Rényi Divergence
(DL hacks輪読) Variational Inference with Rényi Divergence
 
Topological Inference via Meshing
Topological Inference via MeshingTopological Inference via Meshing
Topological Inference via Meshing
 
[컴퓨터비전과 인공지능] 6. 역전파 1
[컴퓨터비전과 인공지능] 6. 역전파 1[컴퓨터비전과 인공지능] 6. 역전파 1
[컴퓨터비전과 인공지능] 6. 역전파 1
 
Tensor Decomposition and its Applications
Tensor Decomposition and its ApplicationsTensor Decomposition and its Applications
Tensor Decomposition and its Applications
 
新たなRNNと自然言語処理
新たなRNNと自然言語処理新たなRNNと自然言語処理
新たなRNNと自然言語処理
 
Radial Basis Function Interpolation
Radial Basis Function InterpolationRadial Basis Function Interpolation
Radial Basis Function Interpolation
 
Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10) Deep Learning: Recurrent Neural Network (Chapter 10)
Deep Learning: Recurrent Neural Network (Chapter 10)
 
Recurrent Neural Networks
Recurrent Neural NetworksRecurrent Neural Networks
Recurrent Neural Networks
 
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation Model
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation ModelDesign of a Pseudo-Random Binary Code Generator via a Developed Simulation Model
Design of a Pseudo-Random Binary Code Generator via a Developed Simulation Model
 
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
Deep Generative Models I (DLAI D9L2 2017 UPC Deep Learning for Artificial Int...
 
Gaussian processing
Gaussian processingGaussian processing
Gaussian processing
 
Lecture 06 marco aurelio ranzato - deep learning
Lecture 06   marco aurelio ranzato - deep learningLecture 06   marco aurelio ranzato - deep learning
Lecture 06 marco aurelio ranzato - deep learning
 
Cpsc125 ch6sec3
Cpsc125 ch6sec3Cpsc125 ch6sec3
Cpsc125 ch6sec3
 

Viewers also liked

ThinkBayes: chapter 13  simulation
ThinkBayes: chapter 13  simulationThinkBayes: chapter 13  simulation
ThinkBayes: chapter 13  simulationJungkyu Lee
 
3 Generative models for discrete data
3 Generative models for discrete data3 Generative models for discrete data
3 Generative models for discrete dataJungkyu Lee
 
Eigenvalues of regular graphs
Eigenvalues of regular graphsEigenvalues of regular graphs
Eigenvalues of regular graphsJungkyu Lee
 
파이널 판타지 3 루트 공략
파이널 판타지 3 루트 공략파이널 판타지 3 루트 공략
파이널 판타지 3 루트 공략Jungkyu Lee
 
머피의 머신러닝: Undirencted Graphical Model
머피의 머신러닝: Undirencted Graphical Model머피의 머신러닝: Undirencted Graphical Model
머피의 머신러닝: Undirencted Graphical ModelJungkyu Lee
 
Murpy's Machine Learning 9. Generalize Linear Model
Murpy's Machine Learning 9. Generalize Linear ModelMurpy's Machine Learning 9. Generalize Linear Model
Murpy's Machine Learning 9. Generalize Linear ModelJungkyu Lee
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsJungkyu Lee
 
Murpy's Machine Learing: 10. Directed Graphical Model
Murpy's Machine Learing: 10. Directed Graphical ModelMurpy's Machine Learing: 10. Directed Graphical Model
Murpy's Machine Learing: 10. Directed Graphical ModelJungkyu Lee
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear ModelJungkyu Lee
 
TETRIS AI WITH REINFORCEMENT LEARNING
TETRIS AI WITH REINFORCEMENT LEARNINGTETRIS AI WITH REINFORCEMENT LEARNING
TETRIS AI WITH REINFORCEMENT LEARNINGJungkyu Lee
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear ModelJungkyu Lee
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear RegressionJungkyu Lee
 
4. Gaussian Model
4. Gaussian Model4. Gaussian Model
4. Gaussian ModelJungkyu Lee
 
앙상블 학습 기반의 추천시스템 개발
앙상블 학습 기반의 추천시스템 개발앙상블 학습 기반의 추천시스템 개발
앙상블 학습 기반의 추천시스템 개발Jungkyu Lee
 
머피's 머신러닝, Mixture model and EM algorithm
머피's 머신러닝, Mixture model and EM algorithm머피's 머신러닝, Mixture model and EM algorithm
머피's 머신러닝, Mixture model and EM algorithmJungkyu Lee
 
From A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecFrom A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecJungkyu Lee
 
Support Vector Machine Tutorial 한국어
Support Vector Machine Tutorial 한국어Support Vector Machine Tutorial 한국어
Support Vector Machine Tutorial 한국어Jungkyu Lee
 
8. Logistic Regression
8. Logistic Regression8. Logistic Regression
8. Logistic RegressionJungkyu Lee
 
머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian ProcessesJungkyu Lee
 
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01Kwang Woo NAM
 

Viewers also liked (20)

ThinkBayes: chapter 13  simulation
ThinkBayes: chapter 13  simulationThinkBayes: chapter 13  simulation
ThinkBayes: chapter 13  simulation
 
3 Generative models for discrete data
3 Generative models for discrete data3 Generative models for discrete data
3 Generative models for discrete data
 
Eigenvalues of regular graphs
Eigenvalues of regular graphsEigenvalues of regular graphs
Eigenvalues of regular graphs
 
파이널 판타지 3 루트 공략
파이널 판타지 3 루트 공략파이널 판타지 3 루트 공략
파이널 판타지 3 루트 공략
 
머피의 머신러닝: Undirencted Graphical Model
머피의 머신러닝: Undirencted Graphical Model머피의 머신러닝: Undirencted Graphical Model
머피의 머신러닝: Undirencted Graphical Model
 
Murpy's Machine Learning 9. Generalize Linear Model
Murpy's Machine Learning 9. Generalize Linear ModelMurpy's Machine Learning 9. Generalize Linear Model
Murpy's Machine Learning 9. Generalize Linear Model
 
ThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensionsThinkBayes: Chapter 9 two_dimensions
ThinkBayes: Chapter 9 two_dimensions
 
Murpy's Machine Learing: 10. Directed Graphical Model
Murpy's Machine Learing: 10. Directed Graphical ModelMurpy's Machine Learing: 10. Directed Graphical Model
Murpy's Machine Learing: 10. Directed Graphical Model
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
 
TETRIS AI WITH REINFORCEMENT LEARNING
TETRIS AI WITH REINFORCEMENT LEARNINGTETRIS AI WITH REINFORCEMENT LEARNING
TETRIS AI WITH REINFORCEMENT LEARNING
 
머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model머피's 머신러닝: Latent Linear Model
머피's 머신러닝: Latent Linear Model
 
7. Linear Regression
7. Linear Regression7. Linear Regression
7. Linear Regression
 
4. Gaussian Model
4. Gaussian Model4. Gaussian Model
4. Gaussian Model
 
앙상블 학습 기반의 추천시스템 개발
앙상블 학습 기반의 추천시스템 개발앙상블 학습 기반의 추천시스템 개발
앙상블 학습 기반의 추천시스템 개발
 
머피's 머신러닝, Mixture model and EM algorithm
머피's 머신러닝, Mixture model and EM algorithm머피's 머신러닝, Mixture model and EM algorithm
머피's 머신러닝, Mixture model and EM algorithm
 
From A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vecFrom A Neural Probalistic Language Model to Word2vec
From A Neural Probalistic Language Model to Word2vec
 
Support Vector Machine Tutorial 한국어
Support Vector Machine Tutorial 한국어Support Vector Machine Tutorial 한국어
Support Vector Machine Tutorial 한국어
 
8. Logistic Regression
8. Logistic Regression8. Logistic Regression
8. Logistic Regression
 
머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes머피의 머신러닝 : Gaussian Processes
머피의 머신러닝 : Gaussian Processes
 
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01
집단지성 프로그래밍 07-고급 분류 기법-커널 기법과 svm-01
 

Similar to Murpy's Machine Learning:14. Kernel

ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...grssieee
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recapbutest
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machineDr. Radhey Shyam
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector MachineDerek Kane
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Zihui Li
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)Padma Metta
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...ijtsrd
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)sohaib_alam
 
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)ALINLAB
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5Sunwoo Kim
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptxNameetDaga1
 
Instance based learning
Instance based learningInstance based learning
Instance based learningswapnac12
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State SpacesDeep Learning JP
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Boris Yen
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache CassandraJacky Chu
 

Similar to Murpy's Machine Learning:14. Kernel (20)

ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
ABayesianApproachToLocalizedMultiKernelLearningUsingTheRelevanceVectorMachine...
 
Db Scan
Db ScanDb Scan
Db Scan
 
Lecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning RecapLecture 17: Supervised Learning Recap
Lecture 17: Supervised Learning Recap
 
properties, application and issues of support vector machine
properties, application and issues of support vector machineproperties, application and issues of support vector machine
properties, application and issues of support vector machine
 
Data Science - Part IX - Support Vector Machine
Data Science - Part IX -  Support Vector MachineData Science - Part IX -  Support Vector Machine
Data Science - Part IX - Support Vector Machine
 
Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)Machine Learning Algorithms (Part 1)
Machine Learning Algorithms (Part 1)
 
Kernel density estimation (kde)
Kernel density estimation (kde)Kernel density estimation (kde)
Kernel density estimation (kde)
 
Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...Soft Computing Techniques Based Image Classification using Support Vector Mac...
Soft Computing Techniques Based Image Classification using Support Vector Mac...
 
Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)Recurrent and Recursive Nets (part 2)
Recurrent and Recursive Nets (part 2)
 
Neural nw k means
Neural nw k meansNeural nw k means
Neural nw k means
 
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
Polynomial Tensor Sketch for Element-wise Matrix Function (ICML 2020)
 
PRML Chapter 5
PRML Chapter 5PRML Chapter 5
PRML Chapter 5
 
CNN for modeling sentence
CNN for modeling sentenceCNN for modeling sentence
CNN for modeling sentence
 
Word_Embedding.pptx
Word_Embedding.pptxWord_Embedding.pptx
Word_Embedding.pptx
 
lec10svm.ppt
lec10svm.pptlec10svm.ppt
lec10svm.ppt
 
Instance based learning
Instance based learningInstance based learning
Instance based learning
 
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
【DL輪読会】Efficiently Modeling Long Sequences with Structured State Spaces
 
Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011Talk about apache cassandra, TWJUG 2011
Talk about apache cassandra, TWJUG 2011
 
Talk About Apache Cassandra
Talk About Apache CassandraTalk About Apache Cassandra
Talk About Apache Cassandra
 
Support vector machines
Support vector machinesSupport vector machines
Support vector machines
 

Recently uploaded

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MIND CTI
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Victor Rentea
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityWSO2
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistandanishmna97
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxRustici Software
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxRemote DBA Services
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingEdi Saputra
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 

Recently uploaded (20)

Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024Finding Java's Hidden Performance Traps @ DevoxxUK 2024
Finding Java's Hidden Performance Traps @ DevoxxUK 2024
 
Platformless Horizons for Digital Adaptability
Platformless Horizons for Digital AdaptabilityPlatformless Horizons for Digital Adaptability
Platformless Horizons for Digital Adaptability
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 

Murpy's Machine Learning:14. Kernel

  • 2. 14.1 Introduction • how do we represent a text document or protein sequence, which can be of variable length? • One approach to such problems is to define a generative model for the data, and use the inferred latent representation and/or the parameters of the model as features, and then to plug these features in to standard methods. • For example, in Chapter 28, we discuss deep learning, which is essentially an unsupervised way to learn good feature representations. • Another approach is to assume that we have some way of measuring the similarity between objects, that doesn’t require preprocessing them into feature vector format. • For example, when comparing strings, we can compute the edit distance between them. • In this chapter, we will discuss several kinds of kernel functions • 길이가 일정치 않은 feature는 어떻게 표현할까 • Latent representation  LDA 각 문서의 토픽 분포 • Kernel: 두 feature 간의 similarity만으로 feature를 표현 • e.g 두 스트링 비교  editing distance • Support Vector Machine • kernelized ridge regression 을 sparse linear machine으로 바꾸는 관점 • maximum margin classifier로 보는 관점
  • 3. 14.2 Kernel functions • We define a kernel function to be a real-valued function of two arguments, κ(x,x)∈R,for x,x’∈X. • symmetric • non-negative • can be interpreted as a measure of similarity
  • 4. 14.2.1 RBF kernels • squared exponential kernel(SE kernel) or Gaussian kernel • If Σ is diagonal, this can be written as • We can interpret the σj as defining the characteristic length scale of dimension j. radial basis function or RBF kernel Here σ2 is known as the bandwidth.
  • 5. • 시그마가 커질수록, 두 feature간 거리가 멀어도 어느정도 수치는 나오게 된다. • 시그마가 작으면, 거리가 좀만 멀어져도 kernel 유사도는 급히 작아진다
  • 6. 14.2.2 Kernels for comparing documents • If we use a bag of words representation, where xij is the number of times words j occurs in document i, we can use the cosine similarity, • Unfortunately, this simple method does not work very well, (stop word)
  • 7. 14.2.3 Mercer (positive definite) kernels • In general, if the kernel is Mercer, then there exists a function φ mapping x∈X to RD such that • For example, consider the (non-stationary)polynomial kernel
  • 8. 14.2.3 Mercer (positive definite) kernels • Gram matrix, • kernel function satisfy the requirement that the Gram matrix be positive definite for any set of inputs • It can be shown (Schoelkopf and Smola 2002) that the Gaussian kernel is a Mercer kernel as is the cosine similarity kernel (Sahami and Heilman 2006) • Mercer’s theorem. • If the Gram matrix is positive definite, we can compute an eigenvector decomposition of it as follows
  • 9. 14.2.4 Linear kernels • Deriving the feature vector implied by a kernel is in general quite difficult, and only possible if the kernel is Mercer. • However, deriving a kernel from a feature vector is easy: we just use • If φ(x)=x,we get the linear kernel, defined by κ • This is useful if the original data is already high dimensional, and if the original features are individually informative, e.g., a bag of words representation where the vocabulary size is large, or the expression level of many genes. • s. In such a case, the decision boundary is likely to be representable as a linear combination of the original features, so it is not necessary to work in some other feature space. • Of course, not all high dimensional problems are linearly separable. For example, images are high dimensional, but individual pixels are not very informative, so image classification typically requires non-linear kernels (see e.g., Section 14.2.7).
  • 10. 14.2.5 Matern kernels • The Matern kernel, which is commonly used in Gaussian process regression (see Section 15.2) • Where r=||x−x ||, ν>0, l>0, and Kν is a modified Bessel function. • As ν→∞, this approaches the SE kernel. If ν=12, the kernel simplifies to
  • 11. 14.2.6 String kernels • The real power of kernels arises when the inputs are structured objects. • As an example, we now describe one way of comparing two variable length strings using a string kernel. • Consider two strings x, and x of lengths D, D’, each defined over the alphabet A • A={A, R, N, D, C, E, Q, G, H, I, L, K, M, F, P, S, T, W, Y, V} • Now let φs(x) denote the number of times that substrings appears in string x • More formally and more generally, let us say that’s is a substring of x if we can write x=usv for some (possibly empty) strings u, s and v . • . x string, x' string의 커널은, substring을 많이 공유하면 커진다
  • 12. 14.2.7 Pyramid match kernels • 이미지에서 점을 추출하고 bag of words 로 표현한다(SIFT) • 여러 레벨을 가진 grid로 나누고, 얼마나 일치하는지 본다(같은 grid에 얼마나 같이 등장하는지?) • This is a Mercer kernel.
  • 13. 14.3 Using kernels inside GLMs • We define a kernel machine to be a GLM where the input feature vector has the form • If κ is an RBF kernel, this is called an RBF network We discuss ways to choose the μk parameters below. • We will call Equation 14.28 a kernelised feature vector. Note that in this approach, the kernel need not be a Mercer kernel. • We can use the kernelized feature vector for logistic regression by defining • This provides a simple way to define a non-linear decision boundary. • As an example, consider the data coming from the exclusive or or xor function.
  • 14. 14.3 Using kernels inside GLMs • Linear regression의 kernelized 버전
  • 15. 14.3.2 L1VMs, RVMs, and other sparse vector machines • The main issue with kernel machines is: how do we choose the centroids μk? • If the input is low-dimensional Euclidean space, we can uniformly tile the space occupied by the data with prototypes, as we did in Figure 14.2(c). = 공간을 균일하게 나눈다 • However, this approach breaks down in higher numbers of dimensions because of the curse of dimensionality. 차원이 커지게 되면, 각 공간이 너무 많아 져서 차원의 저주에 걸리게 된다 • A simpler approach is to make each example xi be a prototype, so we get • Now we see D = N, so we have as many parameters as data points. However, we can use any of the sparsity- promoting priors for w discussed in Chapter 13 to efficiently select a subset of the training exemplars. 이런 접근법은 13장 sparse linear model에서 했던 w를 sparse하게 만드는 접근법을 training example의 subset을 고르는 문제로 바꾸는 것이면 가능하게 한다. wi=0이면, xi는 없어도 예측 가능 • We call this a sparse vector machine
  • 16. • w에 대해서 1 regularization으로 penalty를 주는 sparse vector machine을 L1VM • w에 대해서 2 regularization으로 penalty를 주는 sparse vector machine을 L2VM • Automatic relevance determination (ARD)/sparse Bayesian learning (SBL) (13.7)로 sparsity를 얻는 모델을 relevance vector machine or RVM (Tipping 2001)라 한다 • Another very popular approach to creating a sparse kernel machine is to use a support vector machine or SVM Rather than using a sparsity-promoting prior, it essentially modifies the likelihood term, which is rather unnatural from a Bayesian point of view. Nevertheless, the effect is similar, as we will see.
  • 17.
  • 18. 14.4 The kernel trick • X자체나 X의 고차원 mapping을다루는 것이 아니라 kernel 자체를 다룸 • <x,x’> 를 k(x,x’)로 교체 • linear regression 같은데에서 커널을 쓰면, 고차원으로 사상된 training example 을 다루어야 했기 때문에, 학습 이나 예측 계산이 무거워졌었는데, 그런한 단점을 회피하는 trick 같은 것, • 심지어 무한 차원 매핑까지 할 수 있다(RBF= Gaussian Kernel) • 또는 structured object 의 다루기 어려운 feature 자체를 쓰는 것이 아니라, feature 간의 유사도라고 할 수 있는 kernel 을 씀으로써, 기존 알고리즘들이 structured object 에도 수행될 수 있게 해줌 • This is called the kernel trick.
  • 19. 14.4.1 Kernelized nearest neighbor classification • Recall that in a 1NN classifier (Section 1.4.2), we just need to compute the Euclidean distance of a test vector to all the training points, find the closest one, and look up its label. • This can be kernelized by observing that • This allows us to apply the nearest neighbor classifier to structured data objects. • 예를 들어서, 전에 다루었던 문서 커널을 써서, KNN으로 문서 분류를 할 수 있다.(tf-idf가 가장 비슷한 문서 k개 로 test 문서를 분류) Similarity를 kernel로만 계산
  • 20. 14.4.2 Kernelized K-medoids clustering • K –means가 각 cluster에 속한 feature들의 평균을 중심으로 하는 것처럼, k-medoids는 각 cluster의 중간값 • feature 자체를 다루는 것이 아니기 때문에 평균을 내지 못한다 • e-step = 각 데이터 xi가 어느 cluster에 속하는지 정한다(가장 가까운 centroid), k means와 같음 • m-step = 각 cluster k의 중심은 k cluster에 속한 애들 중에서, 중간값을 구한다 = k cluster에 속한 애들 중에서 다 른 애들과의 거리의 합이 최소 • 예를 들어서 string kernel 을 써서, 공통된 substring이 많은 비슷한 string을 clustering 할 수 있다
  • 21. 14.4.3 Kernelized ridge regression • Applying the kernel trick to distance-based methods was straightforward. • It is not so obvious how to apply it to parametric models such as ridge regression. • 14.4.3.1 The primal problem • Let x ∈ RD be some feature vector, and X be the corresponding N × D design matrix • Minimize • The optimal solution is given b • 14.4.3.2 The dual problem • using the matrix inversion lemma (Equation 4.107) • takes O(N3 + N2D) time to compute. This can be advantageous if D is large
  • 22. 14.4.3 Kernelized ridge regression • we can partially kernelize this, by replacing XXT with the Gram matrix K. • But what about the leading XT term? • Let us define the following dual variables: • Then we can rewrite the primal variables as follows • This tells us that the solution vector is just a linear sum of the N training vectors • When we plug this in at test time to compute the predictive mean, we get • kernelize ridge regression 의 쓰임새? 만약, 다루고자 하는 training example의 차원이 100000000이라면, 원래의 ridge regression이라면, 역행렬을 계산하는데 O(D^3)의 시간복잡도가 들 것이지만, Kernel ridge regression은 O(N^3)의 시간 복잡도를 가진다. • 하지만, 예측 계산은 O(D)에서 O(ND)로 증가한다. 그러므로 α를 svm처럼 sparse하게 만들어서(예측에 support vector만 써서) 해결해야 한다
  • 23. 14.5 Support vector machines (SVMs) • If L is quadratic loss, this is equivalent to ridge regression, • Kernelized ridge regression is kernelized, but not sparse • loss function을 hinge(classification)나 epsilon insensitive loss function(regression)을 쓰면, solution이 sparse해지 고, prediction은 오직 training data의 일부(support vectors)만을 가지고, 예측할 수 있다. • This combination of the kernel trick plus a modified loss function is known as a support vector machine or SVM
  • 24. 14.5.1 SVMs for regression • The problem with kernelized ridge regression is that the solution vector w depends on all the training inputs. • We now seek a method to produce a sparse estimate • epsilon insensitive loss function • This means that any point lying inside an-tube around the prediction is not penalized, as in Figure 14.10. 예측치의 오차가 특정 범위(epsilon)에 있으면, 페 널티는 가하지 않는다. 하지만, 그 범위를 넘어가 면 오차-epsilon만큼 페널티를 가한다
  • 25. • his objective is convex and unconstrained, but not differentiable, because of the absolute value function in the loss term. • One popular approach is to formulate the problem as a constrained optimization problem. • In particular, we introduce slack variables to represent the degree to which each point lies outside the tube 제약조건이 true가 되도록 ξ를 늘려주고, 늘린 만큼 페널티를 준다
  • 26. • One can show (see e.g., (Schoelkopf and Smola 2002)) that the optimal solution has the form • Once the model is trained, we can then make predictions using
  • 27. 14.5.2 SVMs for classification • 14.5.2.1 Hinge loss yη가 1보다 크면 penalty는 없다 f(X)가 > 1이고 y =1 이거나 f(X)가 < 1이고 y =-1 일 때 즉 정답일 때 pernality 없음
  • 28. • The overall objective has the form • Once again, this is non-differentiable, because of the max term. However, by introducing slack variables ξi, one can show that this is equivalent to solving • Standard solvers take O(N3) time. • However, specialized algorithms, which avoid the use of generic QP solvers, have been developed for this problem, such as the sequential minimal optimization or SMO algorithm (Platt 1998). In practice this can take O(N2) • In such settings, it is common to use linear SVMs, which take O(N) time to train (Joachims 2006; Bottou et al. 2007). 제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼 페널티를 준다
  • 29. • One can show that the solution has the form • where αi = λiyi and where α is sparse (because of the hinge loss). • The xi for which αi > 0 are called support vectors; these are points which are either incorrectly classified, or are classified correctly but are on or inside the margin (we discuss margins below). 서포트 벡터는 오분류된 점들이거나, 잘 분류됐지만, margin 안에 있는 점들 • At test time, prediction is done using • his takes O(sD) time to compute, where s ≤ N is the number of support vectors. • This depends on the sparsity level, and hence on the regularizer C 모든 training example의 알파의 weighted sum
  • 30. 14.5.2.2 The large margin principle • In this section, we derive Equation 14.58 form a completely different perspective. • 목표는 feature가 커널에 의해서 사상된 space 상에서 linear한 discriminant function f(x)을 구하는 것이다. X는 f(x)에 사영시킨 x⊥에서 r만큼 전진
  • 31. 14.5.2.2 The large margin principle • We would like to make this distance r = f(x)/||w|| as large as possible, for reasons illustrated in Figure 14.11. • intuitively, the best one to pick is the one that maximizes the margin, i.e., the perpendicular distance to the closest point • In addition, we want to ensure each point is on the correct side of the boundary, hence we want f(xi) yi > 0. • So our objective becomes • Note that by rescaling the parameters using w → kw and w0 → kw0, we do not change the distance of any point to the boundary, since the k factor cancels out when we divide by ||w||. • Therefore let us define the scale factor such that yifi = 1 for the point that is closest to the decision boundary. yi(wT xi + w0) = 1 이 되게 rescaling • We therefore want to optimize () • The constraint says that we want all points to be on the correct side of the decision boundary with a margin of at least 1. large margin classifier. 모든 training example에서 가장 가까운 점의 margin이 가장 큰 w 를 구하라
  • 32. soft margin constraints • If the data is not linearly separable (even after using the kernel trick), there will be no feasible solution in which yifi ≥ 1 for all i. • We replace the hard constraints that yifi ≥ 0 with the soft margin constraints that yifi ≥ 1 − ξi. The new objective becomes • 1 • We therefore introduce slack variables ξi ≥ 0 such that ξi = 0 if the point is on or inside the correct margin boundary, and ξi = |yi −fi| otherwise • 0 < ξi ≤ 1 the point lies inside the margin, but on the correct side of the decision boundary. • If ξi > 1, the point lies on the wrong side of the decision boundary. 제약조건이 true가 되도록 ξ를 늘려 제약조건을 완화해주고, 완화해준 만큼 페널티를 준다
  • 33. 14.5.2.3 Probabilistic output • An SVM classifier produces a hard-labeling, ˆy(x) = sign(f(x)). • However, we often want a measure of confidence in our prediction. • One heuristic approach is to interpret f(x) as the log-odds ratio, log (p(y=1|x)/p(y=0|x) ). • where a, b can be estimated by maximum likelihood on a separate validation set. • However, the resulting probabilities are not particularly well calibrated
  • 34. 14.5.2.4 SVMs for multi-class classification • In Section 8.3.7, we saw how we could “upgrade” a binary logistic regression model to the multiclass case, by replacing the sigmoid function with the softmax, and the Bernoulli distribution with the multinomial. • Upgrading an SVM to the multi-class case is not so easy, since the outputs are not on a calibrated scale and hence are hard to compare to each other • The obvious approach is to use a one-versus-the-rest approach (also called one-vs-all), in which we train C binary classifiers, fc(x), where the data from class c is treated as positive, and the data from all the other classes is treated as negative. • However, this can result in regions of input space which are ambiguously labeled, as shown in Figure 14.14(a). • Another approach is to use the one-versus-one or OVO approach, also called all pairs, in which we train C(C−1)/2 classifiers to discriminate all pairs fc,c . C(C−1)/2개의 분류기C개의 분류기 ambiguously labeled
  • 35. 14.5.3 Choosing C • Typically C is chosen by cross-validation. • however, that C interacts quite strongly with the kernel parameters. • To choose C efficiently, one can develop a path following algorithm in the spirit of lars (Section 13.3.4). • The basic idea is to start with λ large, so that the margin 1/||w(λ)|| is wide, and hence all points are inside of it and have αi = 1. • By slowly decreasing λ, a small set of points will move from inside the margin to outside, and their αi values will change from 1 to 0, as they cease to be support vectors.
  • 36. 14.5.4 Summary of key points • Summarizing the above discussion, we recognize that SVM classifiers involve three key ingredients: • the kernel trick :prevent underfitting, • Sparsity, the large margin principle : prevent overfitting
  • 37. 14.6 Comparison of discriminative kernel methods • We have mentioned several different methods for classification and regression based on kernels, which we summarize in Table 14.1. • GPs and L2VM are generally the slowest, taking O(N3) time • if speed matters, use an RVM, • if well-calibrated probabilistic output matters (e.g., for active learning or control problems), use a GP. “tune” the kernel parameter, use efficient gradient based optimizers to maximize the marginal likelihood 확률 모델이냐? kernel 이 mercer가 아니어도 되냐?
  • 38. 14.7 Kernels for building generative models • There is a different kind of kernel known as a smoothing kernel which can be used to create non-parametric density estimates. • This can be used for unsupervised density estimation, p(x), as well as for creating generative models for classification and regression by making models of the form p(y, x). • 14.7.1 Smoothing kernels • A smoothing kernel is a function of one argument which satisfies the following properties: • A simple example is the Gaussian kernel, • We can control the width of the kernel by introducing a band width parameter h:
  • 39. 14.7.2 Kernel density estimation (KDE) aka Parzen window density estimator • Recall Gaussian mixture model • However, it requires specifying the number K and locations μk of the clusters • Gaussian 갯수가 커지면 RD 의 모든 분포를 표현할 수 있고, 각 training example 마다 Gaussian을 가지고 있다면 • We can generalize the approach by writing • simple non-parametric density model: no model fitting is required • The disadvantage is that the model takes a lot of memory to store, and a lot of time to evaluate. (training example 자체가 모델이고, 계산할 때도 training example을 모두 이용한다 ) 특정 feature x의 확률은, training example xi마다 있는 N개의 Gaussian에 의해 서 결정되는데, x랑 가까이 있는 xi의 Gaussian이 더 큰 영향을 준다 x - xi 가 0에 가까울수록 커짐
  • 40. • 각 data example xi마다 자기 자신을 평균으로 하는 N 개의 정규분포가 있고, 확률을 알고자 하는 x가 N 개의 정규 분포에 얼마나 떨어져 있는지 정규분포로 알아내서 합산 이 x는 3개의 정규분포와 가까워서 수치 높음 각 정규 분포의 분산을 늘리면 smooth 해짐
  • 41. 14.7.4 Kernel regression wi(x) x가 각 xi와 가까우면 weighting이 커지므로, 가까운 training set의 y값을 다 많이 따른다
  • 42. 14.7.5 Locally weighted regression • If we define κh(x−xi)=κ(x,xi), we can rewrite the prediction made by kernel regression as follows • Note tha tκ(x,xi) need not be a smoothing kernel. If it is not, we no longer need the normalization term, so we can just write • N개의 function κ(x∗,xi)을 융합 • linear regression 은 모든 데이터에 대해서 동일한 w를 쓰는 반면 LWR은 test set x*마다 fitting 함으로 y를 예측한다. 즉 test set마다 다른 weight β(x∗)를 가지고 다음과 같이 fit • solution은 현재 예측하고자 하는 x*에 대한 weight β(x∗) 로 모든 training set을 예측했을 때(커널로 가까우면 큰 에러를 갖 도록 조정) 에러가 최소되게 weight 학습