This document proposes a method for linear regression on symbolic data where each observation is represented by a Gaussian distribution. It derives the likelihood function for such "Gaussian symbols" and shows that it can be maximized using gradient descent. Simulation results demonstrate that the maximum likelihood estimator performs better than a naive least squares regression on the mean of each symbol. The method extends classical linear regression to the symbolic data setting.
Physiochemical properties of nanomaterials and its nanotoxicity.pptx
Regression on Gaussian Symbolic Data
1. HAL Id: cel-01620878
https://hal.archives-ouvertes.fr/cel-01620878
Submitted on 24 Oct 2017
HAL is a multi-disciplinary open access
archive for the deposit and dissemination of sci-
entific research documents, whether they are pub-
lished or not. The documents may come from
teaching and research institutions in France or
abroad, or from public or private research centers.
L’archive ouverte pluridisciplinaire HAL, est
destinée au dépôt et à la diffusion de documents
scientifiques de niveau recherche, publiés ou non,
émanant des établissements d’enseignement et de
recherche français ou étrangers, des laboratoires
publics ou privés.
Symbolic Data Analysis: Regression on Gaussian
symbols
Axel Aronio de Romblay
To cite this version:
Axel Aronio de Romblay. Symbolic Data Analysis: Regression on Gaussian symbols. Master. France.
2015. cel-01620878
2. SYMBOLIC DATA ANALYSIS:
Regression on Gaussian symbols
Axel Aronio de Romblay
Telecom Paristech
axel.aronio-de-romblay@telecom-paristech.fr
December 2015
1 Abstract
In classical data analysis, data are single values. This is the case if you consider a dataset of n patients
which age and size you know. But what if you record the blood pressure or the weight of each patient during a
day ? Then, for each patient, you do not have a single-valued data but a set of values since the blood pressure
or the weight are not constant during the day.
Suppose now that you do not want to record blood pressure a thousand times for each patient and to store
it into a database because your memory space is limited. Therefore, you need to aggregate each set of values
into symbols: intervals (lower and upper bounds only), box plots, histograms or even distributions (distribution
law with mean and variance)...
Thus, the issue is to adapt classical statistical tools to symbolic data analysis. More precisely, this article
is aimed at proposing a method to fit a regression on Gaussian distributions. This paper is divided as follows:
first, it presents the computation of the maximum likelihood estimator and then it compares the new approach
with the usual least squares regression.
KEY WORDS: Symbolic data, Big Data, Linear regression, Maximum likelihood, Gaussian symbols
2 Introduction
Symbolic data were first introduced by Diday (1987). A realization of a symbolic-valued variable may
take a finite or an infinite set of values in Rp
. A random variable that takes a finite set of either quantitative
or qualitative values is called a multi-valued variable. A variable that takes an infinite set of numerical values
ranging from a low to a high value is called an interval-valued variable. A more complex symbolic-valued variable
may have weights, probabilities, capacities, credibilities or even a distribution associated with its values. This
class variable is called modal-valued. Classical data are a special case of symbolic data where the internal
distribution puts the probability 1 on a single point value in Rp
.
Symbolic data analysis offers a solution to the massive data problems, especially when the entities of
interest are groups or classes of observations. Recent advances in technology enable storage of extremely large
databases. Larger datasets provide more information about the subjects of interest. However, performing even
simple exploratory procedures on these datasets requires a lot of computing power. As a result, much research
effort in recent years has been steered toward finding more efficient methods to accommodate large datasets.
In many situations, characteristics of classes of observations are of higher interest to an analyst than those of
individual observations. In these cases, individual observations can be aggregated into classes of observations.
A series of papers, Diday (1995), Diday et al. (1996), Emilion (1997) and Diday and Emilion (1996, 1998),
established mathematical probabilistic framework for classes of symbolic data. These results were extended to
Galois lattices in the symbolic data context in Diday and Emilion (1997, 1998, 2003) and Brito and Polaillon
(2005). Beyond those fundamental results, a few recent papers (2010) have presented likelihood functions for
symbolic data but none of them came up with concrete results or closed form solutions.
Thus, in this paper, we mainly concentrate on trying to have an expression of the symbolic likelihood that
is valid and usable. Furthermore, we focus on Gaussian symbolic data which is very common. Moreover, some
actual work refers to uniform distributions but struggle to get a workable solution.
1
3. 3 General settings
Let X a dataset of n observations and y the corresponding target:
X =
x1
...
xn
, xi ∈ Rp
, i = 1 . . . n
y =
y1
...
yn
, yi ∈ R, i = 1 . . . n
We suppose y = X β + , ∼ N(0, σ2
In) and we want to estimate θ = (β, σ2
), β ∈ Rp
and σ ∈ R
4 Gaussian symbols
4.1 Definitions
Here, we consider the case where each observation xi is no more a point in Rp
but a p dimensional normal
distribution with known parameters µi and Σi :
yi|xi ∼ N(xi β, σ2
)
xi ∼ N(µi, Σi)
Furthermore, we suppose that both samples yi|xi and xi are iid.
4.2 Symbolic likelihood
First, let us compute the density of (y,X) or (y, x1, . . . , xn) :
p(y, X) = p(y|X)p(X)
where each term is given by:
p(X) =
n
i=1
(2π)− p
2 |Σi|− 1
2 exp −
1
2
(xi − µi) Σ−1
i (xi − µi) (1)
p(y|X) =
n
i=1
(2π)− 1
2 (σ2
)− 1
2 exp −
1
2
(yi − xi β)2
σ2
(2)
Thus, we can write the joint distribution as:
p(y, X) = C.exp −
1
2
||y − Xβ||2
σ2
+
n
i=1
xi|Σ−1
i xi (3)
using the following notations: C = (2π)−
n(p+1)
2 σ−n n
i=1 |Σi|− 1
2 and xi = xi − µi
Finally, we can write the log-likelihood:
l(θ, µi, Σi, yi) = log(p(y)) = log
x1
. . .
xn
p(y, X)dx1 . . . dxn (4)
Indeed, from (4), we can easily notice that y is a normal distribution with parameters µy and Σy.
E[y] = E[Xβ] = Mβ (5)
with : M =
µ1
...
µn
2
4. If we suppose that X and are independents:
Σy = V ar(y) = V ar(Xβ) + V ar( )
= E[Xβ.(Xβ) ] − E[Xβ].E[Xβ] + σ2
In
where:
E[Xβ.(Xβ) ] = E[((xi β)(xj β))i,j] = E[(β (xixj )β)i,j]
= (β (Cov(xi, xj) + µiµj )β)i,j
= (β δi,jΣiβ)i,j + Mββ M
As a conclusion:
Σy = (β δi,jΣiβ)i,j + σ2
In =
β Σ1β + σ2
(0)
...
(0) β Σnβ + σ2
(6)
And the log likelihood is given by :
l(θ, µi, Σi, yi) = −
n
2
log(2π) −
1
2
n
i=1
log(β Σiβ + σ2
) −
1
2
(y − Mβ) Σ−1
y (y − Mβ) (7)
4.3 Maximum Likelihood Estimator
Let us see if we can get a closed form for βMLE by computing the partial derivative of the log-likelihood with
respect to β :
We write: l1(β) =
n
i=1 log(β Σiβ + σ2
) and l2(β) = (y − Mβ) Σ−1
y (y − Mβ)
∂l1
∂β
(β) = 2
n
i=1
Σiβ
β Σiβ + σ2
∂l2
∂β
(β) =
n
i=1
∂
∂β
(yi− µi|β )2
β Σiβ + σ2
=
n
i=1
−
2(yi− µi|β )2
Σiβ
(β Σiβ + σ2)2
−
2(yi− µi|β )µi
β Σiβ + σ2
Now, if we gather the two terms and factorize, we can get:
∂l
∂β
(θ, µi, Σi, yi) = 0 ⇐⇒
n
i=1
1
β Σiβ + σ2
yiµi − 1 −
yi
2
β Σiβ + σ2
Σiβ = 0 (8)
where: yi− µi|β = yi − yi = yi
With the general settings (Σi = 0), we cannot get a closed form for βMLE. Besides, we do not know σ2
. Let us
remind the expression of the partial derivative of the log likelihood with respect to β :
∂l
∂β
(θ, µi, Σi, yi) =
n
i=1
1
β Σiβ + σ2
yiµi − 1 −
yi
2
β Σiβ + σ2
Σiβ (9)
3
5. Now we compute the partial derivative of the log likelihood with respect to σ2
:
∂l1
∂σ2
(σ2
) =
n
i=1
1
β Σiβ + σ2
∂l2
∂σ2
(σ2
) = −
n
i=1
yi
β Σiβ + σ2
2
Thus:
∂l
∂σ2
(θ, µi, Σi, yi) =
1
2
n
i=1
1
β Σiβ + σ2
yi
2
β Σiβ + σ2
− 1 (10)
To conclude, we have:
l(θ) =
n
i=1
l(i)
(θ) (11)
with:
l(i)
(θ) =
1
β Σiβ+σ2 yiµi − 1 − yi
2
β Σiβ+σ2 Σiβ
1
2
1
β Σiβ+σ2
yi
2
β Σiβ+σ2 − 1
(12)
4.4 Comments
• Finally, we do have an explicit formula for the gradient of the log likelihood, which is a sum of n vector
in a p+1 dimensional space. Thus, with (11) or (12), we can run a full-batch or a stochastic gradient
descent algorithm to get an approximation of θMLE.
• We also notice that we can easily recover the ’usual’ least square regression case where Σi → 0 and
xi ↔ µi :
n
i=1
1
σ2
yiµi = 0 ⇐⇒ (y − Xβ) X = 0 (13)
1
2σ2
n
i=1
yi
2
σ2
− 1 = 0 ⇐⇒ −
n
2σ2
+
1
2σ4
(y − Xβ) (y − Xβ) = 0 (14)
• We supposed that there is no intercept. Indeed, we can get rid of it easily if we first apply the following
transformation:
xi ∼ N
µi
1
, Σi
...
. . . 0
and β ∈ Rp+1
• After estimating β, we can deduce an unbiased prediction: yj = µj β
• We expect σ2 → +∞ when Σi → +∞. Indeed, if we plot the log-likelihood as a function of σ2
for
different values of Σi (we suppose that the Σi are equals for all i: see more details in Appendix), we get
this result:
4
6. Figure 1: log-likelihoods for different Σi
• From (6), we can see that the variance of yi depends on Σi but especially on β which makes sense.
Indeed, if p is large and each coordinate is large, the locations of the yi are not accurate. This can lead
to an issue when fitting the regression since we can have overlapping symbols...
4.5 Results
In this section, we simulate different datasets and we compare the maximum likelihood estimation with
the naive solution of fitting a regression on (µi, yi). We remind the settings: we get (µi, Σi, yi) and we want to
estimate β (and σ2
). Let us note m, Σ, s1 and s2 the following parameters:
• M = uniform(0,m,(n,p)) : each coordinate of M (see definition in section 4.2) is generated uniformly
between 0 and m.
• Σ = [Σi for i in range(n)] = [np.diag(uniform(s1,s2,p)) for i in range(n)] : we suppose here that within a
symbol the p features are independents and we generate a diagonal matrix of each Σi where the diagonal
coefficients are uniformly taken between [s1, s2].
4.5.1 Dataset 1 (reference dataset): n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 2: Convergence to the ML estimator Figure 3: MSE for each method
5
7. 4.5.2 Dataset 2: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-0.1,0.1]
Figure 4: Convergence to the ML estimator Figure 5: MSE for each method
4.5.3 Dataset 3: n=100, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=5, β ∈[-2,2]
Figure 6: Convergence to the ML estimator Figure 7: MSE for each method
4.5.4 Dataset 4: n=1000, p=5, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 8: Convergence to the ML estimator Figure 9: MSE for each method
6
8. 4.5.5 Dataset 5: n=100, p=5, m=100, [s1, s2]=[1, 5], σ2
=0.5, β ∈[-2,2]
Figure 10: Convergence to the ML estimator Figure 11: MSE for each method
4.5.6 Dataset 6: n=100, p=20, m=100, [s1, s2]=[0.1, 1], σ2
=0.5, β ∈[-2,2]
Figure 12: Convergence to the ML estimator Figure 13: MSE for each method
5 Conclusion
From these results, we can see that the Maximum Likelihood Estimator is always more accurate than
fitting a naive least squares regression on the (µi, yi). Nevertheless, the computation time is much longer with
a simple gradient descent even if you get an ’appropriate’ learning rate and you start with a ’good’
initialization. To face this issue, it is highly recommanded to start with β0 = (M M)−1
M y. Thus, it will
take just a few seconds to get an ’accurate’ approximation of βMLE. Besides, from the graphics, we can notice
that:
• When the quantity β Σiβ + σ2
is ’large’ (cf Figure 5, Figure 7 and Figure 11), the MLE is highly more
accurate than the naive estimator. Indeed, the naive estimator does not take into account the internal
variation inside a symbol and thus it is very biased. But in this case, we need to run more iterations.
• When n is ’large’ (cf Figure 9): we need to run less iterations but each iteration takes more time. The
more you get symbols, the more accurate are the two estimators. Here, it is preferable to run a
stochastic gradient descent algorithm: it will be definitely faster but a bit less accurate (unless you run
some more iterations).
• When p is ’large’ (cf Figure 13): the difference between both estimators is negligible. The MLE
approximation needs a lot of iterations and subsampling is not a good solution to speed up the
convergence.
7
9. 6 Acknowledgment
I would like to thank my supervisor Scott Sisson who guided me through my research exchange and who
helped me. I also would like to thank all the administration team at UNSW for their welcome and their useful
help to set up my office.
7 References
[1] Billard, L. and Diday, E. (2000). Regression Analysis for Interval-Valued Data. Data analysis,
Classification, and Related Methods (eds. H.A.L. Kiers, J.-P. Rassoon, P.J.F. Groenen, and M. Schader).
Springer-Verlag, Berlin, 369-374.
[2] Billard, L. and Diday, E. (2007). Symbolic Data Analysis: Conceptual Statistics and Data Mining. Wiley,
Chichester.6
[3] de Carvalho F.A.T., Lima Neto, E.A. and Tenorio, C.P. (2004). A New Method to Fit a Linear Regression
Model for Interval-valued Data. Lecture Notes in Computer Science, KI2004 Advances in Artifical Inteligence.
Springer-Verlag, 295-306.
[4] Le-Rademacher, J. and Billard, L. (2010). Likelihood Functions and Some Maximum Likelihood Estimators
for Symbolic Data. Journal of Statistical Planning and Inference, submitted.7
[5] Lima Neto, E.A. and de Carvalho F.A.T. (2010). Constrained Linear Regression Models for Symbolic
Interval-valued Variables. Computational Statistics Data Analysis, 54(2), 333-347.
8 Appendix
(*) Figure 1 has been plotted with these parameters: n=500, p=5, m=10, s1=s2, σ2
=0.5, β ∈[-2,2]
8