SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
ICML2016
Estimating Structured Vector
Autoregressive Models
NEC
ICML2016 2016/7/21
• Vector Autoregressive 

•
• i.i.d.
• →( ) Lasso Group-Lasso i.i.d.
• →( ) R ( )
• R
ap-
y in
lar-
(1)
uch
, is
dis-
ion
ani,
hine
ume
nancial time series (Tsay, 2005) to modeling the dynamical
systems (Ljung, 1998) and estimating brain function con-
nectivity (Valdes-Sosa et al., 2005), among others. A VAR
model of order d is defined as
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
denotes a multivariate time series, Ak 2
Rp⇥p
, k = 1, . . . , d are the parameters of the model, and
d 1 is the order of the model. In this work, we as-
sume that the noise ✏t 2 Rp
follows a Gaussian distribu-
tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
for ⌧ 6= 0. The VAR process is assumed to be stable and
stationary (Lutkepohl, 2007), while the noise covariance
matrix ⌃ is assumed to be positive definite with bounded
largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.
In the current context, the parameters {Ak} are assumed
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
, and ⌦ is the Kronecker product. The covari-
ance matrix of the noise ✏ is now E[✏✏T
] = ⌃ ⌦ IN⇥N .
Consequently, the regularized estimator takes the form
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
where R( ) can be any vector norm, separable along the
rows of matrices Ak. Specifically, if we denote =
[ T
1 . . . T
p ]T
and Ak(i, :) as the row of matrix Ak for
k = 1, . . . , d, then our assumption is equivalent to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
where A
all the
for 2
of origi
Pd
k=1 A
and Sec
2.3. Pro
In what
trix X i
results w
bounds
Define
we assu
is distri
matrix C
C =
2

 - 

Lasso

[Basu15]


VAR
i.i.d. [Banerjee14]
VAR 

3
• i.i.d.
•
•
•
high probability
||Z ||2
|| ||2
p
N
= O(1) is a pos-
he estimation er-
ded by k k2 
ote that the norm
assumed to be the
rom our assump-
und on the regu-
+ w2
(⌦R)
N2
⌘
. As
nd d grows and
2
✏1) + log(p)) we
(1+✏1)
w2
(⌦R)
N2
◆
.
ion, we will show
some ⌫ > 0 and
p 1
, where Sdp 1
defined as ⌦Ej
=
R( j)
o
, for r >
T
, for j is of size
isfy N O N
+ N2 . With high
then the restricted eigenvalue condition ||Z ||
|| ||2
for 2 cone(⌦E) holds, so that  = O(1
itive constant. Moreover, the norm of the est
ror in optimization problem (4) is bounded b
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note tha
compatibility constant (cone(⌦Ej
)) is assume
same for all j = 1, . . . , p, which follows from o
tion in (5).
Consider now Theorem 3.3 and the bound on
larization parameter N O
⇣
w(⌦R)
p
N
+ w2
(
N
the dimensionality of the problem p and d
the number of samples N increases, the first t
will dominate the second one w2
(⌦R)
N2 . This c
by computing N for which the two terms bec
2
↑
w,
the regularization parameter
nd on R⇤
[ 1
N ZT
✏]  ↵, for
lish the required relationship
2 Rdp
|R(u)  1}, and define
be a Gaussian width of set
ny ✏1 > 0 and ✏2 > 0 with
( min(✏2
2, ✏1) + log(p)) we
(⌦R)
p
N
+ c1(1+✏1)
w2
(⌦R)
N2
◆
e constants.
alue condition, we will show
⌫, for some ⌫ > 0 and
L indicate stronger dependency in the data, thus
more samples for the RE conditions to hold with h
ability.
Analyzing Theorems 3.3 and 3.4 we can interpre
tablished results as follows. As the size and di
ality N, p and d of the problem increase, we e
the scale of the results and use the order notatio
note the constants. Select a number of sample
N O(w2
(⇥)) and let the regularization param
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high pr
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1)
itive constant. Moreover, the norm of the estim
ror in optimization problem (4) is bounded by
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that
compatibility constant (cone(⌦Ej
)) is assumed
same for all j = 1, . . . , p, which follows from our
tion in (5).
y ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
L 2
p
M cw(⇥) ⌘ and c, c1, c2 are
nts, and L, M are defined in (9) and (13).
3.4, we can choose ⌘ = 1
2
p
NL and set
2
p
M cw(⇥) ⌘ and since
p
N > 0
ed, we can establish a lower bound on the
ples N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
bound and using (9) and (13), we can con-
3.3
W
sep
ize
su
OW
3.
To
ram
VAR
h= 1
X(h) = E[Xj,:XT
j+h,:] 2 Rdp⇥dp
, then we can write
inf
1ldp
!2[0,2⇡]
⇤l[⇢X(!)]  ⇤k[CU ]
1kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!)].
The closed form expression of spectral density is
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
i⇤
,
where ⌃E is the covariance matrix of a noise vector and A
are as defined in expression (6). Thus, an upper bound on
CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
⇤min(A) , where we
defined ⇤min(A) = min
!2[0,2⇡]
⇤min(A(!)) for
A(!) = I AT
ei!
I Ae i!
. (12)
Referring back to covariance matrix Qa in (11), we get
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13)
We note that for a general VAR model, there might not exist
closed-form expressions for ⇤max(A) and ⇤min(A). How-
ever, for some special cases there are results establishing
the bounds on these quantities (e.g., see Proposition 2.2 in
Lemma 3.2 Assume tha
condition holds
||Z
||
for 2 cone(⌦E) an
cone(⌦E) is a cone of th
|| ||2 
1 +
r
where (cone(⌦E)) is a
fined as (cone(⌦E)) =
Note that the above error
and (16) hold, then the
(17). However, the result
tities, involving Z and ✏,
the following we establis
Estimating Structured VAR
density is that it has a closed form expression (see Section
9.4 of (Priestley, 1981))
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
where ⇤ denotes a Hermitian of a matrix. Therefore, from
(8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
3. Regularized Es
Denote by = ˆ
optimization problem
rameter. The focus of
under which the optim
tees on the accuracy of
term is bounded: || ||
lish such conditions, w
et al., 2014). Specifica4
• : p vector VAR(d)
•
•
-
n
-
)
h
s
-
n
,
e
e
nancial time series (Tsay, 2005) to modeling the dynamical
systems (Ljung, 1998) and estimating brain function con-
nectivity (Valdes-Sosa et al., 2005), among others. A VAR
model of order d is defined as
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
denotes a multivariate time series, Ak 2
Rp⇥p
, k = 1, . . . , d are the parameters of the model, and
d 1 is the order of the model. In this work, we as-
sume that the noise ✏t 2 Rp
follows a Gaussian distribu-
tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
for ⌧ 6= 0. The VAR process is assumed to be stable and
stationary (Lutkepohl, 2007), while the noise covariance
matrix ⌃ is assumed to be positive definite with bounded
largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.
In the current context, the parameters {Ak} are assumed
timation problems of the form:
ˆ = argmin
2Rq
1
M
ky Z k2
2 + M R( ) , (1)
{(yi, zi), i = 1, . . . , M}, yi 2 R, zi 2 Rq
, such
= [yT
1 , . . . , yT
M ]T
and Z = [zT
1 , . . . , zT
M ]T
, is
ining set of M independently and identically dis-
d (i.i.d.) samples, M > 0 is a regularization
eter, and R(·) denotes a suitable norm (Tibshirani,
dings of the 33rd
International Conference on Machine
g, New York, NY, USA, 2016. JMLR: W&CP volume
pyright 2016 by the author(s).
model of order d is defined as
xt = A1xt 1 + A2xt 2 + ·
where xt 2 Rp
denotes a mult
Rp⇥p
, k = 1, . . . , d are the par
d 1 is the order of the mod
sume that the noise ✏t 2 Rp
fo
tion, ✏t ⇠ N (0, ⌃), with E(✏t✏T
t
for ⌧ 6= 0. The VAR process is
stationary (Lutkepohl, 2007), w
matrix ⌃ is assumed to be posi
largest eigenvalue, i.e., ⇤min(⌃)
In the current context, the para
Estimating Structured VAR
characterizing sample complexity and error bounds.
2.1. Regularized Estimator
To estimate the parameters of the VAR model, we trans-
form the model in (2) into the form suitable for regularized
estimator (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam-
ples generated by the stable VAR model in (2), then stack-
ing them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
y = Z + ✏,
of matrix X (and conseque
dependencies, following (B
utilize the spectral represen
VAR models to control the d
2.2. Stability of VAR Mode
Since VAR models are (line
analysis we need to establi
VAR model (2) is stable, i.e
not diverge over time. For u
venient to rewrite VAR mod
alent VAR model of order 1
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
A1 A2 . . .
I 0 . . .
0 I . . .
...
...
...
0 0 . . .
| {z
A
where A 2 Rdp⇥dp
. There
all the eigenvalues of A sa
for 2 C, | | < 1. Equi
of original parameters Ak,
P
(1). Let (x0, x1, . . . , xT ) denote the T + 1 sam-
ated by the stable VAR model in (2), then stack-
ogether we obtain
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
also be compactly written as
Y = XB + E, (3)
2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
N = T d+1. Vectorizing (column-wise) each
(3), we get
ec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
Since VAR mo
analysis we ne
VAR model (2
not diverge ove
venient to rewr
alent VAR mod
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
|
where A 2 R
all the eigenva
for 2 C, |
les generated by the stable VAR model in (2), then stack-
ng them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
hich can also be compactly written as
Y = XB + E, (3)
here Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
N⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
Since V
analysi
VAR m
not div
venient
alent V
2
6
6
6
4
x
xt
xt (
where
all the
ples generated by the stable VAR model in (2), then stack-
ing them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
ples generated by the stable VAR model in (2), then stack-
ing them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
ing them together we obtain
2
6
6
6
4
xT
d
xT
d+1
...
xT
T
3
7
7
7
5
=
2
6
6
6
4
xT
d 1 xT
d 2 . . . xT
0
xT
d xT
d 1 . . . xT
1
...
...
...
...
xT
T 1 xT
T 2 . . . xT
T d
3
7
7
7
5
2
6
6
6
4
AT
1
AT
2
...
AT
d
3
7
7
7
5
+
2
6
6
6
4
✏T
d
✏T
d+1
...
✏T
T
3
7
7
7
5
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
, and ⌦ is the Kronecker product. The covari-
analysis we
VAR model
not diverge o
venient to re
alent VAR m
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
where A 2
all the eigen
for 2 C,
of original p
Pd
k=1 Ak
1
k
and Section
xT
T xT
T 1 xT
T 2 . . . xT
T d AT
d ✏T
T
which can also be compactly written as
Y = XB + E, (3)
where Y 2 RN⇥p
, X 2 RN⇥dp
, B 2 Rdp⇥p
, and E 2
RN⇥p
for N = T d+1. Vectorizing (column-wise) each
matrix in (3), we get
vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E)
y = Z + ✏,
where y 2 RNp
, Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2
, 2 Rdp2
,
✏ 2 RNp
, and ⌦ is the Kronecker product. The covari-
ance matrix of the noise ✏ is now E[✏✏T
] = ⌃ ⌦ IN⇥N .
Consequently, the regularized estimator takes the form
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
where A 2
all the eigen
for 2 C,
of original p
Pd
k=1 Ak
1
k
and Section
2.3. Propert
In what follo
trix X in (3)
results will th
i.i.d.
5


2 ⌦E




Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
|R(u)  1}, and define
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
compatibility constant (cone(⌦Ej
)) is assumed to
same for all j = 1, . . . , p, which follows from our as
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
)). Note that the norm
⌦Ej
)) is assumed to be the 6


2 ⌦E




Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
|R(u)  1}, and define
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
compatibility constant (cone(⌦Ej
)) is assumed to
same for all j = 1, . . . , p, which follows from our as
Lem.4.2
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
)). Note that the norm
⌦Ej
)) is assumed to be the 7
Restricted Eigenvalue Condition
• 

(well-conditioned)
• cone Lasso-type
•
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2
(⌦R)
N2
◆
where c, c1 and c2 are positive constants.
To establish restricted eigenvalue condition, we will show
that inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫, for some ⌫ > 0 and
then set
p
N = ⌫.
Theorem 3.4 Let ⇥ = cone(⌦Ej
)  Sdp 1
, where Sdp 1
is a unit sphere. The error set ⌦Ej
is defined as ⌦Ej
=n
j 2 Rdp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦E due to the assumption on the row-wise separability of
for
itiv
ror
O
⇣
com
sam
tion
Con
lari
the
the
wil
by
w(⌦
p
w(⌦
low


( )⌦E
N ↵ R [N Z ✏].
Theorem 3.3 Let ⌦R = {u 2 Rd
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be
⌦R for g ⇠ N(0, I). For any ✏
probability at least 1 c exp( m
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R
p
N
where c, c1 and c2 are positive co
To establish restricted eigenvalue
that inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
then set
p
N = ⌫.
Theorem 3.4 Let ⇥ = cone(⌦Ej⌦E 8
Restricted Error Set
•
•
• r=2
⇤
⇤
+ ⌦E
(L1)(L1)
conecone
ce matrix
1.2 of the
(11)
XT
N,:
⇤T
2
ng all the
n order to
Qa), ob-
y stacking
as in (8),
process in
⇡], where
write
X(!)].
1
i⇤
,
for some constant r > 1, where R⇤
⇥ 1
N ZT
✏
⇤
is a
dual form of the vector norm R(·), which is defined as
R⇤
[ 1
N ZT
✏] = sup
R(U)1
⌦ 1
N ZT
✏, U
↵
, for U 2 Rdp2
, where
U = [uT
1 , uT
2 , . . . , uT
p ]T
and ui 2 Rdp
. Then the error
vector k k2 belongs to the set
⌦E=
⇢
2 Rdp2
R( ⇤
+ )  R( ⇤
)+
1
r
R( ) . (15)
The second condition in (Banerjee et al., 2014) establishes
the upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigenvalue (RE)
condition holds
||Z ||2
|| ||2
p
N, (16)
ting Structured VAR
tion
3
5
⇤
,
rom
(9)
r
(10)
need
3. Regularized Estimation Guarantees
Denote by = ˆ ⇤
the error between the solution of
optimization problem (4) and ⇤
, the true value of the pa-
rameter. The focus of our work is to determine conditions
under which the optimization problem in (4) has guaran-
tees on the accuracy of the obtained solution, i.e., the error
term is bounded: || ||2  for some known . To estab-
lish such conditions, we utilize the framework of (Banerjee
et al., 2014). Specifically, estimation error analysis is based
on the following known results adapted to our settings. The
first one characterizes the restricted error set ⌦E, where the
error belongs.
Lemma 3.1 Assume that

(r>1)
⌦E
2⌦E
9


2 ⌦E




Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
|R(u)  1}, and define
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
compatibility constant (cone(⌦Ej
)) is assumed to
same for all j = 1, . . . , p, which follows from our as
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
)). Note that the norm
⌦Ej
)) is assumed to be the 10
• 4.2
•
•
•
k k2
ond condition in (Banerjee et al., 2014) establishes
er bound on the estimation error.
a 3.2 Assume that the restricted eigenvalue (RE)
on holds
||Z ||2
|| ||2
p
N, (16)
2 cone(⌦E) and some constant  > 0, where
E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
(cone(⌦E)) is a norm compatibility constant, de-
(cone(⌦E)) = sup R(U)
||U||2
.
,
density of the VAR process in
X(h)e hi!
, ! 2 [0, 2⇡], where
Rdp⇥dp
, then we can write
⇤k[CU ]
kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!)].
on of spectral density is
1
⌃E
h
I Ae i! 1
i⇤
,
e matrix of a noise vector and A
on (6). Thus, an upper bound on
max[CU ]  ⇤max(⌃)
⇤min(A) , where we
n
2⇡]
⇤min(A(!)) for
AT
ei!
I Ae i!
. (12)
nce matrix Qa in (11), we get
ax(⌃)/⇤min(A) = M. (13)
The second condition in (Banerjee et al., 2014
the upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigen
condition holds
||Z ||2
|| ||2
p
N,
for 2 cone(⌦E) and some constant  >
cone(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)),
where (cone(⌦E)) is a norm compatibility c
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic
rocess in
⇡], where
rite
(!)].
1
i⇤
,
or and A
bound on
where we
(12)
e get
(13)
the upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigenvalue (RE)
condition holds
||Z ||2
|| ||2
p
N, (16)
for 2 cone(⌦E) and some constant  > 0, where
cone(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
The second condition in (Banerjee et al., 2014) establishes
he upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigenvalue (RE)
ondition holds
||Z ||2
|| ||2
p
N, (16)
or 2 cone(⌦E) and some constant  > 0, where
one(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
the upper bound on the estimation error.
Lemma 3.2 Assume that the restricted eigenvalue (RE)
condition holds
||Z ||2
|| ||2
p
N, (16)
for 2 cone(⌦E) and some constant  > 0, where
cone(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.


E
k = 1, . . . , d. Since L1 is decomposable, it can
that (cone(⌦Ej ))  4
p
s. Next, since ⌦R
Rdp
|R(u)  1}, then using Lemma 3 in (Baner
2014) and Gaussian width results in (Chandrasekap
d A
on
we
12)
13)
Lemma 3.2 Assume that the restricted eigenvalue (RE)
condition holds
||Z ||2
|| ||2
p
N, (16)
for 2 cone(⌦E) and some constant  > 0, where
cone(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic, i.e., if (14)
s:
11
•
•
•
• Lasso
k k2
or 2 cone(⌦E) and some constant  > 0, where
one(⌦E) is a cone of the error set, then
|| ||2 
1 + r
r
N

(cone(⌦E)), (17)
where (cone(⌦E)) is a norm compatibility constant, de-
fined as (cone(⌦E)) = sup
U2cone(⌦E )
R(U)
||U||2
.
Note that the above error bound is deterministic, i.e., if (14)
nd (16) hold, then the error satisfies the upper bound in
17). However, the results are defined in terms of the quan-
ities, involving Z and ✏, which are random. Therefore, in
he following we establish high probability bounds on the
egularization parameter in (14) and RE condition in (16).
etails
meter
, for
nship
efine
of set
with
) we
R)
◆
show
0 and
dp 1
L indicate stronger dependency in the data, thus requiring
more samples for the RE conditions to hold with high prob-
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret the es-
tablished results as follows. As the size and dimension-
ality N, p and d of the problem increase, we emphasize
the scale of the results and use the order notations to de-
note the constants. Select a number of samples at least
N O(w2
(⇥)) and let the regularization parameter sat-
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high probability
then the restricted eigenvalue condition ||Z ||2
|| ||2
p
N
for 2 cone(⌦E) holds, so that  = O(1) is a pos-
itive constant. Moreover, the norm of the estimation er-
ror in optimization problem (4) is bounded by k k2 
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the norm
compatibility constant (cone(⌦Ej
)) is assumed to be the
same for all j = 1, . . . , p, which follows from our assump-
tion in (5).
Consider now Theorem 3.3 and the bound on the regu-
larization parameter N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the dimensionality of the problem p and d grows and
ze and dimension-
ase, we emphasize
er notations to de-
of samples at least
tion parameter sat-
th high probability
n ||Z ||2
|| ||2
p
N
= O(1) is a pos-
the estimation er-
unded by k k2 
Note that the norm
s assumed to be the
s from our assump-
ound on the regu-
)
+ w2
(⌦R)
⌘
. As
N
equired relationship
u)  1}, and define
aussian width of set
0 and ✏2 > 0 with
✏2
2, ✏1) + log(p)) we
c1(1+✏1)
w2
(⌦R)
N2
◆
nts.
dition, we will show
or some ⌫ > 0 and
Sdp 1
, where Sdp 1
s defined as ⌦Ej
=o
Analyzing Theorems 3.3 and 3.4 we can interpret the es-
tablished results as follows. As the size and dimension-
ality N, p and d of the problem increase, we emphasize
the scale of the results and use the order notations to de-
note the constants. Select a number of samples at least
N O(w2
(⇥)) and let the regularization parameter sat-
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high probability
then the restricted eigenvalue condition ||Z ||2
|| ||2
p
N
for 2 cone(⌦E) holds, so that  = O(1) is a pos-
itive constant. Moreover, the norm of the estimation er-
ror in optimization problem (4) is bounded by k k2 
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the norm
compatibility constant (cone(⌦Ej
)) is assumed to be the
same for all j = 1, . . . , p, which follows from our assump-
tion in (5).
Consider now Theorem 3.3 and the bound on the regu-
larization parameter N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the dimensionality of the problem p and d grows and
the number of samples N increases, the first term w(⌦R)
p
N
2
□
0 and
Sdp 1
Ej
=
r r >
of size
he set
· · · ⇥
lity of
ui] to
d u 2
2⌘2
+
⌫,
c2 are
Consider now Theorem 3.3 and the bound on the regu-
larization parameter N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
the dimensionality of the problem p and d grows and
the number of samples N increases, the first term w(⌦R)
p
N
will dominate the second one w2
(⌦R)
N2 . This can be seen
by computing N for which the two terms become equal
w(⌦R)
p
N
= w2
(⌦R)
N2 , which happens at N = w
2
3 (⌦R) <
w(⌦R). Therefore, we can rewrite our results as fol-
lows: once the restricted eigenvalue condition holds and
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is upper-bounded by
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid for any norm R(·),
Estimating Structured VAR
k = 1, . . . , d. Since L1 is decomposable, it can be shown
hat (cone(⌦Ej ))  4
p
s. Next, since ⌦R = {u 2
Rdp
|R(u)  1}, then using Lemma 3 in (Banerjee et al.,
2014) and Gaussian width results in (Chandrasekaran et al.,
2012), we can establish that w(⌦R)  O(
p
log(dp)).
Therefore, based on Theorem 4.3 and the discussion at the
nd of Section 3.2, the bound on the regularization parame-
er takes the form N O
⇣p
log(dp)/N
⌘
. Hence, the es-
⇣p ⌘
quence of weights and | |(1)
quence of absolute values of ,
In (Chen & Banerjee, 2015) it
O(
p
log(dp)/¯c), where ¯c is
and the norm compatibility co
2c2
1
p
s/¯c. Therefore, based on
O
⇣p
log(dp)/(¯cN)
⌘
and the e
by k k2  O
⇣
2c1
¯c
p
s log(dp)
Estimating Structured VAR
ce L1 is decomposable, it can be shown
))  4
p
s. Next, since ⌦R = {u 2
then using Lemma 3 in (Banerjee et al.,
an width results in (Chandrasekaran et al.,
stablish that w(⌦R)  O(
p
log(dp)).
on Theorem 4.3 and the discussion at the
, the bound on the regularization parame-
N O
⇣p
log(dp)/N
⌘
. Hence, the es-
ounded by k k2  O
⇣p
s log(dp)/N
⌘
(log(dp)).
quence of weights and | |(1)
quence of absolute values of
In (Chen & Banerjee, 2015) it
O(
p
log(dp)/¯c), where ¯c is
and the norm compatibility co
2c2
1
p
s/¯c. Therefore, based on
O
⇣p
log(dp)/(¯cN)
⌘
and the
by k k2  O
⇣
2c1
¯c
p
s log(dp)
We note that the bound obtained
is similar to the bound obtaine
k = 1, . . . , d. Since L1 is decompo
that (cone(⌦Ej ))  4
p
s. Nex
Rdp
|R(u)  1}, then using Lemma
2014) and Gaussian width results in
2012), we can establish that w(⌦
Therefore, based on Theorem 4.3 an
end of Section 3.2, the bound on the
ter takes the form N O
⇣p
log(d
timation error is bounded by k k2 
as long as N > O(log(dp)).

k = 1, . . . , d. Since L1 is decomposable, it ca
that (cone(⌦Ej ))  4
p
s. Next, since ⌦R
Rdp
|R(u)  1}, then using Lemma 3 in (Ban
2014) and Gaussian width results in (Chandrase
2012), we can establish that w(⌦R)  O(
Therefore, based on Theorem 4.3 and the discu
end of Section 3.2, the bound on the regularizat⇣p ⌘12


2 ⌦E




Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
)). Note that the norm
⌦Ej
)) is assumed to be the
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
|R(u)  1}, and define
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
compatibility constant (cone(⌦Ej
)) is assumed to
same for all j = 1, . . . , p, which follows from our as13
( ) : R -
• R -
•
• →
) can be any vector norm, separable along the
matrices Ak. Specifically, if we denote =
T
]T
and Ak(i, :) as the row of matrix Ak for
, d, then our assumption is equivalent to
pX
=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
clutter and without loss of generality, we assume
R(·) to be the same for each row i. Since the
ecouples across rows, it is straightforward to ex-
analysis to the case when a different norm is used
ow of Ak, e.g., L1 for row one, L2 for row two,
t norm (Argyriou et al., 2012) for row three, etc.
hat within a row, the norm need not be decompos-
results wil
bounds fo
Define an
we assum
is distribu
matrix CX
CX =
2
6
6
4
where (
since CX
bounded a
where R( ) can be any vector norm, separable along the
rows of matrices Ak. Specifically, if we denote =
[ T
1 . . . T
p ]T
and Ak(i, :) as the row of matrix Ak for
k = 1, . . . , d, then our assumption is equivalent to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
To reduce clutter and without loss of generality, we assume
the norm R(·) to be the same for each row i. Since the
analysis decouples across rows, it is straightforward to ex-
tend our analysis to the case when a different norm is used
for each row of Ak, e.g., L1 for row one, L2 for row two,
K-support norm (Argyriou et al., 2012) for row three, etc.
Observe that within a row, the norm need not be decompos-
A2A1 A3 A4
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
) can be any vector norm, separable along the
matrices Ak. Specifically, if we denote =
T
]T
and Ak(i, :) as the row of matrix Ak for
, d, then our assumption is equivalent to
pX
=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
clutter and without loss of generality, we assume
R(·) to be the same for each row i. Since the
ecouples across rows, it is straightforward to ex-
analysis to the case when a different norm is used
In what fo
trix X in
results wil
bounds fo
Define an
we assum
is distribu
matrix CX
CX =
2
6
6
4
where (
ˆ = argmin
2Rdp2
1
N
||y Z ||2
2 + N R( ), (4)
where R( ) can be any vector norm, separable along the
rows of matrices Ak. Specifically, if we denote =
[ T
1 . . . T
p ]T
and Ak(i, :) as the row of matrix Ak for
k = 1, . . . , d, then our assumption is equivalent to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
To reduce clutter and without loss of generality, we assume
the norm R(·) to be the same for each row i. Since the
analysis decouples across rows, it is straightforward to ex-
tend our analysis to the case when a different norm is used
A2A1 A3 A4
= argmin
2Rdp2 N
||y Z ||2 + N R( ), (4)
ere R( ) can be any vector norm, separable along the
ws of matrices Ak. Specifically, if we denote =
. . . T
p ]T
and Ak(i, :) as the row of matrix Ak for
= 1, . . . , d, then our assumption is equivalent to
( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
. (5)
reduce clutter and without loss of generality, we assume
norm R(·) to be the same for each row i. Since the
lysis decouples across rows, it is straightforward to ex-
d our analysis to the case when a different norm is used
each row of Ak, e.g., L1 for row one, L2 for row two,
support norm (Argyriou et al., 2012) for row three, etc.
trix
resu
bou
Defi
we
is d
mat
C
whe
sinc
bou
on ap-
ially in
regular-
(1)
q
, such
T
M ]T
, is
lly dis-
rization
shirani,
Machine
volume
ranging from describing the behavior of economic and fi-
nancial time series (Tsay, 2005) to modeling the dynamical
systems (Ljung, 1998) and estimating brain function con-
nectivity (Valdes-Sosa et al., 2005), among others. A VAR
model of order d is defined as
xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2)
where xt 2 Rp
denotes a multivariate time series, Ak 2
Rp⇥p
, k = 1, . . . , d are the parameters of the model, and
d 1 is the order of the model. In this work, we as-
sume that the noise ✏t 2 Rp
follows a Gaussian distribu-
tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T
t ) = ⌃ and E(✏t✏T
t+⌧ ) = 0,
for ⌧ 6= 0. The VAR process is assumed to be stable and
stationary (Lutkepohl, 2007), while the noise covariance
matrix ⌃ is assumed to be positive definite with bounded
largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1.
In the current context, the parameters {Ak} are assumed
Aj
Theorem 4.3 Let ⌦R = {u 2 R |R(u)  1}
of set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 an
log(p)) we can establish that
R⇤

1
N
ZT
✏ 
✓
c2(
where c, c1 and c2 are positive constants.
To establish restricted eigenvalue condition,
⌫ > 0 and then set
p
N = ⌫.
Theorem 4.4 Let ⇥ = cone(⌦Ej )  Sdp 1, w
⌦Ej =
n
j 2 Rdp R( ⇤
j + j)  R( ⇤
j ) +
for j is of size dp ⇥ 1, and ⇤ = [ ⇤T
1 . . . ⇤
p
in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assum
14
• 3.4
•
•
⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
mption on the row-wise separability of
Also define w(⇥) = E[sup
u2⇥
hg, ui] to
of set ⇥ for g ⇠ N(0, I) and u 2
bability at least 1 c1 exp( c2⌘2
+
> 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
2
p
M cw(⇥) ⌘ and c, c1, c2 are
nd L, M are defined in (9) and (13).
we can choose ⌘ = 1
2
p
NL and set
p
M cw(⇥) ⌘ and since
p
N > 0
e can establish a lower bound on the
N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
d and using (9) and (13), we can con-
er of samples needed to satisfy the
condition is smaller if ⇤min(A) and
w(⌦R)
p
N
= w (⌦R)
N2 , which happens at N
w(⌦R). Therefore, we can rewrite ou
lows: once the restricted eigenvalue con
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is u
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid fo
separable along the rows of Ak, it is instr
ize our analysis to a few popular regula
such as L1 and Group Lasso, Sparse G
OWL norms.
3.3.1. LASSO
To establish results for L1 norm, we ass
rameter ⇤
is s-sparse, which in our case
resent the largest number of non-zero ele
i = 1, . . . , p, i.e., the combined i-th r
⌦Ej
is a part of the decomposition in ⌦E =
⌦Ep
due to the assumption on the row-wise
norm R(·) in (5). Also define w(⇥) = E
be a Gaussian width of set ⇥ for g ⇠ N
Rdp
. Then with probability at least 1 c
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p
||
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ a
positive constants, and L, M are defined in
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2p
N =
p
NL 2
p
M cw(⇥) ⌘ and s
must be satisfied, we can establish a lowe
number of samples N:
p
N > 2
p
M+cw(⇥
p
L/2
Examining this bound and using (9) and (1
clude that the number of samples needed
restricted eigenvalue condition is smaller i
1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦Ep
due to the assumption on the row-wise separability of
norm R(·) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2
Rdp
. Then with probability at least 1 c1 exp( c2⌘2
+
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are
positive constants, and L, M are defined in (9) and (13).
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
p
N > 0
must be satisfied, we can establish a lower bound on the
number of samples N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can con-
clude that the number of samples needed to satisfy the
restricted eigenvalue condition is smaller if ⇤min(A) and
w(
p
w(
low
N
k
3.3
W
se
ize
su
OW
3.
To
ra
re
i
phere. The error set ⌦Ej
is defined as ⌦Ej
=
dp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
. . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
nd ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
o the assumption on the row-wise separability of
) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
ssian width of set ⇥ for g ⇠ N (0, I) and u 2
n with probability at least 1 c1 exp( c2⌘2
+
or any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
=
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are
onstants, and L, M are defined in (9) and (13).
ssion
orem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
NL 2
p
M cw(⇥) ⌘ and since
p
N > 0
atisfied, we can establish a lower bound on the
p 2
p
M+cw(⇥)
the number of samples N
will dominate the second o
by computing N for whic
w(⌦R)
p
N
= w2
(⌦R)
N2 , which
w(⌦R). Therefore, we c
lows: once the restricted e
N O
⇣
w(⌦R)
p
N
⌘
, the er
k k2  O
⇣
w(⌦R)
p
N
⌘
(con
3.3. Special Cases
While the presented result
separable along the rows of
ize our analysis to a few
such as L1 and Group La
OWL norms.
3.3.1. LASSO
To establish results for L
is a unit sphere. The error set ⌦Ej
is defined as ⌦Ej
=n
j 2 Rdp
R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r >
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦Ep
due to the assumption on the row-wise separability of
norm R(·) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2
Rdp
. Then with probability at least 1 c1 exp( c2⌘2
+
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are
positive constants, and L, M are defined in (9) and (13).
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
p
N > 0
the num
will dom
by comp
w(⌦R)
p
N
=
w(⌦R).
lows: on
N O
k k2 
3.3. Spec
While th
separable
ize our a
such as
OWL no
3.3.1. L
define w(⇥) = E[sup
u2⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N
probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
||(Ip⇥p ⌦ X) ||2
⌫,
: Gaussian width
n this Section we present the main results of our work, followed by the discussion on their prop
lustrating some special cases based on popular Lasso and Group Lasso regularization norms. I
4 we will present the main ideas of our proof technique, with all the details delegated to the App
nd D.
o establish lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1
N Z
or some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
heorem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaus
set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min
g(p)) we can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2(⌦R)
N2
◆
here c, c1 and c2 are positive constants.
o establish restricted eigenvalue condition, we will show that inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
> 0 and then set
p
N = ⌫.
heorem 4.4 Let ⇥ = cone(⌦Ej )  Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is
Ej =
n
j 2 Rdp R( ⇤
j + j)  R( ⇤
j ) + 1
r R( j)
o
, for r > 1, j = 1, . . . , p, and = [ T
1 , .
] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Then with
p( c2⌘2 + log(p)), for any ⌘ > 0
inf
||(Ip⇥p ⌦ X) ||2
⌫,
ing Theorems 3.3 and 3.4 we can interpret the es-
ed results as follows. As the size and dimension-
, p and d of the problem increase, we emphasize
le of the results and use the order notations to de-
e constants. Select a number of samples at least
O(w2
(⇥)) and let the regularization parameter sat-
N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high probability
e restricted eigenvalue condition ||Z ||2
|| ||2
p
N
2 cone(⌦E) holds, so that  = O(1) is a pos-
onstant. Moreover, the norm of the estimation er-
optimization problem (4) is bounded by k k2 
⌦R)
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the norm
ibility constant (cone(⌦Ej
)) is assumed to be the
or all j = 1, . . . , p, which follows from our assump-
(5).
er now Theorem 3.3 and the bound on the regu-⇣
w(⌦R)
p
w2
(⌦R)
⌘ 15
• p 

• 

•
•
•
ion on the row-wise separability of
o define w(⇥) = E[sup
u2⇥
hg, ui] to
set ⇥ for g ⇠ N(0, I) and u 2
bility at least 1 c1 exp( c2⌘2
+
0, inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
p
M cw(⇥) ⌘ and c, c1, c2 are
L, M are defined in (9) and (13).
can choose ⌘ = 1
2
p
NL and set
cw(⇥) ⌘ and since
p
N > 0
an establish a lower bound on the
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
nd using (9) and (13), we can con-
of samples needed to satisfy the
ndition is smaller if ⇤min(A) and
lows: once the restricted eigenvalue condi
N O
⇣
w(⌦R)
p
N
⌘
, the error norm is upp
k k2  O
⇣
w(⌦R)
p
N
⌘
(cone(⌦Ej
)).
3.3. Special Cases
While the presented results are valid for a
separable along the rows of Ak, it is instruc
ize our analysis to a few popular regulariz
such as L1 and Group Lasso, Sparse Gro
OWL norms.
3.3.1. LASSO
To establish results for L1 norm, we assum
rameter ⇤
is s-sparse, which in our case is
resent the largest number of non-zero elem
i = 1, . . . , p, i.e., the combined i-th row
norm R(·) in (5). Also define w(⇥) = E[s
u
be a Gaussian width of set ⇥ for g ⇠ N(0
Rdp
. Then with probability at least 1 c1 e
log(p)), for any ⌘ > 0, inf
2cone(⌦E)
||(Ip⇥p⌦
|| |
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and
positive constants, and L, M are defined in (9
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2
p
p
N =
p
NL 2
p
M cw(⇥) ⌘ and sin
must be satisfied, we can establish a lower b
number of samples N:
p
N > 2
p
M+cw(⇥)
p
L/2
Examining this bound and using (9) and (13),
clude that the number of samples needed t
restricted eigenvalue condition is smaller if ⇤
cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are defined in (9)
choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)). (18)
sing (9) and (13), we can conclude that the number of samples needed to satisfy
dition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
eans that matrices A and A in (10) and (12) must be well conditioned and the
eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can
wing that large values of M and small values of L indicate stronger dependency
ore samples for the RE conditions to hold with high probability.
d 4.4 we can interpret the established results as follows. As the size and dimen-
oblem increase, we emphasize the scale of the results and use the order notations
ect a number of samples at least N O(w2(⇥)) and let the regularization pa-
2
⌘
Ej
⌦Ep
due to the assumption on t
norm R(·) in (5). Also define
be a Gaussian width of set ⇥
Rdp
. Then with probability at
log(p)), for any ⌘ > 0, in
2co
where ⌫ =
p
NL 2
p
M cw
positive constants, and L, M ar
3.2. Discussion
From Theorem 3.4, we can ch
p
N =
p
NL 2
p
M cw(⇥
must be satisfied, we can estab
number of samples N:
p
N >
Examining this bound and usin
clude that the number of sam
restricted eigenvalue condition
probability at least 1 c1 exp( c2⌘ + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define
and (13).
4.2 Discussion
From Theorem 4.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ an
p
N > 0 must be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can conclude that the number of samples needed to
the restricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤
are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned
VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively,
also understand (18) as showing that large values of M and small values of L indicate stronger depe
in the data, thus requiring more samples for the RE conditions to hold with high probability.
Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and
sionality N, p and d of the problem increase, we emphasize the scale of the results and use the order no
to denote the constants. Select a number of samples at least N O(w2(⇥)) and let the regularizat
rameter satisfy N O
⇣
w(⌦R)
p + w2(⌦R)
2
⌘
. With high probability then the restricted eigenvalue co
and 3.4 we can interpret the es-
ws. As the size and dimension-
problem increase, we emphasize
nd use the order notations to de-
ct a number of samples at least
the regularization parameter sat-
w2
(⌦R)
N2
⌘
. With high probability
value condition ||Z ||2
|| ||2
p
N
ds, so that  = O(1) is a pos-
, the norm of the estimation er-
em (4) is bounded by k k2 
cone(⌦Ej
)). Note that the norm
cone(⌦Ej
)) is assumed to be the
which follows from our assump-
3.3 and the bound on the regu-
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. As
e problem p and d grows and
r
1, j = 1, . . . , p, and = [ T
1 , . . . , T
p ]T
, for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦Ep
due to the assumption on the row-wise separability of
norm R(·) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2
Rdp
. Then with probability at least 1 c1 exp( c2⌘2
+
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are
positive constants, and L, M are defined in (9) and (13).
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
p
N > 0
must be satisfied, we can establish a lower bound on the
number of samples N:
p
N > 2
p
M+cw(⇥)
p = O(w(⇥)).
by
w(⌦
p
N
w(⌦
low
N
k k
3.3.
Wh
sepa
ize
such
OW
3.3
To
), for any ⌘ > 0
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
c1, c2 are positive constants, and L and M are defined in (9)
L and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
lower bound on the number of samples N
M + cw(⇥)
p
L/2
= O(w(⇥)). (18)
we can conclude that the number of samples needed to satisfy
⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
A and A in (10) and (12) must be well conditioned and the
0 <
= E[sup
u2⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 R
least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M ar
sion
m 4.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥)
ust be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
is bound and using (9) and (13), we can conclude that the number of samples n
eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A
n turn, this means that matrices A and A in (10) and (12) must be well cond16
• 





sion
m 4.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥)
ust be satisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
is bound and using (9) and (13), we can conclude that the number of samples n
eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A
n turn, this means that matrices A and A in (10) and (12) must be well cond
is stable, with eigenvalues well inside the unit circle (see Section 3.2). Altern
nd (18) as showing that large values of M and small values of L indicate strong
us requiring more samples for the RE conditions to hold with high probability.
eorems 4.3 and 4.4 we can interpret the established results as follows. As the s
and d of the problem increase, we emphasize the scale of the results and use the
constants. Select a number of samples at least N O(w2(⇥)) and let the reg
y N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
. With high probability then the restricted eigen
N for 2 cone(⌦E) holds, so that  = O(1) is a positive constant. Moreover,
VAR
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
i⇤
,
where ⌃E is the covariance matrix of a noise vector and A
are as defined in expression (6). Thus, an upper bound on
CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
⇤min(A) , where we
defined ⇤min(A) = min
!2[0,2⇡]
⇤min(A(!)) for
A(!) = I AT
ei!
I Ae i!
. (12)
Referring back to covariance matrix Qa in (11), we get
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13)
We note that for a general VAR model, there might not exist
closed-form expressions for ⇤max(A) and ⇤min(A). How-
ever, for some special cases there are results establishing
the bounds on these quantities (e.g., see Proposition 2.2 in
(Basu & Michailidis, 2015)).
inf
1jp
!2[0,2⇡]
⇤j[⇢(!)]  ⇤k[CX]
1kdp
 sup
1jp
!2[0,2⇡]
⇤j[⇢(!)], (8)
k[·] denotes the k-th eigenvalue of a matrix and for i =
p
1, ⇢(!) =
P1
h= 1 (h)e hi!, ! 2
s the spectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of
spectral density is that it has a closed form expression (see Section 9.4 of [24])
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
e defined ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
A(!)= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
, (10)
endix B.1 for additional details.
ishing high probability bounds we will also need information about a vector q = Xa 2 RN for
Rdp, kak2 = 1. Since each element XT
i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a
ce matrix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written
4 .. .. .. ..
(d 1)T (d 2)T . . . (0)
5
E(xtxT
t+h) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues can
(see [13])
inf
1jp
!2[0,2⇡]
⇤j[⇢(!)]  ⇤k[CX]
1kdp
 sup
1jp
!2[0,2⇡]
⇤j[⇢(!)], (8)
notes the k-th eigenvalue of a matrix and for i =
p
1, ⇢(!) =
P1
h= 1 (h)e hi!, ! 2
pectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of
al density is that it has a closed form expression (see Section 9.4 of [24])
⇢(!)= I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
es a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
ned ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
A(!)= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
, (10)
B.1 for additional details.
high probability bounds we will also need information about a vector q = Xa 2 RN for
kak2 = 1. Since each element XT
i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a
rix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written
Qa = (I ⌦ aT
)CU (I ⌦ a), (11)
inf
1ldp
!2[0,2⇡]
⇤l[⇢X(!)]  ⇤k[CU ]
1kNdp
 sup
1ldp
!2[0,2⇡]
⇤l[⇢X(!
The closed form expression of spectral density is
⇢X(!) = I Ae i! 1
⌃E
h
I Ae i! 1
where ⌃E is the covariance matrix of a noise vector and A are as defined in
bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃)
⇤min(A) , where we defined ⇤
for
A(!) = I AT
ei!
I Ae i!
.
Referring back to covariance matrix Qa in (11), we get
⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M.
We note that for a general VAR model, there might not exist closed-form
⇤min(A). However, for some special cases there are results establishing
(e.g., see Proposition 2.2 in [5]).
Estimating Structured VAR
is that it has a closed form expression (see Section
Priestley, 1981))
I
dX
k=1
Ake ki!
! 1
⌃
2
4 I
dX
k=1
Ake ki!
! 1
3
5
⇤
,
denotes a Hermitian of a matrix. Therefore, from
an establish the following lower bound
⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9)
we defined ⇤max(A) = max
!2[0,2⇡]
⇤max(A(!)) for
= I
dX
k=1
AT
k eki!
!
I
dX
k=1
Ake ki!
!
. (10)
lishing high probability bounds we will also need
tion about a vector q = Xa 2 RN
for any a 2 Rdp
,
1. Since each element XT
i,:a ⇠ N(0, aT
CXa),
ws that q ⇠ N(0, Qa) with a covariance matrix
N⇥N
3. Regularized Estimati
Denote by = ˆ ⇤
the e
optimization problem (4) and
rameter. The focus of our wor
under which the optimization
tees on the accuracy of the obt
term is bounded: || ||2  f
lish such conditions, we utilize
et al., 2014). Specifically, estim
on the following known results
first one characterizes the restr
error belongs.
Lemma 3.1 Assume that
N rR⇤

to
R( )=
pX
i=1
R i =
pX
i=1
R
✓h
A1(i, :)T
. . .Ad(i, :)T
iT
◆
.
To reduce clutter and without loss of generality, we assume the norm R(·) to be the sam
Since the analysis decouples across rows, it is straightforward to extend our analysis t
different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-suppo
three, etc. Observe that within a row, the norm need not be decomposable across column
The main difference between the estimation problem in (1) and the formulation in (4
pendence between the samples (x0, x1, . . . , xT ), violating the i.i.d. assumption on the
1, . . . , Np}. In particular, this leads to the correlations between the rows and columns
consequently of Z). To deal with such dependencies, following [5], we utilize the spectra
the autocovariance of VAR models to control the dependencies in matrix X.
3.2 Stability of VAR Model
Since VAR models are (linear) dynamical systems, for the analysis we need to establish
which the VAR model (2) is stable, i.e., the time-series process does not diverge over time.
stability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR mo
2
6
6
6
4
xt
xt 1
...
xt (d 1)
3
7
7
7
5
=
2
6
6
6
6
6
4
A1 A2 . . . Ad 1 Ad
I 0 . . . 0 0
0 I . . . 0 0
...
...
...
...
...
0 0 . . . I 0
3
7
7
7
7
7
5
| {z }
A
2
6
6
6
4
xt 1
xt 2
...
xt d
3
7
7
7
5
+
2
6
6
6
4
✏t
0
...
0
3
7
7
7
5
where A 2 Rdp⇥dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy de
0 for 2 C, | | < 1. Equivalently, if expressed in terms of original parameters Ak, sta
det(I
Pd
k=1 Ak
1
k ) = 0 (see Appendix A for more details).


L =
1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size
dp ⇥ 1, and ⇤
= [ ⇤T
1 . . . ⇤T
p ]T
, for ⇤
j 2 Rdp
. The set
⌦Ej
is a part of the decomposition in ⌦E = ⌦E1
⇥ · · · ⇥
⌦Ep
due to the assumption on the row-wise separability of
norm R(·) in (5). Also define w(⇥) = E[sup
u2⇥
hg, ui] to
be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2
Rdp
. Then with probability at least 1 c1 exp( c2⌘2
+
log(p)), for any ⌘ > 0, inf
2cone(⌦E )
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
where ⌫ =
p
NL 2
p
M cw(⇥) ⌘ and c, c1, c2 are
positive constants, and L, M are defined in (9) and (13).
3.2. Discussion
From Theorem 3.4, we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ and since
p
N > 0
must be satisfied, we can establish a lower bound on the
number of samples N:
p
N > 2
p
M+cw(⇥)
p
L/2
= O(w(⇥)).
Examining this bound and using (9) and (13), we can con-
L =


17


2 ⌦E




Lasso regularization norms. I
the main ideas of our proof te
delegated to the supplement.
To establish lower bound on
N , we derive an upper bou
some ↵ > 0, which will estab
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2
w(⌦R) = E[ sup
u2⌦R
hg, ui] to
⌦R for g ⇠ N(0, I). For a
probability at least 1 c exp
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w
Thm4.3
Thm4.4
Lem.4.1
Lem.4.2
o hold with high prob-
e can interpret the es-
e size and dimension-
crease, we emphasize
order notations to de-
er of samples at least
ization parameter sat-
With high probability
tion ||Z ||2
|| ||2
p
N
 = O(1) is a pos-
of the estimation er-
bounded by k k2 
)). Note that the norm
ction 3.4 we will present
ique, with all the details
regularization parameter
n R⇤
[ 1
N ZT
✏]  ↵, for
the required relationship
p
|R(u)  1}, and define
a Gaussian width of set
> 0 and ✏2 > 0 with
min(✏2
2, ✏1) + log(p)) we
)
+ c1(1+✏1)
w2
(⌦R)
N2
◆
nstants.
condition, we will show
N as showing that large values of M and small va
L indicate stronger dependency in the data, thus req
more samples for the RE conditions to hold with high
ability.
Analyzing Theorems 3.3 and 3.4 we can interpret
tablished results as follows. As the size and dime
ality N, p and d of the problem increase, we emp
the scale of the results and use the order notations
note the constants. Select a number of samples a
N O(w2
(⇥)) and let the regularization paramet
isfy N O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
. With high prob
then the restricted eigenvalue condition ||Z ||2
|| ||2
for 2 cone(⌦E) holds, so that  = O(1) is
itive constant. Moreover, the norm of the estimat
ror in optimization problem (4) is bounded by k
O
⇣
w(⌦R)
p
N
+ w2
(⌦R)
N2
⌘
(cone(⌦Ej
)). Note that the
compatibility constant (cone(⌦Ej
)) is assumed to
same for all j = 1, . . . , p, which follows from our as
o understand the bound on
s of M and small values of
y in the data, thus requiring
ions to hold with high prob-
3.4 we can interpret the es-
As the size and dimension-
m increase, we emphasize
e the order notations to de-
number of samples at least
gularization parameter sat-
R)
⌘
. With high probability
condition ||Z ||2
|| ||2
p
N
that  = O(1) is a pos-
norm of the estimation er-
4) is bounded by k k2 
⌦Ej
)). Note that the norm
⌦Ej
)) is assumed to be the 18
•
• 4.3
•
•
d by the discussion on their properties and
up Lasso regularization norms. In Section
ll the details delegated to the Appendices C
derive an upper bound on R⇤[ 1
N ZT ✏]  ↵,
N ↵ R⇤[ 1
N ZT ✏].
R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian width
bability at least 1 c exp( min(✏2
2, ✏1) +
1(1+✏1)
w2(⌦R)
N2
◆
t inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫, for some
illustrating some special cases based
4.4 we will present the main ideas of o
and D.
To establish lower bound on the regula
for some ↵ > 0, which will establish
Theorem 4.3 Let ⌦R = {u 2 Rdp|R(
of set ⌦R for g ⇠ N(0, I). For any ✏1
log(p)) we can establish that
R⇤

1
N
ZT
✏
where c, c1 and c2 are positive consta
To establish restricted eigenvalue co
⌫ > 0 and then set
p
N = ⌫.
ction we present the main results of our work, followed by the discussion on their prope
g some special cases based on popular Lasso and Group Lasso regularization norms. In
ll present the main ideas of our proof technique, with all the details delegated to the Appe
sh lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1
N Z
↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gauss
for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min(
e can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2(⌦R)
N2
◆
c1 and c2 are positive constants.
ish restricted eigenvalue condition, we will show that inf
2cone(⌦E)
||(Ip⇥p⌦X) ||2
|| ||2
⌫,
d then set
p
N = ⌫.
4.4 Let ⇥ = cone(⌦Ej )  Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is d
To establish lower bound on the regularization pa
N , we derive an upper bound on R⇤
[ 1
N ZT
✏] 
some ↵ > 0, which will establish the required rela
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2 Rdp
|R(u)  1}, an
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian widt
⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 >
probability at least 1 c exp( min(✏2
2, ✏1) + log
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+✏1)
w2
where c, c1 and c2 are positive constants.
To establish restricted eigenvalue condition, we w
||(Ip⇥p⌦X) ||2
To establish lower bound on the regularizat
N , we derive an upper bound on R⇤
[ 1
N ZT
some ↵ > 0, which will establish the require
N ↵ R⇤
[ 1
N ZT
✏].
Theorem 3.3 Let ⌦R = {u 2 Rdp
|R(u)  1
w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian
⌦R for g ⇠ N(0, I). For any ✏1 > 0 and
probability at least 1 c exp( min(✏2
2, ✏1)
can establish that
R⇤

1
N
ZT
✏ 
✓
c2(1+✏2)
w(⌦R)
p
N
+ c1(1+
where c, c1 and c2 are positive constants.
To establish restricted eigenvalue condition,
||(Ip⇥p⌦X) ||2
rL
ned in terms of the quantities, involving Z and ✏, which are
blish high probability bounds on the regularization parameter
f our work, followed by the discussion on their properties and
ular Lasso and Group Lasso regularization norms. In Section
of technique, with all the details delegated to the Appendices C
n parameter N , we derive an upper bound on R⇤[ 1
N ZT ✏]  ↵,
uired relationship N ↵ R⇤[ 1
N ZT ✏].
}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a Gaussian width
nd ✏2 > 0 with probability at least 1 c exp( min(✏2
2, ✏1) +
(1+✏2)
w(⌦R)
p + c1(1+✏1)
w2(⌦R)
2
◆
re (cone(⌦E)) is a norm compatibility constant, defined as (cone(⌦E)) = sup
U2cone(⌦E)
e that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error sati
nd in (17). However, the results are defined in terms of the quantities, involving Z and
dom. Therefore, in the following we establish high probability bounds on the regularizat
14) and RE condition in (16).
High Probability Bounds
his Section we present the main results of our work, followed by the discussion on their
strating some special cases based on popular Lasso and Group Lasso regularization norm
we will present the main ideas of our proof technique, with all the details delegated to the
D.
stablish lower bound on the regularization parameter N , we derive an upper bound on R⇤
some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1
N ZT ✏].
orem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup
u2⌦R
hg, ui] to be a G
et ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp(
p)) we can establish that
: 

Gaussian width
)
= O(w(⇥)). (18)
clude that the number of samples needed to satisfy
nd ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
n (10) and (12) must be well conditioned and the
nit circle (see Section 3.2). Alternatively, we can
nd small values of L indicate stronger dependency
tions to hold with high probability.
blished results as follows. As the size and dimen-
e the scale of the results and use the order notations
ast N O(w2(⇥)) and let the regularization pa-
robability then the restricted eigenvalue condition
1) is a positive constant. Moreover, the norm of the
by k k2  O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
(cone(⌦Ej )).
) is assumed to be the same for all j = 1, . . . , p,
arization parameter N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
.
the number of samples N increases, the first term 19
• N p (d=1 )
• N 

0 1000 2000 3000 4000 5000
0
5
10
15
20
0 1000 2000 3000 4000 5000
0
0.2
0.4
0.6
0.8
1
0 100 200 300 400 500 600 700
0.4
0.5
0.6
0.7
0.8
0.9
1
kk2
N Np
kk2
/max
/max
N/[s log(pd)]
0 100 200 300 400 500 600
0
5
10
15
20
kk2
N N
kk2
/max
/max
0 1000 2000 3000 4000 5000
0
5
10
15
0 500 1000 1500 2000 2500
0
5
10
15
0 1000 2000 3000 4000 5000
0.4
0.5
0.6
0.7
0.8
0.9
1
N/[s(m + log K)] K
10 20 30 40 50 60
0.3
0.4
0.5
0.6
0.7
0.8
0.9
1
(a) (b) (c) (d)
(e) (f) (g) (h)
Figure 1: Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse
VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N
max
2 [0, 1], K 2 [2, 60] and
d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N
is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N; in (f) the graph
is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of
groups K in (g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N
for fixed p.
p
⇥
hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Th
c1 exp( c2⌘2 + log(p)), for any ⌘ > 0
inf
2cone(⌦E)
||(Ip⇥p ⌦ X) ||2
|| ||2
⌫,
p
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define
we can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘ an
tisfied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
20
• VAR(d=1) 1step
•
• or
k
N N
k
/
/
0 1000 2000 3000 4000 5000
0
5
0 500 1000 1500 2000 2500
0
5
0 1000 2000 3000 4000 5000
0.4
0.5
0.6
N/[s(m + log K)] K
10 20 30 40 50 60
0.3
0.4
0.5
0.6
(e) (f) (g) (h)
Figure 1. Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problem
dimensions: p 2 [10, 600], N 2 [10, 5000], N
max
2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on
sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N
in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in
(g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N for fixed p.
Lasso OWL Group Lasso Sparse Group Lasso Ridge
32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1)
32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2)
Table 1. Mean squared error (row 2) of the five methods used in fitting VAR model, evaluated on aviation dataset (MSE is computed
using one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) in
the VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots
weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights.
the results after averaging across 300 flights.
From the table we can see that the considered problem ex-
hibits a sparse structure since all the methods detected sim-
ilar patterns in matrix A1. In particular, the analysis of
such patterns revealed a meaningful relationship among the
flight parameters (darker dots), e.g., normal acceleration
had high dependency on vertical speed and angle-of-attack,
the altitude had mainly dependency with fuel quantity, ver-
tical speed with aircraft nose pitch angle, etc. The results
also showed that the sparse regularization helps in recov-
5. Conclusions
In this work we present a set of results for characterizing
non-asymptotic estimation error in estimating structured
vector autoregressive models. The analysis holds for any
norms, separable along the rows of parameter matrices
Our analysis is general as it is expressed in terms of Gaus
sian widths, a geometric measure of size of suitable sets
and includes as special cases many of the existing result
focused on structured sparsity in VAR models.
A1
nMSE(%)
(%)
R(·)
TO 2
f its atomic formulation.
tiation of the CG algorithm to handle the Ivanov
lation of OWL regularization; more specifically:
e show how the atomic formulation of the OWL
orm allows solving efficiently the linear program-
ming problem in each iteration of the CG algorithm;
ased on results from [25], we show convergence of
he resulting algorithm and provide explicit values
or the constants.
w derivation of the proximity operator of the OWL
arguably simpler than those in [10] and [44],
ghting its connection to isotonic regression and the
adjacent violators (PAV) algorithm [3], [8].
ficient method to project onto an OWL norm ball,
on a root-finding scheme.
ng the Ivanov formulation under OWL regulariza-
sing projected gradient algorithms, based on the
sed OWL projection.
er is organized as follows. Section II, after reviewing
norm and some of its basic properties, presents
formulation, and derives its dual norm. The two
utational tools for using OWL regularization, the
operator and the projection on a ball, are addressed
III. Section IV instantiates the CG algorithm
rated projected gradient algorithms to tackle the
optimization formulation of OWL regularization
egression. Finally, Section V reports experimental
strating the performance and comparison of the
pproaches, and Section VI concludes the paper.
ase bold letters, e.g., x, y, denote (column) vectors,
oses are xT
, yT
, and the i-th and j-th components
n as xi and yj. Matrices are written in upper
Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)
w1 = w2 > 0; (c) w1 > w2 = 0.
where w 2 Km+ is a vector of non-increasing weights, i.e.,
belonging to the so-called monotone non-negative cone [13],
Km+ = {x 2 Rn
: x1 x2 · · · xn 0} ⇢ Rn
+. (2)
OWL balls in R2
and R3
, for different choices of the weight
vector w, are illustrated in Figs. 1 and 2.
Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;
(b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;
(e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0.
OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b)
> 0; (c) w1 > w2 = 0.
w 2 Km+ is a vector of non-increasing weights, i.e.,
ng to the so-called monotone non-negative cone [13],
+ = {x 2 Rn
: x1 x2 · · · xn 0} ⇢ Rn
+. (2)
alls in R2
and R3
, for different choices of the weight
w, are illustrated in Figs. 1 and 2.
OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0;
w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0;
w2 = w3 = 0; (f) w1 = w2 = w3 > 0.
act that, if w 2 Km+ {0}, then ⌦w is indeed a norm
nvex and homogenous of degree 1), was shown in [10],
s clear that ⌦w is lower bounded by the (appropriately
`1 norm:
⌦w(x) w1|x|[1] = w1 kxk1, (3)
[Zheng15]
21
• - 

VAR
• i.i.d.
• :
inf
2cone(⌦E)
p⇥p 2
|| ||2
⌫,
M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are de
can choose ⌘ = 1
2
p
NL and set
p
N =
p
NL 2
p
M cw(⇥) ⌘
fied, we can establish a lower bound on the number of samples N
p
N >
2
p
M + cw(⇥)
p
L/2
= O(w(⇥)).
nd using (9) and (13), we can conclude that the number of samples neede
condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) an
s means that matrices A and A in (10) and (12) must be well condition
with eigenvalues well inside the unit circle (see Section 3.2). Alternativ
showing that large values of M and small values of L indicate stronger d
N > p
L/2
= O(w(⇥))
mining this bound and using (9) and (13), we can conclude that the
stricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) a
maller. In turn, this means that matrices A and A in (10) and (12
process is stable, with eigenvalues well inside the unit circle (see
understand (18) as showing that large values of M and small value
data, thus requiring more samples for the RE conditions to hold w
yzing Theorems 4.3 and 4.4 we can interpret the established result
lity N, p and d of the problem increase, we emphasize the scale of t
note the constants. Select a number of samples at least N O(w
er satisfy N O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
. With high probability the
2
p
N for 2 cone(⌦E) holds, so that  = O(1) is a positive
ation error in optimization problem (4) is bounded by k k2  O
/2
= O(w(⇥)). (18)
n conclude that the number of samples needed to satisfy
(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃)
d A in (10) and (12) must be well conditioned and the
e the unit circle (see Section 3.2). Alternatively, we can
f M and small values of L indicate stronger dependency
conditions to hold with high probability.
he established results as follows. As the size and dimen-
phasize the scale of the results and use the order notations
s at least N O(w2(⇥)) and let the regularization pa-
high probability then the restricted eigenvalue condition
= O(1) is a positive constant. Moreover, the norm of the
nded by k k2  O
⇣
w(⌦R)
p
N
+ w2(⌦R)
N2
⌘
(cone(⌦Ej )).
(⌦Ej )) is assumed to be the same for all j = 1, . . . , p,
22
•
•
•
•
• - 

• [Negahban09]
• ARX 

23

Más contenido relacionado

La actualidad más candente

Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTandrewmart11
 
Chapter 9 computation of the dft
Chapter 9 computation of the dftChapter 9 computation of the dft
Chapter 9 computation of the dftmikeproud
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processesKazuki Fujikawa
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for controlSpringer
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learningAlexander Novikov
 
Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Pei-Che Chang
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescentXudong Sun
 
Discrete fourier transform
Discrete fourier transformDiscrete fourier transform
Discrete fourier transformMOHAMMAD AKRAM
 
Parallel Coordinate Descent Algorithms
Parallel Coordinate Descent AlgorithmsParallel Coordinate Descent Algorithms
Parallel Coordinate Descent AlgorithmsShaleen Kumar Gupta
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedPei-Che Chang
 
Decimation in time and frequency
Decimation in time and frequencyDecimation in time and frequency
Decimation in time and frequencySARITHA REDDY
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsJayanshu Gundaniya
 
Fft presentation
Fft presentationFft presentation
Fft presentationilker Şin
 
Discrete Fourier Transform
Discrete Fourier TransformDiscrete Fourier Transform
Discrete Fourier TransformAbhishek Choksi
 

La actualidad más candente (20)

Convex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPTConvex Optimization Modelling with CVXOPT
Convex Optimization Modelling with CVXOPT
 
Decimation in Time
Decimation in TimeDecimation in Time
Decimation in Time
 
Chapter 9 computation of the dft
Chapter 9 computation of the dftChapter 9 computation of the dft
Chapter 9 computation of the dft
 
Conditional neural processes
Conditional neural processesConditional neural processes
Conditional neural processes
 
Adaptive dynamic programming for control
Adaptive dynamic programming for controlAdaptive dynamic programming for control
Adaptive dynamic programming for control
 
Tensor Train decomposition in machine learning
Tensor Train decomposition in machine learningTensor Train decomposition in machine learning
Tensor Train decomposition in machine learning
 
DSP 05 _ Sheet Five
DSP 05 _ Sheet FiveDSP 05 _ Sheet Five
DSP 05 _ Sheet Five
 
Dsp lecture vol 2 dft & fft
Dsp lecture vol 2 dft & fftDsp lecture vol 2 dft & fft
Dsp lecture vol 2 dft & fft
 
Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)Brief Introduction About Topological Interference Management (TIM)
Brief Introduction About Topological Interference Management (TIM)
 
Lecture note4coordinatedescent
Lecture note4coordinatedescentLecture note4coordinatedescent
Lecture note4coordinatedescent
 
Discrete fourier transform
Discrete fourier transformDiscrete fourier transform
Discrete fourier transform
 
Parallel Coordinate Descent Algorithms
Parallel Coordinate Descent AlgorithmsParallel Coordinate Descent Algorithms
Parallel Coordinate Descent Algorithms
 
Distributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and RelatedDistributed Architecture of Subspace Clustering and Related
Distributed Architecture of Subspace Clustering and Related
 
Decimation in time and frequency
Decimation in time and frequencyDecimation in time and frequency
Decimation in time and frequency
 
03 image transform
03 image transform03 image transform
03 image transform
 
Smart Multitask Bregman Clustering
Smart Multitask Bregman ClusteringSmart Multitask Bregman Clustering
Smart Multitask Bregman Clustering
 
Fourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time SignalsFourier Series for Continuous Time & Discrete Time Signals
Fourier Series for Continuous Time & Discrete Time Signals
 
Fft presentation
Fft presentationFft presentation
Fft presentation
 
5320
53205320
5320
 
Discrete Fourier Transform
Discrete Fourier TransformDiscrete Fourier Transform
Discrete Fourier Transform
 

Destacado

Icml読み会 deep speech2
Icml読み会 deep speech2Icml読み会 deep speech2
Icml読み会 deep speech2Jiro Nishitoba
 
Dropout Distillation
Dropout DistillationDropout Distillation
Dropout DistillationShotaro Sano
 
Meta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural NetworkMeta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural NetworkYusuke Watanabe
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsTakuya Akiba
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介Kohei Hayashi
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural NetworksSeiya Tokui
 
GTC 2017 基調講演からディープラーニング関連情報のご紹介
GTC 2017 基調講演からディープラーニング関連情報のご紹介GTC 2017 基調講演からディープラーニング関連情報のご紹介
GTC 2017 基調講演からディープラーニング関連情報のご紹介NVIDIA Japan
 
Chainer v2 alpha
Chainer v2 alphaChainer v2 alpha
Chainer v2 alphaSeiya Tokui
 
HPCとAIをつなぐGPUクラウド
HPCとAIをつなぐGPUクラウドHPCとAIをつなぐGPUクラウド
HPCとAIをつなぐGPUクラウドHPC Systems Inc.
 
GTC 2017 さらに発展する AI 革命
GTC 2017 さらに発展する AI 革命GTC 2017 さらに発展する AI 革命
GTC 2017 さらに発展する AI 革命NVIDIA Japan
 
NVIDIA GPU 技術最新情報
NVIDIA GPU 技術最新情報NVIDIA GPU 技術最新情報
NVIDIA GPU 技術最新情報IDC Frontier
 
次世代の AI とディープラーニング GTC 2017
次世代の AI とディープラーニング GTC 2017次世代の AI とディープラーニング GTC 2017
次世代の AI とディープラーニング GTC 2017NVIDIA Japan
 
Facebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみたFacebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみた株式会社メタップスホールディングス
 
激アツ!GPUパワーとインフラの戦い
激アツ!GPUパワーとインフラの戦い激アツ!GPUパワーとインフラの戦い
激アツ!GPUパワーとインフラの戦いIDC Frontier
 
加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーションKeisuke Anzai
 

Destacado (17)

Icml読み会 deep speech2
Icml読み会 deep speech2Icml読み会 deep speech2
Icml読み会 deep speech2
 
Dropout Distillation
Dropout DistillationDropout Distillation
Dropout Distillation
 
Meta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural NetworkMeta-Learning with Memory Augmented Neural Network
Meta-Learning with Memory Augmented Neural Network
 
Learning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for GraphsLearning Convolutional Neural Networks for Graphs
Learning Convolutional Neural Networks for Graphs
 
ICML2016読み会 概要紹介
ICML2016読み会 概要紹介ICML2016読み会 概要紹介
ICML2016読み会 概要紹介
 
論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks論文紹介 Pixel Recurrent Neural Networks
論文紹介 Pixel Recurrent Neural Networks
 
GTC 2017 基調講演からディープラーニング関連情報のご紹介
GTC 2017 基調講演からディープラーニング関連情報のご紹介GTC 2017 基調講演からディープラーニング関連情報のご紹介
GTC 2017 基調講演からディープラーニング関連情報のご紹介
 
Chainer v2 alpha
Chainer v2 alphaChainer v2 alpha
Chainer v2 alpha
 
GTC17 NVIDIA News
GTC17 NVIDIA NewsGTC17 NVIDIA News
GTC17 NVIDIA News
 
HPCとAIをつなぐGPUクラウド
HPCとAIをつなぐGPUクラウドHPCとAIをつなぐGPUクラウド
HPCとAIをつなぐGPUクラウド
 
GTC 2017 さらに発展する AI 革命
GTC 2017 さらに発展する AI 革命GTC 2017 さらに発展する AI 革命
GTC 2017 さらに発展する AI 革命
 
NVIDIA GPU 技術最新情報
NVIDIA GPU 技術最新情報NVIDIA GPU 技術最新情報
NVIDIA GPU 技術最新情報
 
次世代の AI とディープラーニング GTC 2017
次世代の AI とディープラーニング GTC 2017次世代の AI とディープラーニング GTC 2017
次世代の AI とディープラーニング GTC 2017
 
Facebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみたFacebookの人工知能アルゴリズム「memory networks」について調べてみた
Facebookの人工知能アルゴリズム「memory networks」について調べてみた
 
激アツ!GPUパワーとインフラの戦い
激アツ!GPUパワーとインフラの戦い激アツ!GPUパワーとインフラの戦い
激アツ!GPUパワーとインフラの戦い
 
ICCV2017一人読み会
ICCV2017一人読み会ICCV2017一人読み会
ICCV2017一人読み会
 
加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション加速するデータドリブンコミュニケーション
加速するデータドリブンコミュニケーション
 

Similar a Estimating Structured Vector Autoregressive Models Using Regularized Estimation

machinelearning project
machinelearning projectmachinelearning project
machinelearning projectLianli Liu
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsKeisuke OTAKI
 
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Alexander Litvinenko
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Alexander Litvinenko
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861pkavitha
 
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIALON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIALZac Darcy
 
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...Shu Tanaka
 
Some properties of two-fuzzy Nor med spaces
Some properties of two-fuzzy Nor med spacesSome properties of two-fuzzy Nor med spaces
Some properties of two-fuzzy Nor med spacesIOSR Journals
 
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATA
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATATIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATA
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATAijseajournal
 
Lecture 5: Stochastic Hydrology
Lecture 5: Stochastic Hydrology Lecture 5: Stochastic Hydrology
Lecture 5: Stochastic Hydrology Amro Elfeki
 
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...IOSR Journals
 
On Convergence of Jungck Type Iteration for Certain Contractive Conditions
On Convergence of Jungck Type Iteration for Certain Contractive ConditionsOn Convergence of Jungck Type Iteration for Certain Contractive Conditions
On Convergence of Jungck Type Iteration for Certain Contractive Conditionsresearchinventy
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920Karl Rudeen
 
Nonlinear Stochastic Optimization by the Monte-Carlo Method
Nonlinear Stochastic Optimization by the Monte-Carlo MethodNonlinear Stochastic Optimization by the Monte-Carlo Method
Nonlinear Stochastic Optimization by the Monte-Carlo MethodSSA KPI
 

Similar a Estimating Structured Vector Autoregressive Models Using Regularized Estimation (20)

machinelearning project
machinelearning projectmachinelearning project
machinelearning project
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
 
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
Computing f-Divergences and Distances of\\ High-Dimensional Probability Densi...
 
Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...Low rank tensor approximation of probability density and characteristic funct...
Low rank tensor approximation of probability density and characteristic funct...
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861
 
C023014030
C023014030C023014030
C023014030
 
C023014030
C023014030C023014030
C023014030
 
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIALON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
ON ESTIMATION OF TIME SCALES OF MASS TRANSPORT IN INHOMOGENOUS MATERIAL
 
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...
Quantum Annealing for Dirichlet Process Mixture Models with Applications to N...
 
Some properties of two-fuzzy Nor med spaces
Some properties of two-fuzzy Nor med spacesSome properties of two-fuzzy Nor med spaces
Some properties of two-fuzzy Nor med spaces
 
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATA
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATATIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATA
TIME-ABSTRACTING BISIMULATION FOR MARKOVIAN TIMED AUTOMATA
 
A05920109
A05920109A05920109
A05920109
 
Lecture 5: Stochastic Hydrology
Lecture 5: Stochastic Hydrology Lecture 5: Stochastic Hydrology
Lecture 5: Stochastic Hydrology
 
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...
Asymptotic Behavior of Solutions of Nonlinear Neutral Delay Forced Impulsive ...
 
Signals Processing Homework Help
Signals Processing Homework HelpSignals Processing Homework Help
Signals Processing Homework Help
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
On Convergence of Jungck Type Iteration for Certain Contractive Conditions
On Convergence of Jungck Type Iteration for Certain Contractive ConditionsOn Convergence of Jungck Type Iteration for Certain Contractive Conditions
On Convergence of Jungck Type Iteration for Certain Contractive Conditions
 
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
fb69b412-97cb-4e8d-8a28-574c09557d35-160618025920
 
Project Paper
Project PaperProject Paper
Project Paper
 
Nonlinear Stochastic Optimization by the Monte-Carlo Method
Nonlinear Stochastic Optimization by the Monte-Carlo MethodNonlinear Stochastic Optimization by the Monte-Carlo Method
Nonlinear Stochastic Optimization by the Monte-Carlo Method
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxOnBoard
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonetsnaman860154
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxnull - The Open Security Community
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationSafe Software
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptxLBM Solutions
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsPrecisely
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024BookNet Canada
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
Maximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptxMaximizing Board Effectiveness 2024 Webinar.pptx
Maximizing Board Effectiveness 2024 Webinar.pptx
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersEnhancing Worker Digital Experience: A Hands-on Workshop for Partners
Enhancing Worker Digital Experience: A Hands-on Workshop for Partners
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
How to convert PDF to text with Nanonets
How to convert PDF to text with NanonetsHow to convert PDF to text with Nanonets
How to convert PDF to text with Nanonets
 
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptxMaking_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
Making_way_through_DLL_hollowing_inspite_of_CFG_by_Debjeet Banerjee.pptx
 
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry InnovationBeyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
Beyond Boundaries: Leveraging No-Code Solutions for Industry Innovation
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Key Features Of Token Development (1).pptx
Key  Features Of Token  Development (1).pptxKey  Features Of Token  Development (1).pptx
Key Features Of Token Development (1).pptx
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Pigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food ManufacturingPigging Solutions in Pet Food Manufacturing
Pigging Solutions in Pet Food Manufacturing
 
08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men08448380779 Call Girls In Friends Colony Women Seeking Men
08448380779 Call Girls In Friends Colony Women Seeking Men
 
Unlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power SystemsUnlocking the Potential of the Cloud for IBM Power Systems
Unlocking the Potential of the Cloud for IBM Power Systems
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC BiblioShare - Tech Forum 2024
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 

Estimating Structured Vector Autoregressive Models Using Regularized Estimation

  • 2. • Vector Autoregressive 
 • • i.i.d. • →( ) Lasso Group-Lasso i.i.d. • →( ) R ( ) • R ap- y in lar- (1) uch , is dis- ion ani, hine ume nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is defined as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive definite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Specifically, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) where A all the for 2 of origi Pd k=1 A and Sec 2.3. Pro In what trix X i results w bounds Define we assu is distri matrix C C = 2
  • 4. • i.i.d. • • • high probability ||Z ||2 || ||2 p N = O(1) is a pos- he estimation er- ded by k k2  ote that the norm assumed to be the rom our assump- und on the regu- + w2 (⌦R) N2 ⌘ . As nd d grows and 2 ✏1) + log(p)) we (1+✏1) w2 (⌦R) N2 ◆ . ion, we will show some ⌫ > 0 and p 1 , where Sdp 1 defined as ⌦Ej = R( j) o , for r > T , for j is of size isfy N O N + N2 . With high then the restricted eigenvalue condition ||Z || || ||2 for 2 cone(⌦E) holds, so that  = O(1 itive constant. Moreover, the norm of the est ror in optimization problem (4) is bounded b O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note tha compatibility constant (cone(⌦Ej )) is assume same for all j = 1, . . . , p, which follows from o tion in (5). Consider now Theorem 3.3 and the bound on larization parameter N O ⇣ w(⌦R) p N + w2 ( N the dimensionality of the problem p and d the number of samples N increases, the first t will dominate the second one w2 (⌦R) N2 . This c by computing N for which the two terms bec 2 ↑ w, the regularization parameter nd on R⇤ [ 1 N ZT ✏]  ↵, for lish the required relationship 2 Rdp |R(u)  1}, and define be a Gaussian width of set ny ✏1 > 0 and ✏2 > 0 with ( min(✏2 2, ✏1) + log(p)) we (⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ e constants. alue condition, we will show ⌫, for some ⌫ > 0 and L indicate stronger dependency in the data, thus more samples for the RE conditions to hold with h ability. Analyzing Theorems 3.3 and 3.4 we can interpre tablished results as follows. As the size and di ality N, p and d of the problem increase, we e the scale of the results and use the order notatio note the constants. Select a number of sample N O(w2 (⇥)) and let the regularization param isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high pr then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) itive constant. Moreover, the norm of the estim ror in optimization problem (4) is bounded by O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that compatibility constant (cone(⌦Ej )) is assumed same for all j = 1, . . . , p, which follows from our tion in (5). y ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, L 2 p M cw(⇥) ⌘ and c, c1, c2 are nts, and L, M are defined in (9) and (13). 3.4, we can choose ⌘ = 1 2 p NL and set 2 p M cw(⇥) ⌘ and since p N > 0 ed, we can establish a lower bound on the ples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). bound and using (9) and (13), we can con- 3.3 W sep ize su OW 3. To ram VAR h= 1 X(h) = E[Xj,:XT j+h,:] 2 Rdp⇥dp , then we can write inf 1ldp !2[0,2⇡] ⇤l[⇢X(!)]  ⇤k[CU ] 1kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. The closed form expression of spectral density is ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 i⇤ , where ⌃E is the covariance matrix of a noise vector and A are as defined in expression (6). Thus, an upper bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we defined ⇤min(A) = min !2[0,2⇡] ⇤min(A(!)) for A(!) = I AT ei! I Ae i! . (12) Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for ⇤max(A) and ⇤min(A). How- ever, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in Lemma 3.2 Assume tha condition holds ||Z || for 2 cone(⌦E) an cone(⌦E) is a cone of th || ||2  1 + r where (cone(⌦E)) is a fined as (cone(⌦E)) = Note that the above error and (16) hold, then the (17). However, the result tities, involving Z and ✏, the following we establis Estimating Structured VAR density is that it has a closed form expression (see Section 9.4 of (Priestley, 1981)) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , where ⇤ denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) 3. Regularized Es Denote by = ˆ optimization problem rameter. The focus of under which the optim tees on the accuracy of term is bounded: || || lish such conditions, w et al., 2014). Specifica4
  • 5. • : p vector VAR(d) • • - n - ) h s - n , e e nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is defined as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive definite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed timation problems of the form: ˆ = argmin 2Rq 1 M ky Z k2 2 + M R( ) , (1) {(yi, zi), i = 1, . . . , M}, yi 2 R, zi 2 Rq , such = [yT 1 , . . . , yT M ]T and Z = [zT 1 , . . . , zT M ]T , is ining set of M independently and identically dis- d (i.i.d.) samples, M > 0 is a regularization eter, and R(·) denotes a suitable norm (Tibshirani, dings of the 33rd International Conference on Machine g, New York, NY, USA, 2016. JMLR: W&CP volume pyright 2016 by the author(s). model of order d is defined as xt = A1xt 1 + A2xt 2 + · where xt 2 Rp denotes a mult Rp⇥p , k = 1, . . . , d are the par d 1 is the order of the mod sume that the noise ✏t 2 Rp fo tion, ✏t ⇠ N (0, ⌃), with E(✏t✏T t for ⌧ 6= 0. The VAR process is stationary (Lutkepohl, 2007), w matrix ⌃ is assumed to be posi largest eigenvalue, i.e., ⇤min(⌃) In the current context, the para Estimating Structured VAR characterizing sample complexity and error bounds. 2.1. Regularized Estimator To estimate the parameters of the VAR model, we trans- form the model in (2) into the form suitable for regularized estimator (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, of matrix X (and conseque dependencies, following (B utilize the spectral represen VAR models to control the d 2.2. Stability of VAR Mode Since VAR models are (line analysis we need to establi VAR model (2) is stable, i.e not diverge over time. For u venient to rewrite VAR mod alent VAR model of order 1 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 A1 A2 . . . I 0 . . . 0 I . . . ... ... ... 0 0 . . . | {z A where A 2 Rdp⇥dp . There all the eigenvalues of A sa for 2 C, | | < 1. Equi of original parameters Ak, P (1). Let (x0, x1, . . . , xT ) denote the T + 1 sam- ated by the stable VAR model in (2), then stack- ogether we obtain = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 also be compactly written as Y = XB + E, (3) 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N = T d+1. Vectorizing (column-wise) each (3), we get ec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since VAR mo analysis we ne VAR model (2 not diverge ove venient to rewr alent VAR mod 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 | where A 2 R all the eigenva for 2 C, | les generated by the stable VAR model in (2), then stack- ng them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 hich can also be compactly written as Y = XB + E, (3) here Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 N⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) Since V analysi VAR m not div venient alent V 2 6 6 6 4 x xt xt ( where all the ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ples generated by the stable VAR model in (2), then stack- ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) ing them together we obtain 2 6 6 6 4 xT d xT d+1 ... xT T 3 7 7 7 5 = 2 6 6 6 4 xT d 1 xT d 2 . . . xT 0 xT d xT d 1 . . . xT 1 ... ... ... ... xT T 1 xT T 2 . . . xT T d 3 7 7 7 5 2 6 6 6 4 AT 1 AT 2 ... AT d 3 7 7 7 5 + 2 6 6 6 4 ✏T d ✏T d+1 ... ✏T T 3 7 7 7 5 which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- analysis we VAR model not diverge o venient to re alent VAR m 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section xT T xT T 1 xT T 2 . . . xT T d AT d ✏T T which can also be compactly written as Y = XB + E, (3) where Y 2 RN⇥p , X 2 RN⇥dp , B 2 Rdp⇥p , and E 2 RN⇥p for N = T d+1. Vectorizing (column-wise) each matrix in (3), we get vec(Y ) = (Ip⇥p ⌦ X)vec(B) + vec(E) y = Z + ✏, where y 2 RNp , Z = (Ip⇥p ⌦ X) 2 RNp⇥dp2 , 2 Rdp2 , ✏ 2 RNp , and ⌦ is the Kronecker product. The covari- ance matrix of the noise ✏ is now E[✏✏T ] = ⌃ ⌦ IN⇥N . Consequently, the regularized estimator takes the form ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = where A 2 all the eigen for 2 C, of original p Pd k=1 Ak 1 k and Section 2.3. Propert In what follo trix X in (3) results will th i.i.d. 5
  • 6. 
 2 ⌦E 
 
 Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and define a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 6
  • 7. 
 2 ⌦E 
 
 Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and define a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as Lem.4.2 o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 7
  • 8. Restricted Eigenvalue Condition • 
 (well-conditioned) • cone Lasso-type • R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2 (⌦R) N2 ◆ where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we will show that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, for some ⌫ > 0 and then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej ) Sdp 1 , where Sdp 1 is a unit sphere. The error set ⌦Ej is defined as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦E due to the assumption on the row-wise separability of for itiv ror O ⇣ com sam tion Con lari the the wil by w(⌦ p w(⌦ low 
 ( )⌦E N ↵ R [N Z ✏]. Theorem 3.3 Let ⌦R = {u 2 Rd w(⌦R) = E[ sup u2⌦R hg, ui] to be ⌦R for g ⇠ N(0, I). For any ✏ probability at least 1 c exp( m can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R p N where c, c1 and c2 are positive co To establish restricted eigenvalue that inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 then set p N = ⌫. Theorem 3.4 Let ⇥ = cone(⌦Ej⌦E 8
  • 9. Restricted Error Set • • • r=2 ⇤ ⇤ + ⌦E (L1)(L1) conecone ce matrix 1.2 of the (11) XT N,: ⇤T 2 ng all the n order to Qa), ob- y stacking as in (8), process in ⇡], where write X(!)]. 1 i⇤ , for some constant r > 1, where R⇤ ⇥ 1 N ZT ✏ ⇤ is a dual form of the vector norm R(·), which is defined as R⇤ [ 1 N ZT ✏] = sup R(U)1 ⌦ 1 N ZT ✏, U ↵ , for U 2 Rdp2 , where U = [uT 1 , uT 2 , . . . , uT p ]T and ui 2 Rdp . Then the error vector k k2 belongs to the set ⌦E= ⇢ 2 Rdp2 R( ⇤ + )  R( ⇤ )+ 1 r R( ) . (15) The second condition in (Banerjee et al., 2014) establishes the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) ting Structured VAR tion 3 5 ⇤ , rom (9) r (10) need 3. Regularized Estimation Guarantees Denote by = ˆ ⇤ the error between the solution of optimization problem (4) and ⇤ , the true value of the pa- rameter. The focus of our work is to determine conditions under which the optimization problem in (4) has guaran- tees on the accuracy of the obtained solution, i.e., the error term is bounded: || ||2  for some known . To estab- lish such conditions, we utilize the framework of (Banerjee et al., 2014). Specifically, estimation error analysis is based on the following known results adapted to our settings. The first one characterizes the restricted error set ⌦E, where the error belongs. Lemma 3.1 Assume that  (r>1) ⌦E 2⌦E 9
  • 10. 
 2 ⌦E 
 
 Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and define a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 10
  • 11. • 4.2 • • • k k2 ond condition in (Banerjee et al., 2014) establishes er bound on the estimation error. a 3.2 Assume that the restricted eigenvalue (RE) on holds ||Z ||2 || ||2 p N, (16) 2 cone(⌦E) and some constant  > 0, where E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) (cone(⌦E)) is a norm compatibility constant, de- (cone(⌦E)) = sup R(U) ||U||2 . , density of the VAR process in X(h)e hi! , ! 2 [0, 2⇡], where Rdp⇥dp , then we can write ⇤k[CU ] kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(!)]. on of spectral density is 1 ⌃E h I Ae i! 1 i⇤ , e matrix of a noise vector and A on (6). Thus, an upper bound on max[CU ]  ⇤max(⌃) ⇤min(A) , where we n 2⇡] ⇤min(A(!)) for AT ei! I Ae i! . (12) nce matrix Qa in (11), we get ax(⌃)/⇤min(A) = M. (13) The second condition in (Banerjee et al., 2014 the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigen condition holds ||Z ||2 || ||2 p N, for 2 cone(⌦E) and some constant  > cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), where (cone(⌦E)) is a norm compatibility c fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic rocess in ⇡], where rite (!)]. 1 i⇤ , or and A bound on where we (12) e get (13) the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . The second condition in (Banerjee et al., 2014) establishes he upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) ondition holds ||Z ||2 || ||2 p N, (16) or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . the upper bound on the estimation error. Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . 
 E k = 1, . . . , d. Since L1 is decomposable, it can that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Baner 2014) and Gaussian width results in (Chandrasekap d A on we 12) 13) Lemma 3.2 Assume that the restricted eigenvalue (RE) condition holds ||Z ||2 || ||2 p N, (16) for 2 cone(⌦E) and some constant  > 0, where cone(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) s: 11
  • 12. • • • • Lasso k k2 or 2 cone(⌦E) and some constant  > 0, where one(⌦E) is a cone of the error set, then || ||2  1 + r r N  (cone(⌦E)), (17) where (cone(⌦E)) is a norm compatibility constant, de- fined as (cone(⌦E)) = sup U2cone(⌦E ) R(U) ||U||2 . Note that the above error bound is deterministic, i.e., if (14) nd (16) hold, then the error satisfies the upper bound in 17). However, the results are defined in terms of the quan- ities, involving Z and ✏, which are random. Therefore, in he following we establish high probability bounds on the egularization parameter in (14) and RE condition in (16). etails meter , for nship efine of set with ) we R) ◆ show 0 and dp 1 L indicate stronger dependency in the data, thus requiring more samples for the RE conditions to hold with high prob- ability. Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and ze and dimension- ase, we emphasize er notations to de- of samples at least tion parameter sat- th high probability n ||Z ||2 || ||2 p N = O(1) is a pos- the estimation er- unded by k k2  Note that the norm s assumed to be the s from our assump- ound on the regu- ) + w2 (⌦R) ⌘ . As N equired relationship u)  1}, and define aussian width of set 0 and ✏2 > 0 with ✏2 2, ✏1) + log(p)) we c1(1+✏1) w2 (⌦R) N2 ◆ nts. dition, we will show or some ⌫ > 0 and Sdp 1 , where Sdp 1 s defined as ⌦Ej =o Analyzing Theorems 3.3 and 3.4 we can interpret the es- tablished results as follows. As the size and dimension- ality N, p and d of the problem increase, we emphasize the scale of the results and use the order notations to de- note the constants. Select a number of samples at least N O(w2 (⇥)) and let the regularization parameter sat- isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability then the restricted eigenvalue condition ||Z ||2 || ||2 p N for 2 cone(⌦E) holds, so that  = O(1) is a pos- itive constant. Moreover, the norm of the estimation er- ror in optimization problem (4) is bounded by k k2  O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm compatibility constant (cone(⌦Ej )) is assumed to be the same for all j = 1, . . . , p, which follows from our assump- tion in (5). Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the first term w(⌦R) p N 2 □ 0 and Sdp 1 Ej = r r > of size he set · · · ⇥ lity of ui] to d u 2 2⌘2 + ⌫, c2 are Consider now Theorem 3.3 and the bound on the regu- larization parameter N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As the dimensionality of the problem p and d grows and the number of samples N increases, the first term w(⌦R) p N will dominate the second one w2 (⌦R) N2 . This can be seen by computing N for which the two terms become equal w(⌦R) p N = w2 (⌦R) N2 , which happens at N = w 2 3 (⌦R) < w(⌦R). Therefore, we can rewrite our results as fol- lows: once the restricted eigenvalue condition holds and N O ⇣ w(⌦R) p N ⌘ , the error norm is upper-bounded by k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for any norm R(·), Estimating Structured VAR k = 1, . . . , d. Since L1 is decomposable, it can be shown hat (cone(⌦Ej ))  4 p s. Next, since ⌦R = {u 2 Rdp |R(u)  1}, then using Lemma 3 in (Banerjee et al., 2014) and Gaussian width results in (Chandrasekaran et al., 2012), we can establish that w(⌦R)  O( p log(dp)). Therefore, based on Theorem 4.3 and the discussion at the nd of Section 3.2, the bound on the regularization parame- er takes the form N O ⇣p log(dp)/N ⌘ . Hence, the es- ⇣p ⌘ quence of weights and | |(1) quence of absolute values of , In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the e by k k2  O ⇣ 2c1 ¯c p s log(dp) Estimating Structured VAR ce L1 is decomposable, it can be shown ))  4 p s. Next, since ⌦R = {u 2 then using Lemma 3 in (Banerjee et al., an width results in (Chandrasekaran et al., stablish that w(⌦R)  O( p log(dp)). on Theorem 4.3 and the discussion at the , the bound on the regularization parame- N O ⇣p log(dp)/N ⌘ . Hence, the es- ounded by k k2  O ⇣p s log(dp)/N ⌘ (log(dp)). quence of weights and | |(1) quence of absolute values of In (Chen & Banerjee, 2015) it O( p log(dp)/¯c), where ¯c is and the norm compatibility co 2c2 1 p s/¯c. Therefore, based on O ⇣p log(dp)/(¯cN) ⌘ and the by k k2  O ⇣ 2c1 ¯c p s log(dp) We note that the bound obtained is similar to the bound obtaine k = 1, . . . , d. Since L1 is decompo that (cone(⌦Ej ))  4 p s. Nex Rdp |R(u)  1}, then using Lemma 2014) and Gaussian width results in 2012), we can establish that w(⌦ Therefore, based on Theorem 4.3 an end of Section 3.2, the bound on the ter takes the form N O ⇣p log(d timation error is bounded by k k2  as long as N > O(log(dp)). 
k = 1, . . . , d. Since L1 is decomposable, it ca that (cone(⌦Ej ))  4 p s. Next, since ⌦R Rdp |R(u)  1}, then using Lemma 3 in (Ban 2014) and Gaussian width results in (Chandrase 2012), we can establish that w(⌦R)  O( Therefore, based on Theorem 4.3 and the discu end of Section 3.2, the bound on the regularizat⇣p ⌘12
  • 13. 
 2 ⌦E 
 
 Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and define a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as13
  • 14. ( ) : R - • R - • • → ) can be any vector norm, separable along the matrices Ak. Specifically, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used ow of Ak, e.g., L1 for row one, L2 for row two, t norm (Argyriou et al., 2012) for row three, etc. hat within a row, the norm need not be decompos- results wil bounds fo Define an we assum is distribu matrix CX CX = 2 6 6 4 where ( since CX bounded a where R( ) can be any vector norm, separable along the rows of matrices Ak. Specifically, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-support norm (Argyriou et al., 2012) for row three, etc. Observe that within a row, the norm need not be decompos- A2A1 A3 A4 ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) ) can be any vector norm, separable along the matrices Ak. Specifically, if we denote = T ]T and Ak(i, :) as the row of matrix Ak for , d, then our assumption is equivalent to pX =1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) clutter and without loss of generality, we assume R(·) to be the same for each row i. Since the ecouples across rows, it is straightforward to ex- analysis to the case when a different norm is used In what fo trix X in results wil bounds fo Define an we assum is distribu matrix CX CX = 2 6 6 4 where ( ˆ = argmin 2Rdp2 1 N ||y Z ||2 2 + N R( ), (4) where R( ) can be any vector norm, separable along the rows of matrices Ak. Specifically, if we denote = [ T 1 . . . T p ]T and Ak(i, :) as the row of matrix Ak for k = 1, . . . , d, then our assumption is equivalent to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) To reduce clutter and without loss of generality, we assume the norm R(·) to be the same for each row i. Since the analysis decouples across rows, it is straightforward to ex- tend our analysis to the case when a different norm is used A2A1 A3 A4 = argmin 2Rdp2 N ||y Z ||2 + N R( ), (4) ere R( ) can be any vector norm, separable along the ws of matrices Ak. Specifically, if we denote = . . . T p ]T and Ak(i, :) as the row of matrix Ak for = 1, . . . , d, then our assumption is equivalent to ( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . (5) reduce clutter and without loss of generality, we assume norm R(·) to be the same for each row i. Since the lysis decouples across rows, it is straightforward to ex- d our analysis to the case when a different norm is used each row of Ak, e.g., L1 for row one, L2 for row two, support norm (Argyriou et al., 2012) for row three, etc. trix resu bou Defi we is d mat C whe sinc bou on ap- ially in regular- (1) q , such T M ]T , is lly dis- rization shirani, Machine volume ranging from describing the behavior of economic and fi- nancial time series (Tsay, 2005) to modeling the dynamical systems (Ljung, 1998) and estimating brain function con- nectivity (Valdes-Sosa et al., 2005), among others. A VAR model of order d is defined as xt = A1xt 1 + A2xt 2 + · · · + Adxt d + ✏t , (2) where xt 2 Rp denotes a multivariate time series, Ak 2 Rp⇥p , k = 1, . . . , d are the parameters of the model, and d 1 is the order of the model. In this work, we as- sume that the noise ✏t 2 Rp follows a Gaussian distribu- tion, ✏t ⇠ N(0, ⌃), with E(✏t✏T t ) = ⌃ and E(✏t✏T t+⌧ ) = 0, for ⌧ 6= 0. The VAR process is assumed to be stable and stationary (Lutkepohl, 2007), while the noise covariance matrix ⌃ is assumed to be positive definite with bounded largest eigenvalue, i.e., ⇤min(⌃) > 0 and ⇤max(⌃) < 1. In the current context, the parameters {Ak} are assumed Aj Theorem 4.3 Let ⌦R = {u 2 R |R(u)  1} of set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 an log(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2( where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, ⌫ > 0 and then set p N = ⌫. Theorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, w ⌦Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤ p in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assum 14
  • 15. • 3.4 • • ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ mption on the row-wise separability of Also define w(⇥) = E[sup u2⇥ hg, ui] to of set ⇥ for g ⇠ N(0, I) and u 2 bability at least 1 c1 exp( c2⌘2 + > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, 2 p M cw(⇥) ⌘ and c, c1, c2 are nd L, M are defined in (9) and (13). we can choose ⌘ = 1 2 p NL and set p M cw(⇥) ⌘ and since p N > 0 e can establish a lower bound on the N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). d and using (9) and (13), we can con- er of samples needed to satisfy the condition is smaller if ⇤min(A) and w(⌦R) p N = w (⌦R) N2 , which happens at N w(⌦R). Therefore, we can rewrite ou lows: once the restricted eigenvalue con N O ⇣ w(⌦R) p N ⌘ , the error norm is u k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid fo separable along the rows of Ak, it is instr ize our analysis to a few popular regula such as L1 and Group Lasso, Sparse G OWL norms. 3.3.1. LASSO To establish results for L1 norm, we ass rameter ⇤ is s-sparse, which in our case resent the largest number of non-zero ele i = 1, . . . , p, i.e., the combined i-th r ⌦Ej is a part of the decomposition in ⌦E = ⌦Ep due to the assumption on the row-wise norm R(·) in (5). Also define w(⇥) = E be a Gaussian width of set ⇥ for g ⇠ N Rdp . Then with probability at least 1 c log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p || where ⌫ = p NL 2 p M cw(⇥) ⌘ a positive constants, and L, M are defined in 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2p N = p NL 2 p M cw(⇥) ⌘ and s must be satisfied, we can establish a lowe number of samples N: p N > 2 p M+cw(⇥ p L/2 Examining this bound and using (9) and (1 clude that the number of samples needed restricted eigenvalue condition is smaller i 1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also define w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are defined in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisfied, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can con- clude that the number of samples needed to satisfy the restricted eigenvalue condition is smaller if ⇤min(A) and w( p w( low N k 3.3 W se ize su OW 3. To ra re i phere. The error set ⌦Ej is defined as ⌦Ej = dp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > . . , p, and = [ T 1 , . . . , T p ]T , for j is of size nd ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ o the assumption on the row-wise separability of ) in (5). Also define w(⇥) = E[sup u2⇥ hg, ui] to ssian width of set ⇥ for g ⇠ N (0, I) and u 2 n with probability at least 1 c1 exp( c2⌘2 + or any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are onstants, and L, M are defined in (9) and (13). ssion orem 3.4, we can choose ⌘ = 1 2 p NL and set p NL 2 p M cw(⇥) ⌘ and since p N > 0 atisfied, we can establish a lower bound on the p 2 p M+cw(⇥) the number of samples N will dominate the second o by computing N for whic w(⌦R) p N = w2 (⌦R) N2 , which w(⌦R). Therefore, we c lows: once the restricted e N O ⇣ w(⌦R) p N ⌘ , the er k k2  O ⇣ w(⌦R) p N ⌘ (con 3.3. Special Cases While the presented result separable along the rows of ize our analysis to a few such as L1 and Group La OWL norms. 3.3.1. LASSO To establish results for L is a unit sphere. The error set ⌦Ej is defined as ⌦Ej =n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also define w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are defined in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 the num will dom by comp w(⌦R) p N = w(⌦R). lows: on N O k k2  3.3. Spec While th separable ize our a such as OWL no 3.3.1. L define w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, : Gaussian width n this Section we present the main results of our work, followed by the discussion on their prop lustrating some special cases based on popular Lasso and Group Lasso regularization norms. I 4 we will present the main ideas of our proof technique, with all the details delegated to the App nd D. o establish lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1 N Z or some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. heorem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaus set ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min g(p)) we can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2(⌦R) N2 ◆ here c, c1 and c2 are positive constants. o establish restricted eigenvalue condition, we will show that inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, > 0 and then set p N = ⌫. heorem 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is Ej = n j 2 Rdp R( ⇤ j + j)  R( ⇤ j ) + 1 r R( j) o , for r > 1, j = 1, . . . , p, and = [ T 1 , . ] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Then with p( c2⌘2 + log(p)), for any ⌘ > 0 inf ||(Ip⇥p ⌦ X) ||2 ⌫, ing Theorems 3.3 and 3.4 we can interpret the es- ed results as follows. As the size and dimension- , p and d of the problem increase, we emphasize le of the results and use the order notations to de- e constants. Select a number of samples at least O(w2 (⇥)) and let the regularization parameter sat- N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high probability e restricted eigenvalue condition ||Z ||2 || ||2 p N 2 cone(⌦E) holds, so that  = O(1) is a pos- onstant. Moreover, the norm of the estimation er- optimization problem (4) is bounded by k k2  ⌦R) N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the norm ibility constant (cone(⌦Ej )) is assumed to be the or all j = 1, . . . , p, which follows from our assump- (5). er now Theorem 3.3 and the bound on the regu-⇣ w(⌦R) p w2 (⌦R) ⌘ 15
  • 16. • p 
 • 
 • • • ion on the row-wise separability of o define w(⇥) = E[sup u2⇥ hg, ui] to set ⇥ for g ⇠ N(0, I) and u 2 bility at least 1 c1 exp( c2⌘2 + 0, inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, p M cw(⇥) ⌘ and c, c1, c2 are L, M are defined in (9) and (13). can choose ⌘ = 1 2 p NL and set cw(⇥) ⌘ and since p N > 0 an establish a lower bound on the p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). nd using (9) and (13), we can con- of samples needed to satisfy the ndition is smaller if ⇤min(A) and lows: once the restricted eigenvalue condi N O ⇣ w(⌦R) p N ⌘ , the error norm is upp k k2  O ⇣ w(⌦R) p N ⌘ (cone(⌦Ej )). 3.3. Special Cases While the presented results are valid for a separable along the rows of Ak, it is instruc ize our analysis to a few popular regulariz such as L1 and Group Lasso, Sparse Gro OWL norms. 3.3.1. LASSO To establish results for L1 norm, we assum rameter ⇤ is s-sparse, which in our case is resent the largest number of non-zero elem i = 1, . . . , p, i.e., the combined i-th row norm R(·) in (5). Also define w(⇥) = E[s u be a Gaussian width of set ⇥ for g ⇠ N(0 Rdp . Then with probability at least 1 c1 e log(p)), for any ⌘ > 0, inf 2cone(⌦E) ||(Ip⇥p⌦ || | where ⌫ = p NL 2 p M cw(⇥) ⌘ and positive constants, and L, M are defined in (9 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p p N = p NL 2 p M cw(⇥) ⌘ and sin must be satisfied, we can establish a lower b number of samples N: p N > 2 p M+cw(⇥) p L/2 Examining this bound and using (9) and (13), clude that the number of samples needed t restricted eigenvalue condition is smaller if ⇤ cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are defined in (9) choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). (18) sing (9) and (13), we can conclude that the number of samples needed to satisfy dition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) eans that matrices A and A in (10) and (12) must be well conditioned and the eigenvalues well inside the unit circle (see Section 3.2). Alternatively, we can wing that large values of M and small values of L indicate stronger dependency ore samples for the RE conditions to hold with high probability. d 4.4 we can interpret the established results as follows. As the size and dimen- oblem increase, we emphasize the scale of the results and use the order notations ect a number of samples at least N O(w2(⇥)) and let the regularization pa- 2 ⌘ Ej ⌦Ep due to the assumption on t norm R(·) in (5). Also define be a Gaussian width of set ⇥ Rdp . Then with probability at log(p)), for any ⌘ > 0, in 2co where ⌫ = p NL 2 p M cw positive constants, and L, M ar 3.2. Discussion From Theorem 3.4, we can ch p N = p NL 2 p M cw(⇥ must be satisfied, we can estab number of samples N: p N > Examining this bound and usin clude that the number of sam restricted eigenvalue condition probability at least 1 c1 exp( c2⌘ + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define and (13). 4.2 Discussion From Theorem 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ an p N > 0 must be satisfied, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can conclude that the number of samples needed to the restricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤ are smaller. In turn, this means that matrices A and A in (10) and (12) must be well conditioned VAR process is stable, with eigenvalues well inside the unit circle (see Section 3.2). Alternatively, also understand (18) as showing that large values of M and small values of L indicate stronger depe in the data, thus requiring more samples for the RE conditions to hold with high probability. Analyzing Theorems 4.3 and 4.4 we can interpret the established results as follows. As the size and sionality N, p and d of the problem increase, we emphasize the scale of the results and use the order no to denote the constants. Select a number of samples at least N O(w2(⇥)) and let the regularizat rameter satisfy N O ⇣ w(⌦R) p + w2(⌦R) 2 ⌘ . With high probability then the restricted eigenvalue co and 3.4 we can interpret the es- ws. As the size and dimension- problem increase, we emphasize nd use the order notations to de- ct a number of samples at least the regularization parameter sat- w2 (⌦R) N2 ⌘ . With high probability value condition ||Z ||2 || ||2 p N ds, so that  = O(1) is a pos- , the norm of the estimation er- em (4) is bounded by k k2  cone(⌦Ej )). Note that the norm cone(⌦Ej )) is assumed to be the which follows from our assump- 3.3 and the bound on the regu- O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . As e problem p and d grows and r 1, j = 1, . . . , p, and = [ T 1 , . . . , T p ]T , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also define w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are defined in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisfied, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p = O(w(⇥)). by w(⌦ p N w(⌦ low N k k 3.3. Wh sepa ize such OW 3.3 To ), for any ⌘ > 0 ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, c1, c2 are positive constants, and L and M are defined in (9) L and set p N = p NL 2 p M cw(⇥) ⌘ and since lower bound on the number of samples N M + cw(⇥) p L/2 = O(w(⇥)). (18) we can conclude that the number of samples needed to satisfy ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) A and A in (10) and (12) must be well conditioned and the 0 < = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 R least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M ar sion m 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ust be satisfied, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). is bound and using (9) and (13), we can conclude that the number of samples n eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A n turn, this means that matrices A and A in (10) and (12) must be well cond16
  • 17. • 
 
 
 sion m 4.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ust be satisfied, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). is bound and using (9) and (13), we can conclude that the number of samples n eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A n turn, this means that matrices A and A in (10) and (12) must be well cond is stable, with eigenvalues well inside the unit circle (see Section 3.2). Altern nd (18) as showing that large values of M and small values of L indicate strong us requiring more samples for the RE conditions to hold with high probability. eorems 4.3 and 4.4 we can interpret the established results as follows. As the s and d of the problem increase, we emphasize the scale of the results and use the constants. Select a number of samples at least N O(w2(⇥)) and let the reg y N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . With high probability then the restricted eigen N for 2 cone(⌦E) holds, so that  = O(1) is a positive constant. Moreover, VAR ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 i⇤ , where ⌃E is the covariance matrix of a noise vector and A are as defined in expression (6). Thus, an upper bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we defined ⇤min(A) = min !2[0,2⇡] ⇤min(A(!)) for A(!) = I AT ei! I Ae i! . (12) Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. (13) We note that for a general VAR model, there might not exist closed-form expressions for ⇤max(A) and ⇤min(A). How- ever, for some special cases there are results establishing the bounds on these quantities (e.g., see Proposition 2.2 in (Basu & Michailidis, 2015)). inf 1jp !2[0,2⇡] ⇤j[⇢(!)]  ⇤k[CX] 1kdp  sup 1jp !2[0,2⇡] ⇤j[⇢(!)], (8) k[·] denotes the k-th eigenvalue of a matrix and for i = p 1, ⇢(!) = P1 h= 1 (h)e hi!, ! 2 s the spectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of spectral density is that it has a closed form expression (see Section 9.4 of [24]) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , denotes a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) e defined ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for A(!)= I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! , (10) endix B.1 for additional details. ishing high probability bounds we will also need information about a vector q = Xa 2 RN for Rdp, kak2 = 1. Since each element XT i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a ce matrix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written 4 .. .. .. .. (d 1)T (d 2)T . . . (0) 5 E(xtxT t+h) 2 Rp⇥p. It turns out that since CX is a block-Toeplitz matrix, its eigenvalues can (see [13]) inf 1jp !2[0,2⇡] ⇤j[⇢(!)]  ⇤k[CX] 1kdp  sup 1jp !2[0,2⇡] ⇤j[⇢(!)], (8) notes the k-th eigenvalue of a matrix and for i = p 1, ⇢(!) = P1 h= 1 (h)e hi!, ! 2 pectral density, i.e., a Fourier transform of the autocovariance matrix (h). The advantage of al density is that it has a closed form expression (see Section 9.4 of [24]) ⇢(!)= I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , es a Hermitian of a matrix. Therefore, from (8) we can establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) ned ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for A(!)= I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! , (10) B.1 for additional details. high probability bounds we will also need information about a vector q = Xa 2 RN for kak2 = 1. Since each element XT i,:a ⇠ N(0, aT CXa), it follows that q ⇠ N(0, Qa) with a rix Qa 2 RN⇥N . It can be shown (see Appendix B.3 for more details) that Qa can be written Qa = (I ⌦ aT )CU (I ⌦ a), (11) inf 1ldp !2[0,2⇡] ⇤l[⇢X(!)]  ⇤k[CU ] 1kNdp  sup 1ldp !2[0,2⇡] ⇤l[⇢X(! The closed form expression of spectral density is ⇢X(!) = I Ae i! 1 ⌃E h I Ae i! 1 where ⌃E is the covariance matrix of a noise vector and A are as defined in bound on CU can be obtained as ⇤max[CU ]  ⇤max(⌃) ⇤min(A) , where we defined ⇤ for A(!) = I AT ei! I Ae i! . Referring back to covariance matrix Qa in (11), we get ⇤max[Qa]  ⇤max(⌃)/⇤min(A) = M. We note that for a general VAR model, there might not exist closed-form ⇤min(A). However, for some special cases there are results establishing (e.g., see Proposition 2.2 in [5]). Estimating Structured VAR is that it has a closed form expression (see Section Priestley, 1981)) I dX k=1 Ake ki! ! 1 ⌃ 2 4 I dX k=1 Ake ki! ! 1 3 5 ⇤ , denotes a Hermitian of a matrix. Therefore, from an establish the following lower bound ⇤min[CX] ⇤min(⌃)/⇤max(A) = L, (9) we defined ⇤max(A) = max !2[0,2⇡] ⇤max(A(!)) for = I dX k=1 AT k eki! ! I dX k=1 Ake ki! ! . (10) lishing high probability bounds we will also need tion about a vector q = Xa 2 RN for any a 2 Rdp , 1. Since each element XT i,:a ⇠ N(0, aT CXa), ws that q ⇠ N(0, Qa) with a covariance matrix N⇥N 3. Regularized Estimati Denote by = ˆ ⇤ the e optimization problem (4) and rameter. The focus of our wor under which the optimization tees on the accuracy of the obt term is bounded: || ||2  f lish such conditions, we utilize et al., 2014). Specifically, estim on the following known results first one characterizes the restr error belongs. Lemma 3.1 Assume that N rR⇤  to R( )= pX i=1 R i = pX i=1 R ✓h A1(i, :)T . . .Ad(i, :)T iT ◆ . To reduce clutter and without loss of generality, we assume the norm R(·) to be the sam Since the analysis decouples across rows, it is straightforward to extend our analysis t different norm is used for each row of Ak, e.g., L1 for row one, L2 for row two, K-suppo three, etc. Observe that within a row, the norm need not be decomposable across column The main difference between the estimation problem in (1) and the formulation in (4 pendence between the samples (x0, x1, . . . , xT ), violating the i.i.d. assumption on the 1, . . . , Np}. In particular, this leads to the correlations between the rows and columns consequently of Z). To deal with such dependencies, following [5], we utilize the spectra the autocovariance of VAR models to control the dependencies in matrix X. 3.2 Stability of VAR Model Since VAR models are (linear) dynamical systems, for the analysis we need to establish which the VAR model (2) is stable, i.e., the time-series process does not diverge over time. stability, it is convenient to rewrite VAR model of order d in (2) as an equivalent VAR mo 2 6 6 6 4 xt xt 1 ... xt (d 1) 3 7 7 7 5 = 2 6 6 6 6 6 4 A1 A2 . . . Ad 1 Ad I 0 . . . 0 0 0 I . . . 0 0 ... ... ... ... ... 0 0 . . . I 0 3 7 7 7 7 7 5 | {z } A 2 6 6 6 4 xt 1 xt 2 ... xt d 3 7 7 7 5 + 2 6 6 6 4 ✏t 0 ... 0 3 7 7 7 5 where A 2 Rdp⇥dp. Therefore, VAR process is stable if all the eigenvalues of A satisfy de 0 for 2 C, | | < 1. Equivalently, if expressed in terms of original parameters Ak, sta det(I Pd k=1 Ak 1 k ) = 0 (see Appendix A for more details). 
 L = 1, j = 1, . . . , p, and = [ 1 , . . . , p ] , for j is of size dp ⇥ 1, and ⇤ = [ ⇤T 1 . . . ⇤T p ]T , for ⇤ j 2 Rdp . The set ⌦Ej is a part of the decomposition in ⌦E = ⌦E1 ⇥ · · · ⇥ ⌦Ep due to the assumption on the row-wise separability of norm R(·) in (5). Also define w(⇥) = E[sup u2⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp . Then with probability at least 1 c1 exp( c2⌘2 + log(p)), for any ⌘ > 0, inf 2cone(⌦E ) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, where ⌫ = p NL 2 p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L, M are defined in (9) and (13). 3.2. Discussion From Theorem 3.4, we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ and since p N > 0 must be satisfied, we can establish a lower bound on the number of samples N: p N > 2 p M+cw(⇥) p L/2 = O(w(⇥)). Examining this bound and using (9) and (13), we can con- L = 
 17
  • 18. 
 2 ⌦E 
 
 Lasso regularization norms. I the main ideas of our proof te delegated to the supplement. To establish lower bound on N , we derive an upper bou some ↵ > 0, which will estab N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 w(⌦R) = E[ sup u2⌦R hg, ui] to ⌦R for g ⇠ N(0, I). For a probability at least 1 c exp can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w Thm4.3 Thm4.4 Lem.4.1 Lem.4.2 o hold with high prob- e can interpret the es- e size and dimension- crease, we emphasize order notations to de- er of samples at least ization parameter sat- With high probability tion ||Z ||2 || ||2 p N  = O(1) is a pos- of the estimation er- bounded by k k2  )). Note that the norm ction 3.4 we will present ique, with all the details regularization parameter n R⇤ [ 1 N ZT ✏]  ↵, for the required relationship p |R(u)  1}, and define a Gaussian width of set > 0 and ✏2 > 0 with min(✏2 2, ✏1) + log(p)) we ) + c1(1+✏1) w2 (⌦R) N2 ◆ nstants. condition, we will show N as showing that large values of M and small va L indicate stronger dependency in the data, thus req more samples for the RE conditions to hold with high ability. Analyzing Theorems 3.3 and 3.4 we can interpret tablished results as follows. As the size and dime ality N, p and d of the problem increase, we emp the scale of the results and use the order notations note the constants. Select a number of samples a N O(w2 (⇥)) and let the regularization paramet isfy N O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ . With high prob then the restricted eigenvalue condition ||Z ||2 || ||2 for 2 cone(⌦E) holds, so that  = O(1) is itive constant. Moreover, the norm of the estimat ror in optimization problem (4) is bounded by k O ⇣ w(⌦R) p N + w2 (⌦R) N2 ⌘ (cone(⌦Ej )). Note that the compatibility constant (cone(⌦Ej )) is assumed to same for all j = 1, . . . , p, which follows from our as o understand the bound on s of M and small values of y in the data, thus requiring ions to hold with high prob- 3.4 we can interpret the es- As the size and dimension- m increase, we emphasize e the order notations to de- number of samples at least gularization parameter sat- R) ⌘ . With high probability condition ||Z ||2 || ||2 p N that  = O(1) is a pos- norm of the estimation er- 4) is bounded by k k2  ⌦Ej )). Note that the norm ⌦Ej )) is assumed to be the 18
  • 19. • • 4.3 • • d by the discussion on their properties and up Lasso regularization norms. In Section ll the details delegated to the Appendices C derive an upper bound on R⇤[ 1 N ZT ✏]  ↵, N ↵ R⇤[ 1 N ZT ✏]. R) = E[ sup u2⌦R hg, ui] to be a Gaussian width bability at least 1 c exp( min(✏2 2, ✏1) + 1(1+✏1) w2(⌦R) N2 ◆ t inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, for some illustrating some special cases based 4.4 we will present the main ideas of o and D. To establish lower bound on the regula for some ↵ > 0, which will establish Theorem 4.3 Let ⌦R = {u 2 Rdp|R( of set ⌦R for g ⇠ N(0, I). For any ✏1 log(p)) we can establish that R⇤  1 N ZT ✏ where c, c1 and c2 are positive consta To establish restricted eigenvalue co ⌫ > 0 and then set p N = ⌫. ction we present the main results of our work, followed by the discussion on their prope g some special cases based on popular Lasso and Group Lasso regularization norms. In ll present the main ideas of our proof technique, with all the details delegated to the Appe sh lower bound on the regularization parameter N , we derive an upper bound on R⇤[ 1 N Z ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gauss for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( min( e can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2(⌦R) N2 ◆ c1 and c2 are positive constants. ish restricted eigenvalue condition, we will show that inf 2cone(⌦E) ||(Ip⇥p⌦X) ||2 || ||2 ⌫, d then set p N = ⌫. 4.4 Let ⇥ = cone(⌦Ej ) Sdp 1, where Sdp 1 is a unit sphere. The error set ⌦Ej is d To establish lower bound on the regularization pa N , we derive an upper bound on R⇤ [ 1 N ZT ✏]  some ↵ > 0, which will establish the required rela N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 Rdp |R(u)  1}, an w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian widt ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > probability at least 1 c exp( min(✏2 2, ✏1) + log can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+✏1) w2 where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, we w ||(Ip⇥p⌦X) ||2 To establish lower bound on the regularizat N , we derive an upper bound on R⇤ [ 1 N ZT some ↵ > 0, which will establish the require N ↵ R⇤ [ 1 N ZT ✏]. Theorem 3.3 Let ⌦R = {u 2 Rdp |R(u)  1 w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and probability at least 1 c exp( min(✏2 2, ✏1) can establish that R⇤  1 N ZT ✏  ✓ c2(1+✏2) w(⌦R) p N + c1(1+ where c, c1 and c2 are positive constants. To establish restricted eigenvalue condition, ||(Ip⇥p⌦X) ||2 rL ned in terms of the quantities, involving Z and ✏, which are blish high probability bounds on the regularization parameter f our work, followed by the discussion on their properties and ular Lasso and Group Lasso regularization norms. In Section of technique, with all the details delegated to the Appendices C n parameter N , we derive an upper bound on R⇤[ 1 N ZT ✏]  ↵, uired relationship N ↵ R⇤[ 1 N ZT ✏]. }, and define w(⌦R) = E[ sup u2⌦R hg, ui] to be a Gaussian width nd ✏2 > 0 with probability at least 1 c exp( min(✏2 2, ✏1) + (1+✏2) w(⌦R) p + c1(1+✏1) w2(⌦R) 2 ◆ re (cone(⌦E)) is a norm compatibility constant, defined as (cone(⌦E)) = sup U2cone(⌦E) e that the above error bound is deterministic, i.e., if (14) and (16) hold, then the error sati nd in (17). However, the results are defined in terms of the quantities, involving Z and dom. Therefore, in the following we establish high probability bounds on the regularizat 14) and RE condition in (16). High Probability Bounds his Section we present the main results of our work, followed by the discussion on their strating some special cases based on popular Lasso and Group Lasso regularization norm we will present the main ideas of our proof technique, with all the details delegated to the D. stablish lower bound on the regularization parameter N , we derive an upper bound on R⇤ some ↵ > 0, which will establish the required relationship N ↵ R⇤[ 1 N ZT ✏]. orem 4.3 Let ⌦R = {u 2 Rdp|R(u)  1}, and define w(⌦R) = E[ sup u2⌦R hg, ui] to be a G et ⌦R for g ⇠ N(0, I). For any ✏1 > 0 and ✏2 > 0 with probability at least 1 c exp( p)) we can establish that : 
 Gaussian width ) = O(w(⇥)). (18) clude that the number of samples needed to satisfy nd ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) n (10) and (12) must be well conditioned and the nit circle (see Section 3.2). Alternatively, we can nd small values of L indicate stronger dependency tions to hold with high probability. blished results as follows. As the size and dimen- e the scale of the results and use the order notations ast N O(w2(⇥)) and let the regularization pa- robability then the restricted eigenvalue condition 1) is a positive constant. Moreover, the norm of the by k k2  O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ (cone(⌦Ej )). ) is assumed to be the same for all j = 1, . . . , p, arization parameter N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . the number of samples N increases, the first term 19
  • 20. • N p (d=1 ) • N 
 0 1000 2000 3000 4000 5000 0 5 10 15 20 0 1000 2000 3000 4000 5000 0 0.2 0.4 0.6 0.8 1 0 100 200 300 400 500 600 700 0.4 0.5 0.6 0.7 0.8 0.9 1 kk2 N Np kk2 /max /max N/[s log(pd)] 0 100 200 300 400 500 600 0 5 10 15 20 kk2 N N kk2 /max /max 0 1000 2000 3000 4000 5000 0 5 10 15 0 500 1000 1500 2000 2500 0 5 10 15 0 1000 2000 3000 4000 5000 0.4 0.5 0.6 0.7 0.8 0.9 1 N/[s(m + log K)] K 10 20 30 40 50 60 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 (a) (b) (c) (d) (e) (f) (g) (h) Figure 1: Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N max 2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N; in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in (g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N for fixed p. p ⇥ hg, ui] to be a Gaussian width of set ⇥ for g ⇠ N(0, I) and u 2 Rdp. Th c1 exp( c2⌘2 + log(p)), for any ⌘ > 0 inf 2cone(⌦E) ||(Ip⇥p ⌦ X) ||2 || ||2 ⌫, p M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are define we can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ an tisfied, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). 20
  • 21. • VAR(d=1) 1step • • or k N N k / / 0 1000 2000 3000 4000 5000 0 5 0 500 1000 1500 2000 2500 0 5 0 1000 2000 3000 4000 5000 0.4 0.5 0.6 N/[s(m + log K)] K 10 20 30 40 50 60 0.3 0.4 0.5 0.6 (e) (f) (g) (h) Figure 1. Results for estimating parameters of a stable first order sparse VAR (top row) and group sparse VAR (bottom row). Problem dimensions: p 2 [10, 600], N 2 [10, 5000], N max 2 [0, 1], K 2 [2, 60] and d = 1. Figures (a) and (e) show dependency of errors on sample size for different p; in Figure (b) the N is scaled by (s log p) and plotted against k k2 to show that errors scale as (s log p)/N in (f) the graph is similar to (b) but for group sparse VAR; in (c) and (g) we show dependency of N on p (or number of groups K in (g)) for fixed sample size N; finally, Figures (d) and (h) display the dependency of N on N for fixed p. Lasso OWL Group Lasso Sparse Group Lasso Ridge 32.3(6.5) 32.2(6.6) 32.7(6.5) 32.2(6.4) 33.5(6.1) 32.7(7.9) 44.5(15.6) 75.3(8.4) 38.4(9.6) 99.9(0.2) Table 1. Mean squared error (row 2) of the five methods used in fitting VAR model, evaluated on aviation dataset (MSE is computed using one-step-ahead prediction errors). Row 3 shows the average number of non-zeros (as a percentage of total number of elements) in the VAR matrix. The last row shows a typical sparsity pattern in A1 for each method (darker dots - stronger dependencies, lighter dots weaker dependencies). The values in parenthesis denote one standard deviation after averaging the results over 300 flights. the results after averaging across 300 flights. From the table we can see that the considered problem ex- hibits a sparse structure since all the methods detected sim- ilar patterns in matrix A1. In particular, the analysis of such patterns revealed a meaningful relationship among the flight parameters (darker dots), e.g., normal acceleration had high dependency on vertical speed and angle-of-attack, the altitude had mainly dependency with fuel quantity, ver- tical speed with aircraft nose pitch angle, etc. The results also showed that the sparse regularization helps in recov- 5. Conclusions In this work we present a set of results for characterizing non-asymptotic estimation error in estimating structured vector autoregressive models. The analysis holds for any norms, separable along the rows of parameter matrices Our analysis is general as it is expressed in terms of Gaus sian widths, a geometric measure of size of suitable sets and includes as special cases many of the existing result focused on structured sparsity in VAR models. A1 nMSE(%) (%) R(·) TO 2 f its atomic formulation. tiation of the CG algorithm to handle the Ivanov lation of OWL regularization; more specifically: e show how the atomic formulation of the OWL orm allows solving efficiently the linear program- ming problem in each iteration of the CG algorithm; ased on results from [25], we show convergence of he resulting algorithm and provide explicit values or the constants. w derivation of the proximity operator of the OWL arguably simpler than those in [10] and [44], ghting its connection to isotonic regression and the adjacent violators (PAV) algorithm [3], [8]. ficient method to project onto an OWL norm ball, on a root-finding scheme. ng the Ivanov formulation under OWL regulariza- sing projected gradient algorithms, based on the sed OWL projection. er is organized as follows. Section II, after reviewing norm and some of its basic properties, presents formulation, and derives its dual norm. The two utational tools for using OWL regularization, the operator and the projection on a ball, are addressed III. Section IV instantiates the CG algorithm rated projected gradient algorithms to tackle the optimization formulation of OWL regularization egression. Finally, Section V reports experimental strating the performance and comparison of the pproaches, and Section VI concludes the paper. ase bold letters, e.g., x, y, denote (column) vectors, oses are xT , yT , and the i-th and j-th components n as xi and yj. Matrices are written in upper Fig. 1. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b) w1 = w2 > 0; (c) w1 > w2 = 0. where w 2 Km+ is a vector of non-increasing weights, i.e., belonging to the so-called monotone non-negative cone [13], Km+ = {x 2 Rn : x1 x2 · · · xn 0} ⇢ Rn +. (2) OWL balls in R2 and R3 , for different choices of the weight vector w, are illustrated in Figs. 1 and 2. Fig. 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0; (b) w1 > w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0; (e) w1 > w2 = w3 = 0; (f) w1 = w2 = w3 > 0. OWL balls in R2 with different weights: (a) w1 > w2 > 0; (b) > 0; (c) w1 > w2 = 0. w 2 Km+ is a vector of non-increasing weights, i.e., ng to the so-called monotone non-negative cone [13], + = {x 2 Rn : x1 x2 · · · xn 0} ⇢ Rn +. (2) alls in R2 and R3 , for different choices of the weight w, are illustrated in Figs. 1 and 2. OWL balls in R3 with different weights: (a) w1 > w2 > w3 > 0; w2 = w3 > 0; (c) w1 = w2 > w3 > 0; (d) w1 = w2 > w3 = 0; w2 = w3 = 0; (f) w1 = w2 = w3 > 0. act that, if w 2 Km+ {0}, then ⌦w is indeed a norm nvex and homogenous of degree 1), was shown in [10], s clear that ⌦w is lower bounded by the (appropriately `1 norm: ⌦w(x) w1|x|[1] = w1 kxk1, (3) [Zheng15] 21
  • 22. • - 
 VAR • i.i.d. • : inf 2cone(⌦E) p⇥p 2 || ||2 ⌫, M cw(⇥) ⌘ and c, c1, c2 are positive constants, and L and M are de can choose ⌘ = 1 2 p NL and set p N = p NL 2 p M cw(⇥) ⌘ fied, we can establish a lower bound on the number of samples N p N > 2 p M + cw(⇥) p L/2 = O(w(⇥)). nd using (9) and (13), we can conclude that the number of samples neede condition is smaller if ⇤min(A) and ⇤min(⌃) are larger and ⇤max(A) an s means that matrices A and A in (10) and (12) must be well condition with eigenvalues well inside the unit circle (see Section 3.2). Alternativ showing that large values of M and small values of L indicate stronger d N > p L/2 = O(w(⇥)) mining this bound and using (9) and (13), we can conclude that the stricted eigenvalue condition is smaller if ⇤min(A) and ⇤min(⌃) a maller. In turn, this means that matrices A and A in (10) and (12 process is stable, with eigenvalues well inside the unit circle (see understand (18) as showing that large values of M and small value data, thus requiring more samples for the RE conditions to hold w yzing Theorems 4.3 and 4.4 we can interpret the established result lity N, p and d of the problem increase, we emphasize the scale of t note the constants. Select a number of samples at least N O(w er satisfy N O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ . With high probability the 2 p N for 2 cone(⌦E) holds, so that  = O(1) is a positive ation error in optimization problem (4) is bounded by k k2  O /2 = O(w(⇥)). (18) n conclude that the number of samples needed to satisfy (A) and ⇤min(⌃) are larger and ⇤max(A) and ⇤max(⌃) d A in (10) and (12) must be well conditioned and the e the unit circle (see Section 3.2). Alternatively, we can f M and small values of L indicate stronger dependency conditions to hold with high probability. he established results as follows. As the size and dimen- phasize the scale of the results and use the order notations s at least N O(w2(⇥)) and let the regularization pa- high probability then the restricted eigenvalue condition = O(1) is a positive constant. Moreover, the norm of the nded by k k2  O ⇣ w(⌦R) p N + w2(⌦R) N2 ⌘ (cone(⌦Ej )). (⌦Ej )) is assumed to be the same for all j = 1, . . . , p, 22
  • 23. • • • • • - 
 • [Negahban09] • ARX 
 23