Graph-Based Bandits for Learning on Irregular Structures

Laplacian-regularized
Graph Bandits
Laura Toni
University College London
17 December 2020

LASP Research
Learning And Signal Processing Lab
• Electronic Engineering Department at UCL
• 4 PhD students, several MSc students and interns
• Funding: EPSRC, Royal Society, Adobe, Cisco,
Huawei
Laura Toni Silvia Rossi Kaige Yang
Sephora
Madjiheurem
https://laspucl2016.com
2
Pedro Gomes

LASP Research
Key topics: multimedia processing, signal processing, and
machine learning.
Key goal: to develop novel adaptive strategies for large-
scale networks exploiting graph-structure
Applications:
• virtual reality systems
• graph-based bandit problems
• influence maximization, drug discovery and smart mobility
• structural reinforcement learning
Intelligent transport

Exploiting Structure in Data
Chapter 1. Introduction
(a) (b) (c)
(d) (e)
gure 1.1: Examples of graphs and signals on graphs: (a) traffic bottlenecks on the transportation graph,
) average temperature on a geographical graph, (c) fMRI brain signal in a structural, anatomical graph,
) gender attributes on a social network graph, and (e) 3D color attributes on a mesh graph. The size and
e color of each disc in (a)-(c) indicate the value of the signal at the corresponding vertex. The red and
traffic fMRI brain signal
temperature
We are surrounded by
large-scale
interconnected
systems with an
irregular structure
Main Motivation
… to spectral (low-dimensional) domain
-0.2 0 0.2 0.4 0.6 0.8 1
-200
-100
0
100
-0.2 0 0.2 0.4 0.6 0.8 1
-400
-200
0
200
-0.2 0 0.2 0.4 0.6 0.8 1
-10
0
10
From vertex (high
dimensional) domain …
Key Intuition
Machine
Learning
Our Research
Graph-Based Learning
Graph Signal
Processing
Main Goal
Exploit the knowledge of the irregular structure
to develop efficient learning algorithms
RGCNN: Regularized Graph CNN for Point Cloud Segmentation
Gusi Te
Peking University
tegusi@pku.edu.cn
Wei Hu
Peking University
forhuwei@pku.edu.cn
Zongming Guo
Peking University
guozongming@pku.edu.cn
Amin Zheng
MTlab, Meitu Inc.
zam@meitu.com
ABSTRACT
Point cloud, an e�cient 3D object representation, has become pop-
ular with the development of depth sensing and 3D laser scanning
techniques. It has attracted attention in various applications such
as 3D tele-presence, navigation for unmanned vehicles and heritage
reconstruction. The understanding of point clouds, such as point
cloud segmentation, is crucial in exploiting the informative value
of point clouds for such applications. Due to the irregularity of
the data format, previous deep learning works often convert point
clouds to regular 3D voxel grids or collections of images before
feeding them into neural networks, which leads to voluminous
data and quantization artifacts. In this paper, we instead propose
a regularized graph convolutional neural network (RGCNN) that
directly consumes point clouds. Leveraging on spectral graph the-
ory, we treat features of points in a point cloud as signals on graph,
and de�ne the convolution over graph by Chebyshev polynomial
approximation. In particular, we update the graph Laplacian matrix
that describes the connectivity of features in each layer according
to the corresponding learned features, which adaptively captures
the structure of dynamic graphs. Further, we deploy a graph-signal
smoothness prior in the loss function, thus regularizing the learning
process. Experimental results on the ShapeNet part dataset show
that the proposed approach signi�cantly reduces the computational
complexity while achieving competitive performance with the state
of the art. Also, experiments show RGCNN is much more robust
to both noise and point cloud density in comparison with other
methods. We further apply RGCNN to point cloud classi�cation
and achieve competitive results on ModelNet40 dataset.
KEYWORDS
Graph CNN, graph-signal smoothness prior, updated graph Lapla-
cian, point cloud segmentation
1 INTRODUCTION
The development of depth sensors like Microsoft Kinect and 3D
scanners like LiDAR has enabled convenient acquisition of 3D point
clouds, a popular signal representation of arbitrarily-shaped objects
in the 3D space. Point clouds consist of a set of points, each of which
is composed of 3D coordinates and possibly attributes such as color
s s
Figure 1: Illustration of the RGCNN architecture, which di-
rectly consumes raw point clouds (car in this example) with-
out voxelization or rendering. It constructs graphs based on
the coordinates and normal of each point, performs graph
convolution and feature learning, and adaptively updates
graphs in the learning process, which provides an e�cient
and robust approach for 3D recognition tasks such as point
cloud segmentation and classi�cation.
Previous point cloud segmentation works can be classi�ed into
model-driven segmentation and data-driven segmentation. Model-
driven methods include edge-based [21], region-growing [22] and
model-�tting [27], which are based on the prior knowledge of the
geometry but sensitive to noise, uneven density and complicated
structure. Data-driven segmentation, on the other hand, learns the
semantics from data, such as deep learning methods [19]. Neverthe-
less, typical deep learning architectures require regular input data
formats, such as images on regular 2D grids or voxels on 3D grids,
arXiv:1806.02952v1
[cs.CV]
8
Jun
2018
Fig. 2. Example of a point cloud of the ‘yellow dress’ sequence (a). The
geometry is captured by a graph (b) and the r component of the color is
considered as a signal on the graph (c). The size and the color of each disc
indicate the value of the signal at the corresponding vertex.
We build on our previous work [4], and propose a novel
algorithm for motion estimation and compensation in 3D
point cloud sequences. We cast motion estimation as a fea-
ture matching problem on dynamic graphs. In particular, we
compute new local features at different scales with spectral
graph wavelets (SGW) [5] for each node of the graph. Our
feature descriptors, which consist of the wavelet coefficients
of each of the signals placed in the corresponding vertex, are
then used to compute point-to-point correspondences between
graphs of different frames. We match our SGW features
in different graphs with a criterion that is based on the
Mahalanobis distance and trained from the data. To avoid
inaccurate matches, we first compute the motion on a sparse
set of matching nodes that satisfy the matching criterion. We
then interpolate the motion of the other nodes of the graph by
solving a new graph-based quadratic regularization problem,
which promotes smoothness of the motion vectors on the graph
in order to build a consistent motion field.
Then, we design a compression system for 3D point cloud
sequences, where we exploit the estimated motion information
in the predictive coding of the geometry and color information.
The basic blocks of our compression architecture are shown
in Fig. 3. We code the motion field in the graph Fourier
domain by exploiting its smoothness on the graph. Temporal
redundancy in consecutive 3D positions is removed by coding
the structural difference between the target frame and the
motion compensated reference frame. The structural difference
is efficiently described in a binary stream format as described
in [6]. Finally, we predict the color of the target frame by
interpolating it from the color of the motion compensated
reference frame. Only the difference between the actual color
information and the result of the motion compensation is actu-
ally coded with a state-of-the-art encoder for static octree data
[7]. Experimental results illustrate that our motion estimation
scheme effectively captures the correlation between consec-
Fig. 3. S
sequence.
efficient co
color att
comparis
The co
proposed
in efficie
first thro
temporal
the point
motion e
in dynam
scheme
significa
The r
Section
studies t
in Sectio
clouds b
and we
this repre
in Sectio
predictiv
Finally,
The d
been larg
have bee
example
[8], and
resolutio
binary d
performe
octree da
the mesh
idea beh
The octr
cloud att
of leaves
which is
applied
as signa
volumetric image
fMRI brain signal
mobility patterns
Irregular but structured data …
… which can be sparsely represented in the latent space
(c)
(e)
) traffic bottlenecks on the transportation graph,
I brain signal in a structural, anatomical graph,
D color attributes on a mesh graph. The size and
ignal at the corresponding vertex. The red and
respectively.
signals in a priori complex high-dimensional
different types of information which, if com-
fMRI brain signal
re
large-scale
interconnected
systems with an
irregular structure
ain Motivation
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-200
-100
0
100
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-400
-200
0
200
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-10
0
10
From vertex (high
Key Intuition
e
(b) (c)
(e)
raphs: (a) traffic bottlenecks on the transportation graph,
, (c) fMRI brain signal in a structural, anatomical graph,
nd (e) 3D color attributes on a mesh graph. The size and
e of the signal at the corresponding vertex. The red and
and male respectively.
ution of signals in a priori complex high-dimensional
ned using different types of information which, if com-
ng or inferring information in the datasets. Moreover,
way to handle signals that cannot be easily processed
cture. The price to pay for this flexibility is the fact
fMRI brain signal
mperature
large-scale
interconnected
systems with an
irregular structure
Main Motivation
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-200
-100
0
100
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-400
-200
0
200
-0.2 0 0.2 0.4 0.6 0.8 1 1.2 1.4
-10
0
10
From vertex (high
Key Intuition
ph Signal
ocessing
structure
s graph Fourier Transform graph dictionary
2
(a) (b)
(d)
traffic fM
temperature
Main Motivatio
…
Machine
Learning
Our Research
Graph-Based Learning
Graph Signal
Processing
Main Goal
Exploit the knowledge of the irregular structure
to develop efficient learning algorithms
Our Goal: exploit the knowledge of
irregular structure to develop data-efficient
online decision making strategies

Graph-Based DMSs
5
observation
Training data
action
updated
knowledge
3
Et
Et+T
,
-.
,
-./0
-∗
Step 2:Spectral Estimation
at
fat, at
2., ,
-.
Step 1:Act and Observe
Step 3: Selection
max
a
[r(ĝL,t(a)) + UCBL(Et)]
Learning in the
spectral domain
a
Model the search space or
context as graph

GSP-Based DMS
?
▪ Data-efficiency: learn in a sparse
domain
▪ Accuracy: learning representation that
preserves the geometry of the problem
▪ Mathematical framework is missing
Network Process Optimization
Recommendation Systems
Large-Scale RL
Multi-Arms bandit Problems

Kaige Yang
UCL
Xiaowen Dong
University of Oxford
Laplacian-regularized
Graph Bandits

• Graphs and Bandit
• Importance of Graphs in Decision-Making
• A Laplacian Perspective
• Output and Intuitions
• Conclusions
Outline
8

Recommendation Model
9
item 1
θ ∈ ℝd
: user parameter vector
x ∈ ℝd
: item feature vector y : linear payoff
η : σ − sub-Gaussian noise
yt = xT
t θ + η
Aim: Infer the best item by running a sequence of trials
item 2
?
RT = 𝔼
{
T
∑
t=1
max
x
xT
θ −
T
∑
t=1
yt
}
xt = arg max
x∈𝒜,θ∈𝒞t
{xT
θ}
Confidence set

Our Problem
10
item 1
y = xT
θ + η
Well known bandit problem
with assumptions:
(i) stochasticity, (ii) i.i.d.,
(iii) stationarity
item 2
?
? Our interest:
multi-user (high-
dimensional) case
Today’s talk

Main Challenges in DMS
11
observation
Training data
action
updated
knowledge
Theoretically
addressed by
▪ Multi-arm bandit
problem
▪ Reinforcement
Learning
▪ Find the optimal trade-off between exploration and
exploitation bandit and RL problems
▪ Sampling-efficiency: the learning performance does not
scale with the ambient dimension (number of arms, states,
etc) structured learning

Structured DMS - Main Challenges
12
• In DMSs, context or action payoffs (data) have semantically reach
information
Structured problems obviate the curse of
dimensionality by exploiting the data structure

13
information
Structured problems obviate the curse of
dimensionality by exploiting the data structure
C. Gentile et al., “Online Clustering of Bandits”, ICML 2014
Graph Clustering
• reducing the curse of
dimensionality
• degradation in real-world data
Need for more sophisticated frameworks
(than clustering) to handle high-
dimensional and structured data

14
information
• It is important to identify and leverage the structure underneaths
these data
Many works on Bandit are graph based, see overview [1]
▪ data-structure in bandits:
‣ Gentile, C., Li, S., and Zappella, G. “Online clustering of bandits”, ICML 2014
‣ Korda, N., Szorenyi, B., and Shuai, L. “Distributed clustering of linear bandits in
peer to peer networks", JMLR, 2016
‣ Yang, K. and Toni, L., “Graph-based recommendation system”, IEEE GlobalSIP, 2018
[1] Michal Valko: Bandits on graphs and structures, habilitation thesis, École normale supérieure de Cachan (ENS Cachan 2016)
can we capture the external
information beyond data-structure?

15
information
these data
[1] Michal Valko: Bandits on graphs and structures, habilitation thesis, École normale supérieure de Cachan (ENS Cachan 2016)
▪ spectral bandits:
‣ N. Cesa-Bianchi, et al., “A gang of bandits”, NeurIPS 2013
‣ M. Valko, et al., “Spectral bandits for smooth graph functions”, ICML 2014
‣ S. Vaswani, M Schmidt, and L. Lakshmanan, “Horde of bandits using gaussian
markov random fields”, arXiv, 2017.
‣ other recent works on asynchronous and decentralized network bandits
✦ single user bandit
✦ no per-user error bound —> coarse regret upper bounds
scaling linearly with the number of users
✦ high computational complexity
Many works on Bandit are graph based, see overview [1]

16
information
these data
• Highly interesting studies on graph-bandit already published, but
most of them work in the graph spatial (vertex) domain
Graph signal processing (GSP) can be applied to DMSs to
address the above challenges and needs
• Data can be high-dimensional, time-varying, and composition of
superimposed phenomena.
• Need proper framework to capture both data-structure and
external-geometry information (graphs)

Graph Signal Processing
17
/48
Graph signal processing
14
RN
+
0
-
v1
v2
v3 v4
v5
v6
v7
v8
v9
v1
v2
v3 v4
v5
v6
v7
v8
v9
takes into account both structure (edges) and
data (values at vertices)
f : V ! RN
! Network-structured data can be represented by graph signals
Structured but irregular data can be represented by graph signals
Goal: to capture both structure (edges) and data (values at
vertices)

Frequency Analysis
18
• Shuman, David I., et al. "The emerging field of signal processing on graphs: Extending high-dimensional data analysis to networks and other
irregular domains." IEEE signal processing magazine 30.3 (2013): 83-98
• Bronstein, Michael M., et al. "Geometric deep learning: going beyond euclidean data." IEEE Signal Processing Magazine 34.4 (2017): 18-42.
low frequency high frequency
f ̂
f(l) f
graph G is defined as the set of
connecting a vertex with a positive
to a vertex with a negative signal:
( , ) : ( ) ( ) .
E
e i j f i f j 0
1
!
= =
" ,
H SIGNAL REPRESENTATIONS
WO DOMAINS
raph Fourier transform (3) and its
e (4) give us a way to equivalently
sent a signal in two different
ns: the vertex domain and the graph
al domain. While we often start with
al g in the vertex domain, it may also
ful to define a signal g
t directly in
aph spectral domain. We refer to
ignals as kernels. In Figure 4(a) and
ne such signal, a heat kernel, is
in both domains. Analogously to
assical analog case, the graph
r coefficients of a smooth signal such as the one shown
ure 4 decay rapidly. Such signals are compressible as
an be closely approximated by just a few graph Fourier
cients (see, e.g., [24]–[26] for ways to exploit this
essibility).
ETE CALCULUS AND SIGNAL
OTHNESS WITH RESPECT TO THE
NSIC STRUCTURE OF THE GRAPH
we analyze signals, it is important to emphasize that
ties such as smoothness are with respect to the intrinsic
ure of the data domain, which in our context is the
ed graph. Whereas differential geometry provides tools
orporate the geometric structure of the underlying mani-
to the analysis of continuous signals on differentiable
olds, discrete calculus provides a “set of definitions and
ntial operators that make it possible to operate the
nery of multivariate calculus on a finite, discrete space”
1].
add mathematical precision to the notion of smoothness
espect to the intrinsic structure of the underlying graph,
efly present some of the discrete differential operators
d in [4], [6]–[8], [14], and [28]–[30]. Note that the names
ny of the discrete calculus operators correspond to the
ous operators in the continuous setting. In some prob-
the weighted graph arises from a discrete sampling of a
h manifold. In that situation, the discrete differential
ors may converge—possibly under additional assump-
—to their namesake continuous operators as the density of
mpling increases. For example, [31]–[34] examine the
rgence of discrete graph Laplacians (normalized and
malized) to continuous manifold Laplacians.
e edge derivative of a signal f with respect to edge ( , )
e i j
=
and the graph gradient of f at vertex i is the vector
: .
e
f f
. ( , )
E V
i
i e e i j
s.t for some j
d
2
2
=
! !
=
; E
' 1
Then the local variation at vertex i
:
( ) ( )
e
W f j f i
f f
. ( , )
,
E V
N
i
e e i j j i
i j
j
2
2 2
1
2 2
1
s.t for some
i
d
2
2
< < =
= -
! !
!
=
/ c m
=
6
;
G
@ E
/
provides a measure of local smoothness of f around vertex ,
i as it is
small when the function f has similar values at i and all neighbor-
ing vertices of .
i
For notions of global smoothness, the discrete p-Dirichlet
form of f is defined as
( ): ( ) ( ) .
S
p p
W f j f i
1 1
f f ,
N
p i
p
i j
j
p
i V
i V
2
2 2
i
4
< <
= = -
!
!
!
6
; @ E
/
/
/ (5)
When ,
p 1
= ( )
S f
1 is the total variation of the signal with respect
to the graph. When ,
p 2
= we have
( ) ( ) ( )
( ) ( ) .
L
S W f j f i
W f j f i
2
1
f
f f
,
,
( , )
N
E
i j
j
i V
i j
i j
2
2
2 T
i
= -
= - =
!
!
!
6
6
@
@
/
/
/ (6)
( )
S f
2 is known as the graph Laplacian quadratic form [17], and
the seminorm L
f
< < is defined as
: ( ).
L L S
f f f f f
L 2
1
2 2
T
< < < <
= = =
Note from (6) that the quadratic form ( )
S f
2 is equal to zero if
(a) (b)
1
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6
mℓ
g(m
ℓ
)
ˆ
[FIG4] Equivalent representations of a graph signal in the vertex and graph spectral
domains. (a) A signal g that resides on the vertices of the Minnesota road graph [27]
with Gaussian edge weights as in (1). The signal’s component values are represented by
the blue (positive) and black (negative) bars coming out of the vertices. (b) The same
signal in the graph spectral domain. In this case, the signal is a heat kernel, which is
actually defined directly in the graph spectral domain by ( ) .
g e 5
m =
,
m
- ,
t The signal plotted
in (a) is then determined by taking an inverse graph Fourier transform (4) of .
g
t
GFT IGFT
is defined as the set of
ing a vertex with a positive
rtex with a negative signal:
, ) : ( ) ( ) .
E
i j f i f j 0
1
! ,
AL REPRESENTATIONS
MAINS
urier transform (3) and its
ve us a way to equivalently
signal in two different
ertex domain and the graph
n. While we often start with
e vertex domain, it may also
efine a signal g
t directly in
ctral domain. We refer to
kernels. In Figure 4(a) and
signal, a heat kernel, is
h domains. Analogously to
analog case, the graph
ients of a smooth signal such as the one shown
ecay rapidly. Such signals are compressible as
osely approximated by just a few graph Fourier
see, e.g., [24]–[26] for ways to exploit this
y).
LCULUS AND SIGNAL
SS WITH RESPECT TO THE
RUCTURE OF THE GRAPH
yze signals, it is important to emphasize that
h as smoothness are with respect to the intrinsic
he data domain, which in our context is the
h. Whereas differential geometry provides tools
the geometric structure of the underlying mani-
analysis of continuous signals on differentiable
crete calculus provides a “set of definitions and
perators that make it possible to operate the
multivariate calculus on a finite, discrete space”
hematical precision to the notion of smoothness
o the intrinsic structure of the underlying graph,
sent some of the discrete differential operators
[6]–[8], [14], and [28]–[30]. Note that the names
e discrete calculus operators correspond to the
rators in the continuous setting. In some prob-
hted graph arises from a discrete sampling of a
old. In that situation, the discrete differential
converge—possibly under additional assump-
namesake continuous operators as the density of
increases. For example, [31]–[34] examine the
of discrete graph Laplacians (normalized and
to continuous manifold Laplacians.
erivative of a signal f with respect to edge ( , )
e i j
=
fined as
f
2 6 @
: .
e
f f
. ( , )
E V
i
i e e i j
s.t for some j
d
2
2
=
! !
=
; E
' 1
:
( ) ( )
e
W f j f i
f f
. ( , )
,
E V
N
i
e e i j j i
i j
j
2
2 2
1
2 2
1
s.t for some
i
d
2
2
< < =
= -
! !
!
=
/ c m
=
6
;
G
@ E
/
i as it is
ing vertices of .
i
( ): ( ) ( ) .
S
p p
W f j f i
1 1
f f ,
N
p i
p
i j
j
p
i V
i V
2
2 2
i
4
< <
= = -
!
!
!
6
; @ E
/
/
/ (5)
When ,
p 1
= ( )
S f
p 2
= we have
( ) ( ) ( )
( ) ( ) .
L
S W f j f i
W f j f i
2
1
f
f f
,
,
( , )
N
E
i j
j
i V
i j
i j
2
2
2 T
i
= -
= - =
!
!
!
6
6
@
@
/
/
/ (6)
( )
S f
the seminorm L
f
< < is defined as
: ( ).
L L S
f f f f f
L 2
1
2 2
T
< < < <
= = =
S f
and only if f is constant across all vertices (which is why
f L is only a seminorm), and, more generally, ( )
S f
2 is small
(a) (b)
1
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6
mℓ
g(m
ℓ
)
ˆ
g e 5
m =
,
m
- ,
g
t
on a graph G is defined as the set of
edges connecting a vertex with a positive
signal to a vertex with a negative signal:
( ): ( , ) : ( ) ( ) .
Z E
e i j f i f j 0
f
G 1
!
= =
" ,
GRAPH SIGNAL REPRESENTATIONS
IN TWO DOMAINS
The graph Fourier transform (3) and its
inverse (4) give us a way to equivalently
represent a signal in two different
domains: the vertex domain and the graph
spectral domain. While we often start with
a signal g in the vertex domain, it may also
be useful to define a signal g
t directly in
the graph spectral domain. We refer to
such signals as kernels. In Figure 4(a) and
(b), one such signal, a heat kernel, is
shown in both domains. Analogously to
the classical analog case, the graph
Fourier coefficients of a smooth signal such as the one shown
in Figure 4 decay rapidly. Such signals are compressible as
they can be closely approximated by just a few graph Fourier
coefficients (see, e.g., [24]–[26] for ways to exploit this
compressibility).
DISCRETE CALCULUS AND SIGNAL
SMOOTHNESS WITH RESPECT TO THE
INTRINSIC STRUCTURE OF THE GRAPH
When we analyze signals, it is important to emphasize that
properties such as smoothness are with respect to the intrinsic
structure of the data domain, which in our context is the
weighted graph. Whereas differential geometry provides tools
to incorporate the geometric structure of the underlying mani-
fold into the analysis of continuous signals on differentiable
manifolds, discrete calculus provides a “set of definitions and
differential operators that make it possible to operate the
machinery of multivariate calculus on a finite, discrete space”
[14, p. 1].
To add mathematical precision to the notion of smoothness
with respect to the intrinsic structure of the underlying graph,
we briefly present some of the discrete differential operators
defined in [4], [6]–[8], [14], and [28]–[30]. Note that the names
of many of the discrete calculus operators correspond to the
analogous operators in the continuous setting. In some prob-
lems, the weighted graph arises from a discrete sampling of a
smooth manifold. In that situation, the discrete differential
operators may converge—possibly under additional assump-
tions—to their namesake continuous operators as the density of
the sampling increases. For example, [31]–[34] examine the
convergence of discrete graph Laplacians (normalized and
unnormalized) to continuous manifold Laplacians.
The edge derivative of a signal f with respect to edge ( , )
e i j
=
at vertex i is defined as
f
2 6 @
: .
e
f f
. ( , )
E V
i
i e e i j
s.t for some j
d
2
2
=
! !
=
; E
' 1
:
( ) ( )
e
W f j f i
f f
. ( , )
,
E V
N
i
e e i j j i
i j
j
2
2 2
1
2 2
1
s.t for some
i
d
2
2
< < =
= -
! !
!
=
/ c m
=
6
;
G
@ E
/
i as it is
ing vertices of .
i
( ): ( ) ( ) .
S
p p
W f j f i
1 1
f f ,
N
p i
p
i j
j
p
i V
i V
2
2 2
i
4
< <
= = -
!
!
!
6
; @ E
/
/
/ (5)
When ,
p 1
= ( )
S f
p 2
= we have
( ) ( ) ( )
( ) ( ) .
L
S W f j f i
W f j f i
2
1
f
f f
,
,
( , )
N
E
i j
j
i V
i j
i j
2
2
2 T
i
= -
= - =
!
!
!
6
6
@
@
/
/
/ (6)
( )
S f
the seminorm L
f
< < is defined as
: ( ).
L L S
f f f f f
L 2
1
2 2
T
< < < <
= = =
S f
and only if f is constant across all vertices (which is why
f L is only a seminorm), and, more generally, ( )
S f
2 is small
(a) (b)
1
0.8
0.6
0.4
0.2
0 1 2 3 4 5 6
mℓ
g(m
ℓ
)
ˆ
g e 5
m =
,
m
- ,
g
t
̂
f(l) = ⟨f, χl⟩ =
N
∑
n=1
f(n)χ*
l
(n)
f(n) =
N−1
∑
l=0
̂
f(l)χl(n), ∀n ∈
χT
0 Lχ0 = λ0
operator in differential geometric jargon) : ( (
L L
X X
2 2
"
D ) ) is
an operator,
(
f f
div d
D =- ), (14)
, , , .
f f f f f f
( ( (
L T L L
X X
X
2 2 2
d d
G H G H G H
D D
= =
) )
) (15)
The left-hand-side in (15) is known as the Dirichlet energy
in physics and measures the smoothness of a scalar field on
the manifold (see “Physical Interpretation of Laplacian Eigen-
by solving the optimization problem
( ) , , ,
span{ , , }.
( )
E
E
min
min i k
1 1 2 1
1
s.t.
s.t.
i i
i i
0 1
0 0
Dir
Dir
i
0
f
= f
z z
z z
z z z
= = -
=
-
z
z
(S2)
In the discrete setting, when the domain is sampled at n
points, (S2) can be rewritten as
trace( ) ,
I
min s.t.
k k k k
R
k
n k
T
U U U U =
<
<
!
U #
(S3)
where ( , )
k k
0 1
f
z z
U = - . The solution of (S3) is given by
the first k eigenvectors of T satisfying
preted as frequencies, where const
0
z = with the
corresponding eigenvalue 0
0
m = plays the role of the
direct current component.
The Laplacian eigendecomposition can be carried out
in two ways. First, (S4) can be rewritten as a general-
ized eigenproblem ( )
D W A
k k k
U U K
- = , resulting in
A-orthogonal eigenvectors, A I
k k
U U =
<
. Alternatively,
introducing a change of variables A /
k k
1 2
W U
= , we can
obtain a standard eigendecomposition problem
( )
A D W A
/ /
k k k
1 2 1 2
W W K
- =
- -
with orthogonal eigen-
vectors I
k k
W W =
<
. When A D
= is used, the matrix
( )
A D W A
/ /
1 2 1 2
T = -
- -
is referred to as the normalized
symmetric Laplacian.
0 10 20 30 40 50 60 70 80 90 100
−0.2
0
0.2
0
Max
Min
0
Max
Min
(c)
φ0 φ1 φ2 φ3
φ0
φ0
φ3
φ2
φ1
φ1 φ2 φ3
(b)
(a)
φ0
φ
φ φ1 φ2
φ
φ φ3
φ
φ
FIGURE S2. An example of the first four Laplacian eigenfunctions , ,
0 3
f
z z on (a) a Euclidean domain (1-D line), and (b) and (c) non-Euclidean
domains [(b) a human shape modeled as a 2-D manifold, and (c) a Minnesota road graph]. In the Euclidean case, the result is the standard Fourier
basis comprising sinusoids of increasing frequency. In all cases, the eigenfunction 0
z corresponding to zero eigenvalue is constant (direct current
component).1-D: one-dimensional.
χ0 χ1 χ2 χ3
unction f on the domain ,
X the Dirichlet energy
( ) ( ) ( ) ( ) ,
f f x dx f x f x dx
r
Tx
2
X X X
d T
= =
# # (S1)
how smooth it is [the last identity in (S1) stems
)]. We are looking for an orthonormal basis on
ining k smoothest possible functions (Figure S2),
g the optimization problem
( ) , , ,
span{ , , }.
( )
E
E
n
n i k
1 1 2 1
1
s.t.
s.t.
i i
i i
0 1
0 0
Dir
Dir
f
= f
z z
z z
z z z
= = -
=
- (S2)
iscrete setting, when the domain is sampled at n
2) can be rewritten as
trace( ) ,
I
min s.t.
k k k k
R
k
n k
T
U U U U =
<
<
!
U #
(S3)
( , )
k k
0 1
f
z z
= - . The solution of (S3) is given by
eigenvectors of T satisfying
,
k k k
TU U K
= (S4)
where diag( , , )
k k
0 1
f
m m
K = - is the diagonal matrix of
corresponding eigenvalues. The eigenvalues
0 k
0 1 1
g
# #
m m m
= - are nonnegative due to the posi-
tive semidefiniteness of the Laplacian and can be inter-
preted as frequencies, where const
0
z = with the
corresponding eigenvalue 0
0
m = plays the role of the
direct current component.
The Laplacian eigendecomposition can be carried out
in two ways. First, (S4) can be rewritten as a general-
ized eigenproblem ( )
D W A
k k k
U U K
- = , resulting in
A-orthogonal eigenvectors, A I
k k
U U =
<
. Alternatively,
introducing a change of variables A /
k k
1 2
W U
= , we can
obtain a standard eigendecomposition problem
( )
A D W A
/ /
k k k
1 2 1 2
W W K
- =
- -
with orthogonal eigen-
vectors I
k k
W W =
<
. When A D
= is used, the matrix
( )
A D W A
/ /
1 2 1 2
T = -
- -
is referred to as the normalized
symmetric Laplacian.
ysical Interpretation of Laplacian Eigenfunctions
10 20 30 40 50 60 70 80 90 100
0
Max
Min
Max
φ0
φ0
φ3
φ2
φ1
φ1 φ2 φ3
(b)
(a)
φ0
φ
φ φ1 φ2
φ
φ φ3
φ
φ
χ0 χ1 χ2 χ3
λl
̂
f
(l
)

Filtering and Smoothness
f ̂
f(l)
GFT IGFT
/48
26
f
i=1 `=0
GFT
ˆ
f(`) ĝ( `) ˆ
f(`)
IGFT
f(i) =
N 1
X
`=0
ĝ( `) ˆ
f(`) `(i)
ĝ( `)
` ` `
̂
g(λl) ̂
f(l) y =
N−1
∑
l=0
̂
g(λl) ̂
f(l)χl(n)
̂
g(λl)
Example
xT
Lx = 61.93
input signal x in the vertex domain
0.0 2.5 5.0 7.5 10.0 12.5
graph frequency ⁄
0.0
0.2
0.4
0.6
0.8
1.0
frequency
content
x̂(⁄)
signals in the spectral domain
input signal x̂
kernel g
filtered signal ŷ
yT
Ly = 10.75
filtered signal y in the vertex domain
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
Observation: the low-pass filtered signal y is much smoother than x!
12 / 25
Example
xT
Lx = 61.93
0.0 2.5 5.0 7.5 10.0 12.5
graph frequency ⁄
0.0
0.2
0.4
0.6
0.8
1.0
frequency
content
x̂(⁄)
signals in the spectral domain
input signal x̂
kernel g
filtered signal ŷ
yT
Ly = 10.75
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
12 / 25
Example
xT
Lx = 61.93
0.0 2.5 5.0 7.5 10.0 12.5
graph frequency ⁄
0.0
0.2
0.4
0.6
0.8
1.0
frequency
content
x̂(⁄) signals in the spectral domain
input signal x̂
kernel g
filtered signal ŷ
yT
Ly = 10.75
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
≠0.8
≠0.6
≠0.4
≠0.2
0.0
0.2
0.4
0.6
0.8
12 / 25
fT
L f = 61.93 yT
Ly = 10.75
graph frequency λ
frequency
content
input signal
kernel
filtered signal
̂
f
g
̂
y
Filtering
y⋆
= arg min
y
{||y − f ||2
2 + γyT
Ly}
y⋆
= (I + γL)
−1
f = χ (I + γΛ)
−1
χT
f
̂
f(l)
̂
y(l)
g(L)
remove noise by low-pass
filtering in the graph
spectral domain
Denoising problem
M. Defferrard, Deep Learning on Graphs: a journey from continuous
manifolds to discrete networks (KCL/UCL Junior Geometry Seminar)

GSP for Online DMS
GSP
a
GSP to exploit spectral
properties
MAB
Exploration exploitation
trade-off
Training data
GSP-Based MAB
▪ Data-efficiency: learn in a sparse
domain
▪ Accuracy: learning representation that
preserves the geometry of the problem
▪ Mathematical framework is missing
▪ Not many works beyond smoothness

• Graphs and Bandit
• Importance of Graphs in Decision-Making
• A Laplacian Perspective
• Output and Intuitions
• Conclusions
Outline
21

22
item 1
θ ∈ ℝd
: user parameter vector
x ∈ ℝd
: item feature vector
y : linear payoff
η : σ − sub-Gaussian noise
y = xT
θ + η
item 2
?

23
item 1
y = xT
θ + η
Well known bandit problem
with assumptions:
(i) stochasticity, (ii) i.i.d.,
(iii) stationarity
item 2
?
? Our interest:
multi-user (high-
dimensional) case
Today’s talk

Settings
24
• centralized agent
• m arms and n users
• users appearing uniformly at random
• At round t, user it appears, and an agent
‣ chooses an arm at
‣ receives a reward yt = xT
at
θit
+ ηt
?
θ
• Sequential sampling strategy (bandit algorithm)
• Goal: Maximize sum of rewards
at+1 = Ft(i1, a1, y1, …, it, at, yt |it+1)
𝔼
[
T
∑
t=1
yt
]

Exploiting Graph Structure
25
• undirected-weighted graph
• captures similarity between users i and j (i.e, )
• combinatorial Laplacian of
𝒢 = (V, E, W) :
Wi,j = Wj,i : θi,j = θj,i
L = D − W : 𝒢
?
G
θ
Similarity captured in the latent space

Assumptions
26
• User preferences mapped into a graph of similarities
• Exploitation of smoothness prior
Θ = [θ1, θ2, . . . , θn]T
∈ ℝn×d
: signal on graph
tr(ΘT
𝓛Θ) =
1
4
d
∑
k=1
∑
i∼j
(
Wij
Dii
+
Wji
Djj )
(Θik − Θjk)
2
smoothness
measure
G
θ
• Smoothness of over graph can be quantified using the Laplacian
quadratic form
• We express smoothness as a function of the random-walk Laplacian
Θ 𝒢
ℒ = D−1
L with ℒii = 1 and
∑
j≠i
ℒji = − 1
• avoiding a regret scaling with Dii
• achieving convexity property needed to bound the estimation error

Problem Formulation
27
Given
‣ the users graph
‣ arm feature vector
‣ no information about the user ?
The agent seeks the optimal selection strategy that minimizes the
cumulative (pseudo) regret
𝒢
xa, a ∈ {1,2,…, m}
θi, i ∈ {1,2,…, n}
RT =
T
∑
t=1
(
(x*
t )T
θit
− xT
t θit)

Problem Formulation
28
Given
‣ the users graph
‣ arm feature vector
‣ no information about the user ?
The agent seeks the optimal selection strategy that minimizes the
cumulative (pseudo) regret
Under smoothness prior, the users parameter vector is estimated as
𝒢
xa, a ∈ {1,2,…, m}
θi, i ∈ {1,2,…, n}
RT =
T
∑
t=1
(
(x*
t )T
θit
− xT
t θit)
̂
Θt = arg min
Θ∈ℝn×d
n
∑
i=1
∑
τ∈ti
(xT
τ θi − yi,τ)2
+ α tr(ΘT
𝓛Θ)
fidelity term smoothness regularizer
xi,t = arg max
(x,θ)∈(𝒟,𝒞i,t)
xT
θ
The agent selects sequential actions as follows
confident set ?

Problem Formulation
29
Main Challenges
• smoothness not imposed in the observation domain but in the
representation one
• no theoretical error bound for Laplacian regularized estimate
• computational complexity
Main Novelties
• derivation single-user estimation error bound
• proposed single-user UCB in bandit problem
• low-complexity (local) algorithm
• cumulative regret bound as a function of graph properties

Laplacian-regularized Estimator
30
̂
Θt = arg min
Θ∈ℝn×d
n
∑
i=1
∑
τ∈ti
(xT
τ θi − yi,τ)2
+ α tr(ΘT
𝓛Θ)
vec( ̂
Θt) = (ΦtΦT
t + α𝓛 ⊗ I)−1
ΦtYt
where is the Kronecker product, and is a concatenation of column of
⊗ vec( ̂
Θt) ̂
Θt
the identity matrix. Yt = [y1, y2, ..., yt]T
2 Rt
is the
collection of all payo↵s, t = [ 1, 2, ..., t] 2 Rnd⇥t
,
where t 2 Rnd
, is a long spare vector indicating the
event that arm with feature xt is selected for user i.
Formally,
T
t = ( 0, ..., 0
| {z }
(i 1)⇥d times
, xT
t , 0, ...0
| {z }
(n i)⇥d times
) (6)
Eq. 5 gives the closed form solution of ˆ
⇥t, but does not
provide a closed form solution of single-user estimation
ˆ
✓i,t, i 2 [1, ..., n]. Mathematically, ˆ
✓i,t can be obtained
by decoupling Eq. 5. However, due to the inversion
( t
T
t +↵L⌦I) 1
, such decoupling is non-trivial and
tedious. We notice that the single-user estimation ˆ
✓i,t
can be closely approximated by Lemma 1.
Lemma 1. ˆ
⇥t is obtained from Eq. 5, let ˆ
✓i,t be the
i-th row of ˆ
⇥t which is the estimate of ✓i. ˆ
✓i,t can be
approximated by :
ˆ
✓i,t ⇡ A 1
i,t Xi,tYi,t ↵A 1
i,t
n
X
j=1
LijA 1
j,t Xj,tYj,t (7)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
is the Gram ma-
trix of user i, Lij is the (i, j)-th element in L, Yi,t =
[yi,1, ..., yi,ti
] are the collection of payo↵s associated
with user i up to time t.
Rnd
. Note that ⇤i,t 2
along the diagonal of ⇤
from ⇤t. Specifically,
Let define At = t
T
t ,
↵L⌦, Then
⇤t =
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵
where Ai,t =
P
⌧2ti
x⌧
th element in L. The d
Appendix B and C. Giv
the size of the confiden
gives the value of i,t.
Lemma 2. ti is the s
served up to time t.
Ai,t + ↵LiiI, ⇠i,t =
P
identity matrix. ⇤i,t i
i =
Pn
j=1 Lij✓j, the
fined in Eq. 8 satisfies
probability 1 with
||ˆ
✓i,t ✓i||⇤i,t

s
2
decoupling estimates
Closed form solution
Although not straightforward to see, it has a closed
form solution [Alvarez et al., 2012]:
vec( ˆ
⇥t) = ( t
T
t + ↵L ⌦ I) 1
tYt (5)
where ⌦ is the Kronecker product, vec( ˆ
⇥t) 2 Rnd
is
the concatenation of the columns of ˆ
⇥t. I 2 Rd⇥d
is
the identity matrix. Yt = [y1, y2, ..., yt]T
2 Rt
is the
collection of all payo↵s, t = [ 1, 2, ..., t] 2 Rnd⇥t
,
where t 2 Rnd
, is a long spare vector indicating the
event that arm with feature xt is selected for user i.
Formally,
T
t = ( 0, ..., 0
| {z }
(i 1)⇥d times
, xT
t , 0, ...0
| {z }
(n i)⇥d times
) (6)
Eq. 5 gives the closed form solution of ˆ
⇥t, but does not
provide a closed form solution of single-user estimation
ˆ
✓i,t, i 2 [1, ..., n]. Mathematically, ˆ
by decoupling Eq. 5. However, due to the inversion
( t
T
t +↵L⌦I) 1
, such decoupling is non-trivial and
tedious. We notice that the single-user estimation ˆ
✓i,t
can be closely approximated by Lemma 1.
monly
Valko e
Formall
where
⇤t 2 Rn
Rnd
. N
along t
from ⇤
Let defi
↵L⌦, T
Then, ⇤
where A
th elem
Append
the size
gives th

31
] 2 Rnd⇥t
,
cating the
for user i.
s
) (6)
ut does not
estimation
e obtained
e inversion
trivial and
mation ˆ
✓i,t
ˆ
✓i,t be the
ˆ
✓i,t can be
,tYj,t (7)
Gram ma-
L, Yi,t =
associated
along the diagonal of ⇤t. So, ⇤i,t can be obtained
Let define At = t
T
t , L⌦ = L ⌦ I and Mt = At +
↵L⌦, Then
⇤t = MtA 1
t Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
, Lij is the (i, j)-
th element in L. The detail derive is included in the
Appendix B and C. Given Eq. 10, we can upper bound
the size of the confidence set defined in Eq. 8 which
Lemma 2. ti is the set of time at which user i is
served up to time t. Ai,t =
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
identity matrix. ⇤i,t is defined in Eq. 10. Denote
i =
Pn
j=1 Lij✓j, the size of the confidence set de-
fined in Eq. 8 satisfies the following upper bound with
probability 1 with 2 [0, 1].
||ˆ
✓i,t ✓i||⇤i,t

s
2 log
|Vi,t|1/2
|↵I|1/2
+
p
↵|| i||2 (11)
⌧ and ↵
s convex
chniques.
a closed
(5)
2 Rnd
is
Rd⇥d
is
Rt
is the
2 Rnd⇥t
,
ating the
r user i.
(6)
does not
timation
obtained
nversion
ivial and
ation ˆ
✓i,t
i,t be the
,t can be
Yj,t (7)
fine a confidence set around ˆ
✓i,t based on Mahalanobis
distance [De Maesschalck et al., 2000] , which is com-
monly used in bandit literature [Dani et al., 2008,
Valko et al., 2013, Lattimore and Szepesvári, 2018].
Formally,
Ci,t = {✓i,t : ||ˆ
✓i,t ✓i,t||⇤i,t  i,t} (8)
where i,t is the upper bound of ||ˆ
✓i,t ✓i,t||⇤i,t . Let
⇤t 2 Rnd⇥nd
denote the precision matrix of vec( ˆ
⇥t) 2
Rnd
. Note that ⇤i,t 2 Rd⇥d
is the i-th block matrix
Let define At = t
T
t , L⌦ = L ⌦ I and Mt = At +
↵L⌦, Then
⇤t = MtA 1
t Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
i =
Pn
precision matrix of vec( ̂
Θt)

32
error bound
] 2 Rnd⇥t
,
cating the
for user i.
s
) (6)
ut does not
estimation
e obtained
e inversion
trivial and
mation ˆ
✓i,t
ˆ
✓i,t be the
ˆ
✓i,t can be
,tYj,t (7)
Gram ma-
L, Yi,t =
associated
Let define At = t
T
t , L⌦ = L ⌦ I and Mt = At +
↵L⌦, Then
⇤t = MtA 1
t Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
i =
Pn
||ˆ
✓i,t ✓i||⇤i,t

s
2 log
|Vi,t|1/2
|↵I|1/2
+
p
↵|| i||2 (11)
(6)
es not
mation
ained
ersion
al and
n ˆ
✓i,t
be the
can be
t (7)
m ma-
Yi,t =
ciated
⇤t = MtAt Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
i =
Pn
||ˆ
✓i,t ✓i||⇤i,t

s
2 log
|Vi,t|1/2
|↵I|1/2
+
p
↵|| i||2 (11)
graph information
(a)
Figure 1: (a) || i||2 vs Smoo
|
|
Δ
i
|
|
2
smoothness
||Δi ||2 ∈ [0,||θi ||2 ]
Manuscript under review by AISTATS 2019
Which means
i,t =
s
2 log
|Vi,t|1/2
|↵I|1/2
+
p
↵|| i||2 (12)
Proof. Appendix D.
Remark 2. The graph structure is included in the
term i =
Pn
i=1 Lij✓j. We provide the intuition
underlying this term and explain the e↵ect of graph
Algorithm 1: GraphUC
Input : ↵, T, L,
Initialization : For an
ˆ
✓0,i = 0 2 Rd
, ⇤0,i = 0
A0,i = 0 2 Rd⇥d
, i,t =
for t 2 [1, T] do
User index it is select
1. Ai,t Ai,t 1 + x
2. Aj,t Aj,t 1, 8j
3. Update ⇤ via E
Δi =
n
∑
i=1
ℒijθj = θi −
∑
j≠i
(−ℒijθj)
⌧ and ↵
s convex
chniques.
a closed
(5)
2 Rnd
is
Rd⇥d
is
Rt
is the
2 Rnd⇥t
,
ating the
r user i.
(6)
does not
timation
obtained
nversion
ivial and
ation ˆ
✓i,t
i,t be the
,t can be
Yj,t (7)
Formally,
Ci,t = {✓i,t : ||ˆ
✓i,t ✓i,t||⇤i,t  i,t} (8)
⇤t 2 Rnd⇥nd
⇥t) 2
Rnd
Let define At = t
T
t , L⌦ = L ⌦ I and Mt = At +
↵L⌦, Then
⇤t = MtA 1
t Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
i =
Pn

GraphUCB
33
t under review by AISTATS 2019
|2 (12)
ded in the
e intuition
t of graph
, Lij  0
1 Lij✓j =
tween user
neighbors
structure,
connected
Lij = 1
n 1
1
1 ✓i = 0.
0, in this
Algorithm 1: GraphUCB
Input : ↵, T, L,
Initialization : For any i 2 {1, 2, ..., n}
ˆ
✓0,i = 0 2 Rd
, ⇤0,i = 0 2 Rd⇥d
,
A0,i = 0 2 Rd⇥d
, i,t = 0.
for t 2 [1, T] do
User index it is selected
1. Ai,t Ai,t 1 + xi,t 1xT
i,t 1 if i = it.
2. Aj,t Aj,t 1, 8j 6= it.
3. Update ⇤i,t via Eq. 10.
4. Select xi,t via Eq. 13
where i,t is defined in Eq. 12.
5. Receive the payo↵ yi,t.
6. Update ˆ
⇥t via Eq. 4.
end
by the confidence set Ci,t. Formally, at time t, user
arg max
x∈𝒟
xT ̂
θi,t + βi,t ||x||Λ−1
i,t

Analysis
34
m selected by the learner
user i and time ⌧ and ↵
Clearly, Eq. 4 is convex
optimization techniques.
d to see, it has a closed
2012]:
↵L ⌦ I) 1
tYt (5)
roduct, vec( ˆ
⇥t) 2 Rnd
is
umns of ˆ
⇥t. I 2 Rd⇥d
is
y1, y2, ..., yt]T
2 Rt
is the
= [ 1, 2, ..., t] 2 Rnd⇥t
,
are vector indicating the
xt is selected for user i.
xT
t , 0, ...0
| {z }
(n i)⇥d times
) (6)
lution of ˆ
⇥t, but does not
n of single-user estimation
cally, ˆ
ver, due to the inversion
oupling is non-trivial and
single-user estimation ˆ
✓i,t
by Lemma 1.
rom Eq. 5, let ˆ
✓i,t be the
stimate of ✓i. ˆ
✓i,t can be
1
n
X
j=1
LijA 1
j,t Xj,tYj,t (7)
⇤i,t 2 R be the precision matrix of ˆ
✓i,t. We can de-
Formally,
Ci,t = {✓i,t : ||ˆ
✓i,t ✓i,t||⇤i,t  i,t} (8)
⇤t 2 Rnd⇥nd
⇥t) 2
Rnd
Let define At = t
T
t , L⌦ = L ⌦ I and Mt = At +
↵L⌦, Then
⇤t = MtA 1
t Mt (9)
Then, ⇤i,t is
⇤i,t = Ai,t + 2↵LiiI + ↵2
n
X
j=1
L2
ijA 1
j,t (10)
where Ai,t =
P
⌧2ti
x⌧ xT
⌧ 2 Rd⇥d
P
⌧2ti
x⌧ xT
⌧ , Vi,t =
P
⌧2ti
xi,⌧ ⌘i,⌧ , I 2 Rd⇥d
is the
i =
Pn
it provides a comparison with no-graph UCB
performance of proposed algorithms. We first present
a Lemma on a term which plays a central role in the
regret upper bound. Next, we state the regret up-
per bound. Finally, the regret bound is compared
with that of LinUCB [Li et al., 2010] and Gob.Lin
[Cesa-Bianchi et al., 2013].
Lemma 3. Define i,ti =
Pti
t=1 ||xi,t||2
⇤
1
i,t
Pti
t=1 ||xi,t||2
V
1
i,t
, where
Vi,ti = Ai,ti + ↵LiiI and ⇤i,ti defined1
in Eq. 10.
Without loss of generality, assume ||xi,t||2  1 for any
t, ti and i, then
i,ti
2 (0, 1] (14)
Furthermore, denser connected graph leads to smaller
i,ti
. Empirical evidence is provided in Fig. 1-b.
Proof. Appendix F.
6.1 Regret Upper Bound
In this subsection, we present the cumulative re-
gret upper bounds satisfied by both GraphUCB
and GraphUCB-Local. We first present the upper
bound of single-user cumulative regret in Theorem 1.
1
For any isolated node i, Lii is set to be 1.
ability 1 with 2 [0,
RT =
n
X
i=1
Ri,ti =
n
X
i=
 Õ
✓
d
p
Tn ma
i2
Proof. See Appendix G.
Remark 4. We emph
Theorem 2 are satisfied
GraphUCB-Local. To
between GraphUCB an
estimation ˆ
✓i,t. Howeve
based on the ground-truth
algorithms. Therefore, th
per bounds.
6.2 Compare with L
With LinUCB[Li et al.,
(described in Section 3),
user i over time horizon t
bound with probability 1
Ri,ti
= O
✓
p
d log(ti)
Ψ
i,t
time
Manuscript under review by AISTAT
(a) (b)
Next, we sho
regret experie
Theorem 1
Ai,ti + ↵LiiI
Eq. 10. With
for any i 2 {
and i. Then,
ti of any use
upper bound

Regret Analysis
35
𝒪
(( d log(ti) + α ||Δi ||2 )Ψi,ti
dti log(ti)
)
= 𝒪
(
d tiΨi,ti)
Single User Regret
The cumulative regret over ti of user i satisfies the following upper
bound with probability 1 − δ
𝒪
(
d Tn max
i
Ψi,ti)
Network Regret
Assuming users are served uniformly, then, over the time horizon T, the
total cumulative regret experienced by all users satisfies the
following upper bound with probability 1−δ
RT =
n
∑
i=1
Ri,ti

Single User Comparison
36
𝒪
dti log(ti)
)
𝒪
(( d log(ti) + α ||θi ||2 ) dti log(ti)
)
• Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). "A contextual-bandit approach to personalized news article
recommendation”, In Proceedings of the 19th international conference on World wide web, pages 661–670.
• Cesa-Bianchi, N., Gentile, C., and Zappella, G. "A gang of bandits”, NeurIPS 2013
GraphUCB
LinUCB
smoothness and connectivity reduce the regret
||Δi ||2 ∈ [0,||θi ||2 ] Ψi,ti
∈ [0,1]
Single user

Single User Comparison
37
𝒪
dti log(ti)
)
𝒪
(( d log(ti) + α ||θi ||2 ) dti log(ti)
)
• Li, L., Chu, W., Langford, J., and Schapire, R. E. (2010). "A contextual-bandit approach to personalized news article
recommendation”, In Proceedings of the 19th international conference on World wide web, pages 661–670.
• Cesa-Bianchi, N., Gentile, C., and Zappella, G. "A gang of bandits”, NeurIPS 2013
GraphUCB
LinUCB
Single user
All users
𝒪
(
nd T
)
GraphUCB
GOB.Lin
𝒪
(
d Tn max
i
Ψi,ti)

Results - Synthetic
38
Manuscript under review by AISTATS 201
(a) RBF (b) RBF-Sparse (0.5)
(c) ER (p=0.2) (d) BA (m=1)
Figure 3: Performance on graph types: ER (a), BA
(a) Smoothness: in
(c) ER (p)
Figure 4: Performa

Results - Real World Data
39
so
CB-
rth
uta-
arly
e to
oes
ate
ally
ely.
af-
We
par-
te a
Netflix[Bennett et al., 2007]. We follow the data pre-
processing steps in [Valko et al., 2014]. we sample 50
users and test algorithms over T = 1000.
(a) MovieLens (b) Netflix
Figure 5: Performance on MovieLens (a) and Netflix
(b).

Results - Graph Features
40
der review by AISTATS 2019
5) (a) Smoothness: in Eq. 21 (b) RBF (Sparsity)

Conclusions
• Proposed GraphUCB to solve the stochastic linear bandit
problem with multiple users - known user graph
• Single-user UCB
• GraphUCB leads to lower cumulative regret as compared to
algorithms which ignore user graph
• Proposed local-GraphUCB - need further investigation

Conclusions
• Proposed GraphUCB to solve the stochastic linear bandit
problem with multiple users - known user graph
• Single-user UCB
• GraphUCB leads to lower cumulative regret as compared to
algorithms which ignore user graph
• Proposed local-GraphUCB - need further investigation
• Next?
• better understanding of the effect of the graph
• bandit optimality as function of graph features
• graph learning and other GSP properties applied to
MABs?

GSP in DMSs - Perspective
43
Data Efficiency
(preserving structural properties)
12'
y EVD.!
nd binarization.!
12'
ordinate candidates by EVD.!
on matrix by SVD and binarization.!
nce9course.tk'
re'3.'
t_ncut.m'
K-mean Kernel
Meta-Learning

Translation and Sparse Representation
44
Translation
2.3 Signal representation on graphs 13
(a) (b) (c)
Figure 2.2: Translation of the same signal y to three different nodes on the graph. The size and the color
of each disk represent the signal value at each vertex. Due to the irregular topology, the translated signals
appear different but contain the same graph spectral information.
in Fig. 2.2. We notice that the classical shift in the classical definition of the translation does not
apply on graphs.
Filtering is another fundamental operation in graph signal processing. Similarly to classical
signal processing, the outcome yout of the filtering of a graph signal y with a graph filter h is
defined in the spectral domain as the multiplication of the graph Fourier coefficient ŷ(λ!) with the
transfer function ĥ(λ!) such that
ŷout(λ!) = ŷ(λ!)ĥ(λ!), ∀λ! ∈ σ(L). (2.6)
The filtered signal yout at node n is given by taking the inverse graph Fourier transform of Eq.
(2.6), such that
yout(n) =
N−1
!
!=0
ŷ(λ!)ĥ(λ!)χ!(n). (2.7)
Tn f = N(f * δn) = N
N
∑
l=0
̂
f(λl)χ*
l
(n)χl
• Translation of the signal f to three different nodes on the graph
• Due to irregular topology, the translated signal appears different…
Lack of notion of translation for signals on graph
Translation defined via graph spectral domain
• … but they contain the same graph spectral information

45
Data Efficiency
12'
y EVD.!
nd binarization.!
12'
nce9course.tk'
re'3.'
t_ncut.m'
K-mean Kernel
2.3 Signal representation on graphs
(a) (b) (c)
Figure 2.2: Translation of the same signal y to three different nodes on the graph. T
of each disk represent the signal value at each vertex. Due to the irregular topology, t
in Fig. 2.2. We notice that the classical shift in the classical definition of the t
apply on graphs.
Filtering is another fundamental operation in graph signal processing. Si
signal processing, the outcome yout of the filtering of a graph signal y with
defined in the spectral domain as the multiplication of the graph Fourier coeffic
(a)
Figure 2.2: Translation of the same signal y
of each disk represent the signal value at each
appear different but contain the same graph s
in Fig. 2.2. We notice that the classical s
apply on graphs.
Filtering is another fundamental oper
signal processing, the outcome yout of th
defined in the spectral domain as the mul
task 1 task 2
Meta-Learning

46
Data Efficiency
12'
y EVD.!
nd binarization.!
12'
nce9course.tk'
re'3.'
t_ncut.m'
K-mean Kernel
(a) (b) (c)
apply on graphs.
(a)
apply on graphs.
task 1 task 2
Meta-Learning
Global Behavior
(global optimality during exploration)
sampling
exploration vs
exploitation
global
optimality?

47
Data Efficiency
12'
y EVD.!
nd binarization.!
12'
nce9course.tk'
re'3.'
t_ncut.m'
K-mean Kernel
(a) (b) (c)
apply on graphs.
(a)
apply on graphs.
task 1 task 2
Meta-Learning
Global Behavior
(global optimality during exploration)
Global Online Learning for
Complex Networks
sampling
exploration vs
exploitation
global
optimality?
Toni, Laura, and Pascal Frossard. "Spectral MAB for Unknown Graph Processes." 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018.

A First Contribution
Influence maximization problem
ht : source signal on 𝒢 𝒟 : graph dictionary
r(ht) : instantaneous reward of action ht
Toni, Laura, and Pascal Frossard. "Spectral MAB for Unknown Graph Processes." 2018 26th European Signal Processing Conference (EUSIPCO). IEEE, 2018.
action: ht observation: yt = 𝒟ht + ϵ
graph process estimation 𝒟?
Our solution: graph-based multi-arms bandit problems aimed
at optimizing actions on high-dimensional networks.

Multi-Arm Bandit Problems
GOAL: efficiently learn the mapping
action->reward
such that the best action is selected
Multi-Armed Bandit (MAB)
n a casino, N slot machines can be played. Each time you play
machine i, you get a reward Xi = 0 or 1 with initially unknown
mean ri. Objective: sequentially play machines so as to maximize
our average reward (over t plays).
μ1
μ2
μ3
μ4
MAB
1
2
3
4
arms
arm selection
reward LUT
reward
update
estimate
d
mean
reward
(a) MAB
ar
m
s
μ1
μ2
μ3
μ4
Graph-MAB
1
2
3
4 arm
selection
reward as
signal on graph
g
kernel
update
estimated
mean
reward
(b) Graph-based MAB
Figure 2. Graphical visulazation of both the classical MAB and the graph-
of th
t+
proce
the in
r(h
h
ht)
nodes
Thus,
activa
The
taking
consid
make
What we propose!
We learn structured (low-dimensional)
dictionaries that sparsely represent the
(high-dimensional) signal on the graph
We take into account the graph structure

References
• Cesa-Bianchi, Nicolò, Tommaso R. Cesari, and Claire Monteleoni. "Cooperative Online Learning: Keeping your Neighbors Updated"
arXiv preprint arXiv:1901.08082 (2019)
• K Yang, X Dong, L Toni, “Laplacian-regularized graph bandits: Algorithms and theoretical analysis”, arXiv preprint arXiv:1907.05632,
2019
• K Yang, X Dong, L Toni , “Error Analysis on Graph Laplacian Regularized Estimator”, arXiv preprint arXiv:1902.03720, 2019
• Yang, K. and Toni, L., Graph-based recommendation system, IEEE GlobalSIP, 2018
• A. Carpentier, and M. Valko, “Revealing graph bandits for maximizing local influence”, International Conference on Artificial
Intelligence and Statistics. 2016
• E. E. Gad, et al. “Active learning on weighted graphs using adaptive and non-adaptive approaches” ICASSP, 2016
• Li, S., Karatzoglou, A., and Gentile, C. Collaborative filtering bandits, ACM SIGIR 2016
• Korda, N., Szorenyi, B., and Shuai, L. Distributed clustering of linear bandits in peer to peer networks, JMLR, 2016
• Gentile, C., Li, S., and Zappella, G. Online clustering of bandits, ICML 2014
• M. Valko et al., “Spectral Bandits for Smooth Graph Functions”, JMLR 2014
• Q. Gu and J. Han, “Online spectral learning on a graph with bandit feedback”, in Proc. IEEE Int. Conf. on Data Mining, 2014
• D. Thanou, D. I. Shuman, and P. Frossard. “Learning parametric dictionaries for signals on graphs”, IEEE Trans. on Signal
Processing, 2014
• Cesa-Bianchi, N., Gentile, C., and Zappella, G., A gang of bandits, NeurIPS 2013
• W. Chu, L. Li, L. Reyzin, and R. E. Schapire, “Contextual bandits with linear payoff functions”, in AISTATS, 2011
• Vaswani, S., Schmidt, M., and Lakshmanan, L. V., Horde of bandits using gaussian markov random fields, AISTATS 2017

Thank You! Questions?
Learning and Signal Processing Lab
UCL

Graph-Based Bandits for Learning on Irregular Structures

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Graph-Based Bandits for Learning on Irregular Structures

Similar to Graph-Based Bandits for Learning on Irregular Structures (20)

Recently uploaded

Recently uploaded (20)

Graph-Based Bandits for Learning on Irregular Structures