Kdd2015reading-tabei

Monitoring Least Squares Models of
Distributed Streams
M. Gabel*, D.Keren+, A. Schuster*
*Israel Institute of Technology,
+Haifa University
発表者 : 田部井靖生 (JST/東工大)
KDD2015読み会@京大, 2015年8月29日(土)

何故この論文を選んだか？
•  問題設定が新しい(?)
•  手法はシンプル (導出は複雑)

問題設定の概要
•  時系列に流れてくるデータにおいて, モデルの変
化を監視する
–  ノードがk個ある
–  各ノードには, 時系列にデータがやってくる
–  監視するモデルは一つ
•  モデルの更新にはコストがかかる.
コスト = データを集める & βの更新
•  モデルに大きな変化があったときのみ, モデル
を更新したい.

問題の概要
ノード Vi
V1
V2
V3
Vk
(X1
1,y1
1)
(X1
2,y1
2)
(X2
1,y2
1)
(X3
1,y3
1)
(X3
2,y3
2)
(Xｋ
1,yk
1)
データ (X,y)
１つのモデル

手法の概要
•  Naïve
–  毎回βを更新 (通信のコストがかかる)
–  時刻T毎にβを更新
(時間間隔とβの誤差をバランスをとるのは困難)
•  各ノードにおいてβが大きく変化する可能性があると
きのみ, βを更新する.
•  問題点
どうやってグローバルなβの変化を各ノードにおける
ローカルな変化から予測するのか

手法
•  基本的なアイディア
–  データの小さい変化領域Cを定義しする (convex
safe zone)
–  各ノードにおいてデータの変化がconvex safe zone
にあるならば, が成立
–  あるノードでconvex safe zoneを超えるデータの変
化があるとき, βを更新する
(Geometric monitoring)
•  βの更新の際には, 各ノードにおけるデータを一箇所に
集める

手法 (詳細)
•  において, と
する
•  このとき, Aとcは各ノードにおけるAjとcjの線形和で書
ける, i.e.,
•  よって, と書ける
•  すなわち, βの変化はAjとcjの変化量ペアー(Δj,δj)に影
響する

手法 (詳細)
•  と書けるので, βは
AjとCjの平均で書ける
•  (Δ,δ)の凸部分空間Cを以下満たすのように定
義する
•  [Lemma1] Cに関して, 以下が成立する
If for all j, then

Sliding window と Infinite window
•  Sliding window : βをWの範囲のデータから計算, β0を
最後のsync前のWの範囲のデータから計算
Ø  になる条件 :
•  Infinite window : βをこれまでのすべてのデータから計
算, β0を最後のsyncまでのすべてのデータから計算
Ø  になる条件 :
shall be denoted with ˆ·. Hence initial values
kX
j=1
Aj
0 , ˆc0 =
1
k
kX
j=1
cj
0 , ˆ0 = Â 1
0 ˆc0 ,
lues
kX
j=1
Aj
, ˆc =
1
k
kX
j=1
cj
, ˆ = Â 1
ˆc .
= kA 1
thus ˆ = Â 1
ˆc = A 1
c = and
0. In other words, we can compute the OLS
averages of local Aj
, cj
rather than the sums:
1
k
X
j
Aj
! 1
1
k
X
j
cj
!
= Â 1
ˆc (3)
Time
sync nownow
Aj
0
W
Aj
W
Aj
0 Aj
old common
new
Aj
0 Aj
sliding
window
infinite window
Figure 3: Sliding and infinite window models. When
Aj
overlaps Aj
0, j
= Aj
Aj
0 =
P
new xixT
i
P
old xixT
i .

リッジ回帰とGLS
•  リッジ回帰 :
βは閉じた式で書ける
l  Generalized Least Squares (GLS) :
βは閉じた式で書ける , where
Ø 同様の手法でβを監視できる

実験
•  Distributed Least Square monitor(DILSQ)とT間隔毎
にモデルを更新するPER(T)を比較
•  DILSQは, sliding windowを採用
•  評価尺度として, モデルエラーとnormalized message
を用いた
–  それぞれのノードで送られるメッセージの平均
•  データセットは, 人工データ, Traﬃc Monitoring, Gas
Sensor Time Seriesを用いた

人口データを用いた実験
•  それぞれのRoundにおいて, y=xTβtrue+nにおいて, xは
N(0,1)のi.i.d, n N(0,σ2)
•  DILSQのエラーは閾値ε=1.35を超えることはない
•  DILSQは, βの変化に応じてモデルを変化させる
Figure 4: DILSQ model error (black) and syncs (bottom vertical lines) per round, compared to PER(100)
error (green), for k = 10 simulated nodes with m = 10 dimensions, and threshold ✏ = 1.35. Both algorithms
reduce communication to 1%, but DILSQ only syncs when changes (bottom purple line shows k k). PER(100)
syncs every 100 rounds, but is unable to maintain error below the threshold (dashed horizontal line).
guarantees maximum model error below the user-selected
threshold ✏, but PER does not. Hence, when comparing
the two, we ﬁnd a posteriori the maximum period T (hence
minimum communication) for which the maximum error of
PER(T) is equal or below that of DILSQ. Note this gives
PER an unrealistic advantage. First, in a realistic setting we
cannot know a priori the optimal period T. Second, model
changes in realistic settings are not necessarily stationary:
the rate of model change may evolve, which DILSQ will
handle gracefully while PER cannot.

閾値εがモデルの更新コストに影響
•  (a)真のモデルは固定, (b)真のモデルは変化
•  PER(T)のパラメータは, 最大エラーがDILSQと
同じになるように設定
ack) and syncs (bottom vertical lines) per round, compared to PER(100)
d nodes with m = 10 dimensions, and threshold ✏ = 1.35. Both algorithms
DILSQ only syncs when changes (bottom purple line shows k k). PER(100)
able to maintain error below the threshold (dashed horizontal line).
w the user-selected
e, when comparing
um period T (hence
e maximum error of
SQ. Note this gives
a realistic setting we
d T. Second, model
cessarily stationary:
which DILSQ will
In the ﬁxed dataset,
elements drawn i.i.d
with k nodes, each
ctor x of size m and
and y = xT
true + n
noise of strength .
(a) Fixed dataset (b) Drift dataset
Figure 5: Communication for DILSQ (black) and
periodic algorithm tuned to achieve same max error
(green) at di↵erent threshold values. DILSQ com-
munication on ﬁxed model drops to zero for more
permissive ✏ (not shown on logarithmic scale).

パラメータを変化させたときの結果

Traﬃc monitoring
•  問題 : 複数個のセンサーの分毎の車の平均速度
から速度を補完する
•  DILSQ(黒)はExact LSM(紫)と色なく補完できている

閾値εがモデルの更新コストに影響
(a) Window size W = 60 (b) Window size W = 30
Figure 8: Communication for DILSQ (black) and
periodic algorithm (green) on the tra c dataset at
di↵erent ✏ values.
it
res
ne
a c
Sim
we
ind
to
ad
co
39
tri
e
of
syn
ite
co

まとめ
•  分散したストリームデータ上で, モデルの変化を
監視する手法
•  あるノードでモデルの変化に影響のあるデータ
が来たときのみモデルを更新
–  効率的なモデルの更新条件を導出
Ø 少ない通信のオーバヘッドで, モデルの変化を追
跡できる

Kdd2015reading-tabei

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Kdd2015reading-tabei

Similar a Kdd2015reading-tabei (20)

Kdd2015reading-tabei