Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Métodos Adaptativos de Minería de Datos y Aprendizaje
para Flujos de Datos.

Albert Bifet
LARCA: Laboratori d’Algorismica Relacional, Complexitat i Aprenentatge
Departament de Llenguatges i Sistemes Informàtics
Universitat Politècnica de Catalunya

Junio 2009, Santander

Minería de Datos y Aprendizaje para Flujos
de Datos con Cambio de Concepto

Extraer información de
secuencia potencialmente
inﬁnita de data
datos que varian con el
tiempo
usando pocos recursos

usando ADWIN
La Desintegración de la ADaptive Sliding WINdow:
Persistencia de la Memoria Ventana deslizante
adaptativa
1952-54
sin parámetros
Salvador Dalí
2 / 29

Minería de Datos Masivos

Explosión de Datos en los últimos años

: 100 millones búsquedas por día

: 20 millones transacciones por día
1,000 millones de transacciones de tarjetas de credito por mes
3,000 millones de llamadas telefónicas diarias en EUA
30,000 millones de e-mails diarios, 1,000 millones de SMS
Tráﬁco de redes IP: 1,000 millones de paquetes por hora por
router

3 / 29


Datos Masivos
2007
Universo Digital: 281 exabytes (mil millones de gigabytes)
La cantidad de información creada excedió el almacenaje
disponible por primera vez

Green Computing
Estudio y práctica de como usar recursos informáticos
eﬁcientemente.

Algorithmic Efﬁciency
Una de las principales maneras de hacer Green Computing

4 / 29


Koichi Kawana
Simplicidad significa conseguir el máximo efecto con los
mínimos medios.

Donald Knuth
“... we should make use of the idea of
limited resources in our own education.
We can all benefit by doing occasional
"toy" programs, when artificial
restrictions are set up, so that we are
forced to push our abilities to the limit. “

5 / 29

Introducción: Data Streams
Data Streams
Secuencia potencialmente inﬁnita
Gran cantidad de datos: espacio sublineal
Gran velocidad de llegada: tiempo sublineal por ejemplo
Cada vez que un elemento de un data stream se ha procesado,
se descarta o se archiva

Puzzle: Encontrar números que faltan
Sea π una permutación of {1, . . . , n}.
Sea π−1 la permutación π con un
elemento que falta.
π−1 [i] llega en orden creciente

Tarea: Determinar el número que falta

6 / 29

Data Streams

Usar un vector
n-bit para
Sea π−1 la permutación π con un memorizar todos
elemento que falta.
los numeros
π−1 [i] llega en orden creciente (espacio O(n) )

6 / 29

Data Streams

Data Streams:
Sea π−1 la permutación π con un espacio
elemento que falta.
O(log(n)).


6 / 29

Data Streams

Sea π una permutación of {1, . . . , n}. Almacenar
Sea π−1 la permutación π con un n(n + 1)
− ∑ π−1 [j].
elemento que falta. 2 j≤i


6 / 29

Problema
12, 35, 21, 42, 5, 43, 57, 2, 45, 67
Dados n números no ordenados, encontrar un número que
esté en la mitad superior de la lista ordenada.

2, 5, 12, 21, 35 42, 43, 45, 57, 67

Algoritmo
Elegir k números aleatoriamente. Devolver el número mayor.

Análisis
La probabilidad de que la solución sea incorrecta es la
probabilidad de que todos los k números estén en la mitad
inferior : (1/2)k
Para tener probabilidad δ usaremos k = log 1/δ muestras

7 / 29

Outline

1 Introduction

2 ADWIN : Concept Drift Mining

3 Hoeffding Adaptive Tree

4 Conclusions

8 / 29

Data Streams

Data Streams
At any time t in the data stream, we would like the per-item
processing time and storage to be simultaneously
O(log k (N, t)).

Approximation algorithms
Small error rate with high probability
˜
An algorithm (ε, δ )−approximates F if it outputs F for which
˜ − F | > εF ] < δ .
Pr[|F

9 / 29

Data Streams Approximation Algorithms
Frequency moments
Frequency moments of a stream A = {a1 , . . . , aN }:
v
Fk = ∑ fik
i=1

where fi is the frequency of i in the sequence, and k ≥ 0
F0 : number of distinct elements on the sequence
F1 : length of the sequence
F2 : self-join size, the repeat rate, or as Gini’s index of
homogeneity
Sketches can approximate F0 , F1 , F2 in O(log v + log N) space.

Noga Alon, Yossi Matias, and Mario Szegedy.
The space complexity of approximation
the frequency moments. 1996
10 / 29


1011000111 1010101

Sliding Window
We can maintain simple statistics over sliding windows, using
O( 1 log2 N) space, where
ε
N is the length of the sliding window
ε is the accuracy parameter

M. Datar, A. Gionis, P. Indyk, and R. Motwani.
Maintaining stream statistics over sliding windows. 2002

11 / 29


10110001111 0101011

Sliding Window
ε


11 / 29


101100011110 1010111

Sliding Window
ε


11 / 29


1011000111101 0101110

Sliding Window
ε


11 / 29


10110001111010 1011101

Sliding Window
ε


11 / 29


101100011110101 0111010

Sliding Window
ε


11 / 29

Outline

1 Introduction



4 Conclusions

12 / 29

Data Mining Algorithms with Concept Drift

No Concept Drift Concept Drift

DM Algorithm
DM Algorithm
input output input output
-
Counter5 - - -

Counter4 Static Model

Counter3 6

Counter2
Counter1 -
Change Detect.

13 / 29

Data Mining Algorithms with Concept Drift

No Concept Drift Concept Drift

DM Algorithm DM Algorithm
input output input output
-
Counter5 - - Estimator5 -

Counter4 Estimator4
Counter3 Estimator3
Counter2 Estimator2
Counter1 Estimator1

13 / 29

Time Change Detectors and Predictors: A
General Framework

Estimation
-
xt
-
Estimator

14 / 29

General Framework

Estimation
-
xt
-
Estimator Alarm
- -
Change Detect.

14 / 29

General Framework

Estimation
-
xt
-
Estimator Alarm
- -
Change Detect.
6
6
?
-
Memory

14 / 29

Window Management Models

W = 101010110111111

Equal ﬁxed size Total window against
subwindows subwindow
1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent
subwindows ADWIN: All Adjacent subwindows

1010101 1011 1111 1 01010110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 10 1010110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 101 010110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 1010 10110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 10101 0110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 101010 110111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 1010101 10111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 10101011 0111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 101010110 111111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 1010101101 11111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 10101011011 1111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 101010110111 111

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 1010101101111 11

[Dasu+ 06]

15 / 29


W = 101010110111111

1010 1011011 1111
10101011011 1111
[Kifer+ 04]
[Gama+ 04]
Equal size adjacent

1010101 1011 1111 10101011011111 1

[Dasu+ 06] 11

15 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 1

ADWIN: A DAPTIVE W INDOWING A LGORITHM
1 Initialize Window W
2 for each t 0
3 do W ← W ∪ {xt } (i.e., add xt to the head of W )
4 repeat Drop elements from the tail of W
5 until |µW0 − µW1 | ≥ εc holds
ˆ ˆ
6 for every split of W into W = W0 · W1
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 1 W1 = 01010110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 10 W1 = 1010110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 101 W1 = 010110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 1010 W1 = 10110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 10101 W1 = 0110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 101010 W1 = 110111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 1010101 W1 = 10111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111
W0 = 10101011 W1 = 0111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111 |µW0 − µW1 | ≥ εc : CHANGE DET.!
ˆ ˆ
W0 = 101010110 W1 = 111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 101010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN
Example
W = 01010110111111 Drop elements from the tail of W
W0 = 101010110 W1 = 111111

2 for each t 0
ˆ ˆ
7 ˆ
Output µW

16 / 29

Algorithm ADWIN [BG07]

ADWIN has rigorous guarantees (theorems)
On ratio of false positives
On ratio of false negatives
On the relation of the size of the current window and change
rates

Other methods in the literature: [Gama+ 04], [Widmer+ 96],
[Last 02] don’t provide rigorous guarantees.

17 / 29

Algorithm ADWIN [BG07]

Theorem
At every time step we have:
1 (Few false positives guarantee) If µt remains constant within W ,
the probability that ADWIN shrinks the window at this step is at
most δ .
2 (Few false negatives guarantee) If for any partition W in two
parts W0 W1 (where W1 contains the most recent items) we have
|µW0 − µW1 | ε, and if

3 max{µW0 , µW1 } 4n
ε ≥ 4· ln
min{n0 , n1 } δ

then with probability 1 − δ ADWIN shrinks W to W1 , or shorter.

18 / 29

Outline

1 Introduction



4 Conclusions

19 / 29

Classiﬁcation
Example
Contains Domain Has Time
Data set that “Money” type attach. received spam
describes e-mail yes com yes night yes
features for yes edu no night yes
deciding if it is no com yes night yes
spam. no edu no day no
no com no day no
yes cat no day yes

Assume we have to classify the following new instance:
“Money” type attach. received spam
yes edu yes day ?

20 / 29

Classiﬁcation
Assume we have to classify the following new instance:
“Money” type attach. received spam
yes edu yes day ?

20 / 29

Decision Trees

Basic induction strategy:
A ← the “best” decision attribute for next node
Assign A as decision attribute for node
For each value of A, create new descendant of node
Sort training examples to leaf nodes
If training examples perfectly classiﬁed, Then STOP, Else iterate
over new leaf nodes

21 / 29

Hoeffding Tree / CVFDT

Hoeffding Tree : VFDT

Pedro Domingos and Geoff Hulten.
Mining high-speed data streams. 2000

With high probability, constructs an identical model that a
traditional (greedy) method would learn
With theoretical guarantees on the error rate

22 / 29

VFDT / CVFDT

Concept-adapting Very Fast Decision Trees: CVFDT

G. Hulten, L. Spencer, and P. Domingos.
Mining time-changing data streams. 2001

It keeps its model consistent with a sliding window of examples
Construct “alternative branches” as preparation for changes
If the alternative branch becomes more accurate, switch of tree
branches occurs

23 / 29

Decision Trees: CVFDT

No theoretical guarantees on the error rate of CVFDT
CVFDT parameters :
1 W : is the example window size.
2 T0 : number of examples used to check at each node if the
splitting attribute is still the best.
3 T1 : number of examples used to build the alternate tree.
4 T2 : number of examples used to test the accuracy of the
alternate tree.

24 / 29

Decision Trees: Hoeffding Adaptive Tree

Hoeffding Adaptive Tree:
replace frequency statistics counters by estimators
don’t need a window to store examples, due to the fact that we
maintain the statistics data needed with estimators

change the way of checking the substitution of alternate
subtrees, using a change detector with theoretical guarantees

Summary:
1 Theoretical guarantees
2 No Parameters

25 / 29

What is MOA?

{M}assive {O}nline {A}nalysis is a framework for online learning
from data streams.

It is closely related to WEKA
It includes a collection of ofﬂine and online as well as tools for
evaluation:
boosting and bagging
Hoeffding Trees
with and without Naïve Bayes classiﬁers at the leaves.

26 / 29

Ensemble Methods
http://www.cs.waikato.ac.nz/∼abifet/MOA/

New ensemble methods:
ADWIN bagging: When a change is detected, the worst classiﬁer
is removed and a new classiﬁer is added.
Adaptive-Size Hoeffding Tree bagging

27 / 29

Outline

1 Introduction



4 Conclusions

28 / 29

Conclusions

Adaptive and parameter-free methods based in
replace frequency statistics counters by ADWIN
don’t need a window to store examples, due to the fact that we
maintain the statistics data needed with ADWINs

using ADWIN as change detector with theoretical guarantees,

Summary:
1 Theoretical guarantees
2 No parameters needed
3 Higher accuracy
4 Less space needed

29 / 29

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.

Similar a Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos. (20)

Más de Albert Bifet

Más de Albert Bifet (20)

Último

Último (20)

Métodos Adaptativos de Minería de Datos y Aprendizaje para Flujos de Datos.