Thesis Presentation

Clustering Internet users based on their
behavior towards banner ads

Despina Stamkou
stamkou@kth.se

14 Feb 2011

Agenda

 Introduction

 Theoretical Background

 Method

 Results

 Analysis

 Conclusions

 Future Work

Introduction
:: Background

Marketing is an exchange process of values between
companies and customers
(Philip, Armstrong, Wong and Saunders, 2010)

Online Marketing

[2 nd position on Advertisement Investment ]
(Orbit Scripts, 2011)

Introduction
:: Background
 Online Advertisements are promoted through Web Sites (Publishers)

 The goal is to motivate the internet
users to click on the online
advertisements

 Users with similar profiles click on similar
online advertisements
(Giuffrida et al. 2001)

 Users are more likely to click on
personalised advertisements
compared to non-personalised ads
(automatic optimisation)

Introduction
:: Background

 Automatic Optimisation Mechanism for personalised online advertisements

publisher Company
between
Web Site publishers and
AdNetwork clients
Advertisement
Placement
Advertisement 1

Advertisement 2

Advertisement 3
…
Client’s
Advertisement N
Advertisements

automatic
optimisation
mechanism

Introduction
:: Problem Statement

 Problem
AdNetworks need to develop an intelligent automatic optimisation logic
To keep a competent position in the online marketing business area

 Goal
Evaluate well known grouping algorithms
To use the best performing one for the automatic optimisation logic

 Purpose
To prove that the performance success of the dominant algorithm is data-
independent

Introduction
:: Method & Material

 Literature Study
 Background Knowledge on clustering
 Identify algorithms with significant clustering performance

 Empirical Part
 Compare the identified algorithms

Introduction
:: Significance

 Automatic optimisation can increase the revenues of an
AdNetwork

 The thesis topic is part of the automatic optimisation project in
Tradedoubler and will use data from the specific AdNetwork

 Each Adnetwork has different data but can benefit from the
conclusions

 The conclusions will reinforce the data-independence of the
dominant clustering algorithm

Introduction
:: Limitations

 Only two clustering algorithms are examined

 The number of clusters are predefined

 Data set has a specific dimensionality and is not publicly available

 Data set represent an instance of the user’s behaviour for a
specific period

Theoretical Background
:: Classification vs Clustering

Data mining is the process of discovering knowledge from data sources (Bing Liu, 2006)

Supervised Classification ( Classification) Unsupervised Classification ( Clustering)
We know the class labels and the number of classes We do not know the class labels and
may not know the number of classes

… …
1.dark 2.light 3.dark n. pink 1. ??? 2. ??? 3. ??? ?. ???
blue green orange

 Groups users with the exact same characteristics  Groups users with similar characteristics
 Impossible to predict future actions  Opportunity to predict future actions

:: Selecting the clustering method

Clustering

Data object belong to
Non-Exclusive Exclusive
only one cluster

Data object belong to
one or more clusters

Partitional Hierarchical

Agglomerative Divisive

:: Related Research

 Most recent related studies were selected to be examined (2011)

 These studies aimed to compare the clustering performance between the best
performing algorithms from past related studies

 K-means algorithm was used as a base line

 The algorithms were examined with a predefined number of clusters

 The performance measurement was applied through a fitness function

:: Selecting the algorithms

 Particle Swarm Optimisation (PSO) & K-means

 K-means as a base line

 PSO because it outperformed the rest of the clustering algorithms

 Limited studies around PSO

 Interesting to evaluate PSO performance with the available data set from
Tradedoubler and reinforce the data-independence

Method
:: Data Selection

 Data set consists of real transactions within Tradedoubler’s AdNetwork

 254.046 rows

 Sampling by time period – 1 month

 information columns:

PROGRAM_ID ID of the Campaign where the banner belongs
Advertisement
Campaign info WEBSITE_ID ID of Website from where the action was generated
BANNER_ID ID of the banner with which the user interacted
EVENT_ID ID of the event: Click or Sale
Internet user USER_AGENT Visitors’ web browser agent and Operating System
info
TIMESTAMP Time the transaction was made

Method
:: Evaluation Criteria
Clustering evaluation is a complex and difficult problem (Liu, 2006)

Types of evaluation
 External
 With readable and meaningful data -without numbers

 Indirect
 With an external application which will test the results

 Internal
 With any distance comparison function

Method
:: Fitness Function
The fitness function that will be used will provide the summary value of the
maximum distance of each cluster from a data object :

The smaller the value of the summary, the better the clustering algorithm performs

Method
:: Alternative Fitness Function

 Summary value of average distance between the centroid and the data
vectors

 Summary value of minimum distance between data objects that belong to
different clusters

The selected for this study fitness function has been used from relative researches
for the same purpose and with the same algorithms, as the current study, and
therefore was preferred among the alternatives

Results
:: Methods Tools and Time

 Programs developed in Perl and parameterized
for the multidimensional data set

 Both algorithms ran for 10 different values of K;
5, 10, 15, 20, 25, 30, 35, 40, 45 and 50

 The operating system Linux Ubuntu
Hardware characteristics :
RAM: 3GB, processor: Intel Core Duo at 2,26GHz.

 Execution time between the algorithms was approximately 1:4;
K-mean ran in total for 1,5 hours and PSO for 7 hours

Results
:: Performance Chart

Analysis
:: Performance Comparison

PSO >> K-means Why?

 Both algorithms calculate the next position of the clusters
and continuously moving them within the search space until
there is no change on their position but…

…PSO evaluates each next position in the space by using
an internal fitness method

…This method keeps a memory of the previous fitness value
of each cluster and compares it with the fitness of the new
position

…Then a decision is made if the new position should be kept
or return the cluster to the previous one

Analysis
:: Similarity Evaluation

 Through a basic external evaluation from a small sample of data
vectors similarities were traced so as to prove the concept of having
grouped homogeneous users within the same clusters

 Even though it was discussed that external will not be used as
argument for the final conclusions, it can yet provide us with
confidence of having properly developed the clustering algorithms

Analysis
:: Limitations

 Fitness Function is the main evaluation method
 Combined with indirect evaluation would give more accurate conclusions

 Fitness was measured for a defined number of clusters
 Hypothetically PSO would continue performing well in a higher number of K.
Yet this is not proved through the experiments

 The basic external evaluation should not be taken as a criterion for the performance
of the algorithms; rather, to guarantee that the development of the algorithms is
more likely correct

Conclusions

 The experiments reinforce the superiority of PSO in terms of performance
despite the nature and the dimensionality of the data
 Important fact : the data belong to real life transactions

 Indication that the higher the value of clusters is, the better the resulting fitness for PSO
 This indicates additional process effort and memory use
The best number of clusters can be defined based on processing time and fitness

Future Work

 Compare different hybrids of the PSO without predefined number of clusters

 Develop the personalised mechanism to propose relevant advertisements

Subgroup 1
Has seen Show Advertisement
Advertisement A B

Subgroup 3
Inside a Cluster : Has seen Show Advertisement
Advertisement A from
and Advertisement B neighbour cluster

Subgroup 2
Has seen Show Advertisement
Advertisement B A

 Users’ actions will define the performance : indirect method of evaluation

Thank you!

Questions / Comments

References

Philip, K., Armstrong, G., Wong, V. and Saunders, J., 2010. Principles of Marketing, 5th edition. New Jersey: Pearson Education, p.7

Giuffrida, G., Reforgiato, D., Tribulato, G. and Zabra, C. , 2001. A Banner Recommendation System Based on Web Navigation History.
Computational Intelligence and Data Mining (CIDM), 2011 IEEE Symposium, Paris

Liu, B., 2006. Web Data Mining: Exploring Hyperlinks, Contents, and Usage Data. Chicago:Springer, p.6

Thesis Presentation

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (20)

Similar a Thesis Presentation

Similar a Thesis Presentation (20)

Último

Último (20)

Thesis Presentation