This presentation given an overview of geodemographic classifications and why there is a need to use open tools and methods for creating geodemographic classifications. The presentation also describes the challenges involve with creating real-time geodemographic classifications and the use of social media data for geodemographic applications.
1. Geodemographics: Open tools and methods
Dr. Muhammad Adnan
Department of Geography, University College London
Web: http://www.uncertaintyofidentity.com
Email: m.adnan@ucl.ac.uk
Twitter: @gisandtech
2. Lecture Outline
• Geodemographic Classification
• Problems with the Geodemographic Classifications
• Real-time bespoke Geodemographic Classifications
• GeodemCreator: A software for creating Geodemographic
Classifications
• Social Media data for Geodemographics
3. Geodemographics
• “Analysis of people by where they live” or “locality
marketing”
(Sleight, 1993:3)
Person
Home
Address
Area
4. Steps in Creating a Geodemographic Classification
• Variable Selection
• Transformation of the Data
• Standardisation of the Data
• Clustering of the Data (k-means)
• Naming the clusters
5. Data – Census + Other
ONS Output Area Classification (2001 and 2011)
• Census data: 100%
Experian: Mosaic
• Census data: 54%
• Non-Census data: 46%
CACI: Accorn
• Census data: 30%
• Non-Census data: 70%
6. Standardising the data
• Z-Scores
• Widely used variable normalisation technique
• Can create outliers in the datasets
• Range Standardisation
• Standardise values between a range of 0-1
• Can erase interesting patterns in the data
• Principal Component Analysis (PCA)
• Reduces the dimensions of a data set
• Focuses on the part of dataset having maximum variance
• Can erase interesting patterns in the data
7. Segmentations are created by cluster analysis
Areas
V1
V2
V3
V4
V5
V6
…
Variable 2
Cluster 1
Area1
Cluster 2
Area2
Area3
Area4
Variable 1
Area5
Area6
…….
Cluster 3
11. Does one size fit all ?
• Most geodemographic classifications divide areas into a
specified number of categories
• 2011 OAC divides the Output Areas in the UK into 8 broad
categories
• Do these categories account for all the characteristics of
the population ?
• We need to create bespoke small area classifications ?
• Geodemographic categories only apply to a particular area
12. Closed Methods
• Commercial geodemographic classifications (i.e.
MOSAIC, ACCORN) use closed methods
•
•
•
•
Data sources used ?
Weighting of the variables ?
Data standardisation techniques employed ?
Clustering algorithm applied ?
• We need open methods and clear documentation of the
geodemographic classifications
• 2001 OAC
• 2001 LOAC (London‟s Output Area Classification)
• 2011 OAC
13. Public Consultation
• Users of the classification cannot modify or give a
feedback
• Users should have the control to modify the classification
through their feedback
• UCL‟s E-Society Classification
16. Need for real time Geodemographics
• Current classifications are created using static data sources
• Rate and scale of current population change is making large
surveys (census) increasingly redundant
• Significant hidden value in transactional data
• Data is increasingly available in near real time
e.g. ONS (Office of National Statistics) NESS API
• Social media data is available in real time
17. What are real time Geodemographics ?
Specification
Real time
feeds of data
Estimation
Online
Specification
of inputs
Clustering
Testing
Visualisation
18. Computational challenges
• Integration of large and possibly disparate databases
• E.g. NHS data; Census data
• Data normalisation and optimization for fast transactions
• Minimizing computational time of clustering algorithms
(Very Important)!
• Common protocol
• XML (SOAP)
• Use of non traditional data sources. (Singleton, 2008)
• E.g. Flickr; Facebook, Twitter
19. Important Challenge: Selection of clustering
algorithm
•
•
•
•
K-Means
PAM (Partitioning Around Medoids)
CLARA (Clustering Large Applications)
GA (Genetic Algorithm)
20. k-means
• Widely used clustering algorithm for geodemographics
• Attempts to find out cluster centroids by minimising within
sum of squares distance.
• K-means is unstable due to its initial seeds assignment.
• Sensitive to outliers in the data set.
• Creating a Geodemographic classification requires running
algorithm multiple times.
• 10,000 times (Singleton, 2008)
• Computationally expensive in a real time environment.
24. Alternate Clustering Algorithms
• PAM (Partitioning around medoids)
• CLARA (Clustering Large Applications)
• GA (Genetic Algorithm)
25. Alternate Clustering Algorithms…
• PAM (Partitioning around medoids)
• It tries to minimize the sum of dissimilarities of the data
points to their cluster centers.
• Less sensitive to outliers than K-means.
• Cannot handle larger data sets.
• Produces better results than k-means for smaller data
sets.
26. Alternate Clustering Algorithms…
• CLARA (Clustering Large Applications)
• It draws multiple samples of the dataset, applies PAM to
each sample and returns the best result.
• Can handle large data sets as it operates on samples rather than
on actual data set.
• Could be a better choice for creating classifications on the
fly.
27. Alternate Clustering Algorithms…
• GA (Genetic Algorithm)
• It is inspired by models of biological evolution. It produces
results through a breeding procedure.
• Creates hierarchies of generations and then merge the
hierarchies in homogeneous groups having similar
characteristics.
• Can be time consuming due to the creation of generation
hierarchies.
28. Comparing computational efficiency (Z-scores)
OA (Output Area) level results
LSOA (Lower Super Output Area) level results
Ward level results
29. Algorithm Stability (w.r.t. Computational time)
Running k-means on OA (Output Area) for 120 times on each iteration
4
3.5
3
2.5
2
1.5
1
0.5
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)
K-means
Running GA on OA (Output Area) for 120 times on each iteration
Running CLARA on OA (Output Area) for 120 times on each iteration
GA
4
3.5
3
2.5
2
1.5
1
0.5
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)
4
3.5
3
2.5
2
1.5
1
0.5
0
1
5
9
13
17
21
25
29
33
37
41
45
49
53
57
61
65
69
73
77
81
85
89
93
97
Time (s)
CLARA
32. GeodemCreator
• Allows users to create Geodemographic Classifications
• Users have the control of how a Geodemographic
Classification is created (Open Methods !)
41. Why we need Social Media data for Geodemographics ?
• Traditional geodemographic classifications are based on
Census data
• Night time geography
• These classifications do not identify where the population is
during the day time
• We do not know about the Social links between different
people
• A solution is to infuse Social Media data with traditional data
sources
42. Geodemographics
• “Analysis of people by where they live” or “locality
marketing”
Social Media Geodemographics
• “Analysis of people by where they live, travel, and who
they communicate with”
43. Social Media Geodemographics
• Who: Ethnicity, Gender, and Age of social media users
• Where: Where social media conversations are happening
and who is leading them
•
Intelligence about where people are located and what they are
doing
• When: What time of day conversations happen
44. Twitter (www.twitter.com)
• Online social-networking and micro blogging service
• Launched in 2006
• Users can send messages of 140 characters or less
• Approximately 200 million active users
• 350 million tweets daily
• In 2012, UK and London were ranked 4th and
3rd, respectively, in terms of the number of posted tweets
45. Data available through the Twitter API
•
•
•
•
•
•
•
•
•
User Creation Date
Followers
Friends
User ID
Language
Location
Name
Screen Name
Time Zone
•
•
•
•
•
Geo Enabled
Latitude
Longitude
Tweet date and time
Tweet text
46.
47. Analysing Names on Twitter
• Some examples of NAME variations on Twitter
Real Names
Kevin Hodge
Andre Alves
Jose de Franco
Carolina Thomas, Dr.
Prof. Martha Del Val
Fabíola Sanchez Fernandes
Fake Names
Castor 5.
WHAT IS LOVE?
MysticMind
KIRILL_aka_KID
Vanessa
Petuna
51. Summary
• Geodemographics is the analysis of people by where they live
• But generalised geodemographic classifications have some
problems
– We need bespoke classifications for smaller areas
• Real-time geodemographic classifications is a solution to
create bespoke classifications
• Methods of creating current classifications are not open
– We need Open tools and Open methods for geodemographics
• Social media data for geodemographic classifications