Uncommon Grace The Autobiography of Isaac Folorunso
"Melting Pot" of the Sciences in interdisciplinary research
1. Stirring the melting pot of the sciences:
Leading the way to interdisciplinary research
Mixing Social Science into Computer Science,
Bioinformatics and more.
Natalie Jane de Vries
2. Introduction - The University of Newcastle and CIBM
• The Newcastle region is the second most
populated area in the Australian state of New
South Wales (approx 510,000)
• Situated 162 km (2 hours) North of Sydney in
the Hunter Region
• University of Newcastle established: 1965
• Directors of CIBM:
Prof. Pablo Moscato and Co-director Prof.
Rodney Scott
3. The Centre for Bioinformatics, Biomarker Discovery and
Information-based Medicine – Background
• One of only 10 Priority Research Centres of The University
of Newcastle.
• Origin: The Newcastle Bioinformatics Initiative (2002-
2006) established by the work of Moscato and Berretta in
Computer Science
3
Bioinformatics
The application of Computer
Science and Information
Technology to Biology/Life
Sciences
Information-based Medicine
is a shift toward a future of
medicine that can become more
personalized, more predictive,
and ultimately more preventative
4. “Melting pot” of the Sciences?
• Big Data
• Data Analytics
• Consumer Insights
• Consumer Analytics
• ‘Internet of things’
• Social Media
Analysis
• Clustering/subtyping
/segmenting
• Ordering
• Ranking
• Optimization
4
• Community Detection
• Graph analysis
• Similarity Measures
• Classification
• Characterisation
• Predictive Analytics
• Etc..
6. Agenda
What will I talk about today?
• Part 1) General Introduction to the mixing of Computer Science,
Social Science, Marketing and Consumer Behaviour at out Centre
• Part 2) Clustering and Segmentation
– From Breast Cancer Subtypes to Consumer Behaviours to Social
Media Metrics data and more…
• Part 3) Reverse Engineering Consumer Behaviour Modelling
Constructs from Data
– We introduce the idea of functional constructs to model online
customer engagement behaviours through symbolic regression
• Part 4) Future Research Directions
– Future Directions, Aims, Conclusions and time for questions
6
7. Part 1: Computer Science and Consumer Behaviour
Research
• Increase in amount and size of consumer-related data
• Online technologies generate large datasets
• Increase in online behaviours towards brands
• Increasing importance of social media in marketing strategies
• Need for greater understanding of consumers through e.g. clustering
consumers (or objects in general) into similar groups
8. Part 2: Clustering and Segmentation
Complete graph Minimum Spanning Tree Select and remove edges
that are not k-Nearest
Neigbors
Final forest (a
forest is a
set of trees) =
clusters
Previous (large scale) applications of the MST-kNN method:
• U.S. Stock market time series data (Inostroza-Ponta, Berretta, & Moscato, 2011)
• Yeast gene expression data (Inostroza-Ponta, Mendes, Berretta, & Moscato, 2007)
• Alzheimer’s disease data - in the order of 1 million data elements (Arefin, Mathieson, Johnstone, Berretta, & Moscato, 2012)
• Prostate cancer data (Capp et al., 2009)
• Social Media (Facebook) Metrics Data (Lucas et al. 2014)
These examples show the methodology proposed here has a proven scalability for larger
datasets
Novel methodology of clustering by CIBM’s researchers: MST-kNN
9. Biomarker Discovery and Clustering in
Breast Cancer
9
• Incidence – In 2014, it is estimated that 15,270 women will be
diagnosed with breast cancer in Australia.
• Luminal A
• Luminal B
• HER2-enriched
• Normal-like
• Basal-like
Molecular Subtypes
10. Treatment
Not all patients need the same treatment or respond to the same treatment
• Surgery
• Radiotherapy
• Hormonal therapy
• Chemotherapy
10
13. Customer Engagement Behaviours- behavioural manifestations
of Customer Engagement (CE) toward a firm after and beyond
purchase (van Doorn et al. 2010)
13
Online Customer Engagement Survey/Questionnaire Tool
14. Methodological Outline
14Categor
y No.
Explanation
Percentage
of sample
1 Fashion Brands 31.54%
2
Community, Charities, Personality and
Sports Fan Pages
23.99%
3 Other Services 19.68%
4 Other Consumer Goods 8.09%
5 Hospitality (Restaurants, Cafes, Bars) 7.28%
6 Consumer Electronics 7.01%
7 Automotive 2.43%
Respondents’ chosen brand categories
15. Methodology: Difference Meta-features
The difference of values
between two measured
features might be capable to
distinguish between two
given categories, even when
those features are not able to
do so alone (De Paula et al, 2011)
Previous successful
application of difference
meta-features in Alzheimer’s
Disease biomarker detection
(De Paula et al. 2011) and (Arefin et al.
2012), both in PLoS ONE.
Data collection
and pre-
processing
Meta-features:
Pair-wise
differences
Meta-features:
Pair-wise
products
Intra- and
inter-construct
relationships
Distance
Computation
Data preparation
-6
-4
-2
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11
f1
f2
Meta-f
Class A Class B
-6
-4
-2
0
2
4
6
8
10
12
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B
19. Future Research Directions in this study
• Various domains and contexts to apply the novel process outlined in
this study
• Combine a study using survey data as well as ‘live’ behaviour data from
social networking sites (real-time interactions)
• Further exploration of meta-features in both survey data and ‘real’
online behaviour clustering studies; ‘differences’ meta-features in this
study yielded better results
• This study guides the development of future feature selection models
to identify group of consumers according to higher-order characteristics.
20. 20
The MST-kNN Method in Social Media Metrics Data
Engagement in Motion: Exploring Short Term Dynamics in Page-
level Social Media Metrics
Benjamin Lucas1,2, Ahmed Shamsul Arefin1,3, Natalie de Vries1,3, Regina Berretta1,3, Jamie Carlson1,2, Pablo Moscato1,3
1 The University of Newcastle, Australia
2 Newcastle Business School, Faculty of Business and Law
3 The Priority Research Centre for Bioinformatics, Biomarker Discovery and Information-Based Medicine
24. Part 3: Reverse Engineering Consumer Behaviour
Modelling Constructs from Data
Consumer Behaviour Modelling is usually done by
testing hypotheses that are generated from theory
24
For example:
Source: de Vries & Carlson 2014 – Journal of Brand Management
Items (questions) make up
one theoretical construct in
Structural Equation Modelling
(Hair et al. 2014). For example:
29. Figure 2. The Figure shows the items ‘used’ by Eureqa through symbolic regression setting each of
the five ENG items as dependent variables (obtained using the whole data set).
de Vries NJ, Carlson J, Moscato P (2014) A Data-Driven Approach to Reverse Engineering Customer Engagement Models:
Towards Functional Constructs. PLoS ONE 9(7): e102768. doi:10.1371/journal.pone.0102768
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0102768
30. Figure 3. Data Set A – Network found as a result of the application of the model finding optimization
software on each variable as a target.
de Vries NJ, Carlson J, Moscato P (2014) A Data-Driven Approach to Reverse Engineering Customer Engagement Models:
Towards Functional Constructs. PLoS ONE 9(7): e102768. doi:10.1371/journal.pone.0102768
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0102768
31. Inter-rater Agreement
31
de Vries NJ, Carlson J, Moscato P (2014) A Data-Driven Approach to
Reverse Engineering Customer Engagement Models: Towards Functional
Constructs. PLoS ONE 9(7): e102768. doi:10.1371/journal.pone.0102768
http://127.0.0.1:8081/plosone/article?id=info:doi/10.1371/journal.pone.0102
768
32. Our Future research directions
• Work on scalability of methodologies
• Improve optimisation algorithms (minimum distance, maximum
objectives, etc.)
• Meta-heuristics (Memetic Algorithms) for applications on social
sciences
• Network alignment (complex network analysis) of consumer
behaviour networks for uncovering structure in datasets
• Proposal of edited book in large scale “Business and Consumer
Analytics” (Springer)
• Smart Cities Network (sensor data, optimisation of cities and their
networks)
• Digital Economy technologies
33. UoN and UKM
Things to remember:
• UoN is always open for research collaborations (depending on funds – we operate on a project basis)
• At CIBM we have supercomputing capacity available for large-scale projects
• In our team we have particular strong expertise in operations research and management science
• CIBM is open to diversify into new areas (e.g. computational social science as demonstrated today)
• As Prof. Moscato says: “Do not hesitate to throw and ‘odd-ball’. Either we could be interested, or we
could put you in touch with other collaborators and colleagues”.
35. References
• Arefin AS, A, Mathieson L, Johnston D, Berretta R, Moscato P (2012) Unveiling Clusters of RNA Transcript Pairs Associated with
Markers of Alzheimer’s Disease Progression, PLOS ONE, DOI: 10.1371/journal.pone.0045535
• Capp et al. (2009) Is there more than one proctitis syndrome? A revisitation using data from the TROG 96.01 trial, Radiotherapy
and Oncology, 90(3), 400-407
• Hair, J. F., Hult, G. T. M., Ringle, C. M. and Sarstedt, M. (2014) A Primer on Partial Least Squares Structural Equation Modeling
(PLS-SEM) Los Angelos: Sage Publications Inc.
• Inostroza-Ponta M, Mendes A, Berretta R, Moscato P (2007) An Integrated QAP-Based Approach to Visualize Patterns of Gene
Expression Similarity, Progress in Artificial Life, Lecture Notes in Computer Science, 4828, pp 156-167
• Inostroza-Ponta M, Berretta R, Moscato P (2011) QAPgrid: A Two Level QAP-Based Approach for Large-Scale Data Analysis and
Visualization, PLOS ONE, DOI: 10.1371/journal.pone.0014468
• Lucas B, Arefin AS, de Vries NJ, Berretta R, Carlson J, Moscato P (2014) Engagement in Motion: Exploring Short Term Dynamics
in Page-Level Social Media Metrics, IEEE Conference on Social Computing and Big Data and Cloud Computing (Sydney)
• de Vries NJ, Carlson J (2014) Examining the drivers and brand performance implications of customer engagement with brands in
the social media environment, Journal of Brand Management, 21, 495-515
• de Vries NJ, Carlson J, Moscato P (2014) A Data-Driven Approach to Reverse Engineering Customer Engagement Models:
Towards Functional Constructs, PLOS ONE, DOI: 10.1371/journal.pone.0102768
• de Vries NJ, Arefin AS, Moscato P (2014) Gauging Heterogeneity in Online Consumer Behaviour Data: A Proximity Graph
Approach, IEEE Conference on Social Computing and Big Data and Cloud Computing (Sydney)
• Marsden J, Budden D, Craig H, Moscato P (2013) Language Individuation and Marker Words: Shakespeare and His Maxwell's
Demon, PLOS ONE, DOI: 10.1371/journal.pone.0066813
• Naeni LM, de Vries NJ, Reis R, Arefin AS, Berretta R, Moscato P (2014) Identifying Communities of Trust and Confidence in the
Charity and Not-for-Profit Sector: A Memetic Algorithm Approach, , IEEE Conference on Social Computing and Big Data and
Cloud Computing (Sydney)
• van Doorn, J., Lemon, K. N., Mittal, V., Nass, S., Pick, D., Pirner, P. and Verhoef, P. C. (2010). Customer Engagement Behavior:
Theoretical Foundations and Research Directions. Journal of Service Research, 13(3): 253-266.
35
37. New Publication
Published 7th April
2015 in PLOS ONE
N J de Vries
R Reis
P Moscato
Clustering of
consumers based on
trust and donating
behaviours in the not-
for-profit sector
Including symbolic
regression predictive
modeling for consumer
involvement with
charities
37
40. IEEE Conference paper
Methodology: Product Meta-features
The product of values between
two measured features might be
capable to distinguish between
two given categories, even when
those features are not able to do
so alone.
This study is the first to trial the
application of this idea.
Left, the values of f1 (blue) and
f2 (red) do not distinguish the
classes well but their product
(meta-feature in green) does.
Data collection
and pre-
processing
Meta-features:
Pair-wise
differences
Meta-features:
Pair-wise
products
Intra- and
inter-construct
relationships
Distance
Computation
Data preparation
0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B0
2
4
6
8
10
12
14
16
18
1 2 3 4 5 6 7 8 9 10 11 12
f1
f2
Meta-f
Class A Class B
41. My publications
• A Data-Driven Approach to Reverse Engineering Customer Engagement
Models: Towards Functional Constructs (de Vries, Carlson and Moscato)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0102768
• Examining the drivers and brand performance implications of customer
engagement with brands in the social media environment (de Vries and
Carlson): http://www.palgrave-
journals.com/bm/journal/v21/n6/abs/bm201418a.html
• Gauging Heterogeneity in Online Consumer Behaviour Data: A Proximity
Graph Approach (de Vries, Arefin and Moscato)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7034833
• Engagement in Motion: Exploring Short Term Dynamics in Page-Level Social
Media Metrics (Lucas et al)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=7034813&tag=1
• Identifying Communities of Trust and Confidence in the Charity and Not-for-
Profit Sector: A Memetic Algorithm Approach (Moslemi et al)
http://ieeexplore.ieee.org/xpl/articleDetails.jsp?arnumber=7034835&refinem
ents%3D4251871666%26filter%3DAND%28p_IS_Number%3A7034739%29
42. Other Sources
First uses of ‘meta-features’:
• Differences in Abundances of Cell-Signalling Proteins in Blood Reveal Novel
Biomarkers for Early Detection Of Clinical Alzheimer's Disease (De Paula et al)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0017481
• Unveiling Clusters of RNA Transcript Pairs Associated with Markers of Alzheimer’s
Disease Progression (Arefin et al)
http://journals.plos.org/plosone/article?id=10.1371/journal.pone.0045535
MST-kNN papers:
• An Integrated QAP-Based Approach to Visualize Patterns of Gene Expression
Similarity (Inostroza Ponta et al) http://link.springer.com/chapter/10.1007/978-3-
540-76931-6_14
• kNN-MST-Agglomerative: A fast and scalable graph-based data clustering approach
on GPU (Arefin et al)
http://ieeexplore.ieee.org/xpls/abs_all.jsp?arnumber=6295143
Notas del editor
We have all heard the following “buzzwords”, keywords and topics this is what ‘traditional’ and social science have in common nowadays. Analysis of large datasets and development of scalable methods.
Note about how computational methods are highly variable (computational linguistics)
Only talk about this briefly and quickly. The only point is to highlight that the results using some sort of meta-feature were more significant
Just talk about general comparison – doing the process with 3 datasets means finding more solid “structure” in the dataset