Computational intelligence for big data analytics bda 2013
1. Big Data Analytics: Challenges and
l
h ll
d
What Computational Intelligence
Techniques May Offer
h
ff
Ah-Hwee Tan
(http://www.ntu.edu.sg/home/asahtan)
School of Computer Engineering
Nanyang Technological University
Big Data Analytics Symposium
London, UK
13 September 2013
2. Outline
Big Data Analytics
Computational Intelligence Techniques
Web Data Analytics
Flexible Organizer for Competitive
Intelligence (FOCI)
Web Information Fusion and Associative
Discovery
Di
Analytics for Active Living for Elderly
3. The Era of Big Data
Big data refers to
collection of data sets so large and complex
that
th t exceed th competence of commonly used
d the
t
f
l
d
IT systems in terms of processing space and/or
time.
time
4. Sources of Big Data
g
• Traditionally, mostly produced in scientific fields such as
astronomy, meteorology
astronomy meteorology, genomics physics biology and
physics, biology,
environmental research.
• With rapid development of IT technology and the
p
p
gy
consequent decrease of cost on collecting and storing
data, big data has been generated from almost every
industry and sector as well as governmental department
department,
including retail, finance, banking, security, audit, electric
power, healthcare.
• Recently, big data over the Web (big Web data for short),
which includes all the context data, such as, user
generated contents, browser/search log data deep web
contents
data,
data, etc.
5. Examples of Big Data
(Source: Wikipedia)
• Walmart handles more than 1 million customer transactions
every h
hour, which i i
hi h is imported i t d t b
t d into databases estimated t
ti t d to
contain more than 2.5 petabytes (2560 terabytes) of data –
the equivalent of 167 times the information contained in all the books in
the US Library of Congress.
• Facebook handles 50 billion photos from its user base.
• FICO Falcon Credit Card Fraud Detection System protects
2.1 billion active accounts world-wide.
• Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers
yp
determine their typical drive times to and from work
throughout various times of the day.
6. Examples of Big Data
(Source: Wikipedia)
• NASA Center for Climate Simulation
(NCCS) stores 32 petabytes of
climate observations and simulations
on the Discover supercomputing
cluster.
• Utah Data Center is a data center
currently
c rrentl being constr cted b the
constructed by
United States National Security
Agency. When finished, the facility
will handle yottabytes of information
collected by NSA over the Internet.
Value
Metric
1000
kB
kilobyte
10002
MB
megabyte
10003
GB
gigabyte
10004
TB
terabyte
10005
PB
petabyte
10006
EB
exabyte
10007
ZB
zettabyte
10008
YB
yottabyte
7. Money of Big Data
(Source: Wikipedia)
• "Big data" have increased the demand of information
g
management specialists
• Software AG, Oracle Corporation, IBM, Microsoft,
SAP, EMC, d
SAP EMC and HP h
have spent more than $15 billion
t
th
billi
on software firms specializing in data management
and analytics.
y
• In 2010, this industry on its own was worth more than
$100 billion and was growing at almost 10 percent a
year: about twice as fast as the software business as
a whole.
8. Market of Big Data
(Source: Wikipedia)
• Developed economies make increasing use of datadata
intensive technologies. There are 4.6 billion mobilephone subscriptions worldwide and there are between
1 billion and 2 billion people accessing the internet
• The world's effective capacity to exchange information
through telecommunication networks was 281
petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes
in 2000, 65 exabytes in 2007[14] and it is predicted that
the amount of traffic flowing over the internet will reach
667 exabytes annually by 2013.[5]
9. Big Data Market Segments
(Report by Transparency Market Research)
• Segmentation of the big data market by components, by
g
g
y
p
, y
applications and by geography.
• The different components included are software and
services, hardware and storage.
• Software and services segment dominates the components
market whereas storage segment will be the fastest
growing segment for the next 5 years owing to the
perpetual growth in th d t generated.
t l
th i the data
t d
10. Big Data Market Segment by
Applications
• Covered eight applications namely financial services,
manufacturing, healthcare, telecommunication,
government, retail and media & entertainment and others in
the application segment.
• Financial Services, healthcare and the government sector
are the top three contributors of the big data market and
together held more than 55% of the big data market in
2012.
• M di and E t t i
Media d Entertainment and th h lth
t d the healthcare sectors will
t
ill
grow at high CAGR of nearly 42% from 2012 to 2018. The
g
growth in data in the form of video, images, and g
g
games is
driving the media and entertainment segment.
Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ
11. Challenges of Big Data
• Volume
– Size in the order of petabytes,
exabytes, …
• Velocity
– Time sensitive data, data that
g
grow exponentially or even in
p
y
rates that overwhelm the wellknown Moore's Law
Value
Metric
1000
kB
kilobyte
10002
MB
megabyte
10003
GB
gigabyte
i b t
10004
TB
terabyte
10005
PB
petabyte
10006
EB
exabyte
10007
ZB
zettabyte
10008
YB
yottabyte
• V i t
Variety
– From structured data into semi-structured and
completely unstructured data of different types such as
types,
text, image, audio, video, click streams, log files,
12. Deeper Issues of Big Data
(The additional 3Vs)
• Validity
– Is the data correct and accurate for the intended
usage?
• V
Veracity
i
– Are the results meaningful for the given problem
space?
• Volatility
– How long do you need to look/store this data?
13. Computational Intelligence
• Neural Networks (IJCNN)
– Brain-like mathematical models for pattern
recognition, memory, and association discovery
– Examples: Perceptron, BP, SVM, SOM, ART, …
• Fuzzy Systems (IEEE-FUZZ)
– Fuzzy operators for handling non-discrete reasoning
– Examples: FNN, Fuzzy C-Means, …
14. Computational Intelligence
• Evolutionary Computing (CEC)
– Classes of heuristic algorithms repeatedly
search for good solutions by mimicking
g
y
g
the process of natural evolution
– Commonly used for optimization and
search problems
– Examples: Genetic Algo, Memetic Algo,
15. Flagship Events of
Computational Intelligence
• World Congress on Computational Intelligence
(Australia 2012, Beijing 2014)
y p
p
g
• IEEE Symposium on Computational Intelligence
(Singapore 2013, Florida, USA 2014)
• IEEE Symposium on Computational Intelligence
in Big Data (IEEE CIBD'2014)
16. Examples of Use of CI in Big Data
•
•
•
•
•
•
•
•
•
•
•
•
•
•
Data size and feature space adaptation
Uncertainty modeling in learning from big data
Distributed learning techniques in uncertain environment
Uncertainty in cloud computing
Distributed
Di ib d parallel computation
ll l
i
Feature selection/extraction in big data
Sample selection based on uncertainty
Incremental Learning
Manifold Learning on big data
Uncertainty techniques in big data classification/clustering
Imbalance learning on big data
Active learning on big data
Random weight networks on bi d t
R d
i ht t
k
big data
Transfer learning on big data
17. Self-Organizing N
S lf O
i i Neural
l
Networks for
Personalized W b Intelligence
P
li d Web I t lli
Towards Personalized Web Intelligence
g
Ah-Hwee Tan, Hwee-Leng Ong,
Hong Pan, Jamie Ng, Qiu-Xiang Li
Knowledge and Information Systems 18 (2004) 297-306
18. Workflow for Web Data Analytics
y
• Search
– Getting the information
• Organize
(clustering/categorizing)
– Putting things in perspectives
• Analyze (data mining)
– Discover hidden knowledge
• Share (knowledge management)
– Saving for reference and sharing
• Track
– Constant monitoring
19. Approaches to
Organizing/Analyzing
• Cl stering
Clustering
– Organizing information into groups based on
similarity functions and thresholds
– e.g. BullsEye, NorthernLight, Vivisimo
• Categorization
g
– Organizing information into a “predefined” set of
classes
– e.g. Yahoo!, Autonomy Knowledge Server
• Which is better?
20. Clustering
g
• Pros
– Unsupervised/self-organizing, require no training
or predefinition of classes
– Able to identify new themes
• Cons
– Users have no control
– Ever changing cluster structure
– Difficult to navigate and track
21. Categorization
g
• Pros
– Good control on classes
– Every info assigned to one or more classes
of interests
• Cons
–R
Require l
i learning (
i (supervised) and/or
i d)
d/
definition of classification rules/knowledge
– Every info has to be assigned to one or
more classes
– Good control but lack flexibility to handle
new information
22. User-configurable Clustering
(Tan & Pan, PAKDD 2002)
Pan
• Information organization and content
organi ation
management
• Online incremental clustering + user
userdefined structure (preferences)
• Reduces to a clustering system if no user
indication given
• Allows personalization in a direct
direct,
intuitive, and interactive manner
• Control + flexibility
23. ARAM for Personalized
Information Management
Information Clusters
F2
b
F1
a
F1
a
a
-
x
b
x
b
+
+
Information Vector
-
A
B
Preference Vector
24. Flexible Organizer for Competitive
Intelligence (FOCI)
• A platform for gathering, organizing,
tracking, analyzing, and sharing
competitive information
• Natural way of turning raw search results
into personalized CI portfolios
– Multilingual enabled
– with Multilingual Efficient Analyzer
g
y
– Domain localization (Technology)
• Patented and licensed to many companies
27. Personalized Content Management
g
• Portfolio created through Search
f
S
• Unsupervised clustering (ARAM Pattern Channel A)
• Loop
– Personalization by users (ARAM Pattern Channel B)
– Reorganization of clusters (ARAM Pattern Channel A&B)
• Saving of personalized portfolio
• Tracking of new information
28. Personalization Functions
• Marking/labeling (selected) clusters
– Personal interpretation
• Inserting Clusters
– Indicate preference on groupings
• Merging clusters
– Indicate preferences on similarities
• Splitting clusters
– Indicate preferences on differences
• ...
29. Information Clustering
g
• A portfolio created
by a meta-search of
y
4 search engines
with a query on
“Text Mining”
31. Organizing New Information
g
g
Without the
Personalized
Portfolio
42 new documents from
DirectHit, Netscape, and
BusinessWire
B i
Wi
Based on
Personalized
Portfolio
32. Summary
y
• A fusion neural network algorithm, called fusion ART, has
been
proposed
for
integrating
clustering
and
categorization
• Has been applied to competiti e intelligence on the web.
competitive
eb
• Comparing with
advantages in
existing
works,
fusion
ART
has
– Personalization— fusion ART performs analysis and organization
of data based on user preferences
– Low time complexity — f
fusion ART performs real-time search and
f
match of patterns resulting in a linear time complexity
– Incremental clustering manner — fusion ART may adapt to
dynamic web multimedia d
d
i
b
l i di data set b i
by incrementally clustering new
ll l
i
patterns based on the learnt cluster structure without referring to
the old data.
3
2
33. Heterogeneous Data Co-clustering
for
Social Media Data
Theme Discovery and Mining
Lei Meng, Ah-Hwee Tan and Dong Xu
g
g
IEEE Transactions on Knowledge and Data Engineering, 2013
33
34. Introduction
• The popularity of social websites leads to greatly
p p
y
g
y
increase of web multimedia documents
– Massive number – Billions of images and articles online
– Diversity – Diverse content and booming emerging topics
– Multi-modal descriptors – images, text, category, tags,
Keywords
comments
Category
Birds
Images
from
Wild, bird, beach,
Surrounding
tree, vacation,
text
animal, mar, sunny,
playa, nayarit,
arena,ave, water,
vacaciones,
i
hollyday,
pelicano.
34
35. Introduction
• Clustering of web multimedia data is challenging
–
–
–
–
Scalability bi d
S l bili to big data
Difficulty in integrating multi-modal feature data
Ambiguity in deciding the number of categories
Rich but noisy meta-information – semantic gap of images, noisy
tags
Birds
Bi d
Wild, bird,
beach, tree,
vacation,
animal, mar
animal mar,
sunny, playa,
nayarit, arena,
ave, water,
vacaciones,
hollyday,
pelicano.
Beach
B h
Ocean, blue,
sea, summer,
vacation, sun,
man, b h
beach,
water, yellow,
fun, sand,
p y
play, funny,
y
adult, humor,
lifestyle,
sunny, resort. 35
36. Problem Statement
We define the theme discovery of web multimedia data
as a h
heterogeneous d
data co-clustering problem, which
l
i
bl
hi h
identifies the semantic categories of data patterns
through the fusion and recognition of multiple types of
features.
Multiple
Apple
Apple
Descriptions
Category
Fruits
Products
Movies
Tag
User
Description
Surrounding
text
…… …… ……
36
37. Proposed Approach
p
pp
• A self-organizing neural network approach to Heterogeneous
Data Co-clustering
Based on Fusion Adaptive Resonance Theory (Fusion ART)
Fuse arbitrary number of feature modalities
Adaptively tune the weights for different feature modalities
Two different learning function for primary data, such as
images and articles, and meta-information to handle short
and nois text
noisy te t
Incremental fast learning
D not need to give the number of clusters
Do
d
i h
b
f l
37
38. Experiments
• NUS-WIDE data set
– 36784 images of 18 categories
– Visual features: Grid color moment, Edge direction histogram, and
wavelet texture
– T t l features of surrounding text: 1142 words (7 words per image on
Textual f t
f
di t t
d
d
i
average)
• 20 Newsgroups data set
g p
– 12826 text documents of 10 categories
– Textual features of document content: over 60k words (800 words per
document on average)
– Textual features of category: 3 labels per document on average
38
39. Experiments on NUS-WIDE Data Set
• Evaluation on weight adaptation across channels for visual and
textual features
– Performance Comparison with fixed weight values
• GHF-ART with the adaptively tuned weight values γ_SA achieves the best
performance in 5 classes and the overall performance, and achieves close
performance with the best results obtained by fixed weight values
39
40. Experiments on NUS-WIDE Data Set
– Tracking of the change in weight values of γ _SA
• Textual features of surrounding text are assigned higher weights than visual
features
• The value of γ SA s b es in [0.7, 0.8] with the increase of patterns
e v ue o γ_S stabilizes [ .7, . ] w
e c e se o p e s
• Big fluctuation may be resulted by the generation of new clusters
40
41. Experiments on NUS-WIDE Data Set
•
Clustering Performance comparison with existing algorithms in terms of
weighted average precision cluster entropy (H cluster) class entropy ( H class )
precision,
),
),
l
purity and rand index (RI)
• GHF-ART achieves the best performance in terms of all the evaluation
measures
• With supervisory information, GHF-ART(SS) consistently obtains better
performance
41
42. Experiments on NUS-WIDE Data Set
• Time complexity analysis
– GHF-ART and Fusion ART incur very small increase of time cost
– For 23284 images, GHF-ART complete the clustering process in 10 seconds
42
43. Experiments on 20 Newsgroups Data Set
p
g p
• Clustering performance comparison using document content
and category information
d t
i f
ti
– Both GHF-ART and GHF-ART(SS) outperform other algorithms in all
the evaluation measures
– GHF ART has a 5% gain than Fusion ART in terms of Average
GHF-ART
Precision, Purity and Rand Index.
– Comparing with other unsupervised algorithms, GHF-ART achieves
around 80% in Average Precision, Purity and Rand Index while other
Precision
algorithms typically obtain less than 75%
43
44. Summary
y
• A Heterogeneous data co-clustering algorithm, called GHFART,
ART is proposed to discover the themes of web multimedia data
via their rich but heterogeneous descriptors.
• Comparing with existing works GHF ART has advantages in
works, GHF-ART
– Strong noise immunity — A learning function of meta-information is
proposed to handle noise
– Ad ti channel weighting — A well-defined weighting algorithm i
Adaptive h
l
i hti
ll d fi d
i hi
l i h is
proposed to identify the important feature modalities for a better fusion of
multi-modal features for overall similarity measure;
– L
Low ti
time complexity — GHF ART performs real-time search and match
l it
GHF-ART
f
l ti
h d
t h
of patterns resulting in a linear time complexity for big data;
– Incremental clustering manner — GHF-ART may adapt to dynamic
web multimedia d t set b i
b
lti di data t by incrementally clustering new patterns b d
t ll l t i
tt
based
on the learnt cluster structure without referring to the old data.
44
45. Research Centre of Excellence in
Active LIving for th ld LY
A ti LI i f the elderLY (LILY)
Aging in Place:
Opportunities and Challenges
Ah-Hwee Tan
( p
(http://www.ntu.edu.sg/home/asahtan)
g
)
School of Computer Engineering
Nanyang Technological University
JOINT UBC-NTU RESEARCH CENTRE
46. Aging in Place
g g
“the ability to live in one's own home and community
safely, independently, and comfortably, regardless of
age, income, or ability level” - Center for Disease
Control, Dec 2011
,
46
47. Motivation
Global aging population creates silver challenges
Most adults would prefer to age in place
78 percent of adults between the ages of 50 and 64
report that they would prefer to stay in their current
residence as they age
Growing elderly population will be living
independently in own homes
g
Vital to transform future homes into intelligent
human-centered environment for the elderly
Golden opportunities for innovating assistive
technologies f aging i place
h l i for i in l
47
48. A Basic Scenario of Tender Care for Agingin-place
p
Unobtrusive
Sensing
Social Signal
Processing
g
Context
Aware Auto
Tagging
Social
Cognitive
Network
Unobtrusive sensing device detects: the elder keeps walking around at an irregular
pace.
Social signal processing indicates: the elder has been silent for an unusually long
time.
Cognitive
Analysis
result…
lt
Your
mother may
be feeling
anxious
now…
now
I need to
call my
y
mother
now…
50. Vision
To
T enable elderly t maintain an active, h lth and
bl ld l to
i t i
ti
healthy d
engaging life style in their own homes supported by
an age-friendly intelligent environment, providing allg
y
g
p
g
round comprehensive tender care
Round-the-clock day-to-day health and wellness
monitoring
i i
Cognitive Support and recommendation to products
and services
Companionship and emotional support
Support for maintaining/stimulating social
interaction
50
51. Design Consideration and
Challenges
How to perform unobtrusive monitoring?
- Mobile sensing, activity tracking
How to provide all-around comprehensive care?
all around
- Physical, cognitive, emotional, social, sustainability
How to maintain ubiquitous access
q
interaction?
- Cross platform, multimedia, multimodal
How to provide friendly, personal touch?
- Adaptive user modeling, mood detection
and
-P
Proactive, natural i
i
l interaction
i
51
52. Approach and Methodology
pp
gy
To support active living of elderlies
pp
g f
through an intelligent multi-agent environment
with ubiquitous access, natural interface, and allrounded comprehensive care
dd
h i
Key Technologies
Unobtrusive sensing and social signal processing
Activity pattern and user modeling
Information and service recommendation
Proactive stimulation and natural interaction
52
53. A Multi-Agent Collaborative
Care Environment
Isabel
(Personal Nurse)
Small talk
Recommendations
for healthcare
products and services
Alfred
Alf d
(The Butler)
Small talk
User modeling
Social and travel
advisory
Frank
(Robot Dog)
Activity sensing
Pattern modeling
53
54. Why Multi-Agent?
y
g
Unobtrusive sensing and monitoring – agents
of different characteristics and capabilities
Ubi i
Ubiquitous access to information and
i f
i
d
services – agents in different platforms and
locations
Comprehensive tender care – agents with
different domain knowledge and functions
diff
d
i k
l d
df
i
“Three’s a party” – more opportunities for
p y
pp
cognitive stimulation and social interaction
54
55. Comprehensive Tender Care
Physical Support – Activity tracking, safety and
tracking
wellness monitoring
C
Cognitive S
i i Support – i f
information and
i
d
recommendation on (healthcare) products, services,
skills and activities
k
nd ct v t
Emotional Support – mood detection, affective
support, small talk
t
ll t lk
Social Support – companionship and connection
to family and friends (old and new) through sms,
emails and facebooks etc
55
56. Unobtrusive Sensing and
Ubiquitous Access to Services
unobtrusive in-home real-time data collection
and contextual social signal processing
- Essential to better understand and cater to the
elderly’s needs.
ld l ’
d
Sensing – bio sensing, motion sensors,
wearable/mobile sensors for health monitoring and
activity tracking
Cross Platform – Large screen interactive display,
mobile handheld devices, physical robots
Multimedia – text, audio, video
56
57. Adaptive User Modelling
p
g
Identity and profile
Interests and preferences
Behaviour model: Ti
Time, space, activity
p
ti it
Knowledge and skills
S i l network: Family and f d
Social
k
l
d friends
Meth0ds for Model Building
Explicit: User specification
Implicit: User actions, choices, conversation
57
58. Cognitive Support:
Product/Service Recommendation
Domain knowledge:
Healthcare, Travel, Cooking
Delivery modes:
- Question & Answer
-P
Proactive recommendation
i
d i
- Conversation
P
Personal T h
l Touch:
Personalized, Context sensitive, small talks
58
59. Challenges in
g
g
y
Big Living Analytics
Volume – huge amount of data through bio
sensing, motion sensors, wearable/mobile sensors
for health monitoring and activity tracking
Velocity – 24x7 real time sensing, sense making,
decision making service recommendation
making,
Variety – information integration and knowledge
sharing from cross platform, multimedia
h i f
l f
l i di
unstructured data - text, audio, video, gestures
59
60. Research Centre of Excellence in
Active LIving for the elderLY (LILY)
LI
LY
Thank you!
JOINT UBC-NTU RESEARCH CENTRE