Computational intelligence for big data analytics bda 2013

Big Data Analytics: Challenges and
l
h ll
d
What Computational Intelligence
Techniques May Offer
h
ff
Ah-Hwee Tan
(http://www.ntu.edu.sg/home/asahtan)
School of Computer Engineering
Nanyang Technological University
Big Data Analytics Symposium
London, UK
13 September 2013

Outline
 Big Data Analytics
 Computational Intelligence Techniques
 Web Data Analytics


Flexible Organizer for Competitive
Intelligence (FOCI)



Web Information Fusion and Associative
Discovery
Di

 Analytics for Active Living for Elderly

The Era of Big Data
Big data refers to
collection of data sets so large and complex
that
th t exceed th competence of commonly used
d the
t
f
l
d
IT systems in terms of processing space and/or
time.
time

Sources of Big Data
g
• Traditionally, mostly produced in scientific fields such as
astronomy, meteorology
astronomy meteorology, genomics physics biology and
physics, biology,
environmental research.
• With rapid development of IT technology and the
p
p
gy
consequent decrease of cost on collecting and storing
data, big data has been generated from almost every
industry and sector as well as governmental department
department,
including retail, finance, banking, security, audit, electric
power, healthcare.
• Recently, big data over the Web (big Web data for short),
which includes all the context data, such as, user
generated contents, browser/search log data deep web
contents
data,
data, etc.

Examples of Big Data
(Source: Wikipedia)
• Walmart handles more than 1 million customer transactions
every h
hour, which i i
hi h is imported i t d t b
t d into databases estimated t
ti t d to
contain more than 2.5 petabytes (2560 terabytes) of data –
the equivalent of 167 times the information contained in all the books in
the US Library of Congress.

• Facebook handles 50 billion photos from its user base.
• FICO Falcon Credit Card Fraud Detection System protects
2.1 billion active accounts world-wide.
• Windermere Real Estate uses anonymous GPS signals from
nearly 100 million drivers to help new home buyers
yp
determine their typical drive times to and from work
throughout various times of the day.

Examples of Big Data
(Source: Wikipedia)
• NASA Center for Climate Simulation
(NCCS) stores 32 petabytes of
climate observations and simulations
on the Discover supercomputing
cluster.
• Utah Data Center is a data center
currently
c rrentl being constr cted b the
constructed by
United States National Security
Agency. When finished, the facility
will handle yottabytes of information
collected by NSA over the Internet.

Value

Metric

1000

kB

kilobyte

10002

MB

megabyte

10003

GB

gigabyte

10004

TB

terabyte

10005

PB

petabyte

10006

EB

exabyte

10007

ZB

zettabyte

10008

YB

yottabyte

Money of Big Data
(Source: Wikipedia)
• "Big data" have increased the demand of information
g
management specialists
• Software AG, Oracle Corporation, IBM, Microsoft,
SAP, EMC, d
SAP EMC and HP h
have spent more than $15 billion
t
th
billi
on software firms specializing in data management
and analytics.
y
• In 2010, this industry on its own was worth more than
$100 billion and was growing at almost 10 percent a
year: about twice as fast as the software business as
a whole.

Market of Big Data
(Source: Wikipedia)
• Developed economies make increasing use of datadata
intensive technologies. There are 4.6 billion mobilephone subscriptions worldwide and there are between
1 billion and 2 billion people accessing the internet
• The world's effective capacity to exchange information
through telecommunication networks was 281
petabytes in 1986, 471 petabytes in 1993, 2.2 exabytes
in 2000, 65 exabytes in 2007[14] and it is predicted that
the amount of traffic flowing over the internet will reach
667 exabytes annually by 2013.[5]

Big Data Market Segments
(Report by Transparency Market Research)
• Segmentation of the big data market by components, by
g
g
y
p
, y
applications and by geography.
• The different components included are software and
services, hardware and storage.
• Software and services segment dominates the components
market whereas storage segment will be the fastest
growing segment for the next 5 years owing to the
perpetual growth in th d t generated.
t l
th i the data
t d

Big Data Market Segment by
Applications
• Covered eight applications namely financial services,
manufacturing, healthcare, telecommunication,
government, retail and media & entertainment and others in
the application segment.
• Financial Services, healthcare and the government sector
are the top three contributors of the big data market and
together held more than 55% of the big data market in
2012.
• M di and E t t i
Media d Entertainment and th h lth
t d the healthcare sectors will
t
ill
grow at high CAGR of nearly 42% from 2012 to 2018. The
g
growth in data in the form of video, images, and g
g
games is
driving the media and entertainment segment.
Read more: http://www.digitaljournal.com/pr/1395146#ixzz2b0hvuxrQ

Challenges of Big Data
• Volume
– Size in the order of petabytes,
exabytes, …

• Velocity
– Time sensitive data, data that
g
grow exponentially or even in
p
y
rates that overwhelm the wellknown Moore's Law

Value

Metric

1000

kB

kilobyte

10002

MB

megabyte

10003

GB

gigabyte
i b t

10004

TB

terabyte

10005

PB

petabyte

10006

EB

exabyte

10007

ZB

zettabyte

10008

YB

yottabyte

• V i t
Variety
– From structured data into semi-structured and
completely unstructured data of different types such as
types,
text, image, audio, video, click streams, log files,

Deeper Issues of Big Data
(The additional 3Vs)
• Validity
– Is the data correct and accurate for the intended
usage?

• V
Veracity
i
– Are the results meaningful for the given problem
space?

• Volatility
– How long do you need to look/store this data?

Computational Intelligence

• Neural Networks (IJCNN)
– Brain-like mathematical models for pattern
recognition, memory, and association discovery
– Examples: Perceptron, BP, SVM, SOM, ART, …

• Fuzzy Systems (IEEE-FUZZ)
– Fuzzy operators for handling non-discrete reasoning
– Examples: FNN, Fuzzy C-Means, …


• Evolutionary Computing (CEC)
– Classes of heuristic algorithms repeatedly
search for good solutions by mimicking
g
y
g
the process of natural evolution
– Commonly used for optimization and
search problems
– Examples: Genetic Algo, Memetic Algo,

Flagship Events of
• World Congress on Computational Intelligence
(Australia 2012, Beijing 2014)
y p
p
g
• IEEE Symposium on Computational Intelligence
(Singapore 2013, Florida, USA 2014)
• IEEE Symposium on Computational Intelligence
in Big Data (IEEE CIBD'2014)

Examples of Use of CI in Big Data
•
•
•
•
•
•
•
•
•
•
•
•
•
•

Data size and feature space adaptation
Uncertainty modeling in learning from big data
Distributed learning techniques in uncertain environment
Uncertainty in cloud computing
Distributed
Di ib d parallel computation
ll l
i
Feature selection/extraction in big data
Sample selection based on uncertainty
Incremental Learning
Manifold Learning on big data
Uncertainty techniques in big data classification/clustering
Imbalance learning on big data
Active learning on big data
Random weight networks on bi d t
R d
i ht t
k
big data
Transfer learning on big data

Self-Organizing N
S lf O
i i Neural
l
Networks for
Personalized W b Intelligence
P
li d Web I t lli

Towards Personalized Web Intelligence
g
Ah-Hwee Tan, Hwee-Leng Ong,
Hong Pan, Jamie Ng, Qiu-Xiang Li
Knowledge and Information Systems 18 (2004) 297-306

Workflow for Web Data Analytics
y
• Search
– Getting the information

• Organize
(clustering/categorizing)
– Putting things in perspectives

• Analyze (data mining)
– Discover hidden knowledge

• Share (knowledge management)
– Saving for reference and sharing

• Track
– Constant monitoring

Approaches to
Organizing/Analyzing
• Cl stering
Clustering
– Organizing information into groups based on
similarity functions and thresholds
– e.g. BullsEye, NorthernLight, Vivisimo

• Categorization
g
– Organizing information into a “predefined” set of
classes
– e.g. Yahoo!, Autonomy Knowledge Server

• Which is better?

Clustering
g
• Pros
– Unsupervised/self-organizing, require no training
or predefinition of classes
– Able to identify new themes

• Cons
– Users have no control
– Ever changing cluster structure
– Difficult to navigate and track

Categorization
g
• Pros
– Good control on classes
– Every info assigned to one or more classes
of interests

• Cons
–R
Require l
i learning (
i (supervised) and/or
i d)
d/
definition of classification rules/knowledge
– Every info has to be assigned to one or
more classes
– Good control but lack flexibility to handle
new information

User-configurable Clustering
(Tan & Pan, PAKDD 2002)
Pan

• Information organization and content
organi ation
management
• Online incremental clustering + user
userdefined structure (preferences)
• Reduces to a clustering system if no user
indication given
• Allows personalization in a direct
direct,
intuitive, and interactive manner
• Control + flexibility

ARAM for Personalized
Information Management
Information Clusters
F2

b

F1

a

F1
a

a



-

x

b

x

b

+

+

Information Vector

-

A

B

Preference Vector

Flexible Organizer for Competitive
Intelligence (FOCI)
• A platform for gathering, organizing,
tracking, analyzing, and sharing
competitive information
• Natural way of turning raw search results
into personalized CI portfolios
– Multilingual enabled
– with Multilingual Efficient Analyzer
g
y
– Domain localization (Technology)

• Patented and licensed to many companies

FOCI Architecture
Intranet/
Internet

User’s
CI Portfolio
Domain-Specific
Knowledge

Content
Management
Content
Publishing
g
Content
Analysis

Visu
ualization Front End
d

Content
Gathering

Personalized Content Management
g
• Portfolio created through Search
f
S
• Unsupervised clustering (ARAM Pattern Channel A)
• Loop
– Personalization by users (ARAM Pattern Channel B)
– Reorganization of clusters (ARAM Pattern Channel A&B)

• Saving of personalized portfolio
• Tracking of new information

Personalization Functions
• Marking/labeling (selected) clusters
– Personal interpretation

• Inserting Clusters
– Indicate preference on groupings

• Merging clusters
– Indicate preferences on similarities

• Splitting clusters
– Indicate preferences on differences

• ...

Information Clustering
g

• A portfolio created
by a meta-search of
y
4 search engines
with a query on
“Text Mining”

A Personalized Portfolio
after <=19 personalization operations
p
p
(mainly labeling and creating clusters)

Organizing New Information
g
g
Without the
Personalized
Portfolio

42 new documents from
DirectHit, Netscape, and
BusinessWire
B i
Wi

Based on
Personalized
Portfolio

Summary
y
• A fusion neural network algorithm, called fusion ART, has
been
proposed
for
integrating
clustering
and
categorization
• Has been applied to competiti e intelligence on the web.
competitive
eb
• Comparing with
advantages in

existing

works,

fusion

ART

has

– Personalization— fusion ART performs analysis and organization
of data based on user preferences
– Low time complexity — f
fusion ART performs real-time search and
f
match of patterns resulting in a linear time complexity
– Incremental clustering manner — fusion ART may adapt to
dynamic web multimedia d
d
i
b
l i di data set b i
by incrementally clustering new
ll l
i
patterns based on the learnt cluster structure without referring to
the old data.
3
2

Heterogeneous Data Co-clustering
for
Social Media Data
Theme Discovery and Mining

Lei Meng, Ah-Hwee Tan and Dong Xu
g
g
IEEE Transactions on Knowledge and Data Engineering, 2013

33

Introduction
• The popularity of social websites leads to greatly
p p
y
g
y
increase of web multimedia documents
– Massive number – Billions of images and articles online
– Diversity – Diverse content and booming emerging topics
– Multi-modal descriptors – images, text, category, tags,
Keywords
comments
Category
Birds

Images

from
Wild, bird, beach,
Surrounding
tree, vacation,
text
animal, mar, sunny,
playa, nayarit,
arena,ave, water,
vacaciones,
i
hollyday,
pelicano.
34

Introduction
• Clustering of web multimedia data is challenging
–
–
–
–

Scalability bi d
S l bili to big data
Difficulty in integrating multi-modal feature data
Ambiguity in deciding the number of categories
Rich but noisy meta-information – semantic gap of images, noisy
tags

Birds
Bi d

Wild, bird,
beach, tree,
vacation,
animal, mar
animal mar,
sunny, playa,
nayarit, arena,
ave, water,
vacaciones,
hollyday,
pelicano.

Beach
B h

Ocean, blue,
sea, summer,
vacation, sun,
man, b h
beach,
water, yellow,
fun, sand,
p y
play, funny,
y
adult, humor,
lifestyle,
sunny, resort. 35

Problem Statement
We define the theme discovery of web multimedia data
as a h
heterogeneous d
data co-clustering problem, which
l
i
bl
hi h
identifies the semantic categories of data patterns
through the fusion and recognition of multiple types of
features.
Multiple
Apple
Apple

Descriptions
Category

Fruits

Products

Movies

Tag
User
Description
Surrounding
text

…… …… ……
36

Proposed Approach
p
pp
• A self-organizing neural network approach to Heterogeneous
Data Co-clustering
 Based on Fusion Adaptive Resonance Theory (Fusion ART)
 Fuse arbitrary number of feature modalities
 Adaptively tune the weights for different feature modalities
 Two different learning function for primary data, such as
images and articles, and meta-information to handle short
and nois text
noisy te t
 Incremental fast learning
 D not need to give the number of clusters
Do
d
i h
b
f l
37

Experiments
• NUS-WIDE data set
– 36784 images of 18 categories
– Visual features: Grid color moment, Edge direction histogram, and
wavelet texture
– T t l features of surrounding text: 1142 words (7 words per image on
Textual f t
f
di t t
d
d
i
average)

• 20 Newsgroups data set
g p
– 12826 text documents of 10 categories
– Textual features of document content: over 60k words (800 words per
document on average)
– Textual features of category: 3 labels per document on average

38

Experiments on NUS-WIDE Data Set
• Evaluation on weight adaptation across channels for visual and
textual features
– Performance Comparison with fixed weight values

• GHF-ART with the adaptively tuned weight values γ_SA achieves the best
performance in 5 classes and the overall performance, and achieves close
performance with the best results obtained by fixed weight values

39

– Tracking of the change in weight values of γ _SA

• Textual features of surrounding text are assigned higher weights than visual
features
• The value of γ SA s b es in [0.7, 0.8] with the increase of patterns
e v ue o γ_S stabilizes [ .7, . ] w
e c e se o p e s
• Big fluctuation may be resulted by the generation of new clusters

40

•

Clustering Performance comparison with existing algorithms in terms of
weighted average precision cluster entropy (H cluster) class entropy ( H class )
precision,
),
),
l
purity and rand index (RI)

• GHF-ART achieves the best performance in terms of all the evaluation
measures
• With supervisory information, GHF-ART(SS) consistently obtains better
performance

41

• Time complexity analysis

– GHF-ART and Fusion ART incur very small increase of time cost
– For 23284 images, GHF-ART complete the clustering process in 10 seconds

42

Experiments on 20 Newsgroups Data Set
p
g p
• Clustering performance comparison using document content
and category information
d t
i f
ti

– Both GHF-ART and GHF-ART(SS) outperform other algorithms in all
the evaluation measures
– GHF ART has a 5% gain than Fusion ART in terms of Average
GHF-ART
Precision, Purity and Rand Index.
– Comparing with other unsupervised algorithms, GHF-ART achieves
around 80% in Average Precision, Purity and Rand Index while other
Precision
algorithms typically obtain less than 75%
43

Summary
y
• A Heterogeneous data co-clustering algorithm, called GHFART,
ART is proposed to discover the themes of web multimedia data
via their rich but heterogeneous descriptors.
• Comparing with existing works GHF ART has advantages in
works, GHF-ART
– Strong noise immunity — A learning function of meta-information is
proposed to handle noise
– Ad ti channel weighting — A well-defined weighting algorithm i
Adaptive h
l
i hti
ll d fi d
i hi
l i h is
proposed to identify the important feature modalities for a better fusion of
multi-modal features for overall similarity measure;
– L
Low ti
time complexity — GHF ART performs real-time search and match
l it
GHF-ART
f
l ti
h d
t h
of patterns resulting in a linear time complexity for big data;
– Incremental clustering manner — GHF-ART may adapt to dynamic
web multimedia d t set b i
b
lti di data t by incrementally clustering new patterns b d
t ll l t i
tt
based
on the learnt cluster structure without referring to the old data.
44

Research Centre of Excellence in
Active LIving for th ld LY
A ti LI i f the elderLY (LILY)

Aging in Place:
Opportunities and Challenges
Ah-Hwee Tan
( p
(http://www.ntu.edu.sg/home/asahtan)
g
)
School of Computer Engineering
Nanyang Technological University

JOINT UBC-NTU RESEARCH CENTRE

Aging in Place
g g
“the ability to live in one's own home and community
safely, independently, and comfortably, regardless of
age, income, or ability level” - Center for Disease
Control, Dec 2011
,

46

Motivation
 Global aging population creates silver challenges
 Most adults would prefer to age in place
 78 percent of adults between the ages of 50 and 64
report that they would prefer to stay in their current
residence as they age

 Growing elderly population will be living
independently in own homes
g
 Vital to transform future homes into intelligent
human-centered environment for the elderly
 Golden opportunities for innovating assistive
technologies f aging i place
h l i for i in l
47

A Basic Scenario of Tender Care for Agingin-place
p
 Unobtrusive
Sensing
 Social Signal
Processing
g
 Context
Aware Auto
Tagging
 Social
Cognitive
Network

Unobtrusive sensing device detects: the elder keeps walking around at an irregular
pace.
Social signal processing indicates: the elder has been silent for an unusually long
time.

Cognitive
Analysis
result…
lt

Your
mother may
be feeling
anxious
now…
now

I need to
call my
y
mother
now…

Vision
To
T enable elderly t maintain an active, h lth and
bl ld l to
i t i
ti
healthy d
engaging life style in their own homes supported by
an age-friendly intelligent environment, providing allg
y
g
p
g
round comprehensive tender care
 Round-the-clock day-to-day health and wellness
monitoring
i i
 Cognitive Support and recommendation to products
and services
 Companionship and emotional support
 Support for maintaining/stimulating social
interaction
50

Design Consideration and
Challenges
 How to perform unobtrusive monitoring?
- Mobile sensing, activity tracking
 How to provide all-around comprehensive care?
all around
- Physical, cognitive, emotional, social, sustainability

 How to maintain ubiquitous access
q
interaction?
- Cross platform, multimedia, multimodal
 How to provide friendly, personal touch?
- Adaptive user modeling, mood detection

and

-P
Proactive, natural i
i
l interaction
i
51

Approach and Methodology
pp
gy
To support active living of elderlies
pp
g f
through an intelligent multi-agent environment
with ubiquitous access, natural interface, and allrounded comprehensive care
dd
h i
Key Technologies





Unobtrusive sensing and social signal processing
Activity pattern and user modeling
Information and service recommendation
Proactive stimulation and natural interaction
52

A Multi-Agent Collaborative
Care Environment
Isabel
(Personal Nurse)
Small talk
Recommendations
for healthcare
products and services

Alfred
Alf d
(The Butler)
Small talk
User modeling
Social and travel
advisory

Frank
(Robot Dog)
Activity sensing
Pattern modeling
53

Why Multi-Agent?
y
g
 Unobtrusive sensing and monitoring – agents
of different characteristics and capabilities

 Ubi i
Ubiquitous access to information and
i f
i
d
services – agents in different platforms and
locations

 Comprehensive tender care – agents with
different domain knowledge and functions
diff
d
i k
l d
df
i

 “Three’s a party” – more opportunities for
p y
pp
cognitive stimulation and social interaction
54

Comprehensive Tender Care
 Physical Support – Activity tracking, safety and
tracking
wellness monitoring

C
Cognitive S
i i Support – i f
information and
i
d
recommendation on (healthcare) products, services,
skills and activities
k
nd ct v t

 Emotional Support – mood detection, affective
support, small talk
t
ll t lk

 Social Support – companionship and connection
to family and friends (old and new) through sms,
emails and facebooks etc

55

Unobtrusive Sensing and
Ubiquitous Access to Services
unobtrusive in-home real-time data collection
and contextual social signal processing
- Essential to better understand and cater to the
elderly’s needs.
ld l ’
d

 Sensing – bio sensing, motion sensors,
wearable/mobile sensors for health monitoring and
activity tracking

 Cross Platform – Large screen interactive display,
mobile handheld devices, physical robots

 Multimedia – text, audio, video

56

Adaptive User Modelling
p
g
 Identity and profile
 Interests and preferences
 Behaviour model: Ti
Time, space, activity
p
ti it
 Knowledge and skills
 S i l network: Family and f d
Social
k
l
d friends
Meth0ds for Model Building
 Explicit: User specification
 Implicit: User actions, choices, conversation
57

Cognitive Support:
Product/Service Recommendation
 Domain knowledge:
Healthcare, Travel, Cooking

 Delivery modes:
- Question & Answer
-P
Proactive recommendation
i
d i
- Conversation

P
Personal T h
l Touch:
Personalized, Context sensitive, small talks
58

Challenges in
g
g
y
Big Living Analytics
 Volume – huge amount of data through bio
sensing, motion sensors, wearable/mobile sensors
for health monitoring and activity tracking

 Velocity – 24x7 real time sensing, sense making,
decision making service recommendation
making,

 Variety – information integration and knowledge
sharing from cross platform, multimedia
h i f
l f
l i di
unstructured data - text, audio, video, gestures

59

Research Centre of Excellence in
Active LIving for the elderLY (LILY)
LI
LY

Thank you!
JOINT UBC-NTU RESEARCH CENTRE

Computational intelligence for big data analytics bda 2013

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Destacado

Destacado (6)

Similar a Computational intelligence for big data analytics bda 2013

Similar a Computational intelligence for big data analytics bda 2013 (20)

Último

Último (20)

Computational intelligence for big data analytics bda 2013