Given its ability to analyze structured, unstructured, and "multi-structured" data, Hadoop is an increasingly viable option for analytics and business intelligence within the enterprise. Dramatically more scalable and cost-effective than traditional data warehousing technologies, Hadoop is also increasingly used to perform new kinds of analytics that were previously impossible. When it comes to Big Data, retailers are at the forefront of leveraging large volumes of nuanced information about customers, to improve the effectiveness of promotional campaigns, refine pricing models, and lower overall customer acquisition costs. Retailers compete fiercely for consumers' attention, time, and money, and effective use of analytics can result in sustained competitive advantage. Forward-thinking retailers can now take advantage of all data sources to construct a complete picture of a customer. This invariably consists of both structured data (customer and inventory records, spreadsheets, etc.) and unstructured data (clickstream logs, email archives, customer feedback and comment fields, etc.). This allows, for example, online retailers with structured, transactional sales data to connect that data with unstructured comments from product reviews, providing insight into how reviews affect consumers' propensity to purchase a particular product. This session will examine several real-world customer use cases applying combined analysis of structured and unstructured data.
1.
Applying
Big
Data
Analy-cs.
Analyzing Multi-Structured
Data with Hadoop
Justin Borgman
CEO & Co-Founder
2. Company
Profile
• 30
people,
based
in
Cambridge,
MA
• Founded
in
July,
2010
• Raised
$9.5M
Series
A
from
Bessemer
and
Norwest
• CEO
&
Co-‐Founder
• Based
on
the
HadoopDB
research
• Previously
spent
7
years
as
a
project
in
the
Yale
Computer
Science
soAware
developer
at
MIT
Department
by
Daniel
Abadi,
et.
al.
Lincoln
Laboratory
and
product
manager
at
startup
Covectra
• Undergrad:
UMass
Amherst
• Grad:
Yale
University
2
4. Big
Data
in
the
Headlines
“How Target Figured Out A Teen
Girl Was Pregnant Before Her
“Digital universe” grew by 62% last year to 800K Father Did”
petabytes & will grow to 1.2 zettabytes this year
“Why Netflix produces BBC
remake starring Kevin Spacey,
directed by David Fincher” 4
6. Example:
Big
Data
Analysis
Process
HADOOP MPP DBMS
Raw Data load
extract
Aggregate
Sample
Filter
predict
Web access logs
Click logs
Impressions
Email Term extraction
Tweets Entity extraction
Sensor data Sentiment analysis
Documents Geocoding
Cleanse
Sessionization
Join Applications
BI Tools
Predictive analytics
Business
Analyst
7. Example:
Hadapt
Analysis
Process
Raw Data load
predict
Applications
BI Tools
Predictive analytics
8. The
Evolu-on
of
Analy-cs
–
Where
are
we
today?
The
early
stages
of
analy-cs
• Market
Basket
Analysis
• Trend
Analysis
• Cyclical
Analysis
• Customer
Segmenta-on
New
Analy-cal
Models
• Pacern
Detec-on,
Discovery,
Matching
• A/B
Tes-ng
and
Behavioral
Analysis
• Sessioniza-on
• Social
Correla-on
Analysis
• Frac-onal
Acribu-on
• Sen-ment
Analysis
• Personaliza-on
8
9. Big
Data
in
Ac-on
• Amazon
and
Ne)lix
engage
in
arbitrage
on
video
content
based
on
customer
behavior
• Harvard
predicts
the
spread
of
cholera
in
Hai-,
and
Derwent
Capital
out-‐trades
the
market
based
on
tweets
and
their
sen-ment
• En-re
ecosystems
were
shotgun
gene
sequenced
by
Celera.
• Life
events
are
predicted
by
Target
and
marketed
accordingly
• *Osco
Drug
increased
sales
by
op-mizing
product
placement,
e.g.
beer
and
diapers
• Ads
are
op-mally
placed
and
priced
*for
you*
by
DataXu
in
real
-me
• Next
Big
Sound
predicts
new
ar-sts
and
hits
based
on
signals
from
social
media
• Real-‐-me
produc-on
op-miza-on
saves
Chevron
over
$1B/year
• Retailer
web
sites
are
re-‐organized
and
re-‐op-mized
for
content
by
Bloomreach
• LinkedIn
suggests
who
you
might
know,
eHarmony
suggests
who
you
might
love
11. Example:
e-‐Tailer
Business
Opportunity
• Should
I
run
a
promo-on
among
the
Lady
Gaga
fans
or
Jus-n
Bieber
fans?
• Based
on
shopping
cart
and
browsing/purchase
history,
what
other
products
should
be
recommended
before
the
customer
checks
out?
• Which
items
are
oAen
purchased
together,
and
any
correla:on
with
shopping
date/-me,
customer
age,
gender,
etc?
Challenges
• Diverse
data
sources
• In-‐depth
analy-cs
(e.g.
predic-ve
modeling)
• Real
-me
performance
at
scale
Solu-on
– Integrate
Hadoop
with
RDBMS
– Develop
and
integrate
analy-c
libraries
– Make
analy-c
jobs
interac-ve
(not
batch
oriented)
11
12. Example:
Customer
Behavior
Analysis
Business
Opportunity
• Analyze
customer
behavior
to
increase
loyalty
and
trust,
allocate
adver-sing
spend,
op-mize
product
incen-ves,
Golden
Path
Analysis:
iden-fy
fraud,
micro-‐segment
customer
base.
ComparaSve
Performance
Challenges
ETL
+
RDBMS
&
SQL
=
200
minutes
• Full
website
session-‐level
data
needed,
typically
from
raw
web
logs
Hadoop
+
RDBMS
=
135
mins
• Requires
complex
mul--‐pass
SQL
queries
or
Hadapt
=
11
minutes
new
Non-‐SQL
techniques
• Requires
rewri-ng
query
to
change
number
of
clicks
Example
AnalySc
QuesSons
analyzed • Which
life
events
are
strong
opportun-es
for
me
to
becer
engage
my
customers?
Hadapt
Value
• When
am
I
about
to
lose
a
customer?
• What
are
my
top
segments?
• Performance:
Single
pass
over
data
regardless
of
• Which
ad
campaigns
produced
the
most
liA?
number
of
clicks
analyzed
• What
products
can
I
bundle
to
increase
sales?
• Ease
of
Dev
&
Ease
of
Manageability:
Much
simpler
• Are
my
online
offers
canibalizing
my
in-‐store
code
sales?
• Ease
of
Use:
PaPern
flexibility
to
handle
varied
numbers
• What
models
are
my
customers
following
so
I
of
clicks
and
click
pacerns
without
requiring
any
code
can
becer
predict
their
next
move?
rewrite
12
13. Example:
Social
Media
Analysis
Business
Opportunity
• Iden-fy
influencers
based
not
only
on
#
of
followers
and
re-‐tweets,
but
also
messaging
content
and
sen-ment
in
reply/re-‐tweets
• Aggregate
individual
sen-ments
by
incorpora-ng
tweet
authors’
influence
scores
• What
phrases
or
product
defects
do
customers
oAen
men-on
before
they
acrite?
Challenges
• Ingest
and
analyze
high
speed
incoming
events
• High
quality
sen-ment
output
(NLP
+
Big
Data)
• Insights
generated
across
data
sets
Solu-on
– Enhance
Hadoop
with
becer
interac-vity
– Integrate
NLP
packages
to
Big
Data
plaporm
– Ingest,
analyze,
and
store
all
datasets
in
one
plaporm
13
14. Example:
Text
Analysis
&
e-‐Discovery
Business
Goal
• Archive
ALL
electronic
documents
–
email,
Office,
PDF,
instant
messages,
etc
–
in
a
reference
archive,
retaining
original
document
formats.
Provide
rapid,
Building
the
Archive:
flexible
access
and
extrac-on
capabili-es
for
Scalability
and
Cost
Issues
eDiscovery
and
compliance
measures.
Teradata/Netezza
-‐
$50K
–
100K/TB
Challenges
• Massive
scale
of
documents
in
mul-ple
formats
and
Search
engine
-‐
$100K/TB
structures.
IntegraSon
costs:
$150K
• Sophis-cated
query
and
analysis
requirements.
Total:
$200K/TB
+
$150K
• Future
formats
impossible
to
predict.
• Must
retain
original
document
format.
Example
AnalySc
QuesSons
Hadapt
Value
• Retrieve
all
emails
and
instant
messages
from
all
employees
in
Denver
office
• Cost-‐effecSve:
scale
to
100s
of
TB
and
PB
of
original
between
1995
and
1998
document
storage.
• Who
are
the
top
10
recipients
of
emails
• Flexible
query
access:
use
SQL,
Full
Text
Search,
or
from
Bob
Smith
combine
SQL+Search.
• PreventaSve
analysis:
apply
deduplica-on,
sen-ment
analysis,
categoriza-on
to
accelerate
document
assessment.
14
15. Hadapt
–
Key
Considera-ons
Simplicity
• All-‐in-‐one
system
for
“mul--‐structured”
data
analy-cs
• Single
cluster
for
analysis
of
mul-ple
data
types
–
low
TCO,
high
performance
• Analyze
rela-onal
&
unstructured
data
together
to
answer
new
ques-ons
• Eliminate
data
movement
between
Hadoop
and
RDBMS
• Use
SQL
+
Full
Text
Search
–
a
fully
integrated
solu-on
Accessibility
• Leverage
exis-ng
investment
in
SQL
tools
and
skills
• Can
roll
out
Hadapt
analy-cs
to
exis-ng
BI
tool
users
• Makes
Hadoop
easier
to
adopt
for
SQL-‐heavy
enterprises
Scalability
/
Performance
• Enormous
performance
boost
for
mul--‐structured
data
analysis
• Adap-ve
query
planning
provides
on-‐the-‐fly
load
balancing
&
fault
tolerance
• Ad-‐hoc
and
interac-ve
querying
of
massive
data
sets