Más contenido relacionado La actualidad más candente (20) Similar a "Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about Real Time?" - Slides (including TIBCO Examples) from JAX 2014 Online (20) "Hadoop and Data Warehouse (DWH) – Friends, Enemies or Profiteers? What about Real Time?" - Slides (including TIBCO Examples) from JAX 2014 Online1. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop and Data Warehouse –
Friends, Enemies or Profiteers?
What about Real Time?
Kai Wähner
kwaehner@tibco.com
@KaiWaehner
www.kai-waehner.de
2. © Copyright 2000-2014 TIBCO Software Inc.
Disclaimer
!
These opinions are my own and do not necessarily
represent my employer
3. © Copyright 2000-2014 TIBCO Software Inc.
Key Messages
Big Data is not just Hadoop, concentrate on Business Value!
A good Big Data Architecture combines DWH, Hadoop and Real Time!
The Integration Layer is getting even more important in the Big Data Era!
4. © Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology
• Data Warehouse and Business Intelligence
• Big Data Processing with Hadoop
• Big Data Processing in Real Time
5. © Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology
• Data Warehouse and Business Intelligence
• Big Data Processing with Hadoop
• Big Data Processing in Real Time
6. © Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH
/
BI
Hadoop
Real
Time
Big
Data
Architecture
7. © Copyright 2000-2014 TIBCO Software Inc.
DWH means analyzing OLAP Cubes
h9p://www.exforsys.com/tutorials/msas/data-‐warehouse-‐database-‐and-‐oltp-‐database.html
8. © Copyright 2000-2014 TIBCO Software Inc.
Big Data means analyzing Everything
h9p://blogs.teradata.com/internaDonal/tag/hadoop/
• Store
everything
• Even
without
structure
• Use
whatever
you
need
(now
or
later)
9. © Copyright 2000-2014 TIBCO Software Inc.
Big Data: Three shifts in the Way we analyze Information
• Messiness:
Using
ALL
data,
not
just
samples
• Also
bad
data
(e.g.
Word
spell
checker,
Google
auto-‐complete
and
„did
you
mean...“
recommendaDon
• Correla-ons:
Instead
of
causaliDes
• May
not
tell
us
WHY
something
is
happening,
but
THAT
it
is
happening
• In
many
situaDons,
this
is
good
enough
• What
drug
substance
cures
cancer?
When
should
I
buy
an
airplane
Dcket?
• Datafica-on:
Store,
process,
combine,
reuse,
enhance
all
data!
• DigitalisaDon
(Amazon
Kindle
à
Read)
vs.
DataficaDon
(Google
Books
à
Read,
Search,
Process,
...)
• Words
becomes
data:
Google
books:
not
just
read,
but
also
search,
analyse,
etc.
• LocaDons
becomes
data:
GPS:
not
just
navigaDon,
but
also
insurance
costs,
economic
routes,
etc.
10. © Copyright 2000-2014 TIBCO Software Inc.
What is Big Data? The combined Vs of Big Data
Volume
(terabytes,
petabytes)
Variety
(social
networks,
blog
posts,
logs,
sensors,
etc.)
Velocity
(realDme)
Value
X
11. © Copyright 2000-2014 TIBCO Software Inc.
Real Time
Wikipedia Definition:
• Real time programs must guarantee response within strict time constraints, often referred to as
"deadlines”. Real time responses are often understood to be in the order of milliseconds, and
sometimes microseconds.
• The term "near real time” refers to the time delay introduced, by automated data processing or
network transmission.
• The distinction between the terms "near real time" and "real time" is somewhat nebulous and
must be defined for the situation at hand.
Hereby, for this talk, I define:
– Real time == response in nanoseconds || microseconds || milliseconds || <= one second
– Near real time == (response time > one second)
12. © Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology
• Data Warehouse and Business Intelligence
• Big Data Processing with Hadoop
• Big Data Processing in Real Time
13. © Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH
/
BI
Hadoop
Real
Time
Big
Data
Architecture
14. © Copyright 2000-2014 TIBCO Software Inc.
DWH vs. BI
• Data Warehouse (DWH) à Storage
• Business Intelligence (BI) à Analytics
• Both terms are often used as synonym, i.e. when someone talks
about a DWH, this might include analytics
• BI can be used without a DWH
15. © Copyright 2000-2014 TIBCO Software Inc.
Typical DWH Process
h9p://wikibon.org/blog/not-‐your-‐fathers-‐data-‐analyDcs/
A
DWH
is
„Business
Case
driven“:
• ReporDng
• Dashboards
• Drill
Down
AnalyDcs
Different
DWH
OpDons:
• Enterprise
DWH
(
==
EDW)
• Department
/
Project
DWH
• Embedded
BI
(into
ApplicaDons)
18. © Copyright 2000-2014 TIBCO Software Inc.
Products
DWH
• SQL: e.g. MySQL
• MPP: e.g. Teradata, EMC Greenplum, IBM Netezza
– Scale very well (almost linear), very high performance, hardware / software costs
also increase a lot
BI
• Microsoft Excel
• BI Tools: e.g. TIBCO Spotfire, Tableau, MicroStrategy
Hint: Good BI tools
• allow data discovery / visualization using different sources, not just DWH
• are easy to use
20. © Copyright 2000-2014 TIBCO Software Inc.
BI Tool Example: TIBCO Spotfire
The
whole
team
needs
analyDcs.
Spo`ire
is
for
everyone,
helping
users
with
a
variety
of
skill
levels
to
visualize,
explore
and
share
informaDon:
It
has
• At-‐a-‐glance
business
facts
for
managers
• Dashboards
for
front-‐line
decision-‐makers
• Visual
discovery
for
business
users
• Deep
data
exploraDon
for
analysts
• Advanced
predicDve
analyDcs
for
staDsDcians
• And
beauDful
visualizaDons
to
empress
your
execuDves
23. © Copyright 2000-2014 TIBCO Software Inc.
DWH Real World Use Case
h9p://spo`ire.Dbco.com/resources/content-‐center?Content%20Type=Case%20Studies
24. © Copyright 2000-2014 TIBCO Software Inc.
DWH Real World Use Case
h9p://spo`ire.Dbco.com/resources/content-‐center?Content%20Type=Case%20Studies
25. © Copyright 2000-2014 TIBCO Software Inc.
Embedded BI Real World Use Case
h9ps://www.jaspersod.com/embeddedShowcase/periscope.html
26. © Copyright 2000-2014 TIBCO Software Inc.
Problems of a DWH
No flexibility / agility
• Just structured data
• Just some (maybe aggregated) history data
• Just good for already known business cases
Low speed
• ETL is batch, usually takes hours or sometimes even days
• No proactive reactions possible à “too late architecture”
High costs (per GB)
• Just selected data
• Too old data is often outsourced to archives
28. © Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology
• Data Warehouse and Business Intelligence
• Big Data Processing with Hadoop
• Big Data Processing in Real Time
29. © Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH
/
BI
Hadoop
Real
Time
Big
Data
Architecture
30. © Copyright 2000-2014 TIBCO Software Inc.
Why no longer DWH, but Hadoop?
Hadoop was built to solve problems of RDBMS and DWH…
Benefits of Hadoop:
• Store and analyze all data
– all data == not just selected (maybe aggregated) data
– all data == structured + semi-structured + unstructured
à be more flexible, adapt to changing business cases
• Better performance (massively parallel)
• Ad hoc data discovery – also for big data volumes
• Save money (commodity hardware, open source software)
31. © Copyright 2000-2014 TIBCO Software Inc.
What is Hadoop?
Apache Hadoop, an open-source software library, is a
framework that allows for the distributed processing of
large data sets across clusters of commodity hardware
using simple programming models. It is designed to scale
up from single servers to thousands of machines, each
offering local computation and storage.
32. © Copyright 2000-2014 TIBCO Software Inc.
MapReduce
Simple
example:
• Input:
(very
large)
text
files
with
lists
of
strings,
such
as:
„318,
0043012650999991949032412004...0500001N9+01111+99999999999...“
• We
are
interested
just
in
some
content:
year
and
temperate
(marked
in
red)
• The
Map
Reduce
funcDon
has
to
compute
the
maximum
temperature
for
every
year
33. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS
Ecosystem
Features
included
few many
Apache
Hadoop
35. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS
Ecosystem
Features
included
Hadoop
DistribuDon
few many
Apache
Hadoop
Packaging
Deployment-Tooling
Support
+
37. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop Products
MapReduce
HDFS
Ecosystem
Features
included
Hadoop
DistribuDon
Big
Data
Suite
few many
Apache
Hadoop
Packaging
Deployment-Tooling
Support
+
Tooling / Modeling
Code Generation
Scheduling
Integration
+
40. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop Real World Use Case:
Replace ETL to improve Performance
“The advantage of their new system is that they can now look at their
data [from their log processing system] in anyway they want:
• Nightly MapReduce jobs collect statistics about their mail system such as spam counts by
domain, bytes transferred and number of logins.
• When they wanted to find out which part of the world their customers logged in from, a quick
[ad hoc] MapReduce job was created and they had the answer within a few hours. Not really
possible in your typical ETL system.”
http://highscalability.com/how-rackspace-now-uses-mapreduce-and-hadoop-query-terabytes-data
(
no
TIBCO
reference)
41. © Copyright 2000-2014 TIBCO Software Inc.
• A lot of data must be stored „forever“
• Numbers increase exponentially
• Goal: As cheap as possible
• Problem: Queries must still be possible (compliance!)
• Solution: Commodity servers and „Hadoop querying“
Global
Parcel
Service
h9p://archive.org/stream/BigDataImPraxiseinsatz-‐SzenarienBeispieleEffekte/Big_Data_BITKOM-‐Lei`aden_Sept.2012#page/n0/mode/2up
Hadoop Real World Use Case:
Storage to reduce Costs
(
no
TIBCO
reference)
42. © Copyright 2000-2014 TIBCO Software Inc.
DWH or Hadoop?
DWH
Hadoop
Data
Structured
All
data
Maturity
Established
in
Enterprise
New
concepts
Tooling
Installed,
good
knowledge
and
experience
New
tools,
coding
required,
business
can
sDll
use
SQL-‐similar
queries
or
same
BI
tool
Costs
High
(per
GB)
Low
(per
GB)
43. © Copyright 2000-2014 TIBCO Software Inc.
DWH plus Hadoop?
DWH and Hadoop complement each other very well
• Store all data in Hadoop (cheap per GB)
• ETL from Hadoop to DWH (expensive per GB)
• Create specific reports / dashboards in DWH (leverage existing products and knowledge)
• Do Ad Hoc (Big) Data Discovery directly in Hadoop, no DWH needed
Good BI tools support both, DWH and Hadoop!
For example, TIBCO Spotfire has connectors to:
• RDBMS (e.g. MySQL)
• MPP (e.g. Teradata, IBM Netezza, Greenplum)
• Hadoop (e.g. Hive, Impala)
• In-Memory (e.g. TIBCO ActiveSpaces, SAP HANA)
44. © Copyright 2000-2014 TIBCO Software Inc.
Recommendation DWH vs. Hadoop vs. XYZ
• Short
term:
Use
Hadoop
(only)
when
you
can
save
(a
lot
of)
money
or
when
you
can
not
solve
your
business
problem
without
Hadoop.
A
lot
of
things
have
to
be
improved,
e.g.
governance,
security,
performance,
and
tool
support.
•
Long
term:
Hadoop
can
replace
DWH
(as
you
can
create
a
DWH
on
top
of
Hadoop
with
SQL
interface
already
today)!
• Be
aware:
A
lot
of
other
opDons
emerge
for
analyzing
big
data
besides
Hadoop,
e.g.
-‐ AnalyDcal
databases
with
SQL
interface
(MemSQL,
Citus
Data)
-‐ Log
AnalyDcs
(Splunk,
TIBCO
LogLogic)
-‐ Graph
databases
(Neo4j,
InfiniteGraph)
45. © Copyright 2000-2014 TIBCO Software Inc.
Vendors Strategy...
Hadoop vendors push Hadoop as DWH replacement
à Called e.g. „Enterprise Data Hub“ (Cloudera) or „Data Lake“ (Hortonworks)
h9p://gigaom.com/2013/10/29/clouderas-‐plan-‐to-‐become-‐the-‐center-‐of-‐your-‐data-‐universe/
h9p://hortonworks.com/wp-‐content/uploads/downloads/2013/04/
Hortonworks.ApacheHadoopPa9ernsOfUse.v1.0.pdf
46. © Copyright 2000-2014 TIBCO Software Inc.
Vendors Strategy...
MPP / DWH vendors add Hadoop support as
complementary addon to their DWH
à Reason (probably): Market pressure!
à Benefit: One platform (including tooling and support) for DWH and Hadoop
47. © Copyright 2000-2014 TIBCO Software Inc.
Example: EMC combines DWH and Hadoop
h9p://wikibon.org/wiki/v/EMC_Integrates_Greenplum_DB_and_Hadoop_with_Pivotal_HD
h9p://www.gopivotal.com/big-‐data/pivotal-‐hd
48. © Copyright 2000-2014 TIBCO Software Inc.
Example: Teradata combines DWH and Hadoop
h9p://www.teradata.com/Teradata-‐Enterprise-‐Access-‐for-‐Hadoop/
h9p://gigaom.com/2014/04/07/teradata-‐says-‐hadoop-‐is-‐good-‐for-‐business-‐but-‐for-‐how-‐long/
49. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop evolving from Batch to Near Real Time
Hadoop is MapReduce == Batch (== hours, minutes, seconds)
• Good for complex transformations / computations of big data volumes
• Not so good for ad hoc data exploration
• Improvements: Hive Stinger (Hortonworks) etc.
Non-MapReduce processing engines added in the meantime (YARN makes it possible)
• Ad hoc data discovery (== seconds)
• Hive / Pig with Apache Tez replacing MapReduce under the hood for data processing
• New Query engines, e.g. Impala (Cloudera) or Apache Drill (MapR)
MPP vendors (e.g. Teradata, EMC Greenplum) also add own query engines
• Offer fast data exploration (without MapReduce)
Some Hadoop problems remain
• No good, easy tooling (Hadoop ecosystem) à might be solved next years
• Missing maturity (alpha / beta versions) à might be solved next years
• No “real time” (== ms, ns), but “near real time” (> 1 sec) à “too late architecture”
50. © Copyright 2000-2014 TIBCO Software Inc.
Agenda
• Terminology
• Data Warehouse and Business Intelligence
• Big Data Processing with Hadoop
• Big Data Processing in Real Time
51. © Copyright 2000-2014 TIBCO Software Inc.
Big Data Architecture
DWH
/
BI
Hadoop
Real
Time
Big
Data
Architecture
52. © Copyright 2000-2014 TIBCO Software Inc.
Real Time: “The Two-Second Advantage”
“A
li&le
bit
of
the
right
informa2on,
just
a
li&le
bit
beforehand
–
whether
it
is
a
couple
of
seconds,
minutes
or
hours
–
is
more
valuable
than
all
of
the
informa2on
in
the
world
six
months
later…
this
is
the
two-‐second
advantage.”
Vikek
Ranadivé,
Founder
and
CEO
of
TIBCO
54. © Copyright 2000-2014 TIBCO Software Inc.
What is Big Data? The combined Vs of Big Data
Volume
(terabytes,
petabytes)
Variety
(social
networks,
blog
posts,
logs,
sensors,
etc.)
Velocity
(realDme)
X
Fast
Data
55. © Copyright 2000-2014 TIBCO Software Inc.
Real Time Architecture?
EVENTS
Mainframe/ERP/DB/App
ACTION
TransacDon
Based
Architectures
EVENTS
Mainframe/ERP/DB/App
ACTION
Behavior
Based
Architectures
TransacDon
Data,
Event
and
AnalyDcs
Not
ElasDc,
Doesn’t
Scale,
“Always
Late”
architecture
and
analyDcs
ElasDc,
Scales,
Real
Dme
architecture
(Events,
Data
and
AnalyDcs)
56. © Copyright 2000-2014 TIBCO Software Inc.
Complex Event / Stream Processing / In-Memory
Concepts
• Streams: Monitoring millions of events in a specific time window to react proactively
• Stateful: Collect, filter and correlate events with state to anticipate outcomes and react proactively
• Transactional: Highly performant transactional event processing
Products vs. Frameworks
• Products are mature, mission-critical, in production, e.g. TIBCO StreamBase, IBM InfoSphere Streams
• Open Source Frameworks, e.g. “Apache Spark” and “Apache Storm”
– Future will tell us about performance, tooling, support, etc.
– Can be combined with Hadoop
– Are complementary to Products such as TIBCO StreamBase
In-Memory
• Can also be used for “big data” (Terabytes possible!)
• Usually complementary, i.e. they can be / have to be combined with stream processing / complex event
processing
57. © Copyright 2000-2014 TIBCO Software Inc.
Stream Processing Architecture
LiveView Datamart
Con-nuous
Query
Continuous Query Processor
Ad
Hoc
Query
Alerts
CEP
Messaging
(low
latency)
Messaging
(JMS)
Social
Media
Data
Market
Data
In-‐Memory
ESB
Integra-on
Sensor
Data
Historical
Data
JDBC
Ac-veSpaces
Enterprise
data
58. © Copyright 2000-2014 TIBCO Software Inc.
Stream Processing Architecture (Example: TIBCO StreamBase)
TIBCO StreamBase
Con-nuous
Query
Continuous Query Processor
Ad
Hoc
Query
Alerts
Active Tables
Trading
Signal
Transac-on
Cost
Orders
/
Execu-ons
Market
Data
Alert
SeMng
TIBCO LiveViewSnapshot
AND
always-‐live
updates
Quickly
connect
to
streams
An;cipate
opportuni;es,
proac;ve
ac;on
59. © Copyright 2000-2014 TIBCO Software Inc.
Example: TIBCO StreamBase Tooling
StreamBase Development Studio
• Visual Development
• Visual Debugging
• Feed Simulation
• Unit Testing
StreamBase LiveView
• Real Time Analytics and Visualization
• Ad hoc queries
• Alerts and Notifications
• Web, Mobile and API Integration
60. © Copyright 2000-2014 TIBCO Software Inc.
Real World: Real-Time Trade Surveillance
Applica-ons
IntegraDon
NormalizaDon
AggregaDon
CorrelaDon
Rules
Alerts
AutomaDon
Adapters
and
Handlers
Adapters
and
Handlers
StreamBase
Server(s)
StreamBase
Studio
for
Developing
EventFlow
Applica-ons
Data
Management
Persistence
Stores
Logs
Market
Data
Trade
Data
Sta-c
Data
Systems
Data
Performance
Benchmarks
Automa-on
Desktop
Alerts
Inputs
Outputs
61. © Copyright 2000-2014 TIBCO Software Inc.
Real Time (Stream Processing) Real World Use Case
Real-‐Time
Fraud
DetecDon
“The
firm
needs
to
monitor
machine-‐driven
algorithms,
and
look
for
suspicious
pa9erns.
Sounds
simple,
right?
Not
so
simple!
In
this
case,
the
pa9erns
of
interest
required
correlaDon
of
5
streams
of
real-‐Dme
data.
Pa9erns
happen
within
15-‐30
second
windows,
during
which
thousands
of
dollars
could
be
lost.
A9acks
come
in
bursts.
The
data
required
to
find
these
pa9erns
was
loaded
into
a
data
warehouse
and
reports
were
checked
each
day.
Decisions
to
act
were
made
every
day.
LiveView
now
intercepts
the
data
before
it
hit
the
warehouse
by
connecDng
LiveView
to
the
source
of
data.
It
took
3
days
to
integrate
these
sources
because
it
took
that
long
to
find
someone
who
knew
where
3
of
the
data
streams
came
from!
StreamBase
detects
fraud
pa9erns
in
milliseconds.
But
the
really
interesDng
part
came
next.
Once
this
firm
could
see
pa9erns
of
fraud,
they
were
faced
with
a
new
challenge:
what
to
DO
about
it?
How
many
Dmes
did
the
pa9ern
need
to
be
repeated
unDl
acDve
surveillance
is
started?
Should
the
acDon
be
quaranDned
for
a
period,
or
halted
immediately?
All
these
quesDons
were
new,
and
the
answers
to
them
keeps
changing.
The
fact
that
the
answers
keep
changing
highlights
the
importance
of
ease
of
use.
AnalyDcs
must
be
changed
quickly
and
be
made
available
to
fraud
experts
-‐
in
some
cases,
in
hours
-‐
as
understanding
deepens,
and
as
the
bad
guys
change
their
tacDcs.
Be9er,
higher
value-‐add
customer
service
for
highly
automated
industries.
Knowledge
workers
who
anDcipate
sales
opportuniDes.
Spowng
fraud
in
high-‐speed
transacDons
streams
and
taking
acDon.“
Some
more
use
cases:
h9p://streambase.typepad.com/streambase_stream_process/2012/04/streambase-‐liveview-‐10-‐3-‐stories-‐from-‐the-‐trenches.html
62. © Copyright 2000-2014 TIBCO Software Inc.
Real Time (CEP + In-Memory) Real World Use Case
“With
38
million
fans,
MGM
knows
how
to
put
its
customers
first,
it
takes
more
than
a
smile
too.
Customers
want
a
personalized,
tailored
experience,
one
that
knows
their
name
and
can
anDcipate
their
needs.
With
the
help
of
TIBCO
technologies
that
leverage
big
data
and
give
customers
a
digital
idenDty,
MGM
can
send
personalized
offers
directly
to
customers,
save
them
a
seat,
and
have
their
favorite
drink
on
the
way.
With
mulDple
customer
touch
points
and
channels,
MGM
can
reach
customers
in
more
ways,
and
in
more
places,
than
ever
before.”
h9ps://www.youtube.com/watch?v=X-‐7S3kCOx9k
CEP:
• Correlate
• Analyze
• AcDon
In-‐Memory:
• Enable
Real
Time
• Only
customers
that
have
checked
in
64. © Copyright 2000-2014 TIBCO Software Inc.
Hadoop:
• Storage
• Complex computing (MapReduce)
Real Time:
• Immediate (proactive) reactions – automated or manually by user
• Monitor streaming data in Real Time
Example:
TIBCO StreamBase and its Apache Flume connector for reading streaming data from Hadoop /
HDFS or to send streaming data to Hadoop / HDFS
Real Time plus Hadoop?
65. © Copyright 2000-2014 TIBCO Software Inc.
Use Case:
• Predict pricing movement in live bets
Hadoop:
• Store all history information about all past bets
• Use MapReduce to precompute odds for new
matches, based on all history data
TIBCO StreamBase:
• Compute new odds in real time to react within a live
game after events (e.g. when a team scores a goal)
• Monitor stream data in real time dashboards
Real Time plus Hadoop Real World Use Case
h9p://www.casestudyu.com/news/2014/04/04/7762652.htm
h9p://vimeo.com/91461315
66. © Copyright 2000-2014 TIBCO Software Inc.
Recap: Big Data Architecture
DWH
/
BI
Hadoop
Real
Time
Big
Data
Architecture
68. © Copyright 2000-2014 TIBCO Software Inc.
Off Topic
Integration is no talking point in this
session… However:
It gets even more important in the future!
The number of different data sources and technologies increases
even more than in the past
– CRM, ERP, Host, B2B, etc. will not disappear
– DWH, Hadoop cluster, event / streaming server, In-
Memory DB have to communicate
– Cloud, Mobile, Internet of Things are no option, but our
future!
69. © Copyright 2000-2014 TIBCO Software Inc.
Recap: Key Messages
Big Data is not just Hadoop, concentrate on Business Value!
A good Big Data Architecture combines DWH, Hadoop and Real Time!
The Integration Layer is getting even more important in the Big Data Era!
70. © Copyright 2000-2014 TIBCO Software Inc.
Questions?
Kai Wähner
kwaehner@tibco.com, @KaiWaehner, www.kai-waehner.de