The Marriage of Hadoop and Data Warehousing: How Big Data Platforms are Emerging

June 2012

IBM Big Data
The Marriage of Hadoop and Data Warehousing

James Kobielus
Senior Program Director, Product Marketing, Big Data, IBM

© 2012 IBM Corporation

Hadoop and DW are
fast being joined into a
new platform paradigm:
the Hadoop DW

2 © 2012 IBM Corporation

Agenda

 Big Data: “3 Vs” and myriad use cases
 Big Data: diverse workloads
 Big Data: emergence of the “Hadoop DW”


Agenda



Scalability Imperative: 3 “Vs” Drive Big Data Everywhere

Information Radical Extreme
from Everywhere Flexibility Scalability

Volume Velocity Variety

5
12 terabytes
of Tweets created daily
5 million
trade events per second
100’s
from surveillance cameras
video
feeds


More Business Use Cases for Big Data Across Enterprise


More Mission-Critical Apps Ride on Big Data Platforms

Advanced Analytic Applications
 Integrate and manage the full variety, velocity
and volume of data

 Apply advanced analytics to information in its
native form

Big Data Platform  Visualize all available data for ad-hoc analysis
Process and analyze any type of data and discovery
Accelerators
 Development environment for building new
analytic applications

 Integration and deploy applications with
enterprise grade availability, manageability,
security, and performance
• Analyze data in motion • Visualization and
• MapReduce / noSQL exploration
• Machine Learning • Scalability
• Text Analytics • Hardware
• Text Search acceleration
• Data Discovery • Stream computing


Big Data: Business Crucible for Practical Data Science

Business and IT Identify
Information Sources Available

New insights IT Delivers a
drive integration Platform that
to traditional enables creative
technology exploration of all
available data and
content

Business determines what
questions to ask by exploring the
data and relationships


Big Data Initiatives: Fueled by Practical Data Science
Analyze a Variety of Information
Novel analytics on a broad set of mixed
information that could not be analyzed before

Analyze Information in Motion
Streaming data analysis
Large volume data bursts and ad-hoc analysis

Analyze Extreme Volumes of Information
Cost-efficiently process and analyze PBs of
information
Manage & analyze high volumes of structured,
relational data

Discover and Experiment
Ad-hoc analytics, data discovery and
experimentation

Manage and Plan
Enforce data structure, integrity and control to
9 ensure consistency for repeatable queries IBM Corporation
© 2012

Big Data: Marriage of Established & Emerging Approaches

Established Approach Emerging Approaches
Structured, analytical, logical Creative, holistic thought, intuition

DW Hadoop, etc.
Transaction Data Web Logs

Internal App Data Social Data
Structured Unstructured
Structured Enterprise Exploratory
Repeatable Exploratory
Mainframe Data
Repeatable
Linear
Integration
Iterative
Iterative Text Data: emails
Linear
Monthly sales reports Brand sentiment
Profitability analysis Product strategy
OLTP System Data Sensor data: images
Customer surveys Maximum asset utilization

ERP data Traditional New RFID
Sources Sources


Agenda



Continuous Social Media Monitoring and Analytics

Data Set Information extracted
• 1.1B tweets • Buzz and sentiment
• 5.7M blog and forum posts • Gender, Location and Occupation
• 3.5M relevant messages • Fans
• 97K referencing Product A • Intent to in purchase
• 18K referencing Product B • Specific attributes of products


Content mining, natural language processing, & classification

 How it works Unstructured text (document, email, etc)
– Parses text and detects meaning with extractors
Football World Cup 2010, one team
– Understands the context in which the text is
distinguished themselves well, losing to
analyzed
the eventual champions 1-0 in the Final.
– Hundreds of pre-built extractors for names,
addresses, phone numbers, organizations,
Early in the second half, Netherlands’
URL, Datetime, etc. striker, Arjen Robben, had a breakaway,
but the keeper for Spain, Iker Casillas
 Accuracy
made the save. Winger Andres Iniesta
– Highly accurate in deriving meaning from scored for Spain for the win.
complex text

 Performance
– AQL language optimized for MapReduce Classification and Insight
World Cup 2010 Highlights


Entity Extraction and Integration


Statistical Analysis, Predictive Modeling, & Machine Learning

Enables Machine learning (ML) on massive datasets
 R and Matlab-like syntax for smooth adoption
 Optimizations to generate low-level executions plans
 Out-of-box and write-your-own analytic algorithms, e.g. Regression, Clustering,
Classification, Pattern Mining, Ranking, etc.
 Scale to massively parallel clusters from 10s to 1000s of machines and from
Terabytes to Petabytes

What are people
talking about in social
media about a
product?

15


Targeted E-Commerce and Next Best Action


Predictive Complex Event Processing


Intent and Sentiment Analysis

Online flow: Data-in-motion analysis

Data Sources Stream Computing and Analytics Timely
Decisions

Entity Predictive
Data Ingest Text Analytics: Analytics: Analytics:
and Prep Timely Insights Profile Action
Resolution Determination
Dashboard

Hadoop System and Analytics

Comprehensive
Entity
Social Media and Social Media Predictive Customer
Text Analytics Analytics and
Enterprise Data Customer Analytics Models
Integration
Profiles

Offline flow: Data-at-rest analysis Reports


Agenda



Big Data: DW & Hadoop are Married in Spirit

Cloud-facing
architectures
models Massively
policies
metadata aggregates parallel
DQ MDM hubs marts processing
cubes
ETL databases

DW
views
storage In-database
memory
staging
production cache in-database
analytics
nodes
tables analytics
operational
data stores
Mixed workload
management
Hybrid storage
layers


Hadoop is Core of Next-Gen Big Data DW

 Vendor-agnostic framework for
massively parallel processing of
advanced analytics against
polystructured information
 Leverages extensible framework for
building advanced analytics and data
management functions
 Evolving rapidly in new directions
 Being commercialized and adopted
rapidly in enterprises
 Vibrant open-source community and
industry


Hadoop, DW, and other Databases Co-Exist in Big Data
Ecosystem

Hadoop & In-memory
NoSQL
DW RDBMS
Columnar

OLAP

Big Data staging,
ETL, and Big Data SVOT and Big Data access
preprocessing tier governance tier and interaction tier


How Hadoop and DW Complement Each Other


Single Version of Big Data: Where Hadoop DW Will Excel

Monetizable intent to see a Monetizable intent to buy
Kinda feel like going to movies tonight… Any I need a new digital camera for my food pictures, and
recommendations? @Texas Angelika Texas recommendations around 300?

I don’t think anyone understands how much I like What should I buy?? A mini laptop with Windows 7 OR a Apply
watching movies. My 3rd trip to the threatre in 3 days. MacBook!??!

Life Events
Location announcements College: Off to Standard for my MBA! Bbye chicago!
I’m at Starbucks Parque Tezontle http://4sq.com/fYReSj
Looks like we’ll be moving to New Orleans sooner than I thought.

Hadoop DW Integration: What to Look For
models
 Hadoop distro functional depth policies
metadata aggregates
 EDW HDFS connector DQ MDM hubs marts
cubes
 Software, appliance, and cloud form factors for ETL databases

Hadoop offerings
storage
staging
nodes
DW
production
views
memory
cache in-database
 Pluggable storage layer for Hadoop offerings tables
operational
analytics

 Bundled data management and analytics data stores

offerings integrated with Hadoop solutions
 Modeling, management, acceleration, and
optimization tools
 Real-time/low-latency capabilities integrated into
Hadoop offerings
 Robust availability, security, and workload
management tools integrated with Hadoop
offerings
 And many more, focused on EDW-grade
robustness, scalability, and flexibility!


Consider Big Data Platform Accelerators

Telecommunications Retail Customer
CDR streaming analytics Intelligence
Deep Network Analytics Customer Behavior and Lifetime
Value Analysis

Finance Social Media Analytics
Streaming options trading Sentiment Analytics, Intent to
Insurance and banking DW purchase
models

Public transportation Data mining
Real-time monitoring and Streaming statistical analysis
routing optimization

Over 100 sample User Defined Standard Toolkits Industry Data Models
applications Toolkits Banking, Insurance, Telco,
Healthcare, Retail

How Will You Do MDM on Your Hadoop DW?

(A1) Unstructured Entity Integration (on BigInsights)
– Complex analytics to populate master data set
– Text Analytics: Rule language (AQL) for extracting
entities, events, relationships from text and html
documents MDM DaaS
– Entity Integration: Rule language (HIL) to express & Applications
customize the integration, cleansing, and aggregation of and Views
the master entities
(A2) Entity Repository (on MDM)
– BigInsights Bridge: Generation of the MDM model for select cik, Officers, Directors
public master entities, from the BigInsights model; and from Company
bulk-loading of master entities Data services where name = ‘Citigroup’

– Query-based Application Development: Supports the
generation of custom queries for individual applications
Tooling based
Queries on entity model

A2
External data
subscriptions
(e.g., Acxiom)
A1 Relational tables SELECT *
FROM

with master
(SELECT t2.CIK as CIK, t2.NAME as NAME, t2.IS_FORMER_OFFICER as IS_FORMER_OFFICER,
t2.IS_IMPORTANT_OFFICER as IS_IMPORTANT_OFFICER, t2.POSITION_NAME as
POSITION_NAME,

Text Analytics entities FROM
tp.EARLIEST_DATE as EARLIEST_DATE, tp.IS_EARLIEST_EXACT as IS_EARLIEST_EXACT,
tp.LATEST_DATE as LATEST_DATE, tp.IS_LATEST_EXACT as IS_LATEST_EXACT

External public data and (SELECT t1.CIK as CIK, t1.NAME as NAME,t1.IS_FORMER_OFFICER as IS_FORMER_OFFICER,
t1.IS_IMPORTANT_OFFICER as IS_IMPORTANT_OFFICER, p.NAME as POSITION_NAME,
p.POSITIONSPK_ID as POSITIONSPK_ID
sources Entity Integration FROM
(SELECT o.CIK as CIK, o.NAME as NAME, o.IS_FORMER_OFFICER as IS_FORMER_OFFICER,
o.IS_IMPORTANT_OFFICER as IS_IMPORTANT_OFFICER, o.OFFICERSPK_ID as OFFICERSPK_ID
FROM DB2ADMIN.OFFICERS o
(e.g., SEC/FDIC, WHERE o.OFFICER_OF = 567830643756635868
) as t1
Twitter, Blogs, BigInsights InfoSphere MDM left outer join DB2ADMIN.POSITIONS p on t1.OFFICERSPK_ID= p.POSITIONOF
) as t2

Facebook) left outer join D2ADMIN.RANGEOFKNOWNDATES tp

with Extensions UNION
on t2.POSITIONSPK_ID = tp.RANGE_OF_KNOWN_DATES_FOR_POS )

// ( OUTER UNION)

…


IBM Big Data Platform

New analytic applications drive the Analytic Applications
requirements for a big data platform BI / Exploration / Functional Industry Predictive Content
Reporting Visualization App App
BI /
Analytics Analytics
Reporting

• Integrate and manage the full IBM Big Data Platform
variety, velocity and volume of data
Visualization Application Systems
• Apply advanced analytics to & Discovery Development Management
information in its native form
• Visualize all available data for ad- Accelerators
hoc analysis
• Development environment for Hadoop Stream Data
System Computing Warehouse
building new analytic applications
• Workload optimization and
scheduling
• Security and Governance Information Integration & Governance


Thank You!


The Marriage of Hadoop and Data Warehousing: How Big Data Platforms are Emerging

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a The Marriage of Hadoop and Data Warehousing: How Big Data Platforms are Emerging

Similar a The Marriage of Hadoop and Data Warehousing: How Big Data Platforms are Emerging (20)

Más de Ajay Ohri

Más de Ajay Ohri (20)

Último

Último (20)

The Marriage of Hadoop and Data Warehousing: How Big Data Platforms are Emerging

Notas del editor