Big Data is a Hot Topic Because Technology Makes it Possible to Analyze
ALL Available Data
Cost effectively manage and analyze
all available data in its native form
unstructured, structured, streaming
ERP
CRM RFID
Website
Network Switches
Social Media
Billing
BIG DATA is not just HADOOP
Manage & store huge
volume of any data
Hadoop File System
MapReduce
Manage streaming data Stream Computing
Analyze unstructured data Text Analytics Engine
Data WarehousingStructure and control data
Integrate and govern all
data sources
Integration, Data Quality, Security,
Lifecycle Management, MDM
Understand and navigate
federated big data sources
Federated Discovery and Navigation
Business-Centric Big Data Enables You to Start With a Critical Business Pain and Expand the
Foundation for Future Requirements
“Big data” isn’t just a
technology—it’s a business
strategy for capitalizing on
information resources
Getting started is crucial
Success at each entry point is
accelerated by products within
the Big Data platform
Build the foundation for future
requirements by expanding
further into the big data platform
1 – Unlock Big Data
Customer need
• Understand existing data sources
• Search and navigate data within existing
systems
• No copying of data
Value statement
• Get up and running quickly
• Discover and retrieve big data
• Work even with big data sources – by
business users
Solution
• Vivisimo Velocity renamed to
• IBM InfoSphere DataDiscovery
2 – Analyze Raw Data
Customer need
• Ingest data as-is into Hadoop
• Combine it with data from DWH
• Process very large volume of data
Value statement
• Gain new insight
• Overcome the high cost of converting
data from unstructured to structured
format
• Experiment with analysis on different
data and combine them with other
sources
Solution
• IBM InfoSphere BigInsights
Merging the Traditional and Big Data
Approaches
IT
Structures the
data to answer
that question
IT
Delivers a platform to
enable creative
discovery
Business
Explores what questions
could be asked
Business Users
Determine what
question to ask
Monthly sales reports
Profitability analysis
Customer surveys
Brand sentiment
Product strategy
Maximum asset utilization
Big Data Approach
Iterative & Exploratory Analysis
Traditional Approach
Structured & Repeatable Analysis
InfoSphere BigInsights is more than
just HADOOP
IBM InfoSphere Big Insights
• Is much more than HADOOP
IBM Big data platform
• Includes much more than
IBM InfoSphere Big Insights
Hadoop
Open-source software framework from Apache
Inspired by
Google MapReduce
GFS (Google File System)
HDFS
Map/Reduce
InfoSphere BigInsightsPlatform for volume, variety,
velocity
Enhanced Hadoop
foundation
Analytics
Text analytics & tooling
Application accelerators
Usability
Web console
Spreadsheet-style tool
Ready-made “apps”
Enterprise Class
Storage, security, cluster
management
Integration
Connectivity to Netezza,
DB2, JDBC databases, etc
Apache
Hadoop
Basic Edition
Enterprise Edition
Licensed
Application accelerators
Pre-built applications
Text analytics
Spreadsheet-style tool
RDBMS, warehouse connectivity
Administrative tools, security
Eclipse development tools
Performance enhancements
. . . .
Free download
Integrated install
Online InfoCenter
BigData Univ.
Breadth of capabilities
Enterpriseclass
Can run also on top of
Spreadsheet-style Analysis
Web-based analysis
and visualization
Spreadsheet-like
interface
Define and manage
long running data
collection jobs
Analyze content of
the text on the
pages that have
been retrieved
Build a Big Data Program – MapReduce example
Eclipse tools
For Jaql, Hive, Pig Java MapReduce, BigSheets
plug-ins, text analytics, etc.
JAQL – IBM’s programming language in hadoop world
• Jaql is a complete solutions environment
supporting all other BigInsights components Integration point for
various analytics
– Text analytics
– Statistical analysis
– Machine learning
– Ad-hoc analysis
Integration point for
various data sources
– Local and distributed file
systems
– NoSQL data bases
– Content repositories
– Relational sources
(Warehouses, operational
data bases)
BigInsightsText
Analytics
StatisticalAnalysis
(Rmodule)
Machinelearning
(SystemML)
Ad-Hocanalysis
(BigSheets)
(Integration)DB2,
Netezza,Streams,
…
Jaql
Jaql I/O Jaql Core
Operators
Jaql Modules
DFS NoSQL RDBMS File System
3 – Simplify your warehouse
Customer need – SIGNIFICANTLY
• Make performance of DWH better
• Reduce DWH administration costs
Value statement
• Speed: 10 – 100x better performance
• Simplicity: Administration costs reduced by 75% - 90%
• Scalability
• Smart system
• In-database analytics
• Out-of-the box integration with SPSS
Solution
• IBM Netezza renamed to
• PureData System for Analytics
Analyst
IT
I need to evaluate the possible
relationship between client salary
and overdrafts
OK. We have to evaluate a lot of
statistics, set the correct db
indexes and db partitioning. It will
take us 5 days.
Analyst IT
Great. Thanks a lot.
I’m going to check the results.
Done. You can run your analytical
query.
Analyst IT
Great. I can see here some nice
correlations. Now I need to look at it
from the different perspective.
Ohhh, welcome dear friend.
Understand. So, it’s …. another 5
days of our work
Noooo!!!
It’s not possible to work
here!
Analyst
IT
I need to evaluate the possible
relationship between client salary
and overdrafts.
I will use Netezza.
Analyst IT
Great. I can see here some nice
correlations. Now I need to look at it from
the different perspective.
With Netezza I can run the query
immediately. The response will be in the
same time
IT can do something else
– much more useful
Built-In Expertise Makes This as Simple as an Appliance
2
Dedicated device
Optimized for purpose
Complete solution
Fast installation
Very easy operation
Standard interfaces
Low cost
IBM Netezza was renamed to IBM PureData System for Analytics
In October 2012
Netezza
Genesis in T-Mobile CZ
Proof-Of-Concept Project
–New EnterpriseDataWarehouse platform selection
–Comparison of existing and other platforms
–Selection Criteria
• Performance
• Operational Savings
….and the winner was: Netezza
Netezza Genesis in T-Mobile CZ
Expectations
Significant response improvement:
Faster platform means better reports response
Direct Data Availability
Higher trust in data , one version of truth
Aggregation reduction
Any attribute available
Operational Benefits
Storage savings (no data replicas)
Administration costs reduction(DBA)
Infrastructure Simplification
Lower environment complexity
Netezza Genesis in T-Mobile CZ
Actual Status
All relevant ETL procecessing redesigned
Actual parallel run to Original and Netezza platform finished
Netezza as only primary platform
Original
Platform
Netezza
Workflow Reporting 2 hours 1 minute
Invoicing and Payments reporting
Payment discipline of current month invoices 33 minutes 17 seconds
Overdue Debt of Invoices – in Current Month 10 hours 23 seconds
Average Monthly Invoice Figures 50 minutes 38 seconds
RESPONSE TIME MASSIVELY IMPROVED
Real Netezza experience from T-Mobile Czech Rep.
4 – Reduce costs with Hadoop
Customer need – SIGNIFICANTLY
• Too much data => Too expensive to store and to maintain
• Big portion is used “just in case”
• Data amount is still growing => it’s more expensive
• => too expensive to have all data in standard DWH
Value statement
• Leverage the architecture of parallel processing in Hadoop
• Hadoop uses cheap commodity HW
• Enable business users still work in the same or similar way
Solution
• IBM InfoSphere BigInsights
BigInsights and the data warehouse
BigInsights
• Query-ready archive for “cold” warehouse data
Data Warehouse
Big Data
analytic
applications
Traditional
analytic
tools From Cognos BI
via Hive JDBC
Application
SQL interface Engine
InfoSphere BigInsights
HiveTables HBase tables CSV Files
Data Sources
SQL Language
JDBC / ODBC Driver
JDBC / ODBC Server
Future: The SQL interface . . . .
• Rich SQL query capabilities
– SQL '92 and 2011 features
– Correlated subqueries
– Windowed aggregates
• SQL access to all data stored in
InfoSphere BigInsights
• Robust JDBC/ODBC support
• Take advantage of key features of
each data source
• Leverage MapReduce parallelism
OR
achieving low-latency
5 – Analyze Streaming Data
Customer need
• Process and leverage streaming data
• Select valuable data from data stream for
future processing
• Quickly process data going to be useless if it’s
not processed immediately
Value statement
• React in real-time to take an oppurtinity
before it expires
• Periodically adjust streaming models based
on analysis on data at rest
Solution
• IBM InfoSphere Streams
Streams Computing
Streaming Data
Sources
ACTION
Why and when to use InfoSphere
Streams?
Sensors
Environmental, Industrial, GPS, …
Images, Videos, …
Data Exhaust
Network data
system logs (web server, app server), …
High-rate transaction data
Financial transactions
CDRs
Isolation
Processing in isolation
… or in limited windows (time / nr. Of records)
Non-traditional formats included Spatial data, images, text, voice, …
Integration challenges
Different connection methods
Different data rates
Different processing requirements
Multiple processing nodes Volume / rate very high => scalability required
Sub-millisecond latency Immediate analysis and response
Store & mine approach doesn’t work Because of very high volume of data (and its rates)
At least 2 criteria from the list bellow should be fulfilled
Applications needing on-fly processing, filtering and analyzing streaming data
Streams and BigInsights - Integrated Analytics on Data in Motion &
Data at Rest
1. Data Ingest
Data Integration,
data mining,
machine learning,
statistical modeling
Visualization of real-
time and historical
insights
3. Adaptive Analytics Model
Data ingest,
preparation, online
analysis, model
validation
Data
2. Bootstrap/Enrich
Contro
l flow
InfoSphere
BigInsights,
Database &
Warehouse
InfoSphere
Streams
The Platform Advantage
BI /
Reporting
BI /
Reporting
Exploration /
Visualization
Functional
App
Industry
App
Predictive
Analytics
Content
Analytics
Analytic Applications
IBM Big Data Platform
Systems
Management
Application
Development
Visualization
& Discovery
Accelerators
Information Integration & Governance
Hadoop
System
Stream
Computing
Data
Warehouse
BENEFITS IN DETAIL
Increase over
time
By moving from entry to a 2nd
and 3rd project
Lowering
deployment costs
Shared components
Integration
Points of leverage Shared text analytics for
Streams and BigInsights
HDFS connectors (data
integration (ETL, …),
Streams)
Accelerators
Build across multiple
engines