This document discusses big data analytics in a heterogeneous world. It covers the variety of solutions available for big data analytics including changes in hardware, software, execution characteristics, and results. It also discusses building bridges across heterogeneous systems through comprehensive frameworks, reliable data management, versatile application services, and rich ecosystems.
4. BIG DATA ANALYTICS ISSUES
DEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS
Volume
Managing and
harnessing massive
data sets
Skills Variety
Lack of adequate BIG Harmonizing silos of
skills for popular structured and
APIs
DATA unstructured data
ANALYTICS
Costs Velocity
Too expensive to Keeping up with
acquire, operate, unpredictable data
and expand and query flows
5. BIG DATA ANALYTICS MATURITY
FROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE*
New Strategies &
Business Models
Column Store
Hadoop
Big data
NoSQL In memory
Business
data MPP
Value*
Operational Revenue
Efficiencies Growth
*A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data
Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M
savings in operational efficiencies in European economies
6. BIG DATA ANALYTICS IN THE REAL WORLD
PREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS
BIG DATA
Verticals ANALYTICS Functional
Banking • Marketing Analytics
Digital channels
Track visits, discover best channel mix:
Telcom, email, social media, search
• Sales Analytics
Global Capital Markets Deep correlations
Predict risks based on deal DNA (emails,
Retail meetings) pattern match
• Operational Analytics
Government Atomic machine data
Analyze RFIDs, weblogs, SMS, sensors —
continuous operational inefficiency
Healthcare • Financial Analytics
Detailed simulations
Information Providers Liquidity, portfolio simulations —
Stress tests, error margins
8. CAUSAL LINKS: VARIETY, VELOCITY, VOLUME
Events data
Transactional data
µSeconds
Multi-media data
eCommerce data Continuous and/or Bursts Routinely Petabytes
x
w a y
z
Graph data
Variety Velocity Volume
9. GROWING USER COMMUNITIES
Data Scientists Business Analysts Developers/Programmers
Administrators
Business users External consumers Business Processes
10. HARDWARE IS SUPERIOR
Small Server farms – Scale out Larger Servers with partitions – Scale up
Spinning disks to SSDs SSD
1.2x to 2x speed up
SSD SSDs to Main Memory
4x to 200x speed up
Main Memory to CPU caches
CPU Caches
2x to 6x speed up
11. SOFTWARE EXPECTATIONS HAVE CHANGED
Intelligence & Automation
Execution
Characteristics
Performance & Scalability
Results
Characteristics
Traditional Contemporary
16. RESULTS CHARACTERISTICS
ACCURACY TOLERANCE FOCUS
Complex schemas
Multiple applications
Write on schema
Atomic level locking
Consistency guarantees across system losses
Declarative API
Interactive
Does encapsulate elements of CAP
Traditional Associated with SQL
Simple read on schemas
Single application
Batch oriented
Snapshot isolations
Eventual consistency guarantees
Procedural APIs
Does encapsulate elements of ACID
Contemporary Associated with NoSQL
18. COMPREHENSIVE 3-TIER FRAMEWORK
COMMERCIAL AND/OR OPEN SOURCE
Eco-System
Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps
Application Services
In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled
Data Management
High Performance, Highly Scalable, Cloud Enabled
19. RELIABLE DATA MANAGEMENT
Full Mesh High Speed Interconnect
Data
Management
Can handle high performance, compression, batch, ad-hoc analysis
Can routinely scale to Petabyte class problems, thousands of concurrent jobs
Typical characteristics
Massively parallel processing of complex queries
In-memory and on-disk optimizations
Elastic resources for user communities
ACID guarantees
Data variety
Information lifecycle management
User friendly automation tools
File systems (schema free) and/or DBMS structures (schema specific)
20. DATA MANAGEMENT INFRASTRUCTURE
ROBUST, SCALABLE, HIGH PERFORMANCE
Data Discovery Application Modeling Reports/Dashboards Business Decisions
(Data Scientists) (Business Analysts) (BI Programmers) (Business End Users)
Infrastructure
Management Full Mesh High Speed Interconnect
(DBAs)
• Dynamic, elastic MPP grid
– Grow, shrink, provision on-demand
– Heavy parallelization
• Load, prepare, mine, report in a workflow
– Privacy through isolation of resources
– Collaboration through sharing of results/data via sharing of resources
21. VERSATILE APPLICATION SERVICES
Python ADO.NET PERL
Programming PHP Ruby Java C++
APIs
Web Services API
Application Services
In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …
Comprehensive declarative and procedural APIs
In-Database Analytics Plug-In APIs
In-Database Web Services
Query and data federation APIs
Multi-lingual Client APIs
22. VERSATILE APPLICATION SERVICES
RICH ALGORITHMS CLOSE TO DATA
Sybase IQ Process
In Memory
Sybase IQ Process
RPC CALLS
In Memory
User’s DLL “A” User’s DLL “B”
Library Access Process
LOAD
User’s DLL “A” User’s DLL “B”
LOAD
User’s DLL “B”
User’s DLL “B”
In-database + In-process
Multi-lingual APIs
• In-process dynamically loaded In-database + Out-process
Scalar to Scalar
shared libraries
• Out of process shared library
Scalar sets to Aggregate
• Highest possible performance
Scalar sets to Dimensional • Lower security risks
• Incurs security risks, but Aggregates
manageable via privileges • Lower robustness risks
Scalar sets to Multi-attribute
(bulk)
• Incurs robustness risks, but • Lower performance than in-
Multi-attribute (bulk) to
manageable via multiplex process but better than out of
Multi-attribute (bulk)
database
23. VERSATILE APPLICATION SERVICES
NATIVE MAPREDUCE
For stocks in enterprise software sector, find max relative strength of a stock for a trading day*
Key (k1) Value (v1) Key (k2) Value (v2)
Ticker 30-min interval Weighted variance = (A given stock’s variance
30-min Ticker TickValu TickValue
Symbol time / Average Variance across All “N” stocks)
interval time Symbol e Day 1 Day 2
SAP 9:30 am +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
9:30 am SAP 51 52.4 SAP 10:00 am +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
9:30 am ORCL 31 28.2 Map SAP …… ……
9:30 am TDC 22 21.3 Fn ORCL 9:30 am -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
ORCL 10:00 am -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
10:00 am SAP 50.9 53.1
ORCL ……. …..
10:00 am TDC 21.8 20.9 TDC 9:30 am -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
10:00 am ORCL 29.4 27.1 TDC 10:00 am -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
….. ORCL …… ….. TDC ….. ……
Reduce
Fn
Value (v3)
Ticker Symbol Max Absolute Weighted Variance (v3)
SAP Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
ORCL Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
TDC Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
*Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the
day before, weighted by average variance of all stocks for each 30-min interval
24. VERSATILE APPLICATION SERVICES
NATIVE MAPREDUCE – DECLARATIVE WAY
For stocks in enterprise software sector, find max relative strength of a stock for a trading day
• Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float)
RESULT SET YZ (b1 char, b2 datetime, b3 float)
• Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float)
RESULTE SET YZ (b1 char, b2 float)
• Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar,
FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var
FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime,
TickDataTab.30MinValDay1, TickDataTab.30MinValDay2)
OVER (PARTITION BY TickDataTab.30MinInt)))
OVER (PARTITION BY MapVarTPF.TickSymb))
ORDER BY RedMaxVarTPF.TickSymb
• Native MapReduce parallel execution workflow:
MapVarTPF (Partitioned to 15 parallel instances) RedMaxVarTPF (Partitioned to 25 parallel instances) SQL Query collates output using 1 node
……. ……. …..
SAN Fabric SAN Fabric SAN Fabric
• Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g.
text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files
25. RICH ECO-SYSTEM
Source Answers
Data preparation Data Usage
Eco-System
DBMS /
Filesystem
Event Processing Data Federation Business Intelligence
Data Modeling / Database Design Tool
Business Intelligence Tools
Data Integration Tools
Data Mining Tools
Application Tools
DBA Tools
26. RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE I
Feature Characteristics Big Data Use Cases
• Client tool capable of querying •Ideal for bringing together Big Data
DBMS and Hadoop Analytics pre-computations from
different domains
• Better performance when results
from sources are pre- • Example – In Telecommunication: DBMS
Client Side Federation: Join data computed/pre-aggregated
has aggregated customer loyalty data &
Hadoop with aggregated network
from DBMS AND Hadoop at a client
utilization data; Quest Toad for Cloud can
application level bring data from both sources, linking
customer loyalty to network utilization or
network faults (e.g. dropped calls)
Quest
Toad for Cloud
DBMS Hadoop/Hive
27. RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE II
Feature Characteristics Big Data Use Cases
• Extract & load subsets of HDFS
data into DBMS store
• Raw data from HDFS
• Results of Hadoop MR jobs • Ideal for combining subsets of HDFS
ETL unstructured data or summary of
• HDFS Data stored in DBMS is HDFS data into DBMS for mid to long
treated like other DBMS data term usage in business reports
Load Hadoop Data into DBMS • Gets ACID properties of a DBMS
column store: Extract, Transform, • Can be indexed, joined, parallelized • Example – In eCommerce: clickstream data
• Can be queried in an ad-hoc way from weblogs stored in HDFS and outputs of
Load data from HDFS (Hadoop MR jobs on that data (to study browsing
Distributed File System) into DBMS • Visible to BI and other client tools behavior) ETL’d into DBMS. The transactional
schemas sales data in DBMS joined with clickstream data
via DBMS ANSI SQL API only to understand and predict customer browsing
to buying behavior
• Currently, the bulk data transfer
utility SQOOP (built by Cloudera) is
can be used provide this ETL
capability
Clickstream
Data Sales Data
Hadoop/Hive SQOOP DBMS
28. RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE III
Feature Characteristics Big Data Use Cases
• Scan and fetch specified data
• Ideal for combining subsets of HDFS
subsets from HDFS via table UDF
data with DBMS data for operational
• Can read and fetch HDFS data subsets
• Called as part of SQL query
(transient) business reports
• Output joinable with DBMS data
Join HDFS data with DBMS data on • Multiple, simultaneous UDF calls possible • Example – In Retail: Point Of Sale (POS)
the fly: Fetch and join subsets of • Sample UDFs provided in JAVA, C++ detailed data stored in HDFS. DBMS EDW
fetches POS data at fixed intervals from HDFS of
HDFS data on-demand using SQL
queries from DBMS(Data • HDFS data not stored in DBMS specific hot selling SKUs, combines with
inventory data in DBMS to predict and prevent
• Fetched into DBMS In-memory tables inventory “stockouts”.
Federation technique) • ACID properties not applicable
• Repeated use: put fetched data in tables
• Visible to BI/other client tools via
ANSI SQL API only
Inventory
POS Data Data
Hadoop/HDFS UDF Bridge DBMS
29. RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE IV
Feature Characteristics Big Data Use Cases
• Trigger and fetch Hadoop MR job • Ideal for combining results of
results via table UDF Hadoop MR job results with DBMS
• Can trigger Hadoop MR jobs
• Called as part of Sybase IQ SQL query data for operational (transient)
• Output joinable with Sybase IQ data business reports
• No multiple, simultaneous UDF calls
• Sample UDFs provided in JAVA only
Combine results of Hadoop MR jobs • Example – In Utilities: Smart meter and
with DBMS data on the fly: Initiate smart grid data can be combined for load
and Join results of Hadoop MR jobs • HDFS data not stored in DBMS monitoring and demand forecast. Smart grid
on-demand using SQL queries from • Fetched into DBMS In-memory tables transmission quality data (multi-attribute time
• ACID properties not applicable series data) stored in HDFS can be computed
DBMS data (Query Federation • Repeated use: put fetched data in tables via Hadoop MR jobs triggered from DBMS and
technique) combined with Smart meter data stored in
DBMS to analyze demand and workload.
• Visible to BI and other client tools
via DBMS ANSI SQL API only
Smart Grid Smart Meter
Transmission Data consumption data
Hadoop/HDFS UDF Bridge DBMS
30. RICH ECO-SYSTEM
DBMS <–> PREDICTIVE TOOLS BRIDGE
Express Complex Computations In Industry Standard Predictive Modeling
Markup Language (PMML), Plug In Models Close To data for execution
Database Server
DBMS
SQL
Applications Bridge
Universal
Predictions
Plug-In
PMML
UDFs
PMML
PMML
PMML
(models) PMML Preprocessor
(models)
(models) (convert & validate)
31. RICH ECO-SYSTEM
FUNDAMENTALS OF STREAMS TECHNOLOGY
Process data without storing it
Input Streams
Events arrive on input streams
Derived Streams, Windows
Apply continuous query
operators to one or more
input streams to produce
a new stream
Continuous Queries create a new Windows can Have State
“derived” stream or window • Retention rules define how many or how
long events are kept
SELECT FROM one or more input
• Opcodes in events can indicate
streams/windows
insert/update/delete and can be
WHERE…
automatically applied to the window
GROUP BY…
32. RICH ECO-SYSTEM
STREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING
SQL CCL
Windows on
Tables Event Streams
Rows Events
Columns Fields
On-Demand: query Event-Driven:
runs when information query updates when
is needed information arrives
33. RICH ECO-SYSTEM
STREAMS PRE-PROCESSING
Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies
ESP Engine
Alerts Actions
Updates
Memory
Disk
Hadoop/HDFS DBMS
35. 3-LAYER LOGICAL INTEGRATION
STREAM PROCESSING <-> NoSQL <-> DBMS
BI TOOLS DI TOOLS DBA TOOLS DATA MINING TOOLS
Eco-System
Unstructured
Data
App Ingest + Persist (Hadoop,
Services Web 2.0 Java C/C++ SQL Federation
Content Mgmt)
Structured Data
(DBMS)
DMBS
Streaming Data
(ESP)
The heterogeneous world will require co-existence and playing nice!