Big Data Analytics Issues in a Heterogeneous World

BIG DATA ANALYTICS
IN A HETEROGENEOUS WORLD

JOYDEEP DAS
DIRECTOR, ANALYTICS PRODUCT MANAGEMENT
SYBASE INC, AN SAP COMPANY

FEBRUARY 16, 2012

AGENDA
 The real world means business

Change is afoot – Myriad solution trends

Building bridges across a heterogeneous world

 Summary

BIG DATA ANALYTICS
REAL WORLD MEANS BUSINESS

BIG DATA ANALYTICS ISSUES
DEALING WITH VOLUME, VARIETY, VELOCITY, COSTS, SKILLS

Volume
Managing and
harnessing massive
data sets
Skills Variety
Lack of adequate BIG Harmonizing silos of
skills for popular structured and
APIs
DATA unstructured data
ANALYTICS

Costs Velocity
Too expensive to Keeping up with
acquire, operate, unpredictable data
and expand and query flows

BIG DATA ANALYTICS MATURITY
FROM JARGON TO TRANSFORMATIONAL BUSINESS VALUE*

New Strategies &
Business Models
Column Store

Hadoop
Big data

NoSQL In memory
Business
data MPP
Value*

Operational Revenue
Efficiencies Growth

*A McKinsey study titled “Big Data: Next frontier for innovation, competition, and productivity”, May 2011, has found huge potential for Big Data
Analytics with metrics as impressive as 60% improvements in Retail operating margins, 8% reduction in (US) national healthcare expenditures, and $150M
savings in operational efficiencies in European economies

BIG DATA ANALYTICS IN THE REAL WORLD
PREVALENT IN DATA INTENSIVE VERTICALS AND FUNCTIONAL AREAS

BIG DATA
Verticals ANALYTICS Functional

Banking • Marketing Analytics
Digital channels
Track visits, discover best channel mix:
Telcom, email, social media, search
• Sales Analytics
Global Capital Markets Deep correlations
Predict risks based on deal DNA (emails,
Retail meetings) pattern match
• Operational Analytics
Government Atomic machine data
Analyze RFIDs, weblogs, SMS, sensors —
continuous operational inefficiency
Healthcare • Financial Analytics
Detailed simulations
Information Providers Liquidity, portfolio simulations —
Stress tests, error margins

CHANGE IS AFOOT
MYRIAD OF BIG DATA ANALYTICS SOLUTIONS

CAUSAL LINKS: VARIETY, VELOCITY, VOLUME

Events data

Transactional data

µSeconds
Multi-media data
eCommerce data Continuous and/or Bursts Routinely Petabytes
x

w a y

z

Graph data

Variety Velocity Volume

GROWING USER COMMUNITIES

Data Scientists Business Analysts Developers/Programmers

Administrators

Business users External consumers Business Processes

HARDWARE IS SUPERIOR

Small Server farms – Scale out Larger Servers with partitions – Scale up

Spinning disks to SSDs SSD
1.2x to 2x speed up

SSD SSDs to Main Memory

4x to 200x speed up

Main Memory to CPU caches
CPU Caches
2x to 6x speed up

SOFTWARE EXPECTATIONS HAVE CHANGED

Intelligence & Automation
Execution
Characteristics

Performance & Scalability

Results
Characteristics

Traditional Contemporary

EXECUTION CHARACTERISTICS
PERFORMANCE FOCUS

1 2 3 4 5 6 7 8 9…
r1
r2
r3
r4
r5

Columnar access MPP: Shared Nothing, Shared Everything

Algorithms

Computations close to data:
InDB Analytics (MapReduce), FPGA filtering In-memory processing

SCALABILITY FOCUS

1 2 3 4 5 6 7 8 9…
r1
Data Compression r2
r3
r4
r5

Natural Compression Compression Techniques Hybrid Columnar
Column Store Databases Row Store Databases Compression Databases

SAN

Distributed File Systems
DAS NAS

Stream Processing Engines
Data Filtering
Pre-processing Engines Transformation Engines

INTELLIGENCE FOCUS

Query & Load Optimization On-demand systems: Virtualization and provisioning

CPU Caches

CPU Cache Conscious Computations

AUTOMATION FOCUS

Data conscious federation Automatic Workload Balancer/Mixer

User community focused collaborative services

RESULTS CHARACTERISTICS
ACCURACY TOLERANCE FOCUS

 Complex schemas
Multiple applications
Write on schema
 Atomic level locking
 Consistency guarantees across system losses
 Declarative API
 Interactive
 Does encapsulate elements of CAP
Traditional  Associated with SQL

Simple read on schemas
 Single application
Batch oriented
 Snapshot isolations
 Eventual consistency guarantees
 Procedural APIs
 Does encapsulate elements of ACID
Contemporary Associated with NoSQL

BUILDING BIG DATA BRIDGES
ACROSS A HETEROGENEOUS WORLD

COMPREHENSIVE 3-TIER FRAMEWORK
COMMERCIAL AND/OR OPEN SOURCE

Eco-System
Business Intelligence Tools, Data Integration Tools, DBA Tools, Packaged Apps

Application Services
In-Database Analytics, Multi-lingual Client APIs, Federation, Web Enabled

Data Management
High Performance, Highly Scalable, Cloud Enabled

RELIABLE DATA MANAGEMENT
Full Mesh High Speed Interconnect

Data
Management

 Can handle high performance, compression, batch, ad-hoc analysis

 Can routinely scale to Petabyte class problems, thousands of concurrent jobs

 Typical characteristics
 Massively parallel processing of complex queries
 In-memory and on-disk optimizations
 Elastic resources for user communities
 ACID guarantees
 Data variety
 Information lifecycle management
 User friendly automation tools
 File systems (schema free) and/or DBMS structures (schema specific)

DATA MANAGEMENT INFRASTRUCTURE
ROBUST, SCALABLE, HIGH PERFORMANCE

Data Discovery Application Modeling Reports/Dashboards Business Decisions
(Data Scientists) (Business Analysts) (BI Programmers) (Business End Users)

Infrastructure
Management Full Mesh High Speed Interconnect
(DBAs)

• Dynamic, elastic MPP grid
– Grow, shrink, provision on-demand
– Heavy parallelization
• Load, prepare, mine, report in a workflow
– Privacy through isolation of resources
– Collaboration through sharing of results/data via sharing of resources

VERSATILE APPLICATION SERVICES
Python ADO.NET PERL
Programming PHP Ruby Java C++
APIs

Web Services API
Application Services
In-Database Analytics Plug-Ins: SQL, PMML, C++, JAVA, …

 Comprehensive declarative and procedural APIs
 In-Database Analytics Plug-In APIs
 In-Database Web Services
 Query and data federation APIs
 Multi-lingual Client APIs

RICH ALGORITHMS CLOSE TO DATA
Sybase IQ Process
In Memory
Sybase IQ Process

RPC CALLS
In Memory

User’s DLL “A” User’s DLL “B”

Library Access Process

LOAD
User’s DLL “A” User’s DLL “B”

LOAD
User’s DLL “B”

User’s DLL “B”

In-database + In-process
Multi-lingual APIs
• In-process dynamically loaded In-database + Out-process
Scalar to Scalar
shared libraries
• Out of process shared library
Scalar sets to Aggregate
• Highest possible performance
Scalar sets to Dimensional • Lower security risks
• Incurs security risks, but Aggregates
manageable via privileges • Lower robustness risks
Scalar sets to Multi-attribute
(bulk)
• Incurs robustness risks, but • Lower performance than in-
Multi-attribute (bulk) to
manageable via multiplex process but better than out of
Multi-attribute (bulk)
database

NATIVE MAPREDUCE

For stocks in enterprise software sector, find max relative strength of a stock for a trading day*

Key (k1) Value (v1) Key (k2) Value (v2)
Ticker 30-min interval Weighted variance = (A given stock’s variance
30-min Ticker TickValu TickValue
Symbol time / Average Variance across All “N” stocks)
interval time Symbol e Day 1 Day 2
SAP 9:30 am +1.4 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
9:30 am SAP 51 52.4 SAP 10:00 am +2.2 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
9:30 am ORCL 31 28.2 Map SAP …… ……
9:30 am TDC 22 21.3 Fn ORCL 9:30 am -2.8 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
ORCL 10:00 am -2.3 / (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
10:00 am SAP 50.9 53.1
ORCL ……. …..
10:00 am TDC 21.8 20.9 TDC 9:30 am -0.7 / (SUM (+1.4-2.8-0.7….)/”N” stocks)
10:00 am ORCL 29.4 27.1 TDC 10:00 am -1.1/ (SUM (+2.2-2.3-1.1 ….)/”N” stocks)
….. ORCL …… ….. TDC ….. ……

Reduce
Fn

Value (v3)

Ticker Symbol Max Absolute Weighted Variance (v3)
SAP Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
ORCL Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)
TDC Max (ABS(9:30 Wt Var), ABS(10:00 Wt Var), …..)

*Calculate max variance for the day by comparing each 30-min interval tick values across two days: the trading day & the
day before, weighted by average variance of all stocks for each 30-min interval

NATIVE MAPREDUCE – DECLARATIVE WAY
For stocks in enterprise software sector, find max relative strength of a stock for a trading day

• Map function declaration: CREATE PROCEDURE MapVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float, a4 float)
RESULT SET YZ (b1 char, b2 datetime, b3 float)
• Reduce function declaration: CREATE PROCEDURE RedMaxVarTPF (IN XY TABLE (a1 char, a2 datetime, a3 float)
RESULTE SET YZ (b1 char, b2 float)
• Query: SELECT RedMaxVarTPF.TickSymb, RedMaxVarTPF.MaxVar,
FROM RedMaxVarTPF (TABLE (SELECT MapVarTPF.TickSymb, MapVarTPF.30MinIntTime, MapVarTPF.Var
FROM MapVarTPF (TABLE ( SELECT TickDataTab.TickSymb, TickDataTab.30MinIntTime,
TickDataTab.30MinValDay1, TickDataTab.30MinValDay2)
OVER (PARTITION BY TickDataTab.30MinInt)))
OVER (PARTITION BY MapVarTPF.TickSymb))
ORDER BY RedMaxVarTPF.TickSymb
• Native MapReduce parallel execution workflow:

MapVarTPF (Partitioned to 15 parallel instances) RedMaxVarTPF (Partitioned to 25 parallel instances) SQL Query collates output using 1 node

……. ……. …..

SAN Fabric SAN Fabric SAN Fabric

• Native MapReduce with unstructured data: Native MapReduce using can easily be applied to unstructured data also e.g.
text, multi-media, … stored in DBMS or to unstructured data brought into DBMS during execution time from external files

RICH ECO-SYSTEM
Source Answers
Data preparation Data Usage

Eco-System
DBMS /
Filesystem

Event Processing Data Federation Business Intelligence

Data Modeling / Database Design Tool

 Business Intelligence Tools

 Data Integration Tools

 Data Mining Tools

 Application Tools

 DBA Tools

RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE I

Feature Characteristics Big Data Use Cases

• Client tool capable of querying •Ideal for bringing together Big Data
DBMS and Hadoop Analytics pre-computations from
different domains
• Better performance when results
from sources are pre- • Example – In Telecommunication: DBMS
Client Side Federation: Join data computed/pre-aggregated
has aggregated customer loyalty data &
Hadoop with aggregated network
from DBMS AND Hadoop at a client
utilization data; Quest Toad for Cloud can
application level bring data from both sources, linking
customer loyalty to network utilization or
network faults (e.g. dropped calls)

Quest
Toad for Cloud

DBMS Hadoop/Hive

RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE II

• Extract & load subsets of HDFS
data into DBMS store
• Raw data from HDFS
• Results of Hadoop MR jobs • Ideal for combining subsets of HDFS
ETL unstructured data or summary of
• HDFS Data stored in DBMS is HDFS data into DBMS for mid to long
treated like other DBMS data term usage in business reports
Load Hadoop Data into DBMS • Gets ACID properties of a DBMS
column store: Extract, Transform, • Can be indexed, joined, parallelized • Example – In eCommerce: clickstream data
• Can be queried in an ad-hoc way from weblogs stored in HDFS and outputs of
Load data from HDFS (Hadoop MR jobs on that data (to study browsing
Distributed File System) into DBMS • Visible to BI and other client tools behavior) ETL’d into DBMS. The transactional
schemas sales data in DBMS joined with clickstream data
via DBMS ANSI SQL API only to understand and predict customer browsing
to buying behavior
• Currently, the bulk data transfer
utility SQOOP (built by Cloudera) is
can be used provide this ETL
capability
Clickstream
Data Sales Data

Hadoop/Hive SQOOP DBMS

RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE III


• Scan and fetch specified data
• Ideal for combining subsets of HDFS
subsets from HDFS via table UDF
data with DBMS data for operational
• Can read and fetch HDFS data subsets
• Called as part of SQL query
(transient) business reports
• Output joinable with DBMS data
Join HDFS data with DBMS data on • Multiple, simultaneous UDF calls possible • Example – In Retail: Point Of Sale (POS)
the fly: Fetch and join subsets of • Sample UDFs provided in JAVA, C++ detailed data stored in HDFS. DBMS EDW
fetches POS data at fixed intervals from HDFS of
HDFS data on-demand using SQL
queries from DBMS(Data • HDFS data not stored in DBMS specific hot selling SKUs, combines with
inventory data in DBMS to predict and prevent
• Fetched into DBMS In-memory tables inventory “stockouts”.
Federation technique) • ACID properties not applicable
• Repeated use: put fetched data in tables

• Visible to BI/other client tools via
ANSI SQL API only

Inventory
POS Data Data

Hadoop/HDFS UDF Bridge DBMS

RICH ECO-SYSTEM
DBMS <–> HADOOP BRIDGE IV


• Trigger and fetch Hadoop MR job • Ideal for combining results of
results via table UDF Hadoop MR job results with DBMS
• Can trigger Hadoop MR jobs
• Called as part of Sybase IQ SQL query data for operational (transient)
• Output joinable with Sybase IQ data business reports
• No multiple, simultaneous UDF calls
• Sample UDFs provided in JAVA only
Combine results of Hadoop MR jobs • Example – In Utilities: Smart meter and
with DBMS data on the fly: Initiate smart grid data can be combined for load
and Join results of Hadoop MR jobs • HDFS data not stored in DBMS monitoring and demand forecast. Smart grid
on-demand using SQL queries from • Fetched into DBMS In-memory tables transmission quality data (multi-attribute time
• ACID properties not applicable series data) stored in HDFS can be computed
DBMS data (Query Federation • Repeated use: put fetched data in tables via Hadoop MR jobs triggered from DBMS and
technique) combined with Smart meter data stored in
DBMS to analyze demand and workload.
• Visible to BI and other client tools
via DBMS ANSI SQL API only

Smart Grid Smart Meter
Transmission Data consumption data

Hadoop/HDFS UDF Bridge DBMS

RICH ECO-SYSTEM
DBMS <–> PREDICTIVE TOOLS BRIDGE
Express Complex Computations In Industry Standard Predictive Modeling
Markup Language (PMML), Plug In Models Close To data for execution

Database Server
DBMS

SQL
Applications Bridge
Universal
Predictions
Plug-In

PMML
UDFs
PMML
PMML
PMML
(models) PMML Preprocessor
(models)
(models) (convert & validate)

RICH ECO-SYSTEM
FUNDAMENTALS OF STREAMS TECHNOLOGY

Process data without storing it

Input Streams
Events arrive on input streams
Derived Streams, Windows
Apply continuous query
operators to one or more
input streams to produce
a new stream

Continuous Queries create a new Windows can Have State
“derived” stream or window • Retention rules define how many or how
long events are kept
SELECT FROM one or more input
• Opcodes in events can indicate
streams/windows
insert/update/delete and can be
WHERE…
automatically applied to the window
GROUP BY…

RICH ECO-SYSTEM
STREAMS DATA PROCESSING VS TRADITIONAL DATA PROCESESSING

SQL CCL
Windows on
Tables Event Streams
Rows Events

Columns Fields

On-Demand: query Event-Driven:
runs when information query updates when
is needed information arrives

RICH ECO-SYSTEM
STREAMS PRE-PROCESSING
Why store Big Data when you can deal with Small Data – Pre-filter un-necessary data on the fly with Streams technologies

ESP Engine

Alerts Actions

Updates

Memory

Disk

Hadoop/HDFS DBMS

3-LAYER LOGICAL INTEGRATION
STREAM PROCESSING <-> NoSQL <-> DBMS

BI TOOLS DI TOOLS DBA TOOLS DATA MINING TOOLS
Eco-System
Unstructured
Data
App Ingest + Persist (Hadoop,
Services Web 2.0 Java C/C++ SQL Federation
Content Mgmt)

Structured Data
(DBMS)

DMBS
Streaming Data
(ESP)

The heterogeneous world will require co-existence and playing nice!

Q&A

Learn More: http://www.sybase.com/sybaseiqbigdata
Contact: 1-800-SYBASE5 (792.2735)

Big Data Analytics Issues in a Heterogeneous World

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Big Data Analytics Issues in a Heterogeneous World

Similar a Big Data Analytics Issues in a Heterogeneous World (20)

Más de BigDataCloud

Más de BigDataCloud (19)

Último

Último (20)

Big Data Analytics Issues in a Heterogeneous World