Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Data
Science Connect 22nd of July, 2014
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData
(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg

What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0

DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
●
SQL (42%)
●
R (33%)
●
Python (26%)
●
Excel (25%)
●
Java, Ruby, C++ (17%)
●
SPSS, SAS (9%)
●
Limitations (Single Node usage)
●
Main Memory
●
CPU <> Main Memory Bandwidth
●
CPU
●
Storage <> Main Memory Bandwidth (either Single node or SAN)

What is BIG data?

What is BIG data?
Big Data
Hadoop

What is BIG data?
Business Intelligence
Data Warehouse

BigData == Hadoop?
Hadoop BigData
Hadoop

What is beyond “Data Warehouse”?
Data Lake
Data Warehouse

First “BigData” UseCase ?
●
Google Index
●
40 X 10^9 = 40.000.000.000 => 40 billion pages indexed
●
Will break 100 PB barrier soon
●
Derived from MapReduce
●
now “caffeine” based on “percolator”
●
Incremental vs. batch
●
In-Memory vs. disk
●

Map-Reduce → Hadoop → BigInsights

BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore

Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec

Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
 MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog

“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload

of

of
CPU Cores

of
CPU Cores Storage

of
CPU Cores Storage Memory

linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform

How do Databases Scale-Out?
Shared Disk Architectures

How do Databases Scale-Out?
Shared Nothing Architectures

Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
http://bluemix.net/
6 Node Hadoop Cluster 4 Free

Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop

SQL on Hadoop
●
IBM BigSQL (ANSI 92 compliant)
●
HIVE, Presto
●
Cloudera Impala
●
Lingual
●
Shark
●
...
SQL Hadoop

Two types of SQL Engines
●
Type I
●
Compiler and Optimizer SQL->MapReduce
●
Type II
●
Brings own distributed execution engine on Data Nodes
●
Brings own Task Scheduler
●
The Hadoop SQL Ecosystem is evolving very fast

Hive
●
Runs on top of MapReduce
●
→ Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg

Lingual
●
ANSI SQL Layer on top of Cascading
●
Cascading
●
Java API do express DAG
●
Runs on top of MapReduce
●
→ Type I

Limits of MapReduce
●
Disk writes between Map and Reduce
●
Slow for computations which depend on previously computed values
●
JOINs are very slow and difficult to implement
●
Only sequential data access
●
Only tuple-wise data access
●
Map-Side joins have sort and size constraints
●
Reduce-Side joins require secondary sorting of values
●
…
●
...

Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png

Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920

Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png

BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)
(No JobTracker, No Task Tracker, But HDFS/GPFS remains)

BigSQL V3.0 – Architecture
Putting the story together….
Big SQL shares a common SQL dialect with DB2
Big SQL shares the same client drivers with DB2

BigSQL V3.0 – Performance
Query rewrites
Exhaustive query rewrite capabilities
Leverages additional metadata such as constraints and nullability
Optimization
Statistics and heuristic driven query optimization
Query optimizer based upon decades of IBM RDBMS experience
Tools and metrics
Highly detailed explain plans and query diagnostic tools
Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKE
Y AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period

BigSQL V3.0 – Performance
You are substantially faster if you don't use MapReduce
IBM BigInsights v3.0, with Big SQL
3.0, is the only Hadoop distribution
to successfully run ALL 99 TPC-DS
queries and ALL 22 TPC-H queries
without modification. Source:
http://www.ibmbigdatahub.com/blog/big-deal-about-
infosphere-biginsights-v30-big-sql

BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
Compute Node
Big
SQL
Compute Node
Big
SQL

BigSQL V1.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
●
●

CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE LOCATION
'/user/biadmin/32Gtest';

[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
| |
+----------+
| 11416740 |
+----------+
1 row in results(first row: 39.78s; total: 39.78s)

select count(hour), hour from trace group by hour order by hour
30 rows in results(first row: 37.98s; total: 37.99s)

[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner
join trace2 t4 on t3.hour=t4.hour;
+--------+
| |
+--------+
| 477340 |
+--------+

CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid int,clientid int,
date varchar(30), timestamp varchar(30) )
row format delimited
fields terminated by '|'
stored as textfile;

[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
| 1 |
+----------+
| 12014733 |
+----------+

[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner
join trace4 t4 on t3.hour=t4.hour;
+--------+
| 1 |
+--------+
| 504360 |
+--------+

[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3
group by hour order by hour;
29 rows in results(first row: 1.88s; total: 1.89s)

R on Hadoop
●
IBM BigR (based on SystemML Almadan Research project)
●
Rhadoop
●
RHIPE
●
...
“R” Hadoop

Goal: Find column mean
Problems:
• Column vector can not fit into memory
You have to partition and parallelize

● Sampling
 Full dataset > RAM
 Example: use 1% vs 100% of dataset
 Precision loss from skewed/sparse data
● Numerical Stability
 Limitation from finite precision in computing
 Algorithms must be carefully implemented
 Instability causes errors to cascade throughout your analysis
Catastrophic Cancellation Error: 6.375 – 5.625
True value: 0.75 Computed: 0 Relative Error: 1.0
6.375 round to 6.0
5.625 round to 6.0

Data in Hadoop
You
R User
Data in distributed
memory

Data in Hadoop: Can run R on a single node
R User
Data in distributed
memory
You

BigR (based on SystemML)
SystemML compiles hybrid runtime plans ranging from in-
memory, single machine (CP) to large-scale, cluster (MR)
compute
●
Challenge
●
Guaranteed hard memory constraints
(budget of JVM size)
●
for arbitrary complex ML programs
●
Key Technical Innovations
●
CP & MR Runtime: Single machine & MR operations, integrated runtime
●
Caching: Reuse and eviction of in-memory objects
●
Cost Model: Accurate time and worst-case memory estimates
●
Optimizer: Cost-based runtime plan generation
●
Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Runtime
CP CP/MR MR
Gradually exploit
MR parallelism
High performance
computing for
small data sizes.
Scalable
computing for
large data sizes.
Hybrid Plans

R Clients
SystemML
Statistics
Engine
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture

Big R Data Structures: Proxy to entire dataset
data <- bigr.frame(…)
Appears and acts like all of the data is on your
laptop
You

BigR Demo (small)
●
●
●

BigR Demo (small)
library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes =
c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)

BigR Demo (small)
class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...

BigR Demo (small)

BigR Demo (small)
jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()

SPSS on Hadoop

BigSheets Demo (small)
●
●
●
●
●
●

This command runs on 32 GB /
~650.000.000 rows in HDFS

Text Extraction (SystemT, AQL)

If this is not enough? → BigData AppStore

BigData AppStore, Eclipse Tooling
●
Write your apps in
●
Java (MapReduce)
●
PigLatin,Jaql
●
BigSQL/Hive/BigR
●
Deploy it to BigInsights via Eclipse
●
Automatically
●
Schedule
●
Update
●
hdfs files
●
BigSQL tables
●
BigSheets collections

Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (17)

Similar a Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

Similar a Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich (20)

Más de Romeo Kienzler

Más de Romeo Kienzler (20)

Último

Último (20)

Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich