SlideShare una empresa de Scribd logo
1 de 73
Descargar para leer sin conexión
© 2013 IBM Corporation1
The Data Scientists Workplace of the Future - Data
Science Connect 22nd of July, 2014
Romeo Kienzler
IBM Center of Excellence for Data Science, Cognitive Systems and BigData
(A joint-venture between IBM Research Zurich and IBM Innovation Center DACH)
Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
© 2013 IBM Corporation2
What is DataScience?
Source: Statoo.com http://slidesha.re/1kmNiX0
© 2013 IBM Corporation3
DataScience at present
●
Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html)
●
SQL (42%)
●
R (33%)
●
Python (26%)
●
Excel (25%)
●
Java, Ruby, C++ (17%)
●
SPSS, SAS (9%)
●
Limitations (Single Node usage)
●
Main Memory
●
CPU <> Main Memory Bandwidth
●
CPU
●
Storage <> Main Memory Bandwidth (either Single node or SAN)
© 2013 IBM Corporation4
What is BIG data?
© 2013 IBM Corporation5
What is BIG data?
© 2013 IBM Corporation6
What is BIG data?
Big Data
Hadoop
© 2013 IBM Corporation7
What is BIG data?
Business Intelligence
Data Warehouse
© 2013 IBM Corporation8
BigData == Hadoop?
Hadoop BigData
Hadoop
© 2013 IBM Corporation9
What is beyond “Data Warehouse”?
Data Lake
Data Warehouse
© 2013 IBM Corporation10
First “BigData” UseCase ?
●
Google Index
●
40 X 10^9 = 40.000.000.000 => 40 billion pages indexed
●
Will break 100 PB barrier soon
●
Derived from MapReduce
●
now “caffeine” based on “percolator”
●
Incremental vs. batch
●
In-Memory vs. disk
●
© 2013 IBM Corporation11
Map-Reduce → Hadoop → BigInsights
© 2013 IBM Corporation12
BigData Analytics – Predictive Analytics
"sometimes it's not
who has the best
algorithm that wins;
it's who has the most
data."
(C) Google Inc.
The Unreasonable Effectiveness of Data¹
¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf
No Sampling => Work with full dataset => No p-Value/z-Scores anymore
© 2013 IBM Corporation13
Aggregated Bandwith between CPU, Main
Memory and Hard Drive
1 TB (at 10 GByte/s)
- 1 Node - 100 sec
- 10 Nodes - 10 sec
- 100 Nodes - 1 sec
- 1000 Nodes - 100 msec
© 2013 IBM Corporation14
Fault Tolerance / Commodity Hardware
AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM,
3TB SEAGATE Barracuda 7200.14
< CHF 500
 100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD
 MTBF ~ 365 d > 1,5 d
Source: http://www.cloudcomputingpatterns.org/Watchdog
© 2013 IBM Corporation15
“Elastic” Scale-Out
Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
© 2013 IBM Corporation16
“Elastic” Scale-Out
of
© 2013 IBM Corporation17
“Elastic” Scale-Out
of
CPU Cores
© 2013 IBM Corporation18
“Elastic” Scale-Out
of
CPU Cores Storage
© 2013 IBM Corporation19
“Elastic” Scale-Out
of
CPU Cores Storage Memory
© 2013 IBM Corporation20
“Elastic” Scale-Out
linear
Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
© 2013 IBM Corporation21
How do Databases Scale-Out?
Shared Disk Architectures
© 2013 IBM Corporation22
How do Databases Scale-Out?
Shared Nothing Architectures
© 2013 IBM Corporation23
Hadoop?
Shared Nothing Architecture?
Shared Disk Architecture?
http://bluemix.net/
6 Node Hadoop Cluster 4 Free
© 2013 IBM Corporation24
Data Science on Hadoop
SQL (42%)
R (33%)
Python (26%)
Excel (25%)
Java, Ruby, C++ (17%)
SPSS, SAS (9%)
Data Science Hadoop
© 2013 IBM Corporation25
SQL on Hadoop
●
IBM BigSQL (ANSI 92 compliant)
●
HIVE, Presto
●
Cloudera Impala
●
Lingual
●
Shark
●
...
SQL Hadoop
© 2013 IBM Corporation26
Two types of SQL Engines
●
Type I
●
Compiler and Optimizer SQL->MapReduce
●
Type II
●
Brings own distributed execution engine on Data Nodes
●
Brings own Task Scheduler
●
The Hadoop SQL Ecosystem is evolving very fast
© 2013 IBM Corporation27
Hive
●
Runs on top of MapReduce
●
→ Type I
Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
© 2013 IBM Corporation28
Lingual
●
ANSI SQL Layer on top of Cascading
●
Cascading
●
Java API do express DAG
●
Runs on top of MapReduce
●
→ Type I
© 2013 IBM Corporation29
Limits of MapReduce
●
Disk writes between Map and Reduce
●
Slow for computations which depend on previously computed values
●
JOINs are very slow and difficult to implement
●
Only sequential data access
●
Only tuple-wise data access
●
Map-Side joins have sort and size constraints
●
Reduce-Side joins require secondary sorting of values
●
…
●
...
© 2013 IBM Corporation30
Impala (Type II)
http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
© 2013 IBM Corporation31
Presto (Type II)
https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
© 2013 IBM Corporation32
Spark / Shark (Type II)
Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
© 2013 IBM Corporation33
BigSQL V3.0 (Type II)
Like in Spark, MapReduce has been Kicked out :)
(No JobTracker, No Task Tracker, But HDFS/GPFS remains)
© 2013 IBM Corporation34
BigSQL V3.0 – Architecture
Putting the story together….
Big SQL shares a common SQL dialect with DB2
Big SQL shares the same client drivers with DB2
© 2013 IBM Corporation35
BigSQL V3.0 – Performance
Query rewrites
Exhaustive query rewrite capabilities
Leverages additional metadata such as constraints and nullability
Optimization
Statistics and heuristic driven query optimization
Query optimizer based upon decades of IBM RDBMS experience
Tools and metrics
Highly detailed explain plans and query diagnostic tools
Extensive number of available performance metrics
SELECT ITEM_DESC, SUM(QUANTITY_SOLD),
AVG(PRICE), AVG(COST)
FROM PERIOD, DAILY_SALES, PRODUCT,
STORE
WHERE
PERIOD.PERKEY=DAILY_SALES.PERKEY AND
PRODUCT.PRODKEY=DAILY_SALES.PRODKE
Y AND
STORE.STOREKEY=DAILY_SALES.STOREKEY
AND
CALENDAR_DATE BETWEEN AND
'01/01/2012' AND '04/28/2012' AND
STORE_NUMBER='03' AND
CATEGORY=72
GROUP BY ITEM_DESC
Access plan generationQuery transformation
Dozens of query
transformations
Hundreds or thousands
of access plan options
Store
Product
Product Store
NLJOIN
Daily SalesNLJOIN
Period
NLJOIN
Product
NLJOIN
Daily Sales
NLJOIN
Period
NLJOIN
Store
HSJOIN
Daily Sales
HSJOIN
Period
HSJOIN
Product
StoreZZJOIN
Daily Sales
HSJOIN
Period
© 2013 IBM Corporation36
BigSQL V3.0 – Performance
You are substantially faster if you don't use MapReduce
IBM BigInsights v3.0, with Big SQL
3.0, is the only Hadoop distribution
to successfully run ALL 99 TPC-DS
queries and ALL 22 TPC-H queries
without modification. Source:
http://www.ibmbigdatahub.com/blog/big-deal-about-
infosphere-biginsights-v30-big-sql
© 2013 IBM Corporation37
BigSQL V3.0 – Query Federation
Head Node
Big SQL
Compute Node
Task Tracker Data Node Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
Compute Node
Task Tracker Data Node
Big
SQL
© 2013 IBM Corporation38
BigSQL V1.0 – Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation39
BigSQL V1.0 – Demo (small)
CREATE EXTERNAL TABLE trace (
hour integer, employeeid integer,
departmentid integer, clientid integer,
date string, timestamp string)
ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
TERMINATED BY 'n' STORED AS TEXTFILE LOCATION
'/user/biadmin/32Gtest';
© 2013 IBM Corporation40
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation41
BigSQL V1.0 – Demo (small)
© 2013 IBM Corporation42
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1;
+----------+
| |
+----------+
| 11416740 |
+----------+
1 row in results(first row: 39.78s; total: 39.78s)
© 2013 IBM Corporation43
BigSQL V1.0 – Demo (small)
select count(hour), hour from trace group by hour order by hour
30 rows in results(first row: 37.98s; total: 37.99s)
© 2013 IBM Corporation44
BigSQL V1.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner
join trace2 t4 on t3.hour=t4.hour;
+--------+
| |
+--------+
| 477340 |
+--------+
1 row in results(first row: 32.24s; total: 32.25s)
© 2013 IBM Corporation45
BigSQL V3.0 – Demo (small)
CREATE HADOOP TABLE trace3 (
hour int, employeeid int,
departmentid int,clientid int,
date varchar(30), timestamp varchar(30) )
row format delimited
fields terminated by '|'
stored as textfile;
© 2013 IBM Corporation46
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3;
+----------+
| 1 |
+----------+
| 12014733 |
+----------+
1 row in results(first row: 2.94s; total: 2.95s)
© 2013 IBM Corporation47
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner
join trace4 t4 on t3.hour=t4.hour;
+--------+
| 1 |
+--------+
| 504360 |
+--------+
1 row in results(first row: 0.79s; total: 0.80s)
© 2013 IBM Corporation48
BigSQL V3.0 – Demo (small)
[bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3
group by hour order by hour;
29 rows in results(first row: 1.88s; total: 1.89s)
© 2013 IBM Corporation49
R on Hadoop
●
IBM BigR (based on SystemML Almadan Research project)
●
Rhadoop
●
RHIPE
●
...
“R” Hadoop
© 2013 IBM Corporation50
© 2013 IBM Corporation5151
Goal: Find column mean
Problems:
• Column vector can not fit into memory
You have to partition and parallelize
© 2013 IBM Corporation52
● Sampling
 Full dataset > RAM
 Example: use 1% vs 100% of dataset
 Precision loss from skewed/sparse data
● Numerical Stability
 Limitation from finite precision in computing
 Algorithms must be carefully implemented
 Instability causes errors to cascade throughout your analysis
Catastrophic Cancellation Error: 6.375 – 5.625
True value: 0.75 Computed: 0 Relative Error: 1.0
6.375 round to 6.0
5.625 round to 6.0
© 2013 IBM Corporation53
Data in Hadoop
You
R User
Data in distributed
memory
© 2013 IBM Corporation54
Data in Hadoop: Can run R on a single node
R User
Data in distributed
memory
You
© 2013 IBM Corporation55
BigR (based on SystemML)
SystemML compiles hybrid runtime plans ranging from in-
memory, single machine (CP) to large-scale, cluster (MR)
compute
●
Challenge
●
Guaranteed hard memory constraints
(budget of JVM size)
●
for arbitrary complex ML programs
●
Key Technical Innovations
●
CP & MR Runtime: Single machine & MR operations, integrated runtime
●
Caching: Reuse and eviction of in-memory objects
●
Cost Model: Accurate time and worst-case memory estimates
●
Optimizer: Cost-based runtime plan generation
●
Dyn. Recompiler: Re-optimization for initial unknowns
Data size
Runtime
CP CP/MR MR
Gradually exploit
MR parallelism
High performance
computing for
small data sizes.
Scalable
computing for
large data sizes.
Hybrid Plans
© 2013 IBM Corporation56
R Clients
SystemML
Statistics
Engine
Data Sources
Embedded R
Execution
IBM R Packages
IBM R Packages
Pull data
(summaries) to
R client
Or, push R
functions
right on the
data
1
2
3
© 2014 IBM Corporation17 IBM Internal Use Only
BigR Architecture
© 2013 IBM Corporation57
Big R Data Structures: Proxy to entire dataset
data <- bigr.frame(…)
Appears and acts like all of the data is on your
laptop
You
© 2013 IBM Corporation58
BigR Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation59
BigR Demo (small)
library(bigr)
bigr.connect(host="bigdata",
port=7052, database="default",
user="biadmin", password="xxx")
is.bigr.connected()
tbr <- bigr.frame(dataSource="DEL", coltypes =
c("numeric","numeric","numeric","numeric","character","character"),
dataPath="/user/biadmin/32Gtest", delimiter=",",
header=F, useMapReduce=T)
h <- bigr.histogram.stats(tbr$V1, nbins=24)
© 2013 IBM Corporation60
BigR Demo (small)
class bins counts centroids
1 ALL 0 18289280 1.583333
2 ALL 1 15360 2.750000
3 ALL 2 55040 3.916667
4 ALL 3 189440 5.083333
5 ALL 4 579840 6.250000
6 ALL 5 5292160 7.416667
7 ALL 6 8074880 8.583333
8 ALL 7 15653120 9.750000
...
© 2013 IBM Corporation61
BigR Demo (small)
© 2013 IBM Corporation62
BigR Demo (small)
jpeg('hist.jpg')
bigr.histogram(tbr$V1, nbins=24)
# This command runs on 32 GB / ~650.000.000 rows in HDFS
dev.off()
© 2013 IBM Corporation63
SPSS on Hadoop
© 2013 IBM Corporation64
SPSS on Hadoop
© 2013 IBM Corporation65
BigSheets Demo (small)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
●
32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich)
●
3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich)
●
0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
© 2013 IBM Corporation66
BigSheets Demo (small)
© 2013 IBM Corporation67
BigSheets Demo (small)
This command runs on 32 GB /
~650.000.000 rows in HDFS
© 2013 IBM Corporation68
BigSheets Demo (small)
© 2013 IBM Corporation69
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation70
Text Extraction (SystemT, AQL)
© 2013 IBM Corporation71
If this is not enough? → BigData AppStore
© 2013 IBM Corporation72
BigData AppStore, Eclipse Tooling
●
Write your apps in
●
Java (MapReduce)
●
PigLatin,Jaql
●
BigSQL/Hive/BigR
●
Deploy it to BigInsights via Eclipse
●
Automatically
●
Schedule
●
Update
●
hdfs files
●
BigSQL tables
●
BigSheets collections
© 2013 IBM Corporation73
Questions?
http://www.ibm.com/software/data/bigdata/
Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps

Más contenido relacionado

La actualidad más candente

Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchRyousei Takano
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_ENKohei KaiGai
 
Supermicro cloudera hadoop
Supermicro cloudera hadoopSupermicro cloudera hadoop
Supermicro cloudera hadoopSupermicro_SMCI
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術Ryousei Takano
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - EnglishKohei KaiGai
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQLKohei KaiGai
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)Kohei KaiGai
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multiKohei KaiGai
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDASavith Satheesh
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...NVIDIA Taiwan
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015Kohei KaiGai
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computingbakers84
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)Kohei KaiGai
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionNVIDIA Taiwan
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPCHPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPCRyousei Takano
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSPeterAndreasEntschev
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudRyousei Takano
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda enKohei KaiGai
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGaiKohei KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Ural-PDC
 

La actualidad más candente (20)

Expectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software researchExpectations for optical network from the viewpoint of system software research
Expectations for optical network from the viewpoint of system software research
 
20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN20180920_DBTS_PGStrom_EN
20180920_DBTS_PGStrom_EN
 
Supermicro cloudera hadoop
Supermicro cloudera hadoopSupermicro cloudera hadoop
Supermicro cloudera hadoop
 
クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術クラウド時代の半導体メモリー技術
クラウド時代の半導体メモリー技術
 
20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English20181212 - PGconfASIA - LT - English
20181212 - PGconfASIA - LT - English
 
20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL20181116 Massive Log Processing using I/O optimized PostgreSQL
20181116 Massive Log Processing using I/O optimized PostgreSQL
 
SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)SQL+GPU+SSD=∞ (English)
SQL+GPU+SSD=∞ (English)
 
20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi20181016_pgconfeu_ssd2gpu_multi
20181016_pgconfeu_ssd2gpu_multi
 
GPGPU programming with CUDA
GPGPU programming with CUDAGPGPU programming with CUDA
GPGPU programming with CUDA
 
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
Recent Progress in SCCS on GPU Simulation of Biomedical and Hydrodynamic Prob...
 
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015PG-Strom - GPGPU meets PostgreSQL, PGcon2015
PG-Strom - GPGPU meets PostgreSQL, PGcon2015
 
The Rise of Parallel Computing
The Rise of Parallel ComputingThe Rise of Parallel Computing
The Rise of Parallel Computing
 
PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)PG-Strom v2.0 Technical Brief (17-Apr-2018)
PG-Strom v2.0 Technical Brief (17-Apr-2018)
 
Evolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server SolutionEvolution of Supermicro GPU Server Solution
Evolution of Supermicro GPU Server Solution
 
HPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPCHPC Cloud: Clouds on supercomputers for HPC
HPC Cloud: Clouds on supercomputers for HPC
 
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDSDistributed Multi-GPU Computing with Dask, CuPy and RAPIDS
Distributed Multi-GPU Computing with Dask, CuPy and RAPIDS
 
Exploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC CloudExploring the Performance Impact of Virtualization on an HPC Cloud
Exploring the Performance Impact of Virtualization on an HPC Cloud
 
pgconfasia2016 plcuda en
pgconfasia2016 plcuda enpgconfasia2016 plcuda en
pgconfasia2016 plcuda en
 
20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai20190909_PGconf.ASIA_KaiGai
20190909_PGconf.ASIA_KaiGai
 
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
Applying of the NVIDIA CUDA to the video processing in the task of the roundw...
 

Destacado

Open science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectomeOpen science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectomeCameron Craddock
 
Use of big data technologies in capital markets
Use of big data technologies in capital marketsUse of big data technologies in capital markets
Use of big data technologies in capital marketsInfosys
 
Big Data Expo 2015 - Data Science Innovation Privacy Considerations
Big Data Expo 2015 - Data Science Innovation Privacy ConsiderationsBig Data Expo 2015 - Data Science Innovation Privacy Considerations
Big Data Expo 2015 - Data Science Innovation Privacy ConsiderationsBigDataExpo
 
The Complete Guide to Capital Markets for Quantitative Professionals - Summary
The Complete Guide to Capital Markets for Quantitative Professionals - SummaryThe Complete Guide to Capital Markets for Quantitative Professionals - Summary
The Complete Guide to Capital Markets for Quantitative Professionals - Summaryradhap
 
Data Science at Atlassian: 
The transition towards a data-driven organisation
Data Science at Atlassian: 
The transition towards a data-driven organisationData Science at Atlassian: 
The transition towards a data-driven organisation
Data Science at Atlassian: 
The transition towards a data-driven organisationIlias Flaounas
 
UNICEF Innovation: Innovation Lab Do-It-Yourself Guide
UNICEF Innovation: Innovation Lab Do-It-Yourself GuideUNICEF Innovation: Innovation Lab Do-It-Yourself Guide
UNICEF Innovation: Innovation Lab Do-It-Yourself GuideChristopher Fabian
 
The Epidemiology of Innovation
The Epidemiology of InnovationThe Epidemiology of Innovation
The Epidemiology of InnovationTim Stock
 
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'AnnaBig Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'AnnaData Driven Innovation
 
Open Innovation
Open Innovation Open Innovation
Open Innovation Alar Kolk
 
Understand Innovation in 5 Minutes
Understand Innovation in 5 MinutesUnderstand Innovation in 5 Minutes
Understand Innovation in 5 MinutesGordon Graham
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Hamilton
 

Destacado (17)

Open science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectomeOpen science resources for `Big Data' Analyses of the human connectome
Open science resources for `Big Data' Analyses of the human connectome
 
Use of big data technologies in capital markets
Use of big data technologies in capital marketsUse of big data technologies in capital markets
Use of big data technologies in capital markets
 
Big Data Expo 2015 - Data Science Innovation Privacy Considerations
Big Data Expo 2015 - Data Science Innovation Privacy ConsiderationsBig Data Expo 2015 - Data Science Innovation Privacy Considerations
Big Data Expo 2015 - Data Science Innovation Privacy Considerations
 
Data juice
Data juiceData juice
Data juice
 
The DATALAB - building a world-class innovation centre in data science
The DATALAB - building a world-class innovation centre in data scienceThe DATALAB - building a world-class innovation centre in data science
The DATALAB - building a world-class innovation centre in data science
 
Big data Summit
Big data SummitBig data Summit
Big data Summit
 
The Complete Guide to Capital Markets for Quantitative Professionals - Summary
The Complete Guide to Capital Markets for Quantitative Professionals - SummaryThe Complete Guide to Capital Markets for Quantitative Professionals - Summary
The Complete Guide to Capital Markets for Quantitative Professionals - Summary
 
Data Science at Atlassian: 
The transition towards a data-driven organisation
Data Science at Atlassian: 
The transition towards a data-driven organisationData Science at Atlassian: 
The transition towards a data-driven organisation
Data Science at Atlassian: 
The transition towards a data-driven organisation
 
UNICEF Innovation: Innovation Lab Do-It-Yourself Guide
UNICEF Innovation: Innovation Lab Do-It-Yourself GuideUNICEF Innovation: Innovation Lab Do-It-Yourself Guide
UNICEF Innovation: Innovation Lab Do-It-Yourself Guide
 
The Epidemiology of Innovation
The Epidemiology of InnovationThe Epidemiology of Innovation
The Epidemiology of Innovation
 
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'AnnaBig Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
Big Data and Data Science @ BNL - D. Morgagni & L. Dell'Anna
 
Open Innovation
Open Innovation Open Innovation
Open Innovation
 
Innovation can be Trained
Innovation can be TrainedInnovation can be Trained
Innovation can be Trained
 
Understand Innovation in 5 Minutes
Understand Innovation in 5 MinutesUnderstand Innovation in 5 Minutes
Understand Innovation in 5 Minutes
 
Innovation Strategy
Innovation StrategyInnovation Strategy
Innovation Strategy
 
Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science Booz Allen Field Guide to Data Science
Booz Allen Field Guide to Data Science
 
What is Big Data?
What is Big Data?What is Big Data?
What is Big Data?
 

Similar a Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...Romeo Kienzler
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...Romeo Kienzler
 
日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDBgriddb
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldCan Ozdoruk
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Brendan Gregg
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Romeo Kienzler
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasScyllaDB
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storageKohei KaiGai
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...Amazon Web Services
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performanceinside-BigData.com
 
Leveraging Cloud for the Modern SQL Developer
Leveraging Cloud for the Modern SQL DeveloperLeveraging Cloud for the Modern SQL Developer
Leveraging Cloud for the Modern SQL DeveloperJason Strate
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Amazon Web Services
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsAli Hodroj
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon RedshiftAmazon Web Services
 
IBM Analytics Accelerator Trends & Directions Namk Hrle
IBM Analytics Accelerator  Trends & Directions Namk Hrle IBM Analytics Accelerator  Trends & Directions Namk Hrle
IBM Analytics Accelerator Trends & Directions Namk Hrle Surekha Parekh
 
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle Surekha Parekh
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing PerformanceBrendan Gregg
 

Similar a Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich (20)

SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
The datascientists workplace of the future, IBM developerDays 2014, Vienna by...
 
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...BigData processing in the cloud – Guest Lecture - University of Applied Scien...
BigData processing in the cloud – Guest Lecture - University of Applied Scien...
 
日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB日本発のオープンソース・データベース GridDB
日本発のオープンソース・データベース GridDB
 
Molecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New WorldMolecular Shape Searching on GPUs: A Brave New World
Molecular Shape Searching on GPUs: A Brave New World
 
Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)Computing Performance: On the Horizon (2021)
Computing Performance: On the Horizon (2021)
 
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
Information Retrieval, Applied Statistics and Mathematics onBigData - German ...
 
Hadoop Fundamentals I
Hadoop Fundamentals IHadoop Fundamentals I
Hadoop Fundamentals I
 
Dissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance DilemmasDissecting Real-World Database Performance Dilemmas
Dissecting Real-World Database Performance Dilemmas
 
20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage20170602_OSSummit_an_intelligent_storage
20170602_OSSummit_an_intelligent_storage
 
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
SRV402 Deep Dive on Amazon EC2 Instances, Featuring Performance Optimization ...
 
Trends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient PerformanceTrends in Systems and How to Get Efficient Performance
Trends in Systems and How to Get Efficient Performance
 
Leveraging Cloud for the Modern SQL Developer
Leveraging Cloud for the Modern SQL DeveloperLeveraging Cloud for the Modern SQL Developer
Leveraging Cloud for the Modern SQL Developer
 
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
Empowering Congress with Data-Driven Analytics (BDT304) | AWS re:Invent 2013
 
Hybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGsHybrid Transactional/Analytics Processing with Spark and IMDGs
Hybrid Transactional/Analytics Processing with Spark and IMDGs
 
Getting Started with Amazon Redshift
Getting Started with Amazon RedshiftGetting Started with Amazon Redshift
Getting Started with Amazon Redshift
 
IBM Analytics Accelerator Trends & Directions Namk Hrle
IBM Analytics Accelerator  Trends & Directions Namk Hrle IBM Analytics Accelerator  Trends & Directions Namk Hrle
IBM Analytics Accelerator Trends & Directions Namk Hrle
 
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle IBM DB2 Analytics Accelerator  Trends & Directions by Namik Hrle
IBM DB2 Analytics Accelerator Trends & Directions by Namik Hrle
 
YOW2021 Computing Performance
YOW2021 Computing PerformanceYOW2021 Computing Performance
YOW2021 Computing Performance
 
DDMS
DDMSDDMS
DDMS
 

Más de Romeo Kienzler

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingRomeo Kienzler
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkRomeo Kienzler
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Romeo Kienzler
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book VernisageRomeo Kienzler
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Romeo Kienzler
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarRomeo Kienzler
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningRomeo Kienzler
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Romeo Kienzler
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTRomeo Kienzler
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataRomeo Kienzler
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Romeo Kienzler
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceRomeo Kienzler
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...Romeo Kienzler
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASRomeo Kienzler
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamRomeo Kienzler
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...Romeo Kienzler
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14Romeo Kienzler
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Romeo Kienzler
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursRomeo Kienzler
 

Más de Romeo Kienzler (20)

Parallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network TrainingParallelization Stategies of DeepLearning Neural Network Training
Parallelization Stategies of DeepLearning Neural Network Training
 
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & FlinkCognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
Cognitive IoT using DeepLearning on data parallel frameworks like Spark & Flink
 
Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...Love & Innovative technology presented by a technology pioneer and an AI expe...
Love & Innovative technology presented by a technology pioneer and an AI expe...
 
Blockchain Technology Book Vernisage
Blockchain Technology Book VernisageBlockchain Technology Book Vernisage
Blockchain Technology Book Vernisage
 
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
Architecture of the Hyperledger Blockchain Fabric - Christian Cachin - IBM Re...
 
IBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, QatarIBM Middle East Data Science Connect 2016 - Doha, Qatar
IBM Middle East Data Science Connect 2016 - Doha, Qatar
 
Apache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine LearningApache SystemML - Declarative Large-Scale Machine Learning
Apache SystemML - Declarative Large-Scale Machine Learning
 
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
Intro to DeepLearning4J on ApacheSpark SDS DL Workshop 16
 
DeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoTDeepLearning and Advanced Machine Learning on IoT
DeepLearning and Advanced Machine Learning on IoT
 
Geo Python16 keynote
Geo Python16 keynoteGeo Python16 keynote
Geo Python16 keynote
 
Real-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor DataReal-time DeepLearning on IoT Sensor Data
Real-time DeepLearning on IoT Sensor Data
 
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
Cloud scale predictive DevOps automation using Apache Spark: Velocity in Amst...
 
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A ServiceScala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
Scala, Apache Spark, The PlayFramework and Docker in IBM Platform As A Service
 
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
IBM Watson Technical Deep Dive Swiss Group for Artificial Intelligence and Co...
 
TDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAASTDWI_DW2014_SQLNoSQL_DBAAS
TDWI_DW2014_SQLNoSQL_DBAAS
 
Cloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa NeddamCloudant Overview Bluemix Meetup from Lisa Neddam
Cloudant Overview Bluemix Meetup from Lisa Neddam
 
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
The European Conference on Software Architecture (ECSA) 14 - IBM BigData Refe...
 
DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14DBaaS Bluemix Meetup DACH 26.8.14
DBaaS Bluemix Meetup DACH 26.8.14
 
Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014Cloud Databases, Developer Week Nuernberg 2014
Cloud Databases, Developer Week Nuernberg 2014
 
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 HoursCloudfoundry / Bluemix tutorials, compressed in 4 Hours
Cloudfoundry / Bluemix tutorials, compressed in 4 Hours
 

Último

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Delhi Call girls
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% SecurePooja Nehwal
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...amitlee9823
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAroojKhan71
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...amitlee9823
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023ymrp368
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxolyaivanovalion
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfRachmat Ramadhan H
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...amitlee9823
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxJohnnyPlasten
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...amitlee9823
 

Último (20)

Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
Call Girls in Sarai Kale Khan Delhi 💯 Call Us 🔝9205541914 🔝( Delhi) Escorts S...
 
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% SecureCall me @ 9892124323  Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
Call me @ 9892124323 Cheap Rate Call Girls in Vashi with Real Photo 100% Secure
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
Call Girls Bannerghatta Road Just Call 👗 7737669865 👗 Top Class Call Girl Ser...
 
Sampling (random) method and Non random.ppt
Sampling (random) method and Non random.pptSampling (random) method and Non random.ppt
Sampling (random) method and Non random.ppt
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al BarshaAl Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
Al Barsha Escorts $#$ O565212860 $#$ Escort Service In Al Barsha
 
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
Call Girls Hsr Layout Just Call 👗 7737669865 👗 Top Class Call Girl Service Ba...
 
Data-Analysis for Chicago Crime Data 2023
Data-Analysis for Chicago Crime Data  2023Data-Analysis for Chicago Crime Data  2023
Data-Analysis for Chicago Crime Data 2023
 
Edukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFxEdukaciniai dropshipping via API with DroFx
Edukaciniai dropshipping via API with DroFx
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdfMarket Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
Market Analysis in the 5 Largest Economic Countries in Southeast Asia.pdf
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
Call Girls Indiranagar Just Call 👗 7737669865 👗 Top Class Call Girl Service B...
 
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
(NEHA) Call Girls Katra Call Now 8617697112 Katra Escorts 24x7
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
Log Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptxLog Analysis using OSSEC sasoasasasas.pptx
Log Analysis using OSSEC sasoasasasas.pptx
 
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
Chintamani Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore ...
 

Data Science Connect, July 22nd 2014 @IBM Innovation Center Zurich

  • 1. © 2013 IBM Corporation1 The Data Scientists Workplace of the Future - Data Science Connect 22nd of July, 2014 Romeo Kienzler IBM Center of Excellence for Data Science, Cognitive Systems and BigData (A joint-venture between IBM Research Zurich and IBM Innovation Center DACH) Source: http://www.kdnuggets.com/2012/04/data-science-history.jpg
  • 2. © 2013 IBM Corporation2 What is DataScience? Source: Statoo.com http://slidesha.re/1kmNiX0
  • 3. © 2013 IBM Corporation3 DataScience at present ● Tools (http://blog.revolutionanalytics.com/2014/01/in-data-scientist-survey-r-is-the-most-used-tool-other-than-databases.html) ● SQL (42%) ● R (33%) ● Python (26%) ● Excel (25%) ● Java, Ruby, C++ (17%) ● SPSS, SAS (9%) ● Limitations (Single Node usage) ● Main Memory ● CPU <> Main Memory Bandwidth ● CPU ● Storage <> Main Memory Bandwidth (either Single node or SAN)
  • 4. © 2013 IBM Corporation4 What is BIG data?
  • 5. © 2013 IBM Corporation5 What is BIG data?
  • 6. © 2013 IBM Corporation6 What is BIG data? Big Data Hadoop
  • 7. © 2013 IBM Corporation7 What is BIG data? Business Intelligence Data Warehouse
  • 8. © 2013 IBM Corporation8 BigData == Hadoop? Hadoop BigData Hadoop
  • 9. © 2013 IBM Corporation9 What is beyond “Data Warehouse”? Data Lake Data Warehouse
  • 10. © 2013 IBM Corporation10 First “BigData” UseCase ? ● Google Index ● 40 X 10^9 = 40.000.000.000 => 40 billion pages indexed ● Will break 100 PB barrier soon ● Derived from MapReduce ● now “caffeine” based on “percolator” ● Incremental vs. batch ● In-Memory vs. disk ●
  • 11. © 2013 IBM Corporation11 Map-Reduce → Hadoop → BigInsights
  • 12. © 2013 IBM Corporation12 BigData Analytics – Predictive Analytics "sometimes it's not who has the best algorithm that wins; it's who has the most data." (C) Google Inc. The Unreasonable Effectiveness of Data¹ ¹http://www.csee.wvu.edu/~gidoretto/courses/2011-fall-cp/reading/TheUnreasonable%20EffectivenessofData_IEEE_IS2009.pdf No Sampling => Work with full dataset => No p-Value/z-Scores anymore
  • 13. © 2013 IBM Corporation13 Aggregated Bandwith between CPU, Main Memory and Hard Drive 1 TB (at 10 GByte/s) - 1 Node - 100 sec - 10 Nodes - 10 sec - 100 Nodes - 1 sec - 1000 Nodes - 100 msec
  • 14. © 2013 IBM Corporation14 Fault Tolerance / Commodity Hardware AMD Turion II Neo N40L (2x 1,5GHz / 2MB / 15W), 8 GB RAM, 3TB SEAGATE Barracuda 7200.14 < CHF 500  100 K => 200 X (2, 4, 3) => 400 Cores, 1,6 TB RAM, 200 TB HD  MTBF ~ 365 d > 1,5 d Source: http://www.cloudcomputingpatterns.org/Watchdog
  • 15. © 2013 IBM Corporation15 “Elastic” Scale-Out Source: http://www.cloudcomputingpatterns.org/Continuously_Changing_Workload
  • 16. © 2013 IBM Corporation16 “Elastic” Scale-Out of
  • 17. © 2013 IBM Corporation17 “Elastic” Scale-Out of CPU Cores
  • 18. © 2013 IBM Corporation18 “Elastic” Scale-Out of CPU Cores Storage
  • 19. © 2013 IBM Corporation19 “Elastic” Scale-Out of CPU Cores Storage Memory
  • 20. © 2013 IBM Corporation20 “Elastic” Scale-Out linear Source: http://www.cloudcomputingpatterns.org/Elastic_Platform
  • 21. © 2013 IBM Corporation21 How do Databases Scale-Out? Shared Disk Architectures
  • 22. © 2013 IBM Corporation22 How do Databases Scale-Out? Shared Nothing Architectures
  • 23. © 2013 IBM Corporation23 Hadoop? Shared Nothing Architecture? Shared Disk Architecture? http://bluemix.net/ 6 Node Hadoop Cluster 4 Free
  • 24. © 2013 IBM Corporation24 Data Science on Hadoop SQL (42%) R (33%) Python (26%) Excel (25%) Java, Ruby, C++ (17%) SPSS, SAS (9%) Data Science Hadoop
  • 25. © 2013 IBM Corporation25 SQL on Hadoop ● IBM BigSQL (ANSI 92 compliant) ● HIVE, Presto ● Cloudera Impala ● Lingual ● Shark ● ... SQL Hadoop
  • 26. © 2013 IBM Corporation26 Two types of SQL Engines ● Type I ● Compiler and Optimizer SQL->MapReduce ● Type II ● Brings own distributed execution engine on Data Nodes ● Brings own Task Scheduler ● The Hadoop SQL Ecosystem is evolving very fast
  • 27. © 2013 IBM Corporation27 Hive ● Runs on top of MapReduce ● → Type I Source: http://cdn.venublog.com/wp-content/uploads/2013/07/hive-1.jpg
  • 28. © 2013 IBM Corporation28 Lingual ● ANSI SQL Layer on top of Cascading ● Cascading ● Java API do express DAG ● Runs on top of MapReduce ● → Type I
  • 29. © 2013 IBM Corporation29 Limits of MapReduce ● Disk writes between Map and Reduce ● Slow for computations which depend on previously computed values ● JOINs are very slow and difficult to implement ● Only sequential data access ● Only tuple-wise data access ● Map-Side joins have sort and size constraints ● Reduce-Side joins require secondary sorting of values ● … ● ...
  • 30. © 2013 IBM Corporation30 Impala (Type II) http://blog.cloudera.com/blog/wp-content/uploads/2012/10/impala.png
  • 31. © 2013 IBM Corporation31 Presto (Type II) https://www.facebook.com/notes/facebook-engineering/presto-interacting-with-petabytes-of-data-at-facebook/10151786197628920
  • 32. © 2013 IBM Corporation32 Spark / Shark (Type II) Source: http://bighadoop.files.wordpress.com/2014/04/spark-architecture.png
  • 33. © 2013 IBM Corporation33 BigSQL V3.0 (Type II) Like in Spark, MapReduce has been Kicked out :) (No JobTracker, No Task Tracker, But HDFS/GPFS remains)
  • 34. © 2013 IBM Corporation34 BigSQL V3.0 – Architecture Putting the story together…. Big SQL shares a common SQL dialect with DB2 Big SQL shares the same client drivers with DB2
  • 35. © 2013 IBM Corporation35 BigSQL V3.0 – Performance Query rewrites Exhaustive query rewrite capabilities Leverages additional metadata such as constraints and nullability Optimization Statistics and heuristic driven query optimization Query optimizer based upon decades of IBM RDBMS experience Tools and metrics Highly detailed explain plans and query diagnostic tools Extensive number of available performance metrics SELECT ITEM_DESC, SUM(QUANTITY_SOLD), AVG(PRICE), AVG(COST) FROM PERIOD, DAILY_SALES, PRODUCT, STORE WHERE PERIOD.PERKEY=DAILY_SALES.PERKEY AND PRODUCT.PRODKEY=DAILY_SALES.PRODKE Y AND STORE.STOREKEY=DAILY_SALES.STOREKEY AND CALENDAR_DATE BETWEEN AND '01/01/2012' AND '04/28/2012' AND STORE_NUMBER='03' AND CATEGORY=72 GROUP BY ITEM_DESC Access plan generationQuery transformation Dozens of query transformations Hundreds or thousands of access plan options Store Product Product Store NLJOIN Daily SalesNLJOIN Period NLJOIN Product NLJOIN Daily Sales NLJOIN Period NLJOIN Store HSJOIN Daily Sales HSJOIN Period HSJOIN Product StoreZZJOIN Daily Sales HSJOIN Period
  • 36. © 2013 IBM Corporation36 BigSQL V3.0 – Performance You are substantially faster if you don't use MapReduce IBM BigInsights v3.0, with Big SQL 3.0, is the only Hadoop distribution to successfully run ALL 99 TPC-DS queries and ALL 22 TPC-H queries without modification. Source: http://www.ibmbigdatahub.com/blog/big-deal-about- infosphere-biginsights-v30-big-sql
  • 37. © 2013 IBM Corporation37 BigSQL V3.0 – Query Federation Head Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL Compute Node Task Tracker Data Node Big SQL
  • 38. © 2013 IBM Corporation38 BigSQL V1.0 – Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 39. © 2013 IBM Corporation39 BigSQL V1.0 – Demo (small) CREATE EXTERNAL TABLE trace ( hour integer, employeeid integer, departmentid integer, clientid integer, date string, timestamp string) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES TERMINATED BY 'n' STORED AS TEXTFILE LOCATION '/user/biadmin/32Gtest';
  • 40. © 2013 IBM Corporation40 BigSQL V1.0 – Demo (small)
  • 41. © 2013 IBM Corporation41 BigSQL V1.0 – Demo (small)
  • 42. © 2013 IBM Corporation42 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1; +----------+ | | +----------+ | 11416740 | +----------+ 1 row in results(first row: 39.78s; total: 39.78s)
  • 43. © 2013 IBM Corporation43 BigSQL V1.0 – Demo (small) select count(hour), hour from trace group by hour order by hour 30 rows in results(first row: 37.98s; total: 37.99s)
  • 44. © 2013 IBM Corporation44 BigSQL V1.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace1 t3 inner join trace2 t4 on t3.hour=t4.hour; +--------+ | | +--------+ | 477340 | +--------+ 1 row in results(first row: 32.24s; total: 32.25s)
  • 45. © 2013 IBM Corporation45 BigSQL V3.0 – Demo (small) CREATE HADOOP TABLE trace3 ( hour int, employeeid int, departmentid int,clientid int, date varchar(30), timestamp varchar(30) ) row format delimited fields terminated by '|' stored as textfile;
  • 46. © 2013 IBM Corporation46 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3; +----------+ | 1 | +----------+ | 12014733 | +----------+ 1 row in results(first row: 2.94s; total: 2.95s)
  • 47. © 2013 IBM Corporation47 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(*) from trace3 t3 inner join trace4 t4 on t3.hour=t4.hour; +--------+ | 1 | +--------+ | 504360 | +--------+ 1 row in results(first row: 0.79s; total: 0.80s)
  • 48. © 2013 IBM Corporation48 BigSQL V3.0 – Demo (small) [bivm.ibm.com][biadmin] 1> select count(hour), hour from trace3 group by hour order by hour; 29 rows in results(first row: 1.88s; total: 1.89s)
  • 49. © 2013 IBM Corporation49 R on Hadoop ● IBM BigR (based on SystemML Almadan Research project) ● Rhadoop ● RHIPE ● ... “R” Hadoop
  • 50. © 2013 IBM Corporation50
  • 51. © 2013 IBM Corporation5151 Goal: Find column mean Problems: • Column vector can not fit into memory You have to partition and parallelize
  • 52. © 2013 IBM Corporation52 ● Sampling  Full dataset > RAM  Example: use 1% vs 100% of dataset  Precision loss from skewed/sparse data ● Numerical Stability  Limitation from finite precision in computing  Algorithms must be carefully implemented  Instability causes errors to cascade throughout your analysis Catastrophic Cancellation Error: 6.375 – 5.625 True value: 0.75 Computed: 0 Relative Error: 1.0 6.375 round to 6.0 5.625 round to 6.0
  • 53. © 2013 IBM Corporation53 Data in Hadoop You R User Data in distributed memory
  • 54. © 2013 IBM Corporation54 Data in Hadoop: Can run R on a single node R User Data in distributed memory You
  • 55. © 2013 IBM Corporation55 BigR (based on SystemML) SystemML compiles hybrid runtime plans ranging from in- memory, single machine (CP) to large-scale, cluster (MR) compute ● Challenge ● Guaranteed hard memory constraints (budget of JVM size) ● for arbitrary complex ML programs ● Key Technical Innovations ● CP & MR Runtime: Single machine & MR operations, integrated runtime ● Caching: Reuse and eviction of in-memory objects ● Cost Model: Accurate time and worst-case memory estimates ● Optimizer: Cost-based runtime plan generation ● Dyn. Recompiler: Re-optimization for initial unknowns Data size Runtime CP CP/MR MR Gradually exploit MR parallelism High performance computing for small data sizes. Scalable computing for large data sizes. Hybrid Plans
  • 56. © 2013 IBM Corporation56 R Clients SystemML Statistics Engine Data Sources Embedded R Execution IBM R Packages IBM R Packages Pull data (summaries) to R client Or, push R functions right on the data 1 2 3 © 2014 IBM Corporation17 IBM Internal Use Only BigR Architecture
  • 57. © 2013 IBM Corporation57 Big R Data Structures: Proxy to entire dataset data <- bigr.frame(…) Appears and acts like all of the data is on your laptop You
  • 58. © 2013 IBM Corporation58 BigR Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 59. © 2013 IBM Corporation59 BigR Demo (small) library(bigr) bigr.connect(host="bigdata", port=7052, database="default", user="biadmin", password="xxx") is.bigr.connected() tbr <- bigr.frame(dataSource="DEL", coltypes = c("numeric","numeric","numeric","numeric","character","character"), dataPath="/user/biadmin/32Gtest", delimiter=",", header=F, useMapReduce=T) h <- bigr.histogram.stats(tbr$V1, nbins=24)
  • 60. © 2013 IBM Corporation60 BigR Demo (small) class bins counts centroids 1 ALL 0 18289280 1.583333 2 ALL 1 15360 2.750000 3 ALL 2 55040 3.916667 4 ALL 3 189440 5.083333 5 ALL 4 579840 6.250000 6 ALL 5 5292160 7.416667 7 ALL 6 8074880 8.583333 8 ALL 7 15653120 9.750000 ...
  • 61. © 2013 IBM Corporation61 BigR Demo (small)
  • 62. © 2013 IBM Corporation62 BigR Demo (small) jpeg('hist.jpg') bigr.histogram(tbr$V1, nbins=24) # This command runs on 32 GB / ~650.000.000 rows in HDFS dev.off()
  • 63. © 2013 IBM Corporation63 SPSS on Hadoop
  • 64. © 2013 IBM Corporation64 SPSS on Hadoop
  • 65. © 2013 IBM Corporation65 BigSheets Demo (small) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley) ● 32 GB Data, ~650.000.000 rows (small, Innovation Center Zurich) ● 3 TB Data, ~ 60.937.500.000 rows (middle, Innovation Center Zurich) ● 0.7 PB Data, ~ 1.421875×10¹³ rows (large, Innovation Center Hursley)
  • 66. © 2013 IBM Corporation66 BigSheets Demo (small)
  • 67. © 2013 IBM Corporation67 BigSheets Demo (small) This command runs on 32 GB / ~650.000.000 rows in HDFS
  • 68. © 2013 IBM Corporation68 BigSheets Demo (small)
  • 69. © 2013 IBM Corporation69 Text Extraction (SystemT, AQL)
  • 70. © 2013 IBM Corporation70 Text Extraction (SystemT, AQL)
  • 71. © 2013 IBM Corporation71 If this is not enough? → BigData AppStore
  • 72. © 2013 IBM Corporation72 BigData AppStore, Eclipse Tooling ● Write your apps in ● Java (MapReduce) ● PigLatin,Jaql ● BigSQL/Hive/BigR ● Deploy it to BigInsights via Eclipse ● Automatically ● Schedule ● Update ● hdfs files ● BigSQL tables ● BigSheets collections
  • 73. © 2013 IBM Corporation73 Questions? http://www.ibm.com/software/data/bigdata/ Twitter: @RomeoKienzler, @IBMEcosystem_DE, @IBM_ISV_Alps