Splout SQL a richer open source database for Hadoop

Iván de Prado Alonso – CEO of Datasalt
www.datasalt.es
@ivanprado
@datasalt

Splout SQL
When Big Data Output is also Big
Data

Full SQL* Unlike NoSQL

For Big Data Unlike RDBMS

Web latency & Unlike Impala,
throughput Apache Drill, etc.

* Within each partition

How does it work?

Isolation between generation and serving

Generate tablespace CLIENTS_INFO with
Generation table CLIENTS partitioned by CID
table SALES partitioned by CID
Table CLIENTS Tablespace CLIENTS_INFO
CID Name Partition U10 – U35
U20 Doug Table CLIENTS Table SALES
U21 Ted CID Name SID CID Amount
U40 John U20 Doug S100 U20 102
U21 Ted S101 U20 60

Table SALES Partition U36 – U60
SID CID Amount
Table CLIENTS Table SALES
S100 U20 102
CID Name SID CID Amount
S101 U20 60
U40 John S223 U40 99
S223 U40 99

For key = ‘U20’, tablespace=‘CLIENTS_INFO’
SELECT Name, sum(Amount) FROM
Serving CLIENTS c, SALES s WHERE
c.CID = s.CID AND CID = ‘U20’;

Partition U10 – U35 Partition U36 – U60
Table CLIENTS Table CLIENTS
CID Name CID Name
U20 Doug U40 John
U21 Ted

Table SALES Table SALES
SID CID Amount SID CID Amount
S100 U20 102 S223 U40 99
S101 U20 60

For key = ‘U40’, tablespace=‘CLIENTS_INFO’
SELECT Name, sum(Amount) FROM
Serving CLIENTS c, SALES s WHERE
c.CID = s.CID AND CID = ‘U40’;

Partition U10 – U35 Partition U36 – U60
Table CLIENTS Table CLIENTS
CID Name CID Name
U20 Doug U40 John
U21 Ted

Table SALES Table SALES
SID CID Amount SID CID Amount
S100 U20 102 S223 U40 99
S101 U20 60

Why does it scale?
Data is partitioned

Partitions are distributed across nodes

Adding more nodes increases capacity

Queries restricted to a single partition

Generation does not impact serving

Ok, so what is
Splout SQL
useful for?

Big Data
Analytics

Manageable output

Big Data
Analytics

Sometimes Big Data output is also Big Data

Splout SQL allows
to serve
Big Data results

Building a Google Analytics
Imagine that one crazy day you decide to build
some kind of Google Analytics…

Zillions of events
Millions of domains
Individual panel per domain

Requirements
Time-based charts (day/hour aggregations)

Flexible dimension breakdown
Per page, per browser
Per country, per language
…

Splout SQL provides
SQL consolidated
views for Hadoop
data

Let’s see more
details about
Splout SQL

Each partition is …
Backed by SQLite

Generated on Hadoop
Including any indexes needed
Data can be sorted before insertion to
minimize disk seeks at query time
Pre-sampling for balancing partition size
Distributed on Splout SQL cluster
With replication for failover

Atomicity
A tablespace is a set of tables that
share the same partitioning schema

Tablespaces are versioned
Only one version served at a time

Several tablespaces can be deployed
at once
All-or-nothing semantics (atomicity)
Rollback support

Characteristics
Ensured ms latencies
Even when queries hit disk

Controlled by the developer selecting the
proper:
- Cluster topology
- Partitioning
- Indexes
- Data collocation (insertion order)

Characteristics (II)
100% SQL
But restricted to a single partition
Real-time aggregations
Joins

Scalability
In data capacity
In performance

Characteristics (III)
Atomicity
New data replaces old data all at once

High availability
Through the use of replication

Open Source

Characteristics (IV)
Easy to manage
Changing the size of the cluster can be done
without any downtime

Read only
Data is updated in batches
Updates come from new tablespace
deployments

Characteristics (V)
Native connectors
Hive
Pig
Cascading

API - Generation
Command line
Loading CSV files
$ hadoop jar splout-*-hadoop.jar generate …

Java API

Connectors

API - Service
Rest API

JSON response

Benchmark
350 GB Wikipedia logs
Aggregation queries impacting 15 rows in
average
2-machines cluster
900 queries/second, 80 ms/query, 80 threads

Benchmark (II)
4-machines cluster
3150 queries/second, 40 ms/query, 160 threads

More info:
http://sploutsql.com/performance.html

Web latency

SQL

Consolidated Views

For Hadoop
“A good candidate for the serving layer of a lambda architecture”

www.SploutCloud.com - Splout SQL as a service

Future work
Growing the community
Do you want to collaborate? 

Automatic rebalancing on failover
Almost done

Some read/write capabilities
Enabling Splout SQL to become the speed
layer on lambda architectures

Iván de Prado Alonso – CEO of Datasalt
www.datasalt.es
@ivanprado
@datasalt

Questions?

Splout SQL a richer open source database for Hadoop

Recomendados

Recomendados

Más contenido relacionado

Similar a Splout SQL a richer open source database for Hadoop

Similar a Splout SQL a richer open source database for Hadoop (20)

Más de DataWorks Summit

Más de DataWorks Summit (20)

Último

Último (20)

Splout SQL a richer open source database for Hadoop