CloverETL Cluster - Big Data Parallel Processing Explained

BIG
HANDLING LARGE DATA
The CloverETL Cluster Architecture Explained
Wednesday, August 14, 13

The Reality:
You have a really big pile to deal with.
One traditional digger might not be enough.
Really Big Data

You could get a really big, expensive digger...
Really Big Data

…or several smaller ones and get the job done faster & cheaper.
Really Big Data

But what if the one big one suffers a mechanical failure?
Really Big Data

With small diggers, failure of one does not affect the rest.
Really Big Data

Which one do you choose ?
vs

CloverETL Cluster resiliency features
Optimizing for robustness...

Fault resiliency – HW & SW
automatic fail-over
Before After
Node 2 Node 1 Node 2Node 1

automatic load balancing
Load Balancing
N
ew
task
Before After
Node 2
Node 1 Node 1
Node 2

CloverETL Cluster - BIG DATA features
Optimizing for speed...

Traditionally, data transformations were run on a single, big server
with multiple CPUs and plenty of RAM.
And it was expensive.

Then the CloverETL team
developed the concept of a data
transformation cluster.
The CloverETL
Cluster was born
It creates a powerful data transformation beast from a set of low-cost
commodity hardware machines.

Now, one data transformation can be set to run in parallel on
all available nodes of the CloverETL Cluster.

Each cluster node executing the
transformation is automatically fed with a
different portion of the input data.
Part 1
Part 2
Part 3

Part
1
Part
2
Part
3
Now
Before
=
=
Working in parallel, they ﬁnish the job faster,
with less resources needed individually.

That sounds nice and simple.
But how is it really done?

CloverETL allows certain
transformation components to be
assigned to multiple cluster nodes.
runs
1x
runs
1x
runs
3x
Allocated to
Allocated to
Allocatedto
Allocatedto
Node 1
Node 2
Node 3
CloverETL Cluster
Such components then run in multiple instances.
We call this
Allocation.
Allocated to

Special components allow
incoming data to be split
and sent in parallel ﬂows to
multiple nodes where the
processing ﬂow continues.
Node 1
Node 2
Node 3
Serial data Partitioned data
Node 1
1st instance
2nd instance
3rd instance

Other components gather
data from parallel ﬂows back
into a single, serial one.
Node 1
Node 2
Node 3
Serial dataPartitioned data
Node 1
1st instance
2nd instance
3rd instance

The original transformation is automatically
“rewritten” into several smaller ones, which
are executed by cluster nodes in parallel.
Which nodes will be used is determined by
Allocation.
Node 1
Node 2
Node 3
2nd instance
3rd instance
Serial data Serial dataPartitioned data
1st instance
Node 3

Let’s take a look
at an example.

In this example, we’ll read data about company
addresses.There are 10,499,849 records in total.
We also calculate statistics of the number
of companies residing in each US state.
We get a total of 51 records – one
record per US state.
serial processing

Here, we’re processing the same input data, but in parallel now.
We get a total of 51
records again.
Split Gather
work in
3 parallel
streams
Each parallel stream
gets a portion of the
input data
Partial results

Go parallel in 1 minute.
☟
drag&drop drag&drop
serial
parallel

What’s the Trick?
Split the input data into
parallel streams.
Do the heavy lifting on smaller data
portions in parallel.
Bring the individual pieces of
results together at the end.
☞
☜
DONE

Let’s continue.
More on allocation and partitioned sandboxes

A Sandbox
We assume you are familiar
with the CloverETL Server’s
concept of a SANDBOX.
SANDBOX is a logical name for a ﬁle
directory structure managed by the Server. It
allows individual projects on the Server to be
separated into logical units. Each CloverETL
data transformation can access multiple
sandboxes either locally or remotely.
Let’s look at a special type of
sandbox – partitioned

The sandbox presents “originals” – combined data.
Part 2
Part 1 Partitioned
sandbox
“SboxP”
Part 3
Node 1
Node 2
Node 3
SboxP
In a partitioned Sandbox, the input ﬁle is split into subﬁles,
each residing on a different node of the Cluster in a similarly
structured folder.

Partitioned
Sandboxes
A partitioned sandbox is a
logical abstraction on top of
similarly structured folders
on different Cluster nodes.
The Sandbox’s logical
structure with a unified view of folders & files
The Sandbox’s physical
structure with listed locations/nodes of
files’ portions

Partitioned Sandbox
Partitioned sandbox deﬁnes how
data is partitioned
across nodes of the CloverETL
Cluster
Allocation
Allocation deﬁnes how a
transformation’s run is distributed
across nodes of the CloverETL Cluster
☜
☞
The allocation can be set to derive from the sandbox layout.
Data processing happens where data resides.
We tell the cluster to run our transformation
components on nodes that also contain portions of
data we want to process.
☟

Allocation Determined By a
Partitioned Sandbox:
4 partitions 4 parallel
transformations.
There’s no gathering at the end - partitioned results are
stored directly to the partitioned sandbox.Allocation for the
aggregator is derived from sandbox being used.

Allocation Determined By an
Explicit Number:
8 parallel transformations.
Partitioning at the beginning and gathering at
the end is necessary as we need to cross the
serial⇿parallel boundary twice.

A Data Skew
This is called a data skew.
Data is not uniformly distributed across partitions.
This indicates that chosen partitioning key is not
the best for the maximum performance.
However, the chosen key allows us to perform only
single pass aggregation (no semi-results) - thus it’s a
good tradeoff.
The busiest worker will have to process 2.5 million rows whereas the least busy,
only 0.67 million – that is, approximately 3.5x less.

Parallel Pitfalls
When processing data in parallel, a few things should be considered.
Aggregating, Sorting, Joining…
Working in parallel means producing “parallel”/semi results.
First, we produce 4 aggregated
semi-results. Then we aggregate the
semi-results to get the final result.
➔semi-result1
➔semi-result 2
➔semi-result3
➔semi-result4
record stream1
record stream2
record stream3
record stream4
These partial results have to be further
processed to get final result.
➔final resultsemi-result1,2,3,4 ➔
The good news: When increasing or changing the
number of parallel streams, we don’t have to
change the transformation.

Parallel Pitfalls
Full transformation – parallel aggregation & post-processing semi results
sum()
here
count()
here
Why ?
Example: A parallel counting of occurrences of companies
per state using count().
In step 1, we produce partial results. Because records are
partitioned in a round-robin, data for one state may appear
in multiple parallel streams.
For example, we might get data for NY as 4 partial results
in 4 different streams.
In step 2, we merge all the partial results from
the 4 parallel streams into a sequence and then
aggregate again to get the ﬁnal numbers.
At this step the aggregation function is sum() –
we sum the partial counts.
Step 1
Step 2

Parallel Pitfalls
Parallel sorting
merge
here
sort
here
Why ?
Sorting in parallel ➔ records are sorted in
individual parallel streams, but not across all
streams.
Bringing parallel sorted streams together
into serial stream ➔ records have to be
merged according to the same key as
used in parallel sorting ➔ to produce
overall sorted serial result.
1 2

Parallel Pitfalls
Why ?
Joining in parallel➔master&slave(s) records
must be partitioned by the same key/ﬁeld.The
same key must be used for joining records.
!
In another case, there is a danger that records
from master & slave with the same key will not
join as they end up in different parallel streams.
Joiner joins only within one stream and not
across streams.
!
Parallel joining

Parallel Pitfalls
Example
Result
(all master records joined)
Parallel joining - 3 parallel streams - partitioning by state
[AL AK AZ AR CA CO CT DC DE FL]
[GA HI ID IL IN IA KS KY LA ME MD MA MI MN MS MO MT NE NV NH NJ NM NY NC ND]
[OH OK OR PA RI SC SDTNTX UTVT VA WA WV WI WY]
[AK AZ DE]
[IL MD NY]
[OR PA VA]
[AK AZ DE]
[IL MD NY]
[OR PAVA]
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
stream
stream
stream
stream
stream
stream

Parallel Pitfalls
Result
(some master records joined)
Parallel joining - 3 parallel streams - partitioning round robin
[AL AR CT FL GA HI IA LA MA MS NE NJ NC OH PA SD UT WA WY]
[AK CA DC IL KS ME MI MO NV NM ND OK RITNVT WV]
[AZ CO DE ID IN KY MD MN MT NH NY OR SCTXVA WI]
[]
[]
[DE NY]
[AK IL OR]
[AZ MDVA]
[DE NY PA]
1⥤
3⥤
2⥤
1⥤
2⥤
3⥤
1⥤
2⥤
3⥤
Example
stream
stream
stream
stream
stream
stream

Bringing it all together…
Going parallel is easy!
Try it out for yourself.
☞ BIG DATA problems are handled through Cluster’s scalability
☞ Existing transformations can be easily converted to parallel
☞ There’s no magic – users have full control over what’s happening
☞ CloverETL Cluster has built in fault resiliency and load balancing

If you have any questions, check out:
www.cloveretl.com
forum.cloveretl.com
blog.cloveretl.com

CloverETL Cluster - Big Data Parallel Processing Explained

Recomendados

Recomendados

Más contenido relacionado

Último

Último (20)

Destacado

Destacado (20)

CloverETL Cluster - Big Data Parallel Processing Explained