Multi-core Parallelization in Clojure - a Case Study

Multi-core Parallelization in Clojure -
a Case Study

Johann M. Kraus and Hans A. Kestler

AG Bioinformatics and Systems Biology
Institute of Neural Information Processing
University of Ulm

29.06.2009

Outline

1. Concepts of parallel programming

2. Short introduction to Clojure

3. Multi-core parallel K-means - the case study

4. Analysis and Results

5. Summary

Parallel Programming
Deﬁnition:
Parallel programming is a form of programming where many calculations
are performed simultaneously.

• Physical constraints prevent frequency scaling of processors

• This led to an increasing interest in parallel hardware and parallel
programming

• Multi-core hardware is standard on desktop computers

• Parallel software can use this hardware to the full capacity

• Large problems are divided into smaller ones and the sub-
problems are solved simultaneously

• Speedup S is limited by the fraction of parallelizable code P

1
• Amdahl’s law: S=
1−P + P
N

Amdahl's law
20
18

Fraction of parallelizable code
16

0.95 %
0.90 %
14

0.75 %
0.50 %
12
Speedup

10
8
6
4
2
0

1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536

Number of processors

Concepts of Parallel Programming

Explicit vs. implicit parallelization

• Explicitly deﬁne communication and synchronization details for
each task:
• MPI
• Java Threads

• Functional programming allows implicit parallelization:

• Parallel processing of functions

• Functions are free of side-effects

• Data is immutable

Distributed vs. local hardware

• Master - Slave parallelization • Shared memory parallelization
(e.g. Message Passing Interface) (e.g. Open Multi-Processing)

CPU
Master 0

Slave Slave Slave CPU Shared CPU
0 1 2 4 Memory 1

Slave Slave
3 4 CPU
CPU2
3

send data read
send result write

Thread programming

• Threads are reﬁnements of a process that share the same memory and
can be processed separately and simultaneously

• Available in many languages, e.g. PThreads (C), Java Threads (Java),
OpenMP Threads (C, Fortran)

• Execution of threads is handled by a scheduler that manages the available
processing time

• Communication between
new
start runnable
awake

threads is faster than
communication between
processes

schedule
waiting

• Invoking threads is also end block
faster than fork/join
terminated running

processes

Concurrency control via locking and synchronizing

• Concurrency control ensures that threads can access shared memory
without violating data integrity

• The most popular approach to concurrency is locking and synchronizing
public c l a s s Counter {
private int v a l u e = 0 ;
public synchronized void i n c r {
value = value + 1;
}
}
Counter c o u n t e r = new Counter ( ) ;
counter . incr ( ) ;

• Problems might occur when using too many locks, too few locks, wrong
locks, or locks in the wrong order

• Using locks can be fatally error-prone, e.g. dead-locks

Concurrency control via transactional memory

• Transactional memory offers a ﬂexible alternative to lock-based
concurrency control

• Functionality is analogous to controlling simultaneous access to database
management systems

• Transactions ensure properties:
• Atomicity: Either all changes of a transaction occur or none do

• Consistency: Only valid changes are committed

• Isolation: No transaction sees the effect of other transactions

• Durability: Changes from transactions will be persistent

• Software transactional memory maps transactional memory to
concurrency control in parallel programming

TIME

:Transaction 0 :Data :Transaction 1

get data

get data

[consistent data]
send modified data

[consistent data]
send modified data

get data

[consistent data]
send modified data

Clojure

• Functional programming language hosted on the JVM

• Extends the code-as-data paradigm to maps and vectors

• Based on immutable data structures

• Provides built-in concurrency support via software transactional
memory

• Completely symbiotic to Java, e.g. easy access to Java libraries

• Platform independent

• Java interaction
( import ’ ( c e r n . j e t . random . s a m p l i n g
RandomSamplingAssistant ) )
( defn sample
[n k]
( seq ( . RandomSamplingAssistant
( sampleArray k ( i n t −a r r a y ( range n ) ) ) ) ) )

• Dynamic typing and multi-methods

• An object is deﬁned as the sum of what it can do (methods),
rather than the sum of what it is (type hierarchy)

• Add type hints to speed up code

( defn da+ [#ˆ doubles a s #ˆdoubles bs ]
(amap a s i r e t
(+ ( aget a s i ) ( aget bs i ) ) ) )

Transactional references and STM

• Transactional references ensure safe coordinated synchronous
changes to mutable storage locations

• Are bound to a single storage location for their lifetime

• Only allow mutation of that location to occur within transactions

• Available operations are ref-set, alter, and commute

• No explicit locking is required

( def c o u n t e r ( r e f 0 ) )
( dosync ( a l t e r c o u n t e r inc ) )

Agents

• Agents allow independent asynchronous change of mutable
locations

• Are bound to a single storage location for their lifetime

• Only allow mutation of that location to a new state to occur as a
result of an action

• Actions are functions that are asynchronously applied to the state
of an Agent

• The return value of an action becomes new state of the Agent

• Agents are integrated with the STM
( def c o u n t e r ( agent 0 ) )
( send c o u n t e r inc )

Cluster analysis

• Given a data set X compute a partition of X into k disjoint clusters C,
such that:
k
(1) Ci = X
i=1
(2) Ci = ∅ and Ci ∩ Cj = ∅

• How many clusters are in the data set?

3 cluster 9 cluster

Cluster algorithms
• For all possible partitions evaluate the
objective function f and search the optimum.

Number of data points
30
• The cardinality of the set of all possible
35

25
Runtime (nanosecond)
30

20
partitions is given by:
25

15
20

15

10
k
1
10
Stirling numbers of k−i k N
k
= (−1)

5
the second kind
SN i 5

k! i 0

0
i=0 0 5 10 15 20 25 30 35

Number of clusters

Cluster algorithms provide a heuristic for this search:

• Partitional clustering (K-means, Neuralgas, SOM, Fuzzy C-means, ...)

• Hierarchical clustering (Divisive/agglomerative, Complete linkage, ...)

• Graph-based clustering (Spectral clustering, NMF, Afﬁnity propagation, ...)

• Model-based clustering, Biclustering, Semi-supervised clustering

K-means algorithm
Function KMeans

Input : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d )
k ( Number o f c l u s t e r s )

Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s )
m: X −> C ( C l u s t e r a s s i g n m e n t s )

I n i t i a l i z e C ( e . g . random s e l e c t i o n from X)
While C h a s changed
For e a c h x i i n X
m( x i ) = a r g m i n j d i s t a n c e ( x i , c j )
End
For e a c h c j i n C
c j = c e n t r o i d ( { x i | m( x i ) = j } )
End
End

Cluster Validation
• Evaluation requires repeated runs of clustering, e.g.:
• Resampled data sets

• Different parameters

• MCA-index: mean proportion of samples being consistent over
different clusterings
k
M CA = 1
n maxπ i=1 |Ai ∩ Bj |

Estimation of the expected value of a validation index

1.0
Random label: randomly assign
each item to a cluster k

0.8
Random partition: choose a

mean mca index

0.6
random partition

0.4
Random prototype: assign each
item to its next prototype 0.2
0.0

0 10 20 30 40 50

Mean value from 100 runs cluster

Multi-core K-means with Clojure
• Split the data set into smaller pieces that are handled by agents

• Each cluster is represented by an agent

• Add a commutative list of cluster members within a transactional
reference to accelerate the centroid update step

Data Data Data Data Data
Agent 0 Agent 1 Agent 2 Agent 3 Agent n

Member
Cluster Ref 0
Agent 0

Cluster Member
Agent 1 Ref 1

Cluster Member
Agent k Ref k
read

write

simultaneous read

Cluster Data
Agent 0 Agent 0

Cluster Data
Agent 1 Agent 1

Cluster
Agent k Data
Agent n

simultaneous write

Data
Member
Agent 0
Ref 0

Data
Member
Agent 1
Ref 1

Data
Agent n
Member
Ref 2

read: (nearest-cluster)

write: (commute)
(assoc)

( defn a s s i g n m e n t [ ]
(map #(send % update−d a t a a g e n t ) DataAgents )

( defn update−d a t a a g e n t [ d a t a p o i n t s ]
(map update−d a t a p o i n t d a t a p o i n t s ) )

( defn update−d a t a p o i n t [ d a t a p o i n t ]
( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ]
( dosync (commute ( nth MemberRefs newass )
conj ( : d a t a d a t a p o i n t ) ) )
( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )

Benchmark results
Large data sets (artiﬁcial):

• Each data point is sampled from N(0,1)

• Summary for 10 runs of K-means
10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions
20 Cluster 20 Cluster
150

450
runtime (seconds)

runtime (minutes)
100

300
150
50
0

0

ParaKMeans K-means R McKmeans K-means R McKmeans

• Number of computer cores used • Number of data agents used
100.000 x 500 100.000 x 500
20 cluster 20 cluster

800
1500

600
runtime (seconds)

runtime (seconds)
1000

400
500

200
0

0

1 4 8 4 6 8 10

number of computer cores number of data agents

Large data sets with cluster structure

• Data sampled from a multi-variate normal distribution

• 100000 samples, 200/500 dimensions, 10/20 cluster

K-means R McKmeans
2000
1500
runtime (seconds)
1000
500
0

200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20

Number of samples / Number of clusters

Accuracy compared to the known grouping of data

• Measured with the MCA index

• Red bars indicate the random-prototype baseline

100.000 x 200 100.000 x 200 100.000 x 500 100.000 x 500
10 cluster 20 cluster 10 cluster 20 cluster
1.0
0.8

_ _ _ _
_ _ _ _
MCA index
0.4 0.6
0.2
0.0

McKmeans K-means R McKmeans K-means R McKmeans K-means R McKmeans K-means R

Real world data set

• Microarray data (Radiation-induced changes in
human gene expression)

• 22277 samples (genes) and 465 features (proﬁles)
K-means R McKmeans
350
runtime (seconds)

250
150
50
0

2 Cluster 5 Cluster 10 Cluster 20 Cluster 2 Cluster 5 Cluster 10 Cluster 20 Cluster

Number of clusters
Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587–591

Application to Cluster Number Estimation
• Repeated clustering with different subsets of data

• Repeated for different number of clusters k

• Most stable clustering is produced for the ‘real’ cluster number

• Jackknife resampling

1.0
• _ _ _ _

0.8
Evaluation with MCA index
_ _
0.6
• Data set:100000 samples, MCA index

100 features, 3 cluster
0.4

•
0.2

10 runs per cluster number
0.0

• 49.26 minutes on dual-quad 2 3 4 5 6 7

core 3.2 GHz number of clusters

Java GUI
( import ’ ( j a v a x . s w i n g JFrame J L a b e l J T e x t F i e l d JButton )
’ ( j a v a . awt . e v e n t A c t i o n L i s t e n e r )
’ ( j a v a . awt GridLayout ) )

( let [ frame ( new JFrame ” H e l l o , World ! ” )
h e l l o b u t t o n ( new JButton ” Say h e l l o ” )
h e l l o l a b e l ( new J L a b e l ” ” ) ]
( . h e l l o button
( addActionListener
( proxy [ A c t i o n L i s t e n e r ] [ ]
( actionPerformed [ evt ]
( . hello label
( s e t T e x t ” H e l l o , World ! ” ) ) ) ) ) )
( d o t o frame
( . s e t L a y o u t ( new GridLayout 1 1 3 3 ) )
( . add h e l l o b u t t o n )
( . add h e l l o l a b e l )
( . s e t S i z e 300 8 0 )
( . s e t V i s i b l e true )))

Summary

• Writing parallel programs usually requires a careful software design
and a deep knowledge about thread-safe programming

• Concurrency control via transactional memory circumvents
problems of lock-based concurrency strategies

• Immutable data structures play a key role to software transactional
memory

• Clojure combines Lisp, Java and a powerful STM system

• This enables fast parallelization of algorithms, even for rapid
prototyping

• Our simulations show a good performance of the parallelized code

Statistical computing library

• http://wiki.github.com/liebke/incanter
• Clojure-based statistical computing
• R-like semantics
• COLT library for numerical computation
• JFreeChart library for graphics

Multi-core Parallelization in Clojure - a Case Study

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (19)

Similar a Multi-core Parallelization in Clojure - a Case Study

Similar a Multi-core Parallelization in Clojure - a Case Study (20)

Más de elliando dias

Más de elliando dias (20)

Último

Último (20)

Multi-core Parallelization in Clojure - a Case Study