The document describes a case study on using Clojure for multi-core parallelization of the K-means clustering algorithm. It provides background on parallel programming concepts, an introduction to Clojure, and details on how the authors implemented a parallel K-means algorithm in Clojure using agents and software transactional memory. They present results showing speedups from parallelization and accuracy comparable to R's implementation on both synthetic and real-world datasets.
TeamStation AI System Report LATAM IT Salaries 2024
Multi-core Parallelization in Clojure - a Case Study
1. Multi-core Parallelization in Clojure -
a Case Study
Johann M. Kraus and Hans A. Kestler
AG Bioinformatics and Systems Biology
Institute of Neural Information Processing
University of Ulm
29.06.2009
2. Outline
1. Concepts of parallel programming
2. Short introduction to Clojure
3. Multi-core parallel K-means - the case study
4. Analysis and Results
5. Summary
3. Parallel Programming
Definition:
Parallel programming is a form of programming where many calculations
are performed simultaneously.
• Physical constraints prevent frequency scaling of processors
• This led to an increasing interest in parallel hardware and parallel
programming
• Multi-core hardware is standard on desktop computers
• Parallel software can use this hardware to the full capacity
4. • Large problems are divided into smaller ones and the sub-
problems are solved simultaneously
• Speedup S is limited by the fraction of parallelizable code P
1
• Amdahl’s law: S=
1−P + P
N
Amdahl's law
20
18
Fraction of parallelizable code
16
0.95 %
0.90 %
14
0.75 %
0.50 %
12
Speedup
10
8
6
4
2
0
1 2 4 8 16 32 64 128 256 512 1024 2048 4096 8192 16384 32768 65536
Number of processors
5. Concepts of Parallel Programming
Explicit vs. implicit parallelization
• Explicitly define communication and synchronization details for
each task:
• MPI
• Java Threads
• Functional programming allows implicit parallelization:
• Parallel processing of functions
• Functions are free of side-effects
• Data is immutable
6. Distributed vs. local hardware
• Master - Slave parallelization • Shared memory parallelization
(e.g. Message Passing Interface) (e.g. Open Multi-Processing)
CPU
Master 0
Slave Slave Slave CPU Shared CPU
0 1 2 4 Memory 1
Slave Slave
3 4 CPU
CPU2
3
send data read
send result write
7. Thread programming
• Threads are refinements of a process that share the same memory and
can be processed separately and simultaneously
• Available in many languages, e.g. PThreads (C), Java Threads (Java),
OpenMP Threads (C, Fortran)
• Execution of threads is handled by a scheduler that manages the available
processing time
• Communication between
new
start runnable
awake
threads is faster than
communication between
processes
schedule
waiting
• Invoking threads is also end block
faster than fork/join
terminated running
processes
8. Concurrency control via locking and synchronizing
• Concurrency control ensures that threads can access shared memory
without violating data integrity
• The most popular approach to concurrency is locking and synchronizing
public c l a s s Counter {
private int v a l u e = 0 ;
public synchronized void i n c r {
value = value + 1;
}
}
Counter c o u n t e r = new Counter ( ) ;
counter . incr ( ) ;
• Problems might occur when using too many locks, too few locks, wrong
locks, or locks in the wrong order
• Using locks can be fatally error-prone, e.g. dead-locks
9. Concurrency control via transactional memory
• Transactional memory offers a flexible alternative to lock-based
concurrency control
• Functionality is analogous to controlling simultaneous access to database
management systems
• Transactions ensure properties:
• Atomicity: Either all changes of a transaction occur or none do
• Consistency: Only valid changes are committed
• Isolation: No transaction sees the effect of other transactions
• Durability: Changes from transactions will be persistent
10. • Software transactional memory maps transactional memory to
concurrency control in parallel programming
TIME
:Transaction 0 :Data :Transaction 1
get data
get data
[consistent data]
send modified data
[consistent data]
send modified data
get data
[consistent data]
send modified data
11. Clojure
• Functional programming language hosted on the JVM
• Extends the code-as-data paradigm to maps and vectors
• Based on immutable data structures
• Provides built-in concurrency support via software transactional
memory
• Completely symbiotic to Java, e.g. easy access to Java libraries
• Platform independent
12. • Java interaction
( import ’ ( c e r n . j e t . random . s a m p l i n g
RandomSamplingAssistant ) )
( defn sample
[n k]
( seq ( . RandomSamplingAssistant
( sampleArray k ( i n t −a r r a y ( range n ) ) ) ) ) )
• Dynamic typing and multi-methods
• An object is defined as the sum of what it can do (methods),
rather than the sum of what it is (type hierarchy)
• Add type hints to speed up code
( defn da+ [#ˆ doubles a s #ˆdoubles bs ]
(amap a s i r e t
(+ ( aget a s i ) ( aget bs i ) ) ) )
13. Transactional references and STM
• Transactional references ensure safe coordinated synchronous
changes to mutable storage locations
• Are bound to a single storage location for their lifetime
• Only allow mutation of that location to occur within transactions
• Available operations are ref-set, alter, and commute
• No explicit locking is required
( def c o u n t e r ( r e f 0 ) )
( dosync ( a l t e r c o u n t e r inc ) )
14. Agents
• Agents allow independent asynchronous change of mutable
locations
• Are bound to a single storage location for their lifetime
• Only allow mutation of that location to a new state to occur as a
result of an action
• Actions are functions that are asynchronously applied to the state
of an Agent
• The return value of an action becomes new state of the Agent
• Agents are integrated with the STM
( def c o u n t e r ( agent 0 ) )
( send c o u n t e r inc )
15. Cluster analysis
• Given a data set X compute a partition of X into k disjoint clusters C,
such that:
k
(1) Ci = X
i=1
(2) Ci = ∅ and Ci ∩ Cj = ∅
• How many clusters are in the data set?
3 cluster 9 cluster
16. Cluster algorithms
• For all possible partitions evaluate the
objective function f and search the optimum.
Number of data points
30
• The cardinality of the set of all possible
35
25
Runtime (nanosecond)
30
20
partitions is given by:
25
15
20
15
10
k
1
10
Stirling numbers of k−i k N
k
= (−1)
5
the second kind
SN i 5
k! i 0
0
i=0 0 5 10 15 20 25 30 35
Number of clusters
Cluster algorithms provide a heuristic for this search:
• Partitional clustering (K-means, Neuralgas, SOM, Fuzzy C-means, ...)
• Hierarchical clustering (Divisive/agglomerative, Complete linkage, ...)
• Graph-based clustering (Spectral clustering, NMF, Affinity propagation, ...)
• Model-based clustering, Biclustering, Semi-supervised clustering
17. K-means algorithm
Function KMeans
Input : X = { x 1 , . . . , x n } ( Data t o be c l u s t e r e d )
k ( Number o f c l u s t e r s )
Output : C = { c 1 , . . . , c k } ( C l u s t e r c e n t r o i d s )
m: X −> C ( C l u s t e r a s s i g n m e n t s )
I n i t i a l i z e C ( e . g . random s e l e c t i o n from X)
While C h a s changed
For e a c h x i i n X
m( x i ) = a r g m i n j d i s t a n c e ( x i , c j )
End
For e a c h c j i n C
c j = c e n t r o i d ( { x i | m( x i ) = j } )
End
End
18. Cluster Validation
• Evaluation requires repeated runs of clustering, e.g.:
• Resampled data sets
• Different parameters
• MCA-index: mean proportion of samples being consistent over
different clusterings
k
M CA = 1
n maxπ i=1 |Ai ∩ Bj |
19. Estimation of the expected value of a validation index
1.0
Random label: randomly assign
each item to a cluster k
0.8
Random partition: choose a
mean mca index
0.6
random partition
0.4
Random prototype: assign each
item to its next prototype 0.2
0.0
0 10 20 30 40 50
Mean value from 100 runs cluster
20. Multi-core K-means with Clojure
• Split the data set into smaller pieces that are handled by agents
• Each cluster is represented by an agent
• Add a commutative list of cluster members within a transactional
reference to accelerate the centroid update step
Data Data Data Data Data
Agent 0 Agent 1 Agent 2 Agent 3 Agent n
Member
Cluster Ref 0
Agent 0
Cluster Member
Agent 1 Ref 1
Cluster Member
Agent k Ref k
read
write
21. simultaneous read
Cluster Data
Agent 0 Agent 0
Cluster Data
Agent 1 Agent 1
Cluster
Agent k Data
Agent n
simultaneous write
Data
Member
Agent 0
Ref 0
Data
Member
Agent 1
Ref 1
Data
Agent n
Member
Ref 2
22. read: (nearest-cluster)
write: (commute)
(assoc)
( defn a s s i g n m e n t [ ]
(map #(send % update−d a t a a g e n t ) DataAgents )
( defn update−d a t a a g e n t [ d a t a p o i n t s ]
(map update−d a t a p o i n t d a t a p o i n t s ) )
( defn update−d a t a p o i n t [ d a t a p o i n t ]
( l e t [ newass ( n e a r e s t −c l u s t e r d a t a p o i n t ) ]
( dosync (commute ( nth MemberRefs newass )
conj ( : d a t a d a t a p o i n t ) ) )
( assoc d a t a p o i n t : a s s i g n m e n t newass ) ) )
23. Benchmark results
Large data sets (artificial):
• Each data point is sampled from N(0,1)
• Summary for 10 runs of K-means
10.000 cases, 100 dimensions 1.000.000 cases, 200 dimensions
20 Cluster 20 Cluster
150
450
runtime (seconds)
runtime (minutes)
100
300
150
50
0
0
ParaKMeans K-means R McKmeans K-means R McKmeans
24. • Number of computer cores used • Number of data agents used
100.000 x 500 100.000 x 500
20 cluster 20 cluster
800
1500
600
runtime (seconds)
runtime (seconds)
1000
400
500
200
0
0
1 4 8 4 6 8 10
number of computer cores number of data agents
25. Large data sets with cluster structure
• Data sampled from a multi-variate normal distribution
• 100000 samples, 200/500 dimensions, 10/20 cluster
K-means R McKmeans
2000
1500
runtime (seconds)
1000
500
0
200 / 10 200 / 20 500 / 10 500 / 20 200 / 10 200 / 20 500 / 10 500 / 20
Number of samples / Number of clusters
26. Accuracy compared to the known grouping of data
• Measured with the MCA index
• Red bars indicate the random-prototype baseline
100.000 x 200 100.000 x 200 100.000 x 500 100.000 x 500
10 cluster 20 cluster 10 cluster 20 cluster
1.0
0.8
_ _ _ _
_ _ _ _
MCA index
0.4 0.6
0.2
0.0
McKmeans K-means R McKmeans K-means R McKmeans K-means R McKmeans K-means R
27. Real world data set
• Microarray data (Radiation-induced changes in
human gene expression)
• 22277 samples (genes) and 465 features (profiles)
K-means R McKmeans
350
runtime (seconds)
250
150
50
0
2 Cluster 5 Cluster 10 Cluster 20 Cluster 2 Cluster 5 Cluster 10 Cluster 20 Cluster
Number of clusters
Smirnov D, Morley M, Shin E, Spielman R, Cheung V: Genetic analysis of radiation-induced changes in human gene expression. Nature 2009, 459:587–591
28. Application to Cluster Number Estimation
• Repeated clustering with different subsets of data
• Repeated for different number of clusters k
• Most stable clustering is produced for the ‘real’ cluster number
• Jackknife resampling
1.0
• _ _ _ _
0.8
Evaluation with MCA index
_ _
0.6
• Data set:100000 samples, MCA index
100 features, 3 cluster
0.4
•
0.2
10 runs per cluster number
0.0
• 49.26 minutes on dual-quad 2 3 4 5 6 7
core 3.2 GHz number of clusters
29. Java GUI
( import ’ ( j a v a x . s w i n g JFrame J L a b e l J T e x t F i e l d JButton )
’ ( j a v a . awt . e v e n t A c t i o n L i s t e n e r )
’ ( j a v a . awt GridLayout ) )
( let [ frame ( new JFrame ” H e l l o , World ! ” )
h e l l o b u t t o n ( new JButton ” Say h e l l o ” )
h e l l o l a b e l ( new J L a b e l ” ” ) ]
( . h e l l o button
( addActionListener
( proxy [ A c t i o n L i s t e n e r ] [ ]
( actionPerformed [ evt ]
( . hello label
( s e t T e x t ” H e l l o , World ! ” ) ) ) ) ) )
( d o t o frame
( . s e t L a y o u t ( new GridLayout 1 1 3 3 ) )
( . add h e l l o b u t t o n )
( . add h e l l o l a b e l )
( . s e t S i z e 300 8 0 )
( . s e t V i s i b l e true )))
30.
31. Summary
• Writing parallel programs usually requires a careful software design
and a deep knowledge about thread-safe programming
• Concurrency control via transactional memory circumvents
problems of lock-based concurrency strategies
• Immutable data structures play a key role to software transactional
memory
• Clojure combines Lisp, Java and a powerful STM system
• This enables fast parallelization of algorithms, even for rapid
prototyping
• Our simulations show a good performance of the parallelized code