Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

MSA using Hadoop

Presented by:
Dr. G.Sudha Sadasivam
Professor, Dept of CSE,
PSG College of Technology,
Coimbatore

Agenda
Sequence alignment
Introduction to Clouds
Approaches for MSA
Approach 1
Approach 2
Results
Other Projects

What is Sequence Alignment?

The procedure of comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the
same order in the sequences.
Uses
For sequence similarity
Phylogenetic tree analysis
Factors – accuracy and speed

Cloud computing
Provides scalable, on-demand, RT computing services
Suitability of cloud for Sequence Alignment
On-demand scalability of cloud makes it suitable
for dynamic nature of MSA
Low cost in maintenance of infrastructure for
applications
Data and compute parallelism in clouds through
map-reduce paradigm facilitates energy efficient and
fast MSA.

Types of Sequence Alignment
Pair-wise Alignment
Alignment of two sequences
Global –using Needleman Wunsch algorithm.
LGPSSKQTGKGS_SRAWDN
| | | | | | |
LN_ATKSAGKGAIMRL GDA
Local – using Smith Waterman algorithm.
_________TGKG__________
| | |
_________AGKG__________
Multiple Sequence Alignment
Alignment of more than two sequences

Needleman Wunsch Algorithm
Initialization
Case 1: xi aligns to yi
F(0, 0) = 0
Case 2: xi aligns to gap
F(0, i) = −i * d Case 3: yi aligns to gap
F(j, 0) = −j* d
Main Iteration
For each i=1…M and j=1….N

F(i-1,j-1)+s(xi,yj), case 1 s(xi,yj ) = +1 , match
F(i,j) = max F(i-1,j)-d, case 2 -1 , mismatch
F(i,j-1)-d, case 3

DIAG, if case 1
Ptr(i,j) = UP, if case 2
LEFT, if case 3

Needleman Wunsch Algorithm
Optimal
f(0,0)+s(1,1) =1 f(0,1)+s(1,2) =-2
Alignment F(1,1)=max f(0,1)-1 = -2 f(0,2)-1 = -3
A_TA f(1,0)-1 = -2 f(1,1)-1 = 0
AGTA = 1(case 1) Max = 0 (case 3)

F(i,j) i=0 1 2 3 4

A G T A F(i-1,j-1)+s(xi,yj)
F(i-1,j)-d
j=0 0 -1 -2 -3 -4 F(i,j-1)-d

1 A -1 1 0 -1 -2
F(0, 0) = 0
2 T -2 0 0 1 0 F(0, i) = −i * d
F(j, 0) = −j* d
3 A -3 -1 -1 0 2

Case 1: xi aligns to yi s(xi,yj ) = +1, match PTR =
Case 2: xi aligns to gap -1, mismatch DIAG, if case 1
Case 3: yi aligns to gap UP, if case 2
d=1 LEFT, if case 3

Multiple Sequence Alignment
A multiple sequence alignment is a sequence
alignment of three or more biological sequences,
generally protein, DNA, or RNA.

The input is a set of query sequences that are
assumed to have an evolutionary relationship by
which they share a lineage and are descended from
a common ancestor.

From the resulting multiple sequence alignment ,
phylogenetic analysis can be conducted to assess
the sequences shared evolutionary origins.

MSA Approaches

Dynamic programming

Progressive alignment

Iterative approach

MSA methods
Dynamic Accurate Computationally O(Nn)
Programming complex Exhaustive
(n – dim
matrix)
Progressive Fast Alignment ClustalW
approximation Cannot be MAFFT
(aligns closest modified
seq first - Local maxima
heuristics) Less accurate
Iterative Probabilistic Slow & less GA & HMM
/ Stochastic accurate
(Random)

N- sequence length; n- number of sequences

MSA in cloud
CloudBurst – RMAP
Does not split sequences to load in cloud
environment
Not for MSA
No automatic scale up/down of clusters
CLUE- proposal from Maryland University
VM cloning – Snowflock with MPIs

Proposed MSA Approach – hadoop data grid
S1 S2 S3

Map/ Reduce
aligner

A1S1 A2S2

Map/ Reduce
Map/ Reduce
aligner
aligner

A2S1 A2S2 A1S3

1) Identify different Permutations
S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1
2) Perform alignment of each permutation in parallel in Map2
S1 and S2 are aligned to form A1S1 and A2S2
3) Align the output of first Map-Reduce with the third
sequence S3 in Map Phase.
A1S1 is aligned with S3
A1S2 is aligned with S3
Best among these two is chosen to form
A2S1, A2S2 and A1S3.
4) Step 2 & 3 is repeated for all the other permutations in Map1
5) The best possible combination is chosen (alignment score)

Varying Number of Sequences of Same Size

100
80
T im e in S e c

60
40
20
0
2 4 6 8 10
Num ber of sequences
2 nodes 3 nodes

Different Block Sizes

350
300
250
200
T im e in S e c

150
100
50
0
10 100 1000 6400
B lo c k S iz e in K B
2 nodes 3 nodes

Analysis
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘b’ – Average number of blocks in a sequence
‘K’ – Size of 1 block

Complexity Proposed Conventional
Measure Method Method
Score O(N) O(n*N)
Calculation
Pairwise O(K2) O(N2)
alignment
MSA O[(n-1) *(N2)/b] O(Nn)

Proposed MSA Approach on Cloud
Time efficient approach to sequence alignment
with quality (accuracy) in Cloud

Using hadoop framework
Dynamic approach accuracy
Data and compute parallelism in hadoop speed
Blocking and scalability of hadoop
Parallel transfer of sequence splits over the
network to remote clusters
Automated scale up/down of clusters based on
computational needs of th environment.

System Architecture
4. Forking VMs / deleting VMs

2. Parallel transmission 3. Copy to HDFS New VMs
AGT….CG over Internet
AGT….CG Head Server
AGT….CG (VM) New VMs
AGT….CG
AGT….CG ……….
.
5. Perform Alignment .
SEQUENCE FRAGMENTS
6. Report the result New VMs
1. Create virtual environment
2. Split the sequences SERVER SIDE
CLIENT SIDE VIRTUAL HADOOP CLUSTER
ENVIRONMENT

A single Combination –
An illustration

S1= “AGTA”; A2=“ATA”; A3=“GAT”
1. ALIGNMENT OF SI & S2
0 1 2 3 4
2. ALIGNMENT OF A1SI & S3
A G T A
0 1 2 3 4
0 0 -1 -2 -3 -4
A G T A
1 A -1 1 0 -1 -2
0 0 -1 -2 -3 -4
2 T -2 0 0 1 0
1 G -1 -1 0 -1 -2
3 A -3 -1 -1 0 2
2 A -2 0 -1 1 0
SCORE: 4
3 T -3 -1 -1 0 -1
A1S1:“AGTA”; A1S2:“A_TA”
SCORE: -5
A2S1:“AG_TA”; A1S3:“_GAT_”

3. ALIGNMENT OF A1S2 & A1S3
0 1 2 3 4 5
A _ T A _
0 0 -1 -2 -3 -4 -5
1 _ -1 0 0 -1 -2 -3
2 G -2 -1 -1 -1 -2 -2
3 A -3 -1 -1 -2 0 -1
4 T -4 -2 -1 0 -1 0
5 _ -5 -3 -1 -1 0 0
SCORE: -3
A2S2:“A _ _TA_”;
A2S3:“ _GAT_ _”

Analysis
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘k’ – Average number of blocks in a sequence
‘K’ – Size of 1 block

Complexity Proposed Conventional
Measure Method Method
Score O(N) O(n*N)
Calculation
Pairwise O(K2) O(N2)
alignment
MSA O[K2 * ( n(n-1)/2] O(Nn)

2. Parallelised data trasfer
‘T’ – Time for sequence transfer serially & ‘k’ –
block size
T/k – Time for sequence transfer in parallel

3. Dynamic cluster creation
Advantage: Computation power of remote cluster
is optimal and not wasted
Disadvantage: Time to set up the cluster

Effect of parallel file transfer
File File Split Merge C1 T1 C2 T2
Size Transfer Time Time (sec) (sec) (sec) (sec)
(MB) (sec) (sec) (sec)
100 6.23 0.02 0.03 2.13 2.18 0.73 0.78

200 9.32 0.23 0.43 2.96 3.62 1.23 1.89

300 11.43 0.85 1.64 3.84 6.33 1.16 3.65

C1: Communication time from 3 client VMs to server without multithreading.
C2: Communication time from 3 client VMs to the server with multithreading.
T1: Total time for file transfer from client to server without multi threading
T2: Total time for file transfer from client to server with multi threading

Time to start virtual machines
120

100
Time in Sec

80

60

40

20

0
1 2 3 4
Number of VMs

Parallelised starting of VMs can be done to reduce time

Cluster performance wrt number of VMs
30 KB sequences with 2 KB splits – upto 5 sequences
350

300

250
Time in Sec

200

150

100

50

0
31 42 35 4 6 5 7 68 7
9 8
10 9
11 10
12
Num ber of sequences

4 slave VMs (sec) 6 slave VMs (sec)

Number of sequences is less than 6, a five node hadoop cluster is sufficient.

Dynamic scaling up/down of clusters
VMs instantiated based on number of Map-Reduce Tasks
Dynamically number of tasks were checked up New VMs started and tasks were
reallocated
Old VMs were destroyed if not used
File Size Static VM creation based on Dynamic VM creation
(GB) Predicted application load based on actual
(maps + reduces) application load
(maps + reduces)

Block size Time VMs Time New VMs
(10 MB) (min -sec) (min-sec) added

1 5-36 2 3-16 1
2 5-52 3 5-40 1
3 8-27 4 5-48 2
5 12-13 5 6-39 9

Conclusion
1) Proposed MSA improves on the computation time and also
maintains the accuracy.
Parallelism of sequence alignment in three levels.
Hadoop data grids - Data and compute parallelism &
scalability
Dynamic Programming - accuracy.
2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)]
Combining progressive and dynamic approaches.
Blocking in hadoop
3) Enhancements (using clouds for MSA)
Automatic configuration of the cloud environment
based on the computational needs
Efficient upload of data into the HDFS by parallel
transfer of sequence fragments over the Internet.

Other Projects
Enhancement of existing fairshare scheduler in
hadoop
Reliability using Reed Solomon codes
Hybrid scheduler
Motif identification for MSA
CBIR using image signatures
Text categorization
Hybrid PSO (PSO and GA) for job scheduling
Semantic search using hadoop framework.
Others – Globus and GridSim

Acknowledgement

The Research has been carried out as a result of PSG-Yahoo
Research programme on Grid and Cloud computing.
Sincere Thanks to
1) Dr R Rudramoorthy, Principal,
PSG College of Techniology, Coimbatore.
2) Mr K V Chidambaran,
Director, Grid and Cloud Systems Group,
Yahoo, Bangalore

Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Similar a Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop (20)

Más de Yahoo Developer Network

Más de Yahoo Developer Network (20)

Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop