SlideShare una empresa de Scribd logo
1 de 31
Descargar para leer sin conexión
MSA using Hadoop


        Presented by:
   Dr. G.Sudha Sadasivam
   Professor, Dept of CSE,
  PSG College of Technology,
         Coimbatore
Agenda
Sequence alignment
Introduction to Clouds
Approaches for MSA
Approach 1
Approach 2
Results
Other Projects
What is Sequence Alignment?

The procedure of comparing two or more
sequences by searching for a series of individual
characters or character patterns that are in the
same order in the sequences.
  Uses
     For sequence similarity
     Phylogenetic tree analysis
  Factors – accuracy and speed
Cloud computing
Provides scalable, on-demand, RT computing services
Suitability of cloud for Sequence Alignment
  On-demand scalability of cloud makes it suitable
  for dynamic nature of MSA
  Low cost in maintenance of infrastructure for
  applications
  Data and compute parallelism in clouds through
  map-reduce paradigm facilitates energy efficient and
  fast MSA.
Types of Sequence Alignment
Pair-wise Alignment
  Alignment of two sequences
     Global –using Needleman Wunsch algorithm.
   LGPSSKQTGKGS_SRAWDN
   |    |  | | | |    |
   LN_ATKSAGKGAIMRL GDA
    Local – using Smith Waterman algorithm.
   _________TGKG__________
             | | |
   _________AGKG__________
Multiple Sequence Alignment
 Alignment of more than two sequences
Needleman Wunsch Algorithm
   Initialization
                                        Case 1: xi aligns to yi
    F(0, 0) = 0
                                        Case 2: xi aligns to gap
    F(0, i) = −i * d                    Case 3: yi aligns to gap
    F(j, 0) = −j* d
   Main Iteration
    For each i=1…M and j=1….N

                    F(i-1,j-1)+s(xi,yj), case 1   s(xi,yj ) =   +1 , match
F(i,j) = max        F(i-1,j)-d, case 2                          -1 , mismatch
                    F(i,j-1)-d, case 3


                           DIAG, if case 1
      Ptr(i,j) =           UP, if case 2
                           LEFT, if case 3
Needleman Wunsch Algorithm
 Optimal
                                                   f(0,0)+s(1,1) =1            f(0,1)+s(1,2) =-2
 Alignment                           F(1,1)=max f(0,1)-1 = -2                  f(0,2)-1 = -3
   A_TA                                            f(1,0)-1 = -2               f(1,1)-1 = 0
   AGTA                                     = 1(case 1)                        Max = 0 (case 3)


      F(i,j)          i=0        1          2         3            4

                                     A            G            T           A      F(i-1,j-1)+s(xi,yj)
                                                                                  F(i-1,j)-d
      j=0            0       -1              -2           -3           -4         F(i,j-1)-d

         1      A    -1      1               0            -1           -2
                                                                                        F(0, 0) = 0
         2      T    -2      0               0            1            0                F(0, i) = −i * d
                                                                                        F(j, 0) = −j* d
         3      A    -3      -1              -1           0            2

Case 1: xi aligns to yi     s(xi,yj ) = +1, match                       PTR =
Case 2: xi aligns to gap                -1, mismatch                       DIAG, if case 1
Case 3: yi aligns to gap                                                   UP, if case 2
                            d=1                                            LEFT, if case 3
Multiple Sequence Alignment
A multiple sequence alignment is a sequence
alignment of three or more biological sequences,
generally protein, DNA, or RNA.

The input is a set of query sequences that are
assumed to have an evolutionary relationship by
which they share a lineage and are descended from
a common ancestor.

 From the resulting multiple sequence alignment ,
phylogenetic analysis can be conducted to assess
the sequences shared evolutionary origins.
MSA Approaches



Dynamic programming

Progressive alignment

Iterative approach
MSA methods
Dynamic         Accurate            Computationally   O(Nn)
Programming                         complex           Exhaustive
(n – dim
matrix)
Progressive     Fast             Alignment            ClustalW
approximation                    Cannot be            MAFFT
(aligns closest                  modified
seq first -                      Local maxima
heuristics)                      Less accurate
Iterative          Probabilistic Slow & less          GA & HMM
                   / Stochastic accurate
                   (Random)

     N- sequence length; n- number of sequences
MSA in cloud
CloudBurst – RMAP
  Does not split sequences to load in cloud
  environment
  Not for MSA
  No automatic scale up/down of clusters
CLUE- proposal from Maryland University
VM cloning – Snowflock with MPIs
Proposed MSA Approach – hadoop data grid
       S1                 S2            S3



            Map/ Reduce
              aligner



       A1S1           A2S2



                               Map/ Reduce
                                 Map/ Reduce
                                 aligner
                                    aligner



                          A2S1           A2S2   A1S3
1) Identify different Permutations
S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1
2) Perform alignment of each permutation in parallel in Map2
   S1 and S2 are aligned to form A1S1 and A2S2
3) Align the output of first Map-Reduce with the third
   sequence S3 in Map Phase.
        A1S1 is aligned with S3
        A1S2 is aligned with S3
         Best among these two is chosen to form
              A2S1, A2S2 and A1S3.
4) Step 2 & 3 is repeated for all the other permutations in Map1
5) The best possible combination is chosen (alignment score)
Varying Number of Sequences of Same Size

                100
                          80
        T im e in S e c

                          60
                          40
                          20
                           0
                               2       4       6     8        10
                                    Num ber of sequences
                                   2 nodes          3 nodes
Different Block Sizes



                       350
                       300
                       250
                       200
     T im e in S e c




                       150
                       100
                        50
                         0
                               10         100            1000             6400
                                       B lo c k S iz e in K B
                             2 nodes                            3 nodes
Analysis
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘b’ – Average number of blocks in a sequence
‘K’ – Size of 1 block

Complexity      Proposed             Conventional
Measure         Method               Method
Score           O(N)                 O(n*N)
Calculation
Pairwise        O(K2)                O(N2)
alignment
MSA             O[(n-1) *(N2)/b]     O(Nn)
Proposed MSA Approach on Cloud
Time efficient approach to sequence alignment
  with quality (accuracy) in Cloud

    Using hadoop framework
      Dynamic approach       accuracy
      Data and compute parallelism in hadoop   speed
      Blocking and scalability of hadoop
    Parallel transfer of sequence splits over the
    network to remote clusters
    Automated scale up/down of clusters based on
    computational needs of th environment.
System Architecture
                                                                  4. Forking VMs / deleting VMs



                                2. Parallel transmission 3. Copy to HDFS      New VMs
  AGT….CG                       over Internet
   AGT….CG                                                 Head Server
     AGT….CG                                                  (VM)            New VMs
      AGT….CG
            AGT….CG                                                            ……….
                                                                               .
                                                       5. Perform Alignment    .
SEQUENCE FRAGMENTS
                                6. Report the result                          New VMs
1. Create virtual environment
2. Split the sequences                                     SERVER SIDE
 CLIENT SIDE VIRTUAL                                       HADOOP CLUSTER
 ENVIRONMENT
A single Combination –
     An illustration
S1= “AGTA”; A2=“ATA”; A3=“GAT”
1. ALIGNMENT OF SI & S2
        0   1   2   3   4
                            2. ALIGNMENT OF A1SI & S3
            A G T       A
                                     0    1   2   3   4
0       0   -1 -2 -3 -4
                                          A G T       A
1   A -1 1      0   -1 -2
                             0       0    -1 -2 -3 -4
2   T   -2 0    0   1   0
                             1   G -1 -1 0        -1 -2
3   A -3 -1 -1 0        2
                             2   A -2 0       -1 1    0
SCORE: 4
                             3   T   -3 -1 -1 0       -1
A1S1:“AGTA”; A1S2:“A_TA”
                            SCORE: -5
                            A2S1:“AG_TA”; A1S3:“_GAT_”
3. ALIGNMENT OF A1S2 & A1S3
          0    1    2    3    4    5
               A    _    T    A    _
0         0    -1   -2   -3   -4   -5
1     _   -1   0    0    -1   -2   -3
2     G   -2   -1   -1   -1   -2   -2
3     A   -3   -1   -1   -2   0    -1
4     T   -4   -2   -1   0    -1   0
5     _   -5   -3   -1   -1   0    0
      SCORE: -3
      A2S2:“A _ _TA_”;
      A2S3:“ _GAT_ _”
Analysis
‘n’ – Number of Sequences
‘N’ – Average length of a sequence
‘k’ – Average number of blocks in a sequence
‘K’ – Size of 1 block

Complexity      Proposed             Conventional
Measure         Method               Method
Score           O(N)                 O(n*N)
Calculation
Pairwise        O(K2)                O(N2)
alignment
MSA             O[K2 * ( n(n-1)/2]   O(Nn)
2. Parallelised data trasfer
‘T’ – Time for sequence transfer serially & ‘k’ –
block size
T/k – Time for sequence transfer in parallel

3. Dynamic cluster creation
Advantage: Computation power of remote cluster
is optimal and not wasted
Disadvantage: Time to set up the cluster
Effect of parallel file transfer
File     File         Split    Merge C1               T1    C2    T2
Size     Transfer     Time     Time  (sec)            (sec) (sec) (sec)
(MB)     (sec)        (sec)    (sec)
100      6.23         0.02     0.03  2.13             2.18      0.73 0.78


200      9.32         0.23 0.43           2.96        3.62      1.23 1.89


300      11.43        0.85 1.64           3.84        6.33      1.16 3.65


C1: Communication time from 3 client VMs to server without multithreading.
C2: Communication time from 3 client VMs to the server with multithreading.
T1: Total time for file transfer from client to server without multi threading
T2: Total time for file transfer from client to server with multi threading
Time to start virtual machines
                 120

                 100
        Time in Sec



                      80

                      60

                      40

                      20

                      0
                           1   2          3    4
                               Number of VMs



Parallelised starting of VMs can be done to reduce time
Cluster performance wrt number of VMs
   30 KB sequences with 2 KB splits – upto 5 sequences
                                   350

                                   300

                                   250
                     Time in Sec



                                   200

                                   150

                                   100

                                   50

                                    0
                                         31   42   35 4 6 5  7   68  7
                                                                     9   8
                                                                         10   9
                                                                              11   10
                                                                                   12
                                                    Num ber of sequences

                                         4 slave VMs (sec)        6 slave VMs (sec)


Number of sequences is less than 6, a five node hadoop cluster is sufficient.
Dynamic scaling up/down of clusters
VMs instantiated based on number of Map-Reduce Tasks
Dynamically number of tasks were checked up   New VMs started and tasks were
reallocated
Old VMs were destroyed if not used
File Size    Static VM creation based on        Dynamic VM creation
(GB)         Predicted application load         based on actual
             (maps + reduces)                   application load
                                                (maps + reduces)

Block size   Time                    VMs        Time          New VMs
(10 MB)      (min -sec)                         (min-sec)     added


1            5-36                    2          3-16          1
2            5-52                    3          5-40          1
3            8-27                    4          5-48          2
5            12-13                   5          6-39          9
Conclusion
1) Proposed MSA improves on the computation time and also
   maintains the accuracy.
        Parallelism of sequence alignment in three levels.
        Hadoop data grids - Data and compute parallelism &
        scalability
        Dynamic Programming - accuracy.
2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)]
        Combining progressive and dynamic approaches.
        Blocking in hadoop
3) Enhancements (using clouds for MSA)
        Automatic configuration of the cloud environment
        based on the computational needs
        Efficient upload of data into the HDFS by parallel
        transfer of sequence fragments over the Internet.
Other Projects
 Enhancement of existing fairshare scheduler in
 hadoop
 Reliability using Reed Solomon codes
 Hybrid scheduler
 Motif identification for MSA
 CBIR using image signatures
 Text categorization
 Hybrid PSO (PSO and GA) for job scheduling
 Semantic search using hadoop framework.
 Others – Globus and GridSim
Acknowledgement

The Research has been carried out as a result of PSG-Yahoo
Research programme on Grid and Cloud computing.
Sincere Thanks to
1) Dr R Rudramoorthy, Principal,
PSG College of Techniology, Coimbatore.
2) Mr K V Chidambaran,
Director, Grid and Cloud Systems Group,
Yahoo, Bangalore
THANK YOU




QUESTIONS?

Más contenido relacionado

La actualidad más candente

Complex dynamics of superior phoenix set
Complex dynamics of superior phoenix setComplex dynamics of superior phoenix set
Complex dynamics of superior phoenix setIAEME Publication
 
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Igor Moiseev
 
Ysu conference presentation alaverdyan
Ysu conference  presentation alaverdyanYsu conference  presentation alaverdyan
Ysu conference presentation alaverdyanGrigor Alaverdyan
 
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSFURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSgraphhoc
 
Final Present Pap1on relibility
Final Present Pap1on relibilityFinal Present Pap1on relibility
Final Present Pap1on relibilityketan gajjar
 
Chapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic SignalsChapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic SignalsAttaporn Ninsuwan
 
Practising Fourier Analysis with Digital Images
Practising Fourier Analysis with Digital ImagesPractising Fourier Analysis with Digital Images
Practising Fourier Analysis with Digital ImagesFrédéric Morain-Nicolier
 
Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)asghar123456
 
Flexural analysis of thick beams using single
Flexural analysis of thick beams using singleFlexural analysis of thick beams using single
Flexural analysis of thick beams using singleiaemedu
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systemsbabak danyal
 
02 2d systems matrix
02 2d systems matrix02 2d systems matrix
02 2d systems matrixRumah Belajar
 
11.[95 103]solution of telegraph equation by modified of double sumudu transf...
11.[95 103]solution of telegraph equation by modified of double sumudu transf...11.[95 103]solution of telegraph equation by modified of double sumudu transf...
11.[95 103]solution of telegraph equation by modified of double sumudu transf...Alexander Decker
 
Local Volatility 1
Local Volatility 1Local Volatility 1
Local Volatility 1Ilya Gikhman
 

La actualidad más candente (20)

Complex dynamics of superior phoenix set
Complex dynamics of superior phoenix setComplex dynamics of superior phoenix set
Complex dynamics of superior phoenix set
 
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
Singularities in the one control problem. S.I.S.S.A., Trieste August 16, 2007.
 
Ysu conference presentation alaverdyan
Ysu conference  presentation alaverdyanYsu conference  presentation alaverdyan
Ysu conference presentation alaverdyan
 
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHSFURTHER RESULTS ON ODD HARMONIOUS GRAPHS
FURTHER RESULTS ON ODD HARMONIOUS GRAPHS
 
Chapter 5 (maths 3)
Chapter 5 (maths 3)Chapter 5 (maths 3)
Chapter 5 (maths 3)
 
Final Present Pap1on relibility
Final Present Pap1on relibilityFinal Present Pap1on relibility
Final Present Pap1on relibility
 
Chapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic SignalsChapter3 - Fourier Series Representation of Periodic Signals
Chapter3 - Fourier Series Representation of Periodic Signals
 
5320
53205320
5320
 
Practising Fourier Analysis with Digital Images
Practising Fourier Analysis with Digital ImagesPractising Fourier Analysis with Digital Images
Practising Fourier Analysis with Digital Images
 
International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)International Journal of Engineering Inventions (IJEI)
International Journal of Engineering Inventions (IJEI)
 
Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)Amth250 octave matlab some solutions (1)
Amth250 octave matlab some solutions (1)
 
Flexural analysis of thick beams using single
Flexural analysis of thick beams using singleFlexural analysis of thick beams using single
Flexural analysis of thick beams using single
 
redes neuronais
redes neuronaisredes neuronais
redes neuronais
 
Gamma function
Gamma functionGamma function
Gamma function
 
Lecture8 Signal and Systems
Lecture8 Signal and SystemsLecture8 Signal and Systems
Lecture8 Signal and Systems
 
T07 Euler Path
T07 Euler PathT07 Euler Path
T07 Euler Path
 
02 2d systems matrix
02 2d systems matrix02 2d systems matrix
02 2d systems matrix
 
Lecture 9
Lecture 9Lecture 9
Lecture 9
 
11.[95 103]solution of telegraph equation by modified of double sumudu transf...
11.[95 103]solution of telegraph equation by modified of double sumudu transf...11.[95 103]solution of telegraph equation by modified of double sumudu transf...
11.[95 103]solution of telegraph equation by modified of double sumudu transf...
 
Local Volatility 1
Local Volatility 1Local Volatility 1
Local Volatility 1
 

Similar a Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Paper computer
Paper computerPaper computer
Paper computerbikram ...
 
Notions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systemsNotions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systemsStavros Vologiannidis
 
A novel steganographic method for jpeg images
A novel steganographic method for jpeg imagesA novel steganographic method for jpeg images
A novel steganographic method for jpeg imagesAjit Kumar Pradhan
 
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHSDISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHSgraphhoc
 
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy AnnotationsToru Tamaki
 
State Space Realizations_new.pptx
State Space Realizations_new.pptxState Space Realizations_new.pptx
State Space Realizations_new.pptxMohdNajibAliMokhtar
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsKeisuke OTAKI
 
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Arthur Weglein
 
Last+minute+revision(+Final)+(1) (1).pptx
Last+minute+revision(+Final)+(1) (1).pptxLast+minute+revision(+Final)+(1) (1).pptx
Last+minute+revision(+Final)+(1) (1).pptxAryanMishra860130
 
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyNbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyMD Kutubuddin Sardar
 
IIT Jam math 2016 solutions BY Trajectoryeducation
IIT Jam math 2016 solutions BY TrajectoryeducationIIT Jam math 2016 solutions BY Trajectoryeducation
IIT Jam math 2016 solutions BY TrajectoryeducationDev Singh
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861pkavitha
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricGota Morota
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...Pioneer Natural Resources
 

Similar a Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop (20)

Paper computer
Paper computerPaper computer
Paper computer
 
Paper computer
Paper computerPaper computer
Paper computer
 
kactl.pdf
kactl.pdfkactl.pdf
kactl.pdf
 
Lifting 1
Lifting 1Lifting 1
Lifting 1
 
Notions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systemsNotions of equivalence for linear multivariable systems
Notions of equivalence for linear multivariable systems
 
A novel steganographic method for jpeg images
A novel steganographic method for jpeg imagesA novel steganographic method for jpeg images
A novel steganographic method for jpeg images
 
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHSDISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
DISTANCE TWO LABELING FOR MULTI-STOREY GRAPHS
 
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
CLIM Fall 2017 Course: Statistics for Climate Research, Nonstationary Covaria...
 
Ph ddefence
Ph ddefencePh ddefence
Ph ddefence
 
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
論文紹介:Towards Robust Adaptive Object Detection Under Noisy Annotations
 
State Space Realizations_new.pptx
State Space Realizations_new.pptxState Space Realizations_new.pptx
State Space Realizations_new.pptx
 
Reading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGsReading Seminar (140515) Spectral Learning of L-PCFGs
Reading Seminar (140515) Spectral Learning of L-PCFGs
 
Chang etal 2012a
Chang etal 2012aChang etal 2012a
Chang etal 2012a
 
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
Finite-difference modeling, accuracy, and boundary conditions- Arthur Weglein...
 
Last+minute+revision(+Final)+(1) (1).pptx
Last+minute+revision(+Final)+(1) (1).pptxLast+minute+revision(+Final)+(1) (1).pptx
Last+minute+revision(+Final)+(1) (1).pptx
 
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer keyNbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
Nbhm m. a. and m.sc. scholarship test september 20, 2014 with answer key
 
IIT Jam math 2016 solutions BY Trajectoryeducation
IIT Jam math 2016 solutions BY TrajectoryeducationIIT Jam math 2016 solutions BY Trajectoryeducation
IIT Jam math 2016 solutions BY Trajectoryeducation
 
10.1.1.474.2861
10.1.1.474.286110.1.1.474.2861
10.1.1.474.2861
 
Diffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metricDiffusion kernels on SNP data embedded in a non-Euclidean metric
Diffusion kernels on SNP data embedded in a non-Euclidean metric
 
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
OPTIMIZED RATE ALLOCATION OF HYPERSPECTRAL IMAGES IN COMPRESSED DOMAIN USING ...
 

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsYahoo Developer Network
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network
 

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaDeveloping Mobile Apps for Performance - Swapnil Patel, Verizon Media
Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media
 
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...
 
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanAthenz & SPIFFE, Tatsuya Yano, Yahoo Japan
Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan
 
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...
 
CICD at Oath using Screwdriver
CICD at Oath using ScrewdriverCICD at Oath using Screwdriver
CICD at Oath using Screwdriver
 
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathBig Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath
 
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuHow @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu
 
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolThe Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool
 
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...
 
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...
 
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathHDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath
 
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...
 
Moving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, OathMoving the Oath Grid to Docker, Eric Badger, Oath
Moving the Oath Grid to Docker, Eric Badger, Oath
 
Architecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI ApplicationsArchitecting Petabyte Scale AI Applications
Architecting Petabyte Scale AI Applications
 
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...
 
Jun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step BeyondJun 2017 HUG: YARN Scheduling – A Step Beyond
Jun 2017 HUG: YARN Scheduling – A Step Beyond
 
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies
 
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...
 
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache ApexFebruary 2017 HUG: Exactly-once end-to-end processing with Apache Apex
February 2017 HUG: Exactly-once end-to-end processing with Apache Apex
 
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsFebruary 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics
 

Hadoop Summit 2010 Multiple Sequence Alignment Using Hadoop

  • 1. MSA using Hadoop Presented by: Dr. G.Sudha Sadasivam Professor, Dept of CSE, PSG College of Technology, Coimbatore
  • 2. Agenda Sequence alignment Introduction to Clouds Approaches for MSA Approach 1 Approach 2 Results Other Projects
  • 3. What is Sequence Alignment? The procedure of comparing two or more sequences by searching for a series of individual characters or character patterns that are in the same order in the sequences. Uses For sequence similarity Phylogenetic tree analysis Factors – accuracy and speed
  • 4. Cloud computing Provides scalable, on-demand, RT computing services Suitability of cloud for Sequence Alignment On-demand scalability of cloud makes it suitable for dynamic nature of MSA Low cost in maintenance of infrastructure for applications Data and compute parallelism in clouds through map-reduce paradigm facilitates energy efficient and fast MSA.
  • 5. Types of Sequence Alignment Pair-wise Alignment Alignment of two sequences Global –using Needleman Wunsch algorithm. LGPSSKQTGKGS_SRAWDN | | | | | | | LN_ATKSAGKGAIMRL GDA Local – using Smith Waterman algorithm. _________TGKG__________ | | | _________AGKG__________ Multiple Sequence Alignment Alignment of more than two sequences
  • 6. Needleman Wunsch Algorithm Initialization Case 1: xi aligns to yi F(0, 0) = 0 Case 2: xi aligns to gap F(0, i) = −i * d Case 3: yi aligns to gap F(j, 0) = −j* d Main Iteration For each i=1…M and j=1….N F(i-1,j-1)+s(xi,yj), case 1 s(xi,yj ) = +1 , match F(i,j) = max F(i-1,j)-d, case 2 -1 , mismatch F(i,j-1)-d, case 3 DIAG, if case 1 Ptr(i,j) = UP, if case 2 LEFT, if case 3
  • 7. Needleman Wunsch Algorithm Optimal f(0,0)+s(1,1) =1 f(0,1)+s(1,2) =-2 Alignment F(1,1)=max f(0,1)-1 = -2 f(0,2)-1 = -3 A_TA f(1,0)-1 = -2 f(1,1)-1 = 0 AGTA = 1(case 1) Max = 0 (case 3) F(i,j) i=0 1 2 3 4 A G T A F(i-1,j-1)+s(xi,yj) F(i-1,j)-d j=0 0 -1 -2 -3 -4 F(i,j-1)-d 1 A -1 1 0 -1 -2 F(0, 0) = 0 2 T -2 0 0 1 0 F(0, i) = −i * d F(j, 0) = −j* d 3 A -3 -1 -1 0 2 Case 1: xi aligns to yi s(xi,yj ) = +1, match PTR = Case 2: xi aligns to gap -1, mismatch DIAG, if case 1 Case 3: yi aligns to gap UP, if case 2 d=1 LEFT, if case 3
  • 8. Multiple Sequence Alignment A multiple sequence alignment is a sequence alignment of three or more biological sequences, generally protein, DNA, or RNA. The input is a set of query sequences that are assumed to have an evolutionary relationship by which they share a lineage and are descended from a common ancestor. From the resulting multiple sequence alignment , phylogenetic analysis can be conducted to assess the sequences shared evolutionary origins.
  • 9. MSA Approaches Dynamic programming Progressive alignment Iterative approach
  • 10. MSA methods Dynamic Accurate Computationally O(Nn) Programming complex Exhaustive (n – dim matrix) Progressive Fast Alignment ClustalW approximation Cannot be MAFFT (aligns closest modified seq first - Local maxima heuristics) Less accurate Iterative Probabilistic Slow & less GA & HMM / Stochastic accurate (Random) N- sequence length; n- number of sequences
  • 11. MSA in cloud CloudBurst – RMAP Does not split sequences to load in cloud environment Not for MSA No automatic scale up/down of clusters CLUE- proposal from Maryland University VM cloning – Snowflock with MPIs
  • 12. Proposed MSA Approach – hadoop data grid S1 S2 S3 Map/ Reduce aligner A1S1 A2S2 Map/ Reduce Map/ Reduce aligner aligner A2S1 A2S2 A1S3
  • 13. 1) Identify different Permutations S1,S2,S3; S1,S3,S2; S2,S1,S3; S2,S3,S1; S3,S1,S2; S3,S2,S1 2) Perform alignment of each permutation in parallel in Map2 S1 and S2 are aligned to form A1S1 and A2S2 3) Align the output of first Map-Reduce with the third sequence S3 in Map Phase. A1S1 is aligned with S3 A1S2 is aligned with S3 Best among these two is chosen to form A2S1, A2S2 and A1S3. 4) Step 2 & 3 is repeated for all the other permutations in Map1 5) The best possible combination is chosen (alignment score)
  • 14. Varying Number of Sequences of Same Size 100 80 T im e in S e c 60 40 20 0 2 4 6 8 10 Num ber of sequences 2 nodes 3 nodes
  • 15. Different Block Sizes 350 300 250 200 T im e in S e c 150 100 50 0 10 100 1000 6400 B lo c k S iz e in K B 2 nodes 3 nodes
  • 16. Analysis ‘n’ – Number of Sequences ‘N’ – Average length of a sequence ‘b’ – Average number of blocks in a sequence ‘K’ – Size of 1 block Complexity Proposed Conventional Measure Method Method Score O(N) O(n*N) Calculation Pairwise O(K2) O(N2) alignment MSA O[(n-1) *(N2)/b] O(Nn)
  • 17. Proposed MSA Approach on Cloud Time efficient approach to sequence alignment with quality (accuracy) in Cloud Using hadoop framework Dynamic approach accuracy Data and compute parallelism in hadoop speed Blocking and scalability of hadoop Parallel transfer of sequence splits over the network to remote clusters Automated scale up/down of clusters based on computational needs of th environment.
  • 18. System Architecture 4. Forking VMs / deleting VMs 2. Parallel transmission 3. Copy to HDFS New VMs AGT….CG over Internet AGT….CG Head Server AGT….CG (VM) New VMs AGT….CG AGT….CG ………. . 5. Perform Alignment . SEQUENCE FRAGMENTS 6. Report the result New VMs 1. Create virtual environment 2. Split the sequences SERVER SIDE CLIENT SIDE VIRTUAL HADOOP CLUSTER ENVIRONMENT
  • 19. A single Combination – An illustration
  • 20. S1= “AGTA”; A2=“ATA”; A3=“GAT” 1. ALIGNMENT OF SI & S2 0 1 2 3 4 2. ALIGNMENT OF A1SI & S3 A G T A 0 1 2 3 4 0 0 -1 -2 -3 -4 A G T A 1 A -1 1 0 -1 -2 0 0 -1 -2 -3 -4 2 T -2 0 0 1 0 1 G -1 -1 0 -1 -2 3 A -3 -1 -1 0 2 2 A -2 0 -1 1 0 SCORE: 4 3 T -3 -1 -1 0 -1 A1S1:“AGTA”; A1S2:“A_TA” SCORE: -5 A2S1:“AG_TA”; A1S3:“_GAT_”
  • 21. 3. ALIGNMENT OF A1S2 & A1S3 0 1 2 3 4 5 A _ T A _ 0 0 -1 -2 -3 -4 -5 1 _ -1 0 0 -1 -2 -3 2 G -2 -1 -1 -1 -2 -2 3 A -3 -1 -1 -2 0 -1 4 T -4 -2 -1 0 -1 0 5 _ -5 -3 -1 -1 0 0 SCORE: -3 A2S2:“A _ _TA_”; A2S3:“ _GAT_ _”
  • 22. Analysis ‘n’ – Number of Sequences ‘N’ – Average length of a sequence ‘k’ – Average number of blocks in a sequence ‘K’ – Size of 1 block Complexity Proposed Conventional Measure Method Method Score O(N) O(n*N) Calculation Pairwise O(K2) O(N2) alignment MSA O[K2 * ( n(n-1)/2] O(Nn)
  • 23. 2. Parallelised data trasfer ‘T’ – Time for sequence transfer serially & ‘k’ – block size T/k – Time for sequence transfer in parallel 3. Dynamic cluster creation Advantage: Computation power of remote cluster is optimal and not wasted Disadvantage: Time to set up the cluster
  • 24. Effect of parallel file transfer File File Split Merge C1 T1 C2 T2 Size Transfer Time Time (sec) (sec) (sec) (sec) (MB) (sec) (sec) (sec) 100 6.23 0.02 0.03 2.13 2.18 0.73 0.78 200 9.32 0.23 0.43 2.96 3.62 1.23 1.89 300 11.43 0.85 1.64 3.84 6.33 1.16 3.65 C1: Communication time from 3 client VMs to server without multithreading. C2: Communication time from 3 client VMs to the server with multithreading. T1: Total time for file transfer from client to server without multi threading T2: Total time for file transfer from client to server with multi threading
  • 25. Time to start virtual machines 120 100 Time in Sec 80 60 40 20 0 1 2 3 4 Number of VMs Parallelised starting of VMs can be done to reduce time
  • 26. Cluster performance wrt number of VMs 30 KB sequences with 2 KB splits – upto 5 sequences 350 300 250 Time in Sec 200 150 100 50 0 31 42 35 4 6 5 7 68 7 9 8 10 9 11 10 12 Num ber of sequences 4 slave VMs (sec) 6 slave VMs (sec) Number of sequences is less than 6, a five node hadoop cluster is sufficient.
  • 27. Dynamic scaling up/down of clusters VMs instantiated based on number of Map-Reduce Tasks Dynamically number of tasks were checked up New VMs started and tasks were reallocated Old VMs were destroyed if not used File Size Static VM creation based on Dynamic VM creation (GB) Predicted application load based on actual (maps + reduces) application load (maps + reduces) Block size Time VMs Time New VMs (10 MB) (min -sec) (min-sec) added 1 5-36 2 3-16 1 2 5-52 3 5-40 1 3 8-27 4 5-48 2 5 12-13 5 6-39 9
  • 28. Conclusion 1) Proposed MSA improves on the computation time and also maintains the accuracy. Parallelism of sequence alignment in three levels. Hadoop data grids - Data and compute parallelism & scalability Dynamic Programming - accuracy. 2) Complexity is reduced from O(Nn) to O[K2 * (n *(n-1)/2)] Combining progressive and dynamic approaches. Blocking in hadoop 3) Enhancements (using clouds for MSA) Automatic configuration of the cloud environment based on the computational needs Efficient upload of data into the HDFS by parallel transfer of sequence fragments over the Internet.
  • 29. Other Projects Enhancement of existing fairshare scheduler in hadoop Reliability using Reed Solomon codes Hybrid scheduler Motif identification for MSA CBIR using image signatures Text categorization Hybrid PSO (PSO and GA) for job scheduling Semantic search using hadoop framework. Others – Globus and GridSim
  • 30. Acknowledgement The Research has been carried out as a result of PSG-Yahoo Research programme on Grid and Cloud computing. Sincere Thanks to 1) Dr R Rudramoorthy, Principal, PSG College of Techniology, Coimbatore. 2) Mr K V Chidambaran, Director, Grid and Cloud Systems Group, Yahoo, Bangalore