SlideShare una empresa de Scribd logo
1 de 1
MapReduce and Clouds for Sciencehttp://salsahpc.indiana.edu/ Indiana University Bloomington Geoffrey Fox, Judy Qiu, SALSA Group SALSA project (salsahpc.indiana.edu) investigates new programming models of parallel multicore computing and Cloud/Grid computing. It aims at developing and applying parallel and distributed Cyberinfrastructure to support large scale data analysis. We illustrate this with a project for life sciences: clustering for biology Alu and Metagenomics sequences; a study of usability and performance of different Cloud approaches; an iterative MapReduce runtime, Twister, to support complex data analysis algorithms for scientific applications; engagement of undergraduate students in new programming models using Dryad and TPL through class, REU, and Minority outreach programs. Processing/Visualizing DNA Sequencing Pipeline Biology MDS and Clustering Results There is a data deluge throughout science and all areas need analysis pipelines or workflows to propel the  data from instruments through various stages to scientific discovery often aided by visualization. It is well known that these pipelines typically offer natural data parallelism that can be implemented within many different frameworks. We chose to look at the MapReduce frameworks as these stem from the commercial information retrieval field which is perhaps currently the world’s most demanding data analysis problem. Exploiting commercial approaches offers a good chance that one can achieve high-quality, robust environments and MapReduce has a mixture of commercial and open source implementations. This figure  illustrates results from our research of a pipeline mode to provide services on demand (Software as a Service SaaS) for genomics.  Alu Families This visualizes results of Alu repeats from Chimpanzee and Human Genomes. Young families (green, yellow) are tight clusters Metagenomics This visualizes results of clustering and dimension reduction to 3D of 30000 gene sequences from an environmental sample.  Usability and Performance of Different Cloud/MapReduce Models We have demonstrated that clouds offer attractive computing paradigms for loosely coupled scientific applications. Higher level models include Dryad and Hadoopwhich we find are easier to use than EC2 and Azure (less setup and fewer lines of code).  The cost effectiveness of cloud data centers combined with the comparable performance reported here suggests that loosely coupled science applications will increasingly be implemented on clouds and that using MapReduce will offer convenient user interfaces with little overhead. Earlier studies have shown that MPI is similar in performance to  Hadoop and Dryad. Undergraduate Research Experiences Twister(MapReduce++)                                                                 supports iterative MapReduce Computations  and allows MapReduce to achieve higher performance, perform faster data transfers, and reduce the time it takes to process vast sets of data for data mining and machine learning applications.  Open source code supports  streaming communication and long running processes The IU HBCU STEM Summer Scholar Institute is an eight-week program that provides opportunities for minority students to engage in continuous, substantive research and work with researchers of our group on active projects. Funded by NSF, a team of STEM summer scholars from North Carolina A&T has joined Community Grids Lab and involved in research activities with the SALSA project that is funded by Microsoft research. http://www.iterativemapreduce.org/

Más contenido relacionado

La actualidad más candente

∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
Pradeeban Kathiravelu, Ph.D.
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
IJECEIAES
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
IJRAT
 

La actualidad más candente (18)

A Survey on Geographically Distributed Big-Data Processing using Map Reduce
A Survey on Geographically Distributed Big-Data Processing using Map ReduceA Survey on Geographically Distributed Big-Data Processing using Map Reduce
A Survey on Geographically Distributed Big-Data Processing using Map Reduce
 
Indexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and DeduplicationIndexing Techniques for Scalable Record Linkage and Deduplication
Indexing Techniques for Scalable Record Linkage and Deduplication
 
Efficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data SetsEfficient Duplicate Detection Over Massive Data Sets
Efficient Duplicate Detection Over Massive Data Sets
 
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
∂u∂u Multi-Tenanted Framework: Distributed Near Duplicate Detection for Big Data
 
Marie Claire Beaulieu, PERSEIDS and Digital Textual Annotation
Marie Claire Beaulieu, PERSEIDS and Digital Textual AnnotationMarie Claire Beaulieu, PERSEIDS and Digital Textual Annotation
Marie Claire Beaulieu, PERSEIDS and Digital Textual Annotation
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
Knowledge Graph Maintenance
Knowledge Graph MaintenanceKnowledge Graph Maintenance
Knowledge Graph Maintenance
 
IEEE Datamining 2016 Title and Abstract
IEEE  Datamining 2016 Title and AbstractIEEE  Datamining 2016 Title and Abstract
IEEE Datamining 2016 Title and Abstract
 
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
HPC-ABDS: The Case for an Integrating Apache Big Data Stack with HPC
 
ieee cloud 2015 keynote talk
ieee cloud 2015 keynote talkieee cloud 2015 keynote talk
ieee cloud 2015 keynote talk
 
A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce A Novel Approach for Clustering Big Data based on MapReduce
A Novel Approach for Clustering Big Data based on MapReduce
 
1.demystifying big data & hadoop
1.demystifying big data & hadoop1.demystifying big data & hadoop
1.demystifying big data & hadoop
 
CBS CEDAR Presentation
CBS CEDAR PresentationCBS CEDAR Presentation
CBS CEDAR Presentation
 
containers2016
containers2016containers2016
containers2016
 
7
77
7
 
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
Big Graph : Tools, Techniques, Issues, Challenges and Future Directions
 
Accelerating your research with Microsoft Azure
Accelerating your research with Microsoft AzureAccelerating your research with Microsoft Azure
Accelerating your research with Microsoft Azure
 
Paper id 25201498
Paper id 25201498Paper id 25201498
Paper id 25201498
 

Similar a Project Name

Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
BOSC 2010
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
NavNeet KuMar
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
Alexander Decker
 
MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015
CEPHAS MAWERE
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
ijwscjournal
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
Anup Singh
 

Similar a Project Name (20)

LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTERLOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
LOAD BALANCING LARGE DATA SETS IN A HADOOP CLUSTER
 
Qiu bosc2010
Qiu bosc2010Qiu bosc2010
Qiu bosc2010
 
Hm2413291336
Hm2413291336Hm2413291336
Hm2413291336
 
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
A Survey on Data Mapping Strategy for data stored in the storage cloud  111A Survey on Data Mapping Strategy for data stored in the storage cloud  111
A Survey on Data Mapping Strategy for data stored in the storage cloud 111
 
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISONMAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
MAP-REDUCE IMPLEMENTATIONS: SURVEY AND PERFORMANCE COMPARISON
 
B1803031217
B1803031217B1803031217
B1803031217
 
Shortest path estimation for graph
Shortest path estimation for graphShortest path estimation for graph
Shortest path estimation for graph
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
 
Qo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environmentQo s aware scientific application scheduling algorithm in cloud environment
Qo s aware scientific application scheduling algorithm in cloud environment
 
Cyberistructure
CyberistructureCyberistructure
Cyberistructure
 
Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22Cyberinfrastructure and Applications Overview: Howard University June22
Cyberinfrastructure and Applications Overview: Howard University June22
 
MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015MawereC- Ubuntunet paper publication 2015
MawereC- Ubuntunet paper publication 2015
 
Ax34298305
Ax34298305Ax34298305
Ax34298305
 
The MADlib Analytics Library
The MADlib Analytics Library The MADlib Analytics Library
The MADlib Analytics Library
 
Time to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the CloudTime to Science/Time to Results: Transforming Research in the Cloud
Time to Science/Time to Results: Transforming Research in the Cloud
 
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
MAP/REDUCE DESIGN AND IMPLEMENTATION OF APRIORIALGORITHM FOR HANDLING VOLUMIN...
 
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENTLARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
LARGE-SCALE DATA PROCESSING USING MAPREDUCE IN CLOUD COMPUTING ENVIRONMENT
 
B017320612
B017320612B017320612
B017320612
 
Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics Leveraging Map Reduce With Hadoop for Weather Data Analytics
Leveraging Map Reduce With Hadoop for Weather Data Analytics
 
Thesis blending big data and cloud -epilepsy global data research and inform...
Thesis  blending big data and cloud -epilepsy global data research and inform...Thesis  blending big data and cloud -epilepsy global data research and inform...
Thesis blending big data and cloud -epilepsy global data research and inform...
 

Más de butest

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
butest
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
butest
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
butest
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
butest
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
butest
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
butest
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
butest
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
butest
 
Facebook
Facebook Facebook
Facebook
butest
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
butest
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
butest
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
butest
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
butest
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
butest
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
butest
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
butest
 

Más de butest (20)

EL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBEEL MODELO DE NEGOCIO DE YOUTUBE
EL MODELO DE NEGOCIO DE YOUTUBE
 
1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同1. MPEG I.B.P frame之不同
1. MPEG I.B.P frame之不同
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Timeline: The Life of Michael Jackson
Timeline: The Life of Michael JacksonTimeline: The Life of Michael Jackson
Timeline: The Life of Michael Jackson
 
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
Popular Reading Last Updated April 1, 2010 Adams, Lorraine The ...
 
LESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIALLESSONS FROM THE MICHAEL JACKSON TRIAL
LESSONS FROM THE MICHAEL JACKSON TRIAL
 
Com 380, Summer II
Com 380, Summer IICom 380, Summer II
Com 380, Summer II
 
PPT
PPTPPT
PPT
 
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet JazzThe MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
The MYnstrel Free Press Volume 2: Economic Struggles, Meet Jazz
 
MICHAEL JACKSON.doc
MICHAEL JACKSON.docMICHAEL JACKSON.doc
MICHAEL JACKSON.doc
 
Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1Social Networks: Twitter Facebook SL - Slide 1
Social Networks: Twitter Facebook SL - Slide 1
 
Facebook
Facebook Facebook
Facebook
 
Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...Executive Summary Hare Chevrolet is a General Motors dealership ...
Executive Summary Hare Chevrolet is a General Motors dealership ...
 
Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...Welcome to the Dougherty County Public Library's Facebook and ...
Welcome to the Dougherty County Public Library's Facebook and ...
 
NEWS ANNOUNCEMENT
NEWS ANNOUNCEMENTNEWS ANNOUNCEMENT
NEWS ANNOUNCEMENT
 
C-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.docC-2100 Ultra Zoom.doc
C-2100 Ultra Zoom.doc
 
MAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.docMAC Printing on ITS Printers.doc.doc
MAC Printing on ITS Printers.doc.doc
 
Mac OS X Guide.doc
Mac OS X Guide.docMac OS X Guide.doc
Mac OS X Guide.doc
 
hier
hierhier
hier
 
WEB DESIGN!
WEB DESIGN!WEB DESIGN!
WEB DESIGN!
 

Project Name

  • 1. MapReduce and Clouds for Sciencehttp://salsahpc.indiana.edu/ Indiana University Bloomington Geoffrey Fox, Judy Qiu, SALSA Group SALSA project (salsahpc.indiana.edu) investigates new programming models of parallel multicore computing and Cloud/Grid computing. It aims at developing and applying parallel and distributed Cyberinfrastructure to support large scale data analysis. We illustrate this with a project for life sciences: clustering for biology Alu and Metagenomics sequences; a study of usability and performance of different Cloud approaches; an iterative MapReduce runtime, Twister, to support complex data analysis algorithms for scientific applications; engagement of undergraduate students in new programming models using Dryad and TPL through class, REU, and Minority outreach programs. Processing/Visualizing DNA Sequencing Pipeline Biology MDS and Clustering Results There is a data deluge throughout science and all areas need analysis pipelines or workflows to propel the data from instruments through various stages to scientific discovery often aided by visualization. It is well known that these pipelines typically offer natural data parallelism that can be implemented within many different frameworks. We chose to look at the MapReduce frameworks as these stem from the commercial information retrieval field which is perhaps currently the world’s most demanding data analysis problem. Exploiting commercial approaches offers a good chance that one can achieve high-quality, robust environments and MapReduce has a mixture of commercial and open source implementations. This figure illustrates results from our research of a pipeline mode to provide services on demand (Software as a Service SaaS) for genomics. Alu Families This visualizes results of Alu repeats from Chimpanzee and Human Genomes. Young families (green, yellow) are tight clusters Metagenomics This visualizes results of clustering and dimension reduction to 3D of 30000 gene sequences from an environmental sample. Usability and Performance of Different Cloud/MapReduce Models We have demonstrated that clouds offer attractive computing paradigms for loosely coupled scientific applications. Higher level models include Dryad and Hadoopwhich we find are easier to use than EC2 and Azure (less setup and fewer lines of code). The cost effectiveness of cloud data centers combined with the comparable performance reported here suggests that loosely coupled science applications will increasingly be implemented on clouds and that using MapReduce will offer convenient user interfaces with little overhead. Earlier studies have shown that MPI is similar in performance to Hadoop and Dryad. Undergraduate Research Experiences Twister(MapReduce++) supports iterative MapReduce Computations and allows MapReduce to achieve higher performance, perform faster data transfers, and reduce the time it takes to process vast sets of data for data mining and machine learning applications. Open source code supports streaming communication and long running processes The IU HBCU STEM Summer Scholar Institute is an eight-week program that provides opportunities for minority students to engage in continuous, substantive research and work with researchers of our group on active projects. Funded by NSF, a team of STEM summer scholars from North Carolina A&T has joined Community Grids Lab and involved in research activities with the SALSA project that is funded by Microsoft research. http://www.iterativemapreduce.org/