This document introduces BioCloud, a tool for using cloud computing platforms like Hadoop to process large biological datasets in parallel. It discusses how biology applications are becoming more resource-intensive and how cloud platforms can provide scalable computing resources at a lower cost than local hardware. It provides an overview of Hadoop and MapReduce as an framework for processing vast amounts of data across clusters of machines. Examples of companies using Hadoop include Google, Yahoo, and Facebook for applications involving terabytes of data.
2. Disclaimer
I'm working on computer security research... no biology
background anywhere in my field, not even on computer virus ;)
While working, I stumbled across hadoop for scalable web
spidering purposes.
I'm not a bioinformatician (yet)... but I saw a powerful tool that
could be useful in your research field(s):
"biodatacrunching" ?
4. Biology and computer science
• Increasingly resource-hungry applications
o Nowadays, they can be approached by "brute force"
o More data means more "iron" to crunch it
• Local IT team nor budget keep up with this pace
o €€€ spent on new hardware
o €€€ spent on IT personnel
o Isn't it wiser to scale one machine at a time ?
• Developers get angry or frustrated on
o Delays on software installation and config
o Unscheduled downtimes
o Delays as a result of not enough computing power
5. What is cloud computing ?
In plain english:
http://www.youtube.com/watch?v=XdBd14rjcs0
8. Infraestructure
• Amazon
o EC2
o S3
o AMI
Recently added BioInformatic appliances
Public data sets
• Eukalyptus
o EC2 + AMI server-side open source implementation
o We run it for our internal projects
• Enomalism
• Rightscale & Service Cloud
o Tools/Consultants for the upcoming cloud issues
10. Application layer
• Hadoop
o Open source mapreduce implementation
o Java based, but any language can be used
• Cloudburst-bio
o MapReduce fine tuned implementation for Bio (XXX)
12. What is hadoop
Quotation from official web page:
"Hadoop is a software platform that lets one easily write and
run applications that process vast amounts of data."
"vast amounts of data (ATGTTAG...)" + "easily" = sounds good
isn't it ? or is it vaporware ?
13. Why is it used for ?
• Attack problems that imply several GB, TB even PB of data
• The programmer does not care on job management
o The focus is on data transformation, piping (useful work)
• Not intended for realtime processing
• Suitable to offload databases from long batch jobs
14. What is MapReduce
Joel on software explanation
Useful to crunch *tons* of data parallellized by design
17. Who is using it ?
• Google
o Lots of internal projects (proprietary MapReduce)
GMail spam machine learning
Google maps
...
• Yahoo
o Internal web graph (powers search engine)
o Pig (sqlish abstraction)
o Sort 1 terabyte of data in 209 seconds
• Facebook
o Users big graph, used for data mining (Hive)
19. Next steps ?
Identify resource-hungry applications (batch vs interactive)
Migrate apps to cloud
1) Allocate a certain fixed amount of money
2) Give a try on amazon EC2
3) Optional: Build (local) rocks cluster with Eukaliptus cloud
Test, deploy, automate, automate and automate ... puppet ?