1. Design of a DSL by Ruby
for heavy computations
over map-reduce clusters
the 37th Grace seminar
16th June, 2010
Koichi Fujikawa
Cirius Technologies, Inc.
4. We Live in the "Big Data" era
World-wide web page data (Text-only) is expected
400TB (at one point).
Some web service company (like Google,
Yahoo, etc) have to process these data for
their business, but..
General HDD can read data in 50MB/sec. This
means we can take 2000 hours (approx. 100
days) to read the total web data(400TB) by one
machine.
We need the parallel processing / file system.
5. MapReduce
MapReduce is one of the parallel skeletons
Became popular by Google's paper(2004)
MapReduce has two phases
Map phase: transform key and value to
another (key and) value
Reduce phase: aggregate and calculate
values by one key
Each record process by map phase first and
then by reduce phase
6.
7. Hadoop
Hadoop is open source clone of Google
MapReduce hosted by Apache Foundation
Big web service provider(Yahoo, Facebook,
etc) contribute this project actively.
Large development and user community all
over the world (including Japan)
Hadoop conference Japan 2009
Hadoop source code reading events
9. Programming Model
General programmers, engineers are not
familiar with this "MapReduce" model, so it is
too difficult to try and use
Especially to separate Map and Reduce
No Effective way of the "pattern of the
MapRecuce programming" because this
technology is not mature for the engineers.
We have to find this individually. It is very
difficult and time-consuming.
10. Programming Language
Hadoop is written in Java language, so the
programmers need to write Map and Reduce
procedure in Java.
Java is strong typed and compile language.
Some web service engineer don't like these
language.
No problem if the code is fixed and
completed, but I wonder it is suitable for ad-
hoc prototyping and easy querying.
MapReduce jobs depend on what users want to
get, so flexibility is important, I think.
12. Hide complexity of MapReduce
I found the description for MapReduce could
be simpler in some specific case (e.g. log
analysis).
In this case (but almost all of Hadoop usage is
now log analysis), it would be nice if
programmers can write the description without
taking care of MapReduce!
13. DSL approach by Ruby
For this description, I created DSL for each
specific usage.
Log analysis DSL is a reference
implementation which I prepared.
As DSL runtime environment for Hadoop, I
chose Ruby and JRuby, which is Ruby
runtime working on JVM.
Ruby is very flexible and reusable object-
oriented language, so very easy to create
DSL processor.
15. Hadoop Papyrus
DSL framework for Hadoop by JRuby
We can write log analysis code by
only several line.
Open source (Apache Licence) same as
Hadoop
Hosted by github
Distributed by common Ruby archive site
RubyGems.org
Supported by IPA mitoh 2009
20. On the way to big challenge
We need parallel processing method to
handle massive web-scale data.
MapReduce and Hadoop is one of good tools,
but..
Difficult to describe Map and Reduce
Irritated to write Java for someone :-)
Hadoop Papyrus is providing the key!
Ruby-based DSL framework for Hadoop
You can write Map and Reduce at once