Design of a_dsl_by_ruby_for_heavy_computations

•

0 recomendaciones•856 vistas

Koichi Fujikawa

Presentation of the 37th GRACE seminar at NII, 16th June, 2010. http://www.grace-center.jp/event/node/59

Tecnología

Design of a DSL by Ruby
for heavy computations
over map-reduce clusters

the 37th Grace seminar
16th June, 2010

Koichi Fujikawa
Cirius Technologies, Inc.

Today's Agenda
Background
Problem
Approach
My Project
Conclusion

We Live in the "Big Data" era
World-wide web page data (Text-only) is expected
400TB (at one point).
Some web service company (like Google,
Yahoo, etc) have to process these data for
their business, but..
General HDD can read data in 50MB/sec. This
means we can take 2000 hours (approx. 100
days) to read the total web data(400TB) by one
machine.
We need the parallel processing / file system.

MapReduce
MapReduce is one of the parallel skeletons
Became popular by Google's paper(2004)
MapReduce has two phases
Map phase: transform key and value to
another (key and) value
Reduce phase: aggregate and calculate
values by one key
Each record process by map phase first and
then by reduce phase

Hadoop
Hadoop is open source clone of Google
MapReduce hosted by Apache Foundation
Big web service provider(Yahoo, Facebook,
etc) contribute this project actively.
Large development and user community all
over the world (including Japan)
Hadoop conference Japan 2009
Hadoop source code reading events

Programming Model
General programmers, engineers are not
familiar with this "MapReduce" model, so it is
too difficult to try and use
Especially to separate Map and Reduce
No Effective way of the "pattern of the
MapRecuce programming" because this
technology is not mature for the engineers.
We have to find this individually. It is very
difficult and time-consuming.

Programming Language
Hadoop is written in Java language, so the
programmers need to write Map and Reduce
procedure in Java.
Java is strong typed and compile language.
Some web service engineer don't like these
language.
No problem if the code is fixed and
completed, but I wonder it is suitable for ad-
hoc prototyping and easy querying.
MapReduce jobs depend on what users want to
get, so flexibility is important, I think.

Hide complexity of MapReduce
I found the description for MapReduce could
be simpler in some specific case (e.g. log
analysis).
In this case (but almost all of Hadoop usage is
now log analysis), it would be nice if
programmers can write the description without
taking care of MapReduce!

DSL approach by Ruby
For this description, I created DSL for each
specific usage.
Log analysis DSL is a reference
implementation which I prepared.
As DSL runtime environment for Hadoop, I
chose Ruby and JRuby, which is Ruby
runtime working on JVM.
Ruby is very flexible and reusable object-
oriented language, so very easy to create
DSL processor.

Hadoop Papyrus
DSL framework for Hadoop by JRuby
We can write log analysis code by
only several line.
Open source (Apache Licence) same as
Hadoop
Hosted by github
Distributed by common Ruby archive site
RubyGems.org
Supported by IPA mitoh 2009

On the way to big challenge
We need parallel processing method to
handle massive web-scale data.
MapReduce and Hadoop is one of good tools,
but..
Difficult to describe Map and Reduce
Irritated to write Java for someone :-)
Hadoop Papyrus is providing the key!
Ruby-based DSL framework for Hadoop
You can write Map and Reduce at once

Questions?
Thank you very much!
Twitter ID: @fujibee

Más contenido relacionado

La actualidad más candente

Hadoop Pig: MapReduce the easy way!Nathan Bijnens

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkRde:code 2017

Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)Gruter

RedisGleicon Moraes

Introduction to the Hadoop Ecosystem (codemotion Edition)Uwe Printz

Onyx data processing the clojure wayBahadir Cambel

7 key recipes for data engineeringunivalence

Scalding for HadoopChicago Hadoop Users Group

Scoobi - Scala for Startupsbmlever

Data Science Stack with MongoDB and RStudioWinston Chen

Java Persistence Frameworks for MongoDBTobias Trelle

power-assert, mechanism and philosophyTakuto Wada

Introduction to Apache Tajo: Future of Data WarehouseJihoon Son

HadoopCassell Hsu

MongoDB + Java + Spring DataAnton Sulzhenko

Scalding: Twitter's Scala DSL for Hadoop/Cascadingjohnynek

Introduction to Scalding and MonoidsHugo Gävert

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQRick Copeland

Huangjing renrend0nn9n

Presto OverfviewMiguel Ping

La actualidad más candente (20)

Hadoop Pig: MapReduce the easy way!

[AI04] Scaling Machine Learning to Big Data Using SparkML and SparkR

Gruter_TECHDAY_2014_04_TajoCloudHandsOn (in Korean)

Redis

Introduction to the Hadoop Ecosystem (codemotion Edition)

Onyx data processing the clojure way

7 key recipes for data engineering

Scalding for Hadoop

Scoobi - Scala for Startups

Data Science Stack with MongoDB and RStudio

Java Persistence Frameworks for MongoDB

power-assert, mechanism and philosophy

Introduction to Apache Tajo: Future of Data Warehouse

Hadoop

MongoDB + Java + Spring Data

Scalding: Twitter's Scala DSL for Hadoop/Cascading

Introduction to Scalding and Monoids

Realtime Analytics Using MongoDB, Python, Gevent, and ZeroMQ

Huangjing renren

Presto Overfview

Destacado

Hadoop Conf Japan 2009 After Party LT - Hadoop Ruby DSLKoichi Fujikawa

Cloud computing competition by HapyrusKoichi Fujikawa

Technology Plan For Stevenson Ms TableDana Luterman

Rakuten tech confKoichi Fujikawa

GUIAS DE NADALboello

Trends WCM 2010Martijn Hoeijmans

クラウド時代の並列分散処理技術Koichi Fujikawa

Tokyo Webmining #12 HapyrusKoichi Fujikawa

Amazon Redshiftの開発者がこれだけは知っておきたい10のTIPS / 第18回 AWS User Group - Japan Koichi Fujikawa

Destacado (9)

Hadoop Conf Japan 2009 After Party LT - Hadoop Ruby DSL

Cloud computing competition by Hapyrus

Technology Plan For Stevenson Ms Table

Rakuten tech conf

GUIAS DE NADAL

Trends WCM 2010

クラウド時代の並列分散処理技術

Tokyo Webmining #12 Hapyrus

Amazon Redshiftの開発者がこれだけは知っておきたい10のTIPS / 第18回 AWS User Group - Japan

Similar a Design of a_dsl_by_ruby_for_heavy_computations

Hadoop a Natural Choice for Data Intensive Log ProcessingHitendra Kumar

spark_v1_2Frank Schroeter

Architecting and productionising data science applications at scalesamthemonad

Learning How to Learn HadoopSilicon Halton

Hadoop Seminar ReportBhushan Kulkarni

Hadoop vs sparkamarkayam

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...Chris Baglieri

Overview of big data & hadoop v1Thanh Nguyen

Hadoop And Big Data - My Presentation To Selective AudienceChandra Sekhar

Hadoop Interview Question and Answerstechieguy85

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...Big Data Montreal

Blue Ruby SDN WebinarJuergen Schmerder

Gluecon 2014 - Bringing Node.js to the JVMJeremy Whitlock

Hadoop at Yahoo! -- University Talksyhadoop

Unit 4 lecture2vishal choudhary

B04 06 0918International Journal of Engineering Inventions www.ijeijournal.com

Handling not so big dataSATOSHI TAGOMORI

Learn about SPARK tool and it's componemtssiddharth30121

Hadoop demo pptPhil Young

Understanding hadoopRexRamos9

Similar a Design of a_dsl_by_ruby_for_heavy_computations (20)

Hadoop a Natural Choice for Data Intensive Log Processing

spark_v1_2

Architecting and productionising data science applications at scale

Learning How to Learn Hadoop

Hadoop Seminar Report

Hadoop vs spark

Finding the needles in the haystack. An Overview of Analyzing Big Data with H...

Overview of big data & hadoop v1

Hadoop And Big Data - My Presentation To Selective Audience

Hadoop Interview Question and Answers

BDM37: Hadoop in production – the war stories by Nikolaï Grigoriev, Principal...

Blue Ruby SDN Webinar

Gluecon 2014 - Bringing Node.js to the JVM

Hadoop at Yahoo! -- University Talks

Unit 4 lecture2

B04 06 0918

Handling not so big data

Learn about SPARK tool and it's componemts

Hadoop demo ppt

Understanding hadoop

Último

Enhancing Worker Digital Experience: A Hands-on Workshop for PartnersThousandEyes

GenCyber Cyber Security Day PresentationMichael W. Hawkins

Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK

08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls

Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC

My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar

A Domino Admins Adventures (Engage 2024)Gabriella Davis

The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad

Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia

08448380779 Call Girls In Friends Colony Women Seeking MenDelhi Call girls

Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j

#StandardsGoals for 2024: What’s new for BISAC - Tech Forum 2024BookNet Canada

04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG

Transforming Data Streams with Kafka Connect: An Introduction to Single Messa...HostedbyConfluent

Google AI Hackathon: LLM based Evaluator for RAGSujit Pal

The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los

Automating Business Process via MuleSoft Composer | Bangalore MuleSoft Meetup...shyamraj55

SQL Database Design For Developers at php[tek] 2024Scott Keck-Warren

Slack Application Development 101 Slidespraypatel2

🐬 The future of MySQL is Postgres 🐘RTylerCroy

Design of a_dsl_by_ruby_for_heavy_computations

1. Design of a DSL by Ruby for heavy computations over map-reduce clusters the 37th Grace seminar 16th June, 2010 Koichi Fujikawa Cirius Technologies, Inc.

2. Today's Agenda Background Problem Approach My Project Conclusion

3. Background Where are we in the world?

4. We Live in the "Big Data" era World-wide web page data (Text-only) is expected 400TB (at one point). Some web service company (like Google, Yahoo, etc) have to process these data for their business, but.. General HDD can read data in 50MB/sec. This means we can take 2000 hours (approx. 100 days) to read the total web data(400TB) by one machine. We need the parallel processing / file system.

5. MapReduce MapReduce is one of the parallel skeletons Became popular by Google's paper(2004) MapReduce has two phases Map phase: transform key and value to another (key and) value Reduce phase: aggregate and calculate values by one key Each record process by map phase first and then by reduce phase

7. Hadoop Hadoop is open source clone of Google MapReduce hosted by Apache Foundation Big web service provider(Yahoo, Facebook, etc) contribute this project actively. Large development and user community all over the world (including Japan) Hadoop conference Japan 2009 Hadoop source code reading events

8. Problem What issues do we face?

9. Programming Model General programmers, engineers are not familiar with this "MapReduce" model, so it is too difficult to try and use Especially to separate Map and Reduce No Effective way of the "pattern of the MapRecuce programming" because this technology is not mature for the engineers. We have to find this individually. It is very difficult and time-consuming.

10. Programming Language Hadoop is written in Java language, so the programmers need to write Map and Reduce procedure in Java. Java is strong typed and compile language. Some web service engineer don't like these language. No problem if the code is fixed and completed, but I wonder it is suitable for ad- hoc prototyping and easy querying. MapReduce jobs depend on what users want to get, so flexibility is important, I think.

11. Approach How do we resolve it?

12. Hide complexity of MapReduce I found the description for MapReduce could be simpler in some specific case (e.g. log analysis). In this case (but almost all of Hadoop usage is now log analysis), it would be nice if programmers can write the description without taking care of MapReduce!

13. DSL approach by Ruby For this description, I created DSL for each specific usage. Log analysis DSL is a reference implementation which I prepared. As DSL runtime environment for Hadoop, I chose Ruby and JRuby, which is Ruby runtime working on JVM. Ruby is very flexible and reusable object- oriented language, so very easy to create DSL processor.

14. My project What do I do?

15. Hadoop Papyrus DSL framework for Hadoop by JRuby We can write log analysis code by only several line. Open source (Apache Licence) same as Hadoop Hosted by github Distributed by common Ruby archive site RubyGems.org Supported by IPA mitoh 2009

16.

17.

18. DEMO

19. Conclusion What is archiving now?

20. On the way to big challenge We need parallel processing method to handle massive web-scale data. MapReduce and Hadoop is one of good tools, but.. Difficult to describe Map and Reduce Irritated to write Java for someone :-) Hadoop Papyrus is providing the key! Ruby-based DSL framework for Hadoop You can write Map and Reduce at once

21. Questions? Thank you very much! Twitter ID: @fujibee

Design of a_dsl_by_ruby_for_heavy_computations

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (9)

Similar a Design of a_dsl_by_ruby_for_heavy_computations

Similar a Design of a_dsl_by_ruby_for_heavy_computations (20)

Último

Último (20)

Design of a_dsl_by_ruby_for_heavy_computations