This document discusses how small data can be effectively analyzed using simple command line tools and scripting languages like Ruby. The key points made are:
1) Most data is small in size and does not require large Hadoop clusters for processing - command line tools and scripting are often much faster for small data workloads.
2) The Unix shell is a powerful programming environment that allows stringing together simple commands into powerful pipelines for analyzing and transforming data in flexible ways.
3) Ruby is a great fit for scripting small data tasks and integrating with Unix tools due to its clean syntax, large standard library, and ability to be used for one-liners or full scripts.
18. ”For the same amount of data
I was able to use my laptop
to get the results in about
12 seconds (270MB/sec), while
the Hadoop cluster took about
26 minutes (1.14MB/sec)”
Adam Drake, “Command-line tools can be 235x faster
than your Hadoop cluster”, http://bit.ly/1sS01aP