Hadoop is an excellent environment for analyzing large data sets, but it lacks an easy-to-use graphical interface for building data pipelines and performing advanced analytics. RapidMiner is an excellent open-source tool for data analytics, but is limited to running on a single machine.In this presentation, we will introduce Radoop, an extension to RapidMiner that lets users interact with a Hadoop cluster. Radoop combines the strengths of both projects and provides a user-friendly interface for editing and running ETL, analytics, and machine learning processes on Hadoop. We will also discuss lessons learned while integrating HDFS, Hive, and Mahout with RapidMiner.
2. Who we are
• Active members of a Data Mining Research
Group in Europe
• We started using Hadoop two years ago
• We are using basic Hadoop, Hive, and Mahout
11/9/2011 HadoopWorld 2011 2
3. Data mining tools
• Closed software
– SAS Enterprise Miner
– IBM SPSS Modeler
• Open-source software
– Rapid-I RapidMiner
–R
• Graphical user interface
• Data-flow structure
• Adaptability is important
11/9/2011 HadoopWorld 2011 3
4. Hadoop vs. Data mining tools
Hadoop Data mining tools
11/9/2011 HadoopWorld 2011 4
5. Why is it important?
• Barrier to entry for Hadoop
– Using Hadoop without expert Hadoop knowledge
• Develop time vs. running time
• User-friendly graphical interface
– Program readability
11/9/2011 HadoopWorld 2011 5
6. RapidMiner
• The most used data mining tool in 2010*
• Open-source software
• Supports extensions
• Data-flow structure
• Marketplace
• * http://www.kdnuggets.com/
11/9/2011 HadoopWorld 2011 6
8. Implementation difficulties
RapidMiner and Hive data types
RapidMiner Hive
• Nominal • TINYINT
– Text • SMALLINT
– Polynominal
• INT
– Binominal
• BIGINT
• Numeric
• BOOLEAN
– Integer
– Real • FLOAT
• Date and time • DOUBLE
– Date • STRING
– Time
11/9/2011 HadoopWorld 2011 8
9. Implementation difficulties
• Input data restrictions for Mahout
– Conversion between Hive and Mahout
• Mahout needs data in special format
– Data must be stored in VectorWritable class
• Hive can export data
– Plain text or Sequence file format
• Solution: simple MapReduce jobs
– Convert exported plain text Hive table to
VectorWritable format and vica versa
11/9/2011 HadoopWorld 2011 9
10. Implementation difficulties
• Remote Mahout’s jobs running
• Hadoop Commons and Hive handle remote
connections well
• At the same time, Mahout does not support
remote running
• Solution: modifications in the Mahout’s base
source code
11/9/2011 HadoopWorld 2011 10
11. Implementation status
• Data imports and exports
– CSV, Excel, and Database import/export
• Data transformations
– Most used data manipulation functions
• Scalable machine learning and data mining
– Clustering algorithms
– Classifications
11/9/2011 HadoopWorld 2011 11
17. Radoop case study
Creates a new view with where statement
11/9/2011 HadoopWorld 2011 17
18. Radoop case study
Creates a new view with group by function
11/9/2011 HadoopWorld 2011 18
19. Radoop case study
Creates a new view with sort by function
11/9/2011 HadoopWorld 2011 19
20. Radoop case study
Creates a new view with limit
11/9/2011 HadoopWorld 2011 20
21. Radoop case study
Creates a new table from the last view
11/9/2011 HadoopWorld 2011 21
22. Future
• “We believe that more than half of the world’s
data will be stored in Apache Hadoop within
five years.” Hortonworks
• Radoop is opening the doors for people who
are less comfortable with Hadoop but want to
use Hadoop for Big Data analytics
11/9/2011 HadoopWorld 2011 22