Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop

•

8 recomendaciones•2,231 vistas

Hadoop is an excellent environment for analyzing large data sets, but it lacks an easy-to-use graphical interface for building data pipelines and performing advanced analytics. RapidMiner is an excellent open-source tool for data analytics, but is limited to running on a single machine.In this presentation, we will introduce Radoop, an extension to RapidMiner that lets users interact with a Hadoop cluster. Radoop combines the strengths of both projects and provides a user-friendly interface for editing and running ETL, analytics, and machine learning processes on Hadoop. We will also discuss lessons learned while integrating HDFS, Hive, and Mahout with RapidMiner.

Tecnología

Radoop: A Graphical Analytics Tool for Big Data

Gabor Makrai, CTO, Radoop

Who we are

• Active members of a Data Mining Research
Group in Europe

• We started using Hadoop two years ago

• We are using basic Hadoop, Hive, and Mahout

11/9/2011 HadoopWorld 2011 2

Data mining tools

• Closed software
– SAS Enterprise Miner
– IBM SPSS Modeler
• Open-source software
– Rapid-I RapidMiner
–R
• Graphical user interface
• Data-flow structure
• Adaptability is important
11/9/2011 HadoopWorld 2011 3

Hadoop vs. Data mining tools

Hadoop Data mining tools

11/9/2011 HadoopWorld 2011 4

Why is it important?

• Barrier to entry for Hadoop
– Using Hadoop without expert Hadoop knowledge

• Develop time vs. running time

• User-friendly graphical interface
– Program readability

11/9/2011 HadoopWorld 2011 5

RapidMiner

• The most used data mining tool in 2010*
• Open-source software
• Supports extensions
• Data-flow structure
• Marketplace

• * http://www.kdnuggets.com/

11/9/2011 HadoopWorld 2011 6

Radoop architecture

11/9/2011 HadoopWorld 2011 7

Implementation difficulties

RapidMiner and Hive data types
RapidMiner Hive
• Nominal • TINYINT
– Text • SMALLINT
– Polynominal
• INT
– Binominal
• BIGINT
• Numeric
• BOOLEAN
– Integer
– Real • FLOAT
• Date and time • DOUBLE
– Date • STRING
– Time

11/9/2011 HadoopWorld 2011 8

Implementation difficulties

• Input data restrictions for Mahout
– Conversion between Hive and Mahout
• Mahout needs data in special format
– Data must be stored in VectorWritable class
• Hive can export data
– Plain text or Sequence file format
• Solution: simple MapReduce jobs
– Convert exported plain text Hive table to
VectorWritable format and vica versa

11/9/2011 HadoopWorld 2011 9

Implementation difficulties

• Remote Mahout’s jobs running

• Hadoop Commons and Hive handle remote
connections well

• At the same time, Mahout does not support
remote running

• Solution: modifications in the Mahout’s base
source code

11/9/2011 HadoopWorld 2011 10

Implementation status

• Data imports and exports
– CSV, Excel, and Database import/export

• Data transformations
– Most used data manipulation functions

• Scalable machine learning and data mining
– Clustering algorithms
– Classifications

11/9/2011 HadoopWorld 2011 11

Radoop base elements

• Operator

• Process

11/9/2011 HadoopWorld 2011 12

Radoop case study

11/9/2011 HadoopWorld 2011 13

Radoop case study

11/9/2011 HadoopWorld 2011 14

Radoop case study

11/9/2011 HadoopWorld 2011 15

Radoop case study

Gets the Hive table

11/9/2011 HadoopWorld 2011 16

Radoop case study

Creates a new view with where statement

11/9/2011 HadoopWorld 2011 17

Radoop case study

Creates a new view with group by function

11/9/2011 HadoopWorld 2011 18

Radoop case study

Creates a new view with sort by function

11/9/2011 HadoopWorld 2011 19

Radoop case study

Creates a new view with limit

11/9/2011 HadoopWorld 2011 20

Radoop case study

Creates a new table from the last view

11/9/2011 HadoopWorld 2011 21

Future

• “We believe that more than half of the world’s
data will be stored in Apache Hadoop within
five years.” Hortonworks

• Radoop is opening the doors for people who
are less comfortable with Hadoop but want to
use Hadoop for Big Data analytics

11/9/2011 HadoopWorld 2011 22

Contacts

• Gabor Makrai
– makrai@radoop.eu

• Webpage
– http://www.radoop.eu/

• E-mail
– radoop@radoop.eu

• Twitter
– @radoopeu

11/9/2011 HadoopWorld 2011 23

Más contenido relacionado

La actualidad más candente

Hdf5 parallelmfolk

Update on HDF5 1.8The HDF-EOS Tools and Information Center

The Whitebox Geospatial-Analyisis Tools Project and Open-Access GISGolgi Alvarez

GI2013 ppt kafka&team-inspire in pocketIGN Vorstand

Inspire in pocket dresden 2Karel Charvat

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...Amazon Web Services

HDF UpdateThe HDF-EOS Tools and Information Center

HDF5 High Level and Lite LibrariesThe HDF-EOS Tools and Information Center

Hdf5 current futuremfolk

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...WG_ Events

America Runs on Excel and HDF5 - Glued together by PythonThe HDF-EOS Tools and Information Center

HDF Cloud ServicesThe HDF-EOS Tools and Information Center

SCAPE Information Day at BL - Large Scale Processing with HadoopSCAPE Project

This is an interesting metadata source. Can I import it into Koha?Dobrica Pavlinušić

hadoop_moduleArmy Public School and College -Faisal

Linked Data, Ontologies and InferenceBarry Norton

Olap operationsRohanJaiswal29

GDAL Enhancement for ESDIS ProjectThe HDF-EOS Tools and Information Center

Linked data enhanced publishing for special collections (with Drupal)Joachim Neubert

HDF Project UpdateThe HDF-EOS Tools and Information Center

La actualidad más candente (20)

Hdf5 parallel

Update on HDF5 1.8

The Whitebox Geospatial-Analyisis Tools Project and Open-Access GIS

GI2013 ppt kafka&team-inspire in pocket

Inspire in pocket dresden 2

AWS Public Data Sets: How to Stage Petabytes of Data for Analysis in AWS (WPS...

HDF Update

HDF5 High Level and Lite Libraries

Hdf5 current future

DataTalks #4: Построение хранилища данных на основе платформы hadoop / Игорь ...

America Runs on Excel and HDF5 - Glued together by Python

HDF Cloud Services

SCAPE Information Day at BL - Large Scale Processing with Hadoop

This is an interesting metadata source. Can I import it into Koha?

hadoop_module

Linked Data, Ontologies and Inference

Olap operations

GDAL Enhancement for ESDIS Project

Linked data enhanced publishing for special collections (with Drupal)

HDF Project Update

Destacado

M Chambers and RapidMiner Overview for Babson classmcAnalytics99

Data mining toolssuganmca14

Présentation on radoop siliconsudipt

radoop - nlp matiné 2014Zoltan Varju

Slides PAPIs.io'14 RapidMinerSabrina Kirstein

Data mining tools overallMohamed Sharique Vellikan

RapidMiner, an entrance to explore MIMIC-III?Sven Van Poucke, MD, PhD

Data Mining: Implementation of Data Mining Techniques using RapidMiner softwareMohammed Kharma

Rapid minerManish Champaneri

RapidminerGernot Schulmeister

Introduction to RapidMiner Studio V7geraldinegray

RapidMiner: Introduction To Rapid MinerRapidmining Content

Terminology Machine LearningDataminingTools Inc

Destacado (13)

M Chambers and RapidMiner Overview for Babson class

Data mining tools

Présentation on radoop

radoop - nlp matiné 2014

Slides PAPIs.io'14 RapidMiner

Data mining tools overall

RapidMiner, an entrance to explore MIMIC-III?

Data Mining: Implementation of Data Mining Techniques using RapidMiner software

Rapid miner

Rapidminer

Introduction to RapidMiner Studio V7

RapidMiner: Introduction To Rapid Miner

Terminology Machine Learning

Similar a Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop

Hadoop pycon2011ukAditya Sakhuja

Hadoop In ActionBigdata Meetup Kochi

Dallas TDWI Meeting Dec. 2012: Hadooplamont_lockwood

YARN - Strata 2014Hortonworks

Hortonworks - What's Possible with a Modern Data Architecture?Hortonworks

Hortonworks and Red Hat Webinar - Part 2Hortonworks

Introduction to BIg Data and HadoopAmir Shaikh

Apache Hadoop YARN: Past, Present and FutureDataWorks Summit

INTRODUCTION TO BIG DATA HADOOPKrishna Sujeer

201305 hadoop jpl-v3Eric Baldeschwieler

Hadoop Summit San Jose 2015: YARN - Past, Present and FutureVinod Kumar Vavilapalli

Hadoop Eco systemTilani Gunawardena PhD(UNIBAS), BSc(Pera), FHEA(UK), CEng, MIESL

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...Innovative Management Services

Hadoop Summit Europe 2015 - YARN Present and FutureVinod Kumar Vavilapalli

Apache Hadoop YARN 2015: Present and FutureDataWorks Summit

Hadoop And Their Ecosystem pptsunera pathan

Hadoop And Their Ecosystemsunera pathan

Intro to Big Data Analytics using Apache Spark and Apache ZeppelinAlex Zeltov

SoCal BigData DayJohn Park

M. Florence Dayana - Hadoop Foundation for Analytics.pptxDr.Florence Dayana

Similar a Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop (20)

Hadoop pycon2011uk

Hadoop In Action

Dallas TDWI Meeting Dec. 2012: Hadoop

YARN - Strata 2014

Hortonworks - What's Possible with a Modern Data Architecture?

Hortonworks and Red Hat Webinar - Part 2

Introduction to BIg Data and Hadoop

Apache Hadoop YARN: Past, Present and Future

INTRODUCTION TO BIG DATA HADOOP

201305 hadoop jpl-v3

Hadoop Summit San Jose 2015: YARN - Past, Present and Future

Hadoop Eco system

Open-BDA Hadoop Summit 2014 - Mr. Slim Baltagi (Building a Modern Data Archit...

Hadoop Summit Europe 2015 - YARN Present and Future

Apache Hadoop YARN 2015: Present and Future

Hadoop And Their Ecosystem ppt

Hadoop And Their Ecosystem

Intro to Big Data Analytics using Apache Spark and Apache Zeppelin

SoCal BigData Day

M. Florence Dayana - Hadoop Foundation for Analytics.pptx

Más de Cloudera, Inc.

Partner Briefing_January 25 (FINAL).pptxCloudera, Inc.

Cloudera Data Impact Awards 2021 - Finalists Cloudera, Inc.

2020 Cloudera Data Impact Awards FinalistsCloudera, Inc.

Edc event vienna presentation 1 oct 2019Cloudera, Inc.

Machine Learning with Limited Labeled Data 4/3/19Cloudera, Inc.

Data Driven With the Cloudera Modern Data Warehouse 3.19.19Cloudera, Inc.

Introducing Cloudera DataFlow (CDF) 2.13.19Cloudera, Inc.

Introducing Cloudera Data Science Workbench for HDP 2.12.19Cloudera, Inc.

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Cloudera, Inc.

Leveraging the cloud for analytics and machine learning 1.29.19Cloudera, Inc.

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Cloudera, Inc.

Leveraging the Cloud for Big Data Analytics 12.11.18Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 3Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 2Cloudera, Inc.

Modern Data Warehouse Fundamentals Part 1Cloudera, Inc.

Extending Cloudera SDX beyond the PlatformCloudera, Inc.

Federated Learning: ML with Privacy on the Edge 11.15.18Cloudera, Inc.

Analyst Webinar: Doing a 180 on Customer 360Cloudera, Inc.

Build a modern platform for anti-money laundering 9.19.18Cloudera, Inc.

Introducing the data science sandbox as a service 8.30.18Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx

Cloudera Data Impact Awards 2021 - Finalists

2020 Cloudera Data Impact Awards Finalists

Edc event vienna presentation 1 oct 2019

Machine Learning with Limited Labeled Data 4/3/19

Data Driven With the Cloudera Modern Data Warehouse 3.19.19

Introducing Cloudera DataFlow (CDF) 2.13.19

Introducing Cloudera Data Science Workbench for HDP 2.12.19

Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19

Leveraging the cloud for analytics and machine learning 1.29.19

Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19

Leveraging the Cloud for Big Data Analytics 12.11.18

Modern Data Warehouse Fundamentals Part 3

Modern Data Warehouse Fundamentals Part 2

Modern Data Warehouse Fundamentals Part 1

Extending Cloudera SDX beyond the Platform

Federated Learning: ML with Privacy on the Edge 11.15.18

Analyst Webinar: Doing a 180 on Customer 360

Build a modern platform for anti-money laundering 9.19.18

Introducing the data science sandbox as a service 8.30.18

Último

ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous

TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc

Navigating the Deluge_ Dubai Floods and the Resilience of Dubai International...Orbitshub

CNIC Information System with Pakdata Cf In Pakistandanishmna97

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

MINDCTI Revenue Release Quarter One 2024MIND CTI

Apidays New York 2024 - The value of a flexible API Management solution for O...apidays

Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...apidays

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

AXA XL - Insurer Innovation Award Americas 2024The Digital Insurer

Spring Boot vs Quarkus the ultimate battle - DevoxxUKJago de Vreede

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

+971581248768>> SAFE AND ORIGINAL ABORTION PILLS FOR SALE IN DUBAI AND ABUDHA...?#DUbAI#??##{{(☎️+971_581248768%)**%*]'#abortion pills for sale in dubai@

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Manulife - Insurer Transformation Award 2024The Digital Insurer

presentation ICT roal in 21st century educationjfdjdjcjdnsjd

Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop

1. Radoop: A Graphical Analytics Tool for Big Data Gabor Makrai, CTO, Radoop

2. Who we are • Active members of a Data Mining Research Group in Europe • We started using Hadoop two years ago • We are using basic Hadoop, Hive, and Mahout 11/9/2011 HadoopWorld 2011 2

3. Data mining tools • Closed software – SAS Enterprise Miner – IBM SPSS Modeler • Open-source software – Rapid-I RapidMiner –R • Graphical user interface • Data-flow structure • Adaptability is important 11/9/2011 HadoopWorld 2011 3

4. Hadoop vs. Data mining tools Hadoop Data mining tools 11/9/2011 HadoopWorld 2011 4

5. Why is it important? • Barrier to entry for Hadoop – Using Hadoop without expert Hadoop knowledge • Develop time vs. running time • User-friendly graphical interface – Program readability 11/9/2011 HadoopWorld 2011 5

6. RapidMiner • The most used data mining tool in 2010* • Open-source software • Supports extensions • Data-flow structure • Marketplace • * http://www.kdnuggets.com/ 11/9/2011 HadoopWorld 2011 6

7. Radoop architecture 11/9/2011 HadoopWorld 2011 7

8. Implementation difficulties RapidMiner and Hive data types RapidMiner Hive • Nominal • TINYINT – Text • SMALLINT – Polynominal • INT – Binominal • BIGINT • Numeric • BOOLEAN – Integer – Real • FLOAT • Date and time • DOUBLE – Date • STRING – Time 11/9/2011 HadoopWorld 2011 8

9. Implementation difficulties • Input data restrictions for Mahout – Conversion between Hive and Mahout • Mahout needs data in special format – Data must be stored in VectorWritable class • Hive can export data – Plain text or Sequence file format • Solution: simple MapReduce jobs – Convert exported plain text Hive table to VectorWritable format and vica versa 11/9/2011 HadoopWorld 2011 9

10. Implementation difficulties • Remote Mahout’s jobs running • Hadoop Commons and Hive handle remote connections well • At the same time, Mahout does not support remote running • Solution: modifications in the Mahout’s base source code 11/9/2011 HadoopWorld 2011 10

11. Implementation status • Data imports and exports – CSV, Excel, and Database import/export • Data transformations – Most used data manipulation functions • Scalable machine learning and data mining – Clustering algorithms – Classifications 11/9/2011 HadoopWorld 2011 11

12. Radoop base elements • Operator • Process 11/9/2011 HadoopWorld 2011 12

13. Radoop case study 11/9/2011 HadoopWorld 2011 13

14. Radoop case study 11/9/2011 HadoopWorld 2011 14

15. Radoop case study 11/9/2011 HadoopWorld 2011 15

16. Radoop case study Gets the Hive table 11/9/2011 HadoopWorld 2011 16

17. Radoop case study Creates a new view with where statement 11/9/2011 HadoopWorld 2011 17

18. Radoop case study Creates a new view with group by function 11/9/2011 HadoopWorld 2011 18

19. Radoop case study Creates a new view with sort by function 11/9/2011 HadoopWorld 2011 19

20. Radoop case study Creates a new view with limit 11/9/2011 HadoopWorld 2011 20

21. Radoop case study Creates a new table from the last view 11/9/2011 HadoopWorld 2011 21

22. Future • “We believe that more than half of the world’s data will be stored in Apache Hadoop within five years.” Hortonworks • Radoop is opening the doors for people who are less comfortable with Hadoop but want to use Hadoop for Big Data analytics 11/9/2011 HadoopWorld 2011 22

23. Contacts • Gabor Makrai – makrai@radoop.eu • Webpage – http://www.radoop.eu/ • E-mail – radoop@radoop.eu • Twitter – @radoopeu 11/9/2011 HadoopWorld 2011 23

Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (13)

Similar a Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop

Similar a Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Hadoop World 2011: Radoop: a Graphical Analytics Tool for Big Data - Gabor Makrai, Radoop