SlideShare una empresa de Scribd logo
1 de 15
Business Analyst Tools for Hadoop
Amr Awadallah
CTO, Cloudera, Inc.
Hadoop World
October 12th, 2010
Copyright 2010 Couldera Inc. All Rights Reserved. 1
The Spectrum of Hadoop Users
Copyright 2010 Cloudera Inc. All rights reserved 2
Logs Files Web Data
Enterprise
Data
Warehouse
Web
Application
Enterprise
Reporting
BI, Analytics
Analysts Business Users
Customers
IDEs
Engineers
Relational
Databases
Low-Latency
Serving
Systems
Cloudera
Enterprise
Operators
Evolution of Hadoop Query/Programming Languages
1. Java MapReduce: Gives the most flexibility and performance,
but potentially long development cycle (the “assembly
language” of Hadoop).
2. Streaming MapReduce (also Pipes): Allows you to develop in
any programming language of your choice, but slightly lower
performance and less flexibility.
3. Cascading: Cascading is a thin Java library that sits on top of
MapReduce, it lets developers assemble complex processes.
4. Pig: A high-level language out of Yahoo, suitable for batch data
flow workloads.
5. Hive: A SQL interpreter out of Facebook, also includes a meta-
store mapping files to their schemas and associated SerDe.
6. Oozie: A PDL XML workflow server engine that enables creating
a workflow of jobs composed of any of the above.
3Copyright 2010 Couldera Inc. All Rights Reserved.
Hive vs Pig Example (count distinct values > 0)
• Hive syntax:
SELECT COUNT(DISTINCT col1)
FROM mytable
WHERE col1 > 0;
• Pig syntax:
mytable = LOAD ‘myfile’ AS (col1, col2, col3);
mytable = FOREACH mytable GENERATE col1;
mytable = FILTER mytable BY col1 > 0;
mytable = DISTINCT col1;
mytable = GROUP mytable BY col1;
mytable = FOREACH mytable GENERATE COUNT(mytable);
DUMP mytable;
4Copyright 2010 Couldera Inc. All Rights Reserved.
Hive Features
• A subset of SQL covering the most common statements
• Agile data types: Array, Map, Struct, and JSON objects
• User Defined Functions and Aggregates
• Regular Expression support
• MapReduce support
• JDBC/ODBC support
• Partitions and Buckets (for performance optimization)
• In The Works: Indices, Columnar Storage, Views, Microstrategy
compatibility, Explode/Collect
• More details: http://wiki.apache.org/hadoop/Hive
5Copyright 2010 Couldera Inc. All Rights Reserved.
The Hadoop Query Tool Ecosystem
6Copyright 2010 Couldera Inc. All Rights Reserved.
Cloudera Enterprise
Cloudera’s Distribution for Hadoop
In Memory
PowerPivot
QlikTech
EdgeSpring
Tableau
ETL
Informatica
Pervasive
IBM DataStage
Microsoft SSIS
Talend
Kettle
Query Authoring
Karmasphere
Quest (Toad)
Spreadsheet
IBM BigSheets
Datameer
BI/OLAP
MicroStrategy
IBM Cognos
SAP BOBJ
Microsoft SSRS
Jaspersoft
Pentaho
Developer
Karmasphere
Eclipse
Cascading
Stats/Math
SAS
IBM SPSS
Matlab
R/RHIPE
Mahoot
Hama
Reporting
SAP Crystal
Actuate/BIRT
Hadoop is very flexible, use the right tool for the job at hand.
Toad for Cloud (for Query Authoring)
7Copyright 2010 Couldera Inc. All Rights Reserved.
RDBMSHadoop
Learn more at: http://www.ToadForCloud.com
Karmasphere (for Developers and Analysts)
8Copyright 2010 Couldera Inc. All Rights Reserved.
Tableau (for Advanced Visualization)
9Copyright 2010 Couldera Inc. All Rights Reserved.
Datameer (for Analysts, Spreadsheet UI)
10Copyright 2010 Couldera Inc. All Rights Reserved.
MicroStrategy (for interactive Dashboards)
11Copyright 2010 Couldera Inc. All Rights Reserved.
Talend (for Extract-Tranform-Load, aka ETL)
12Copyright 2010 Couldera Inc. All Rights Reserved.
General Advice for Choosing the Right Tool.
• First and foremost, what problem are you trying to solve? And
what is your skill set? Use the tool that gets you there fastest.
• What is the learning curve involved with this new tool?
• Does the tool interoperate with other systems?
• Is the tool leveraging the investment in Pig/Hive?
• Does the tool lock you in to a proprietary file format?
• Is the tool certified for Cloudera’s Distribution of Hadoop?
13Copyright 2010 Couldera Inc. All Rights Reserved.
Appendix
Copyright 2010 Couldera Inc. All Rights Reserved. 14
Hive Agile Data Types
• STRUCTS:
• SELECT mytable.mycolumn.myfield FROM …
• MAPS (Hashes):
• SELECT mytable.mycolumn[mykey] FROM …
• ARRAYS:
• SELECT mytable.mycolumn[5] FROM …
• JSON:
• SELECT get_json_object(mycolumn, objpath
15Copyright 2010 Couldera Inc. All Rights Reserved.

Más contenido relacionado

La actualidad más candente

Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
joshwills
 

La actualidad más candente (20)

Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?Hadoop on Cloud: Why and How?
Hadoop on Cloud: Why and How?
 
Impala use case @ Zoosk
Impala use case @ ZooskImpala use case @ Zoosk
Impala use case @ Zoosk
 
Hadoop Ecosystem at a Glance
Hadoop Ecosystem at a GlanceHadoop Ecosystem at a Glance
Hadoop Ecosystem at a Glance
 
Building Big data solutions in Azure
Building Big data solutions in AzureBuilding Big data solutions in Azure
Building Big data solutions in Azure
 
Big data solutions in azure
Big data solutions in azureBig data solutions in azure
Big data solutions in azure
 
JethroData technical white paper
JethroData technical white paperJethroData technical white paper
JethroData technical white paper
 
Hadoop distributions - ecosystem
Hadoop distributions - ecosystemHadoop distributions - ecosystem
Hadoop distributions - ecosystem
 
Big data solutions in Azure
Big data solutions in AzureBig data solutions in Azure
Big data solutions in Azure
 
Hadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced AnalyticsHadoop vs. RDBMS for Advanced Analytics
Hadoop vs. RDBMS for Advanced Analytics
 
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
Keynote – From MapReduce to Spark: An Ecosystem Evolves by Doug Cutting, Chie...
 
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinneyIbis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
Ibis: operating the Python data ecosystem at Hadoop scale by Wes McKinney
 
High concurrency,
Low latency analytics
using Spark/Kudu
 High concurrency,
Low latency analytics
using Spark/Kudu High concurrency,
Low latency analytics
using Spark/Kudu
High concurrency,
Low latency analytics
using Spark/Kudu
 
Comparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and sparkComparison among rdbms, hadoop and spark
Comparison among rdbms, hadoop and spark
 
A Closer Look at Apache Kudu
A Closer Look at Apache KuduA Closer Look at Apache Kudu
A Closer Look at Apache Kudu
 
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for productionFaster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
Faster Batch Processing with Cloudera 5.7: Hive-on-Spark is ready for production
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
Enabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache KuduEnabling the Active Data Warehouse with Apache Kudu
Enabling the Active Data Warehouse with Apache Kudu
 
What is Apache spark
What is Apache sparkWhat is Apache spark
What is Apache spark
 
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UKSUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
SUSE, Hadoop and Big Data Update. Stephen Mogg, SUSE UK
 
Sparkflows.io
Sparkflows.ioSparkflows.io
Sparkflows.io
 

Destacado (9)

Test3
Test3Test3
Test3
 
Photo Sharing Services Smart Card 060513
Photo Sharing Services Smart Card 060513Photo Sharing Services Smart Card 060513
Photo Sharing Services Smart Card 060513
 
Introduction
IntroductionIntroduction
Introduction
 
Manifeste des tiers lieux
Manifeste des tiers lieuxManifeste des tiers lieux
Manifeste des tiers lieux
 
Final Presentation-ARC
Final Presentation-ARCFinal Presentation-ARC
Final Presentation-ARC
 
Scope of cost accounting
Scope of cost accountingScope of cost accounting
Scope of cost accounting
 
HBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBaseHBaseCon 2013: ETL for Apache HBase
HBaseCon 2013: ETL for Apache HBase
 
Acc0101. Meaning and Scope of Accounting
Acc0101. Meaning and Scope of AccountingAcc0101. Meaning and Scope of Accounting
Acc0101. Meaning and Scope of Accounting
 
Vahva henkilöbrändi DIKO
Vahva henkilöbrändi DIKO Vahva henkilöbrändi DIKO
Vahva henkilöbrändi DIKO
 

Similar a Cloudera - Amr Awadallah - Hadoop World 2010

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
Thanh Nguyen
 

Similar a Cloudera - Amr Awadallah - Hadoop World 2010 (20)

Overview of big data & hadoop v1
Overview of big data & hadoop   v1Overview of big data & hadoop   v1
Overview of big data & hadoop v1
 
Overview of big data & hadoop version 1 - Tony Nguyen
Overview of big data & hadoop   version 1 - Tony NguyenOverview of big data & hadoop   version 1 - Tony Nguyen
Overview of big data & hadoop version 1 - Tony Nguyen
 
Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1Overview of Big data, Hadoop and Microsoft BI - version1
Overview of Big data, Hadoop and Microsoft BI - version1
 
Eric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers ConferenceEric Baldeschwieler Keynote from Storage Developers Conference
Eric Baldeschwieler Keynote from Storage Developers Conference
 
Big data or big deal
Big data or big dealBig data or big deal
Big data or big deal
 
spark_v1_2
spark_v1_2spark_v1_2
spark_v1_2
 
Hadoop and Big Data: Revealed
Hadoop and Big Data: RevealedHadoop and Big Data: Revealed
Hadoop and Big Data: Revealed
 
Architecting the Future of Big Data and Search
Architecting the Future of Big Data and SearchArchitecting the Future of Big Data and Search
Architecting the Future of Big Data and Search
 
Big SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor LandscapeBig SQL Competitive Summary - Vendor Landscape
Big SQL Competitive Summary - Vendor Landscape
 
Agile data lake? An oxymoron?
Agile data lake? An oxymoron?Agile data lake? An oxymoron?
Agile data lake? An oxymoron?
 
The other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needsThe other Apache Technologies your Big Data solution needs
The other Apache Technologies your Big Data solution needs
 
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | EdurekaWhat are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
What are Hadoop Components? Hadoop Ecosystem and Architecture | Edureka
 
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
Hadoop in the Enterprise - Dr. Amr Awadallah @ Microstrategy World 2011
 
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
Java One 2017: Open Source Big Data in the Cloud: Hadoop, M/R, Hive, Spark an...
 
Oracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by ExampleOracle Unified Information Architeture + Analytics by Example
Oracle Unified Information Architeture + Analytics by Example
 
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
Hadoop World 2011: How Hadoop Revolutionized Business Intelligence and Advanc...
 
Hadoop in a Nutshell
Hadoop in a NutshellHadoop in a Nutshell
Hadoop in a Nutshell
 
10 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 201910 big data analytics tools to watch out for in 2019
10 big data analytics tools to watch out for in 2019
 
Data Science Languages and Industry Analytics
Data Science Languages and Industry AnalyticsData Science Languages and Industry Analytics
Data Science Languages and Industry Analytics
 
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Apache Hive: From MapReduce to Enterprise-grade Big Data WarehousingApache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
Apache Hive: From MapReduce to Enterprise-grade Big Data Warehousing
 

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Partner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptxPartner Briefing_January 25 (FINAL).pptx
Partner Briefing_January 25 (FINAL).pptx
 
Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists Cloudera Data Impact Awards 2021 - Finalists
Cloudera Data Impact Awards 2021 - Finalists
 
2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists2020 Cloudera Data Impact Awards Finalists
2020 Cloudera Data Impact Awards Finalists
 
Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019Edc event vienna presentation 1 oct 2019
Edc event vienna presentation 1 oct 2019
 
Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19Machine Learning with Limited Labeled Data 4/3/19
Machine Learning with Limited Labeled Data 4/3/19
 
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19Data Driven With the Cloudera Modern Data Warehouse 3.19.19
Data Driven With the Cloudera Modern Data Warehouse 3.19.19
 
Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19Introducing Cloudera DataFlow (CDF) 2.13.19
Introducing Cloudera DataFlow (CDF) 2.13.19
 
Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19Introducing Cloudera Data Science Workbench for HDP 2.12.19
Introducing Cloudera Data Science Workbench for HDP 2.12.19
 
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
Shortening the Sales Cycle with a Modern Data Warehouse 1.30.19
 
Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19Leveraging the cloud for analytics and machine learning 1.29.19
Leveraging the cloud for analytics and machine learning 1.29.19
 
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
Modernizing the Legacy Data Warehouse – What, Why, and How 1.23.19
 
Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18Leveraging the Cloud for Big Data Analytics 12.11.18
Leveraging the Cloud for Big Data Analytics 12.11.18
 
Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3Modern Data Warehouse Fundamentals Part 3
Modern Data Warehouse Fundamentals Part 3
 
Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2Modern Data Warehouse Fundamentals Part 2
Modern Data Warehouse Fundamentals Part 2
 
Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1Modern Data Warehouse Fundamentals Part 1
Modern Data Warehouse Fundamentals Part 1
 
Extending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the PlatformExtending Cloudera SDX beyond the Platform
Extending Cloudera SDX beyond the Platform
 
Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18Federated Learning: ML with Privacy on the Edge 11.15.18
Federated Learning: ML with Privacy on the Edge 11.15.18
 
Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360Analyst Webinar: Doing a 180 on Customer 360
Analyst Webinar: Doing a 180 on Customer 360
 
Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18Build a modern platform for anti-money laundering 9.19.18
Build a modern platform for anti-money laundering 9.19.18
 
Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18Introducing the data science sandbox as a service 8.30.18
Introducing the data science sandbox as a service 8.30.18
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 

Último (20)

[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf[BuildWithAI] Introduction to Gemini.pdf
[BuildWithAI] Introduction to Gemini.pdf
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
Apidays New York 2024 - Accelerating FinTech Innovation by Vasa Krishnan, Fin...
 
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
Apidays New York 2024 - Passkeys: Developing APIs to enable passwordless auth...
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
DBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor PresentationDBX First Quarter 2024 Investor Presentation
DBX First Quarter 2024 Investor Presentation
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost SavingRepurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
Repurposing LNG terminals for Hydrogen Ammonia: Feasibility and Cost Saving
 
presentation ICT roal in 21st century education
presentation ICT roal in 21st century educationpresentation ICT roal in 21st century education
presentation ICT roal in 21st century education
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024MINDCTI Revenue Release Quarter One 2024
MINDCTI Revenue Release Quarter One 2024
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
WSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering DevelopersWSO2's API Vision: Unifying Control, Empowering Developers
WSO2's API Vision: Unifying Control, Empowering Developers
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWEREMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWER
 

Cloudera - Amr Awadallah - Hadoop World 2010

  • 1. Business Analyst Tools for Hadoop Amr Awadallah CTO, Cloudera, Inc. Hadoop World October 12th, 2010 Copyright 2010 Couldera Inc. All Rights Reserved. 1
  • 2. The Spectrum of Hadoop Users Copyright 2010 Cloudera Inc. All rights reserved 2 Logs Files Web Data Enterprise Data Warehouse Web Application Enterprise Reporting BI, Analytics Analysts Business Users Customers IDEs Engineers Relational Databases Low-Latency Serving Systems Cloudera Enterprise Operators
  • 3. Evolution of Hadoop Query/Programming Languages 1. Java MapReduce: Gives the most flexibility and performance, but potentially long development cycle (the “assembly language” of Hadoop). 2. Streaming MapReduce (also Pipes): Allows you to develop in any programming language of your choice, but slightly lower performance and less flexibility. 3. Cascading: Cascading is a thin Java library that sits on top of MapReduce, it lets developers assemble complex processes. 4. Pig: A high-level language out of Yahoo, suitable for batch data flow workloads. 5. Hive: A SQL interpreter out of Facebook, also includes a meta- store mapping files to their schemas and associated SerDe. 6. Oozie: A PDL XML workflow server engine that enables creating a workflow of jobs composed of any of the above. 3Copyright 2010 Couldera Inc. All Rights Reserved.
  • 4. Hive vs Pig Example (count distinct values > 0) • Hive syntax: SELECT COUNT(DISTINCT col1) FROM mytable WHERE col1 > 0; • Pig syntax: mytable = LOAD ‘myfile’ AS (col1, col2, col3); mytable = FOREACH mytable GENERATE col1; mytable = FILTER mytable BY col1 > 0; mytable = DISTINCT col1; mytable = GROUP mytable BY col1; mytable = FOREACH mytable GENERATE COUNT(mytable); DUMP mytable; 4Copyright 2010 Couldera Inc. All Rights Reserved.
  • 5. Hive Features • A subset of SQL covering the most common statements • Agile data types: Array, Map, Struct, and JSON objects • User Defined Functions and Aggregates • Regular Expression support • MapReduce support • JDBC/ODBC support • Partitions and Buckets (for performance optimization) • In The Works: Indices, Columnar Storage, Views, Microstrategy compatibility, Explode/Collect • More details: http://wiki.apache.org/hadoop/Hive 5Copyright 2010 Couldera Inc. All Rights Reserved.
  • 6. The Hadoop Query Tool Ecosystem 6Copyright 2010 Couldera Inc. All Rights Reserved. Cloudera Enterprise Cloudera’s Distribution for Hadoop In Memory PowerPivot QlikTech EdgeSpring Tableau ETL Informatica Pervasive IBM DataStage Microsoft SSIS Talend Kettle Query Authoring Karmasphere Quest (Toad) Spreadsheet IBM BigSheets Datameer BI/OLAP MicroStrategy IBM Cognos SAP BOBJ Microsoft SSRS Jaspersoft Pentaho Developer Karmasphere Eclipse Cascading Stats/Math SAS IBM SPSS Matlab R/RHIPE Mahoot Hama Reporting SAP Crystal Actuate/BIRT Hadoop is very flexible, use the right tool for the job at hand.
  • 7. Toad for Cloud (for Query Authoring) 7Copyright 2010 Couldera Inc. All Rights Reserved. RDBMSHadoop Learn more at: http://www.ToadForCloud.com
  • 8. Karmasphere (for Developers and Analysts) 8Copyright 2010 Couldera Inc. All Rights Reserved.
  • 9. Tableau (for Advanced Visualization) 9Copyright 2010 Couldera Inc. All Rights Reserved.
  • 10. Datameer (for Analysts, Spreadsheet UI) 10Copyright 2010 Couldera Inc. All Rights Reserved.
  • 11. MicroStrategy (for interactive Dashboards) 11Copyright 2010 Couldera Inc. All Rights Reserved.
  • 12. Talend (for Extract-Tranform-Load, aka ETL) 12Copyright 2010 Couldera Inc. All Rights Reserved.
  • 13. General Advice for Choosing the Right Tool. • First and foremost, what problem are you trying to solve? And what is your skill set? Use the tool that gets you there fastest. • What is the learning curve involved with this new tool? • Does the tool interoperate with other systems? • Is the tool leveraging the investment in Pig/Hive? • Does the tool lock you in to a proprietary file format? • Is the tool certified for Cloudera’s Distribution of Hadoop? 13Copyright 2010 Couldera Inc. All Rights Reserved.
  • 14. Appendix Copyright 2010 Couldera Inc. All Rights Reserved. 14
  • 15. Hive Agile Data Types • STRUCTS: • SELECT mytable.mycolumn.myfield FROM … • MAPS (Hashes): • SELECT mytable.mycolumn[mykey] FROM … • ARRAYS: • SELECT mytable.mycolumn[5] FROM … • JSON: • SELECT get_json_object(mycolumn, objpath 15Copyright 2010 Couldera Inc. All Rights Reserved.