Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise

Integrating Hadoop into the Enterprise
Jonathan Seidman
Hadoop Summit 2012
June 14th, 2012

Who I Am

• Solutions Architect, Partner Engineering
Team.
• Co-founder of Chicago Hadoop User
Group and co-founder/organizer of
Chicago Big Data.
• jseidman@cloudera.com
• @jseidman
• cloudera.com/careers

2
©2012 Cloudera, Inc. All Rights Reserved.

What I’ll Be Talking About
• Some Background.
• Common uses of Hadoop in an enterprise data
infrastructure.
• Hadoop Integration – the big picture.
• Deeper dive:
– Data import/export: Moving data between Hadoop
and existing data stores.
– ETL tools.
– Business intelligence (BI) and analytic tools.
• Example architectures and data flows.
• Conclusions

3

My Life Before Cloudera…

4

Hadoop at Orbitz
100.00%
Queries
90.00%
80.00% Searches
71.67%
70.00%
60.00%
50.00%
40.00%
34.30%
31.87%
30.00%
20.00%
10.00%
2.78%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

5

But Hadoop Was An Isolated System

Developers Business Analysts Normal
Users Humans

6

Hadoop + the Data Warehouse…

7

…Enabled New Analyses

8

In our opinion, integration with existing IT systems
and software is critical, as we know enterprises will
not be replacing these technologies anytime soon.

For Hadoop platforms this means integration with
existing databases, data warehouses, and
business-analytics and business-visualization
tools. *

* A near-term outlook for big data, Jo Maitland, GigaOM Pro, March 2012

9

What Can We Do?
• ETL
– Scalable ETL – allows companies to meet SLA’s
(inexpensively).
– Agile – facilitates rapid modifications.
• Moving analysis off of existing systems.
• Sandbox for exploratory analytics.
• Using Hadoop as an active archive.
• Joining transactional data from a DB with
interaction data.
• Common theme: freeing up existing systems for
tasks they’re better suited for.

10

BI/Analytics Tools

Enterprise
Data
Warehouse

Relational
Databases
Flume
Data Import/Export ETL Tools

Appliances NoSQL

11

Data Import/Export

Enterprise
Data
Warehouse

Relational
Databases

12

Sqoop Overview

• Apache project designed to ease import
and export of data between Hadoop and
relational databases.
• Provides functionality to do bulk imports
and exports of data with HDFS, Hive and
HBase.
• Java based. Leverages MapReduce to
transfer data in parallel.

13

Sqoop Overview

• Uses a “connector” abstraction.
• Two types of connectors
– Standard connectors are JDBC based.
– Direct connectors use native database
interfaces to improve performance.
• Direct connectors are available for many
open-source and commercial databases –
MySQL, PostgreSQL, Oracle, SQL
Server, Teradata, etc.

14

Sqoop Import Flow

Run import Collect metadata

Client Sqoop

Generate code, Pull data
Execute MR job
MapReduce Map Map Map

Write to Hadoop

Hadoop

15

Sqoop Limitations

Sqoop has some limitations, including:
• Poor support for security.
$ sqoop import –username scott –password tiger…
– Sqoop can read command line options from
an option file, but this still has holes.
• Error prone syntax.
• Tight coupling to JDBC model – not a
good fit for non-RDBMS systems.

16

Fortunately…

Sqoop 2 (incubating) will address many of
these limitations:
• Adds a web-based GUI.
• Centralized configuration.
• More flexible model.
• Improved security model.

17

Informatica PowerExchange

• Not just RDBMS integration – provides
consistent, native integration between
Hadoop and a range of data
sources, databases, legacy
systems, standard file formats, CRM…
• Integrated with PowerCenter for pre/post-
processing of data, administration, and
metadata management.

18

Power Exchange – Data Import

Access Data Pre-Process Ingest Data
Web server

Databases, PowerExchange PowerCenter
Data Warehouse
Batch HDFS

Message Queues,
Email, Social Media CDC HIVE
e.g.
Filter, Join, Cle
ERP, CRM anse
Real-time

Mainframe

19

Power Exchange – Data Export

Extract Data Post-Process Deliver Data

Web server

PowerCenter PowerExchange
Databases,
Data Warehouse
HDFS Batch

Real-time
ERP, CRM
e.g. Transform
to target
schema
Mainframe

20

Informatica PowerExchange
1. Create Ingest or
Extract Mapping

2. Create Hadoop
Connection

3. Configure Workflow

4. Configure Hive
Properties

21

There’s Always the Low-Tech Way…

GreenPlum

GPLoad
Hadoop GreenPlum
Processing Hive Local Disk

GreenPlum

22

BI/Analytics Tools

Enterprise
Data
Warehouse

Relational
Databases
Flume

Appliances NoSQL

23

ETL Tools

24

ETL Tools

25

ETL – The Wikipedia Definition

• Extract, transform and load (ETL) is a
process in database usage and especially
in data warehousing that involves:
– Extracting data from outside sources
– Transforming it to fit operational needs
– Loading it into the end target (DB or data
warehouse)

http://en.wikipedia.org/wiki/Extract,_transform,_load

26

ETL Tools

• Very common use case for Hadoop.
• Most ETL in Hadoop is still done through
plain old MapReduce.
• Companies want to leverage their existing
developer skills – many enterprises have
armies of SQL and ETL developers.

27

Informatica HParser

• Not exactly ETL – provides data
transformation and parsing optimized for
parallel processing on Hadoop.
• Supports deeply hierarchical data and
complex data formats.
• Transformations are defined in a Windows
UI and then deployed to a Hadoop Cluster
for execution.

28

HParser – How does it work?
hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt

HDFS

1. Develop a DT transformation
2. Deploy the transformation to Hadoop
3. Run DT on Hadoop to produce
tabular data
4. Analyze the data with HIVE / PIG /
MapReduce / Other…

29

Pentaho

• Existing BI tools extended to support
Hadoop.
• Not just ETL – also provides data
import/export, job
orchestration, reporting, and analysis
functionality.
• Supports integration with HDFS, Hive and
Hbase.
• Community and Enterprise Editions
offered.
30

Pentaho
• Primary component is
Pentaho Data
Integration (PDI), also
known as Kettle.
• PDI Provides a
graphical drag-and-
drop environment for
defining ETL
jobs, which interface
with Java MapReduce
to execute in-cluster
transformations.

31

Other ETL Solutions

• Talend
– Also following an open-source model.
– Extending their existing data integration tools
to data integration.
• Pervasive RushAnalyzer
– Software to build and run big data ETL, data
transformation, mining and visualization on
Hadoop.

32

BI/Analytics Tools

Enterprise
Data
Warehouse

Relational
Databases
Flume

Appliances NoSQL

33

Business Intelligence/Analytics Tools

34

BI – The Forrester Research Definition

"Business Intelligence is a set of
methodologies, processes, architectures, an
d technologies that transform raw data into
meaningful and useful information used to
enable more effective strategic, tactical, and
operational insights and decision-making.” *

* http://en.wikipedia.org/wiki/Business_intelligence

35

Business Intelligence/Analytics Tools

Relational Data
…
Databases Warehouses

36

Cloudera ODBC Driver
• Most of these tools use the
ODBC standard.
• Since Hive is an SQL-like ODBC

system it’s a good fit for DRIVER

ODBC. HIVEQL

• ODBC driver for Hive is
available, but has licensing HIVE SERVER

issues. HIVE

• Because of this, Cloudera
developed it’s own
drivers, available for free
download.
37

Hive ODBC Limitations

• Hive does not have full SQL support.
• Multi-user is currently not supported by
Hive Server.
• Poor support for security.
• Dependent on Hive – data must be loaded
in Hive to be available.
• The Thrift API in the Hive Server doesn’t
support common ODBC calls.

38

Hive ODBC Limitations

The Hive community is working on Hive Server 2 to
address some of these limitations:
• Improved support for multiple users.
• Improved support for ODBC and JDBC
drivers.
• And better support for security is coming.

39

MicroStrategy

40

Tableau

41

Other BI Connectors

• Microsoft ODBC Driver
– Part of the Hadoop on Windows solution.
– Provides connectivity for MS BI tools such as
Excel, PowerPivot, etc.
• MapR ODBC driver
– Support for standard ODBC based tools.

42

Analytic Tools

– RHadoop project.

– Integration of SAS analytics with Hadoop.

– Integration of SAP HANA with Hadoop

– Toad for Cloud

43

Hadoop Specific Tools – Karmasphere

44

Hadoop Specific Tools – Datameer

45

Example Integration

Event HParser PowerCenter/Power Data
Hive Exchange
Logs Warehouse

https://community.informatica.com/mpresources/Communities/IW2012/Docs/bos_65.pdf

46

Example – Migration of ETL

Logs Raw ETL (SQL) Target
Tables Tables

Data
Warehouse

HDFS ETL
Logs Flume (MapReduce)
Sqoop Target
Tables

Data
Warehouse

47

What’s Missing?

• Better tools for ETL without coding.
• Better tools for data governance, data
quality, etc.
– Ensuring that data in Hadoop complies with
policies, rules, etc.
• Integration with commercial enterprise
schedulers/workflow engines.
– Although open-source workflow schedulers
exist (e.g. Oozie).

48

Conclusions
• Hadoop integration is still in the early stages.
– Expect to see new/better tools coming from both vendors
and the open-source community.
• Despite the relative immaturity of this space, there’s
already a dizzying array of solutions available.
– Choose solutions based on existing skills and tools already
in use by your organization.
• If using current BI tools integrated with Hive keep in
mind that enhancements for multi-user, security, etc.
are on the way.
• And it bears repeating: always use the right tool for the
job.
– Hadoop won’t replace your data warehouses and
databases, but will complement them.

49

Thank
Questions?
You!
http://www.cloudera.com/partners/spotlight/

+1 (888) 789-1488 cloudera.com twitter.com/
cloudera
sales@cloudera.com

facebook.com/
cloudera

50

Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise

Similar a Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise (20)

Más de Cloudera, Inc.

Más de Cloudera, Inc. (20)

Último

Último (20)

Hadoop Summit 2012 | Integrating Hadoop Into the Enterprise

Notas del editor