Más contenido relacionado La actualidad más candente (20) Similar a Integrating Hadoop Into the Enterprise (20) Más de DataWorks Summit (20) Integrating Hadoop Into the Enterprise2. Who I Am
• Solutions Architect, Partner Engineering
Team.
• Co-founder of Chicago Hadoop User
Group and co-founder/organizer of
Chicago Big Data.
• jseidman@cloudera.com
• @jseidman
• cloudera.com/careers
2
©2012 Cloudera, Inc. All Rights Reserved.
3. What I’ll Be Talking About
• Some Background.
• Common uses of Hadoop in an enterprise data
infrastructure.
• Hadoop Integration – the big picture.
• Deeper dive:
– Data import/export: Moving data between Hadoop
and existing data stores.
– ETL tools.
– Business intelligence (BI) and analytic tools.
• Example architectures and data flows.
• Conclusions
3
©2012 Cloudera, Inc. All Rights Reserved.
4. My Life Before Cloudera…
4
©2012 Cloudera, Inc. All Rights Reserved.
5. Hadoop at Orbitz
100.00%
Queries
90.00%
80.00% Searches
71.67%
70.00%
60.00%
50.00%
40.00%
34.30%
31.87%
30.00%
20.00%
10.00%
2.78%
0.00%
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20
5
©2012 Cloudera, Inc. All Rights Reserved.
6. But Hadoop Was An Isolated System
Developers Business Analysts Normal
Users Humans
6
©2012 Cloudera, Inc. All Rights Reserved.
7. Hadoop + the Data Warehouse…
7
©2012 Cloudera, Inc. All Rights Reserved.
9. In our opinion, integration with existing IT systems
and software is critical, as we know enterprises will
not be replacing these technologies anytime soon.
For Hadoop platforms this means integration with
existing databases, data warehouses, and
business-analytics and business-visualization
tools. *
* A near-term outlook for big data, Jo Maitland, GigaOM Pro, March 2012
9
©2012 Cloudera, Inc. All Rights Reserved.
10. What Can We Do?
• ETL
– Scalable ETL – allows companies to meet SLA’s
(inexpensively).
– Agile – facilitates rapid modifications.
• Moving analysis off of existing systems.
• Sandbox for exploratory analytics.
• Using Hadoop as an active archive.
• Joining transactional data from a DB with
interaction data.
• Common theme: freeing up existing systems for
tasks they’re better suited for.
10
©2012 Cloudera, Inc. All Rights Reserved.
11. BI/Analytics Tools
Enterprise
Data
Warehouse
Relational
Databases
Flume
Data Import/Export ETL Tools
Appliances NoSQL
11
©2012 Cloudera, Inc. All Rights Reserved.
12. Data Import/Export
Enterprise
Data
Warehouse
Relational
Databases
12
©2012 Cloudera, Inc. All Rights Reserved.
13. Sqoop Overview
• Apache project designed to ease import
and export of data between Hadoop and
relational databases.
• Provides functionality to do bulk imports
and exports of data with HDFS, Hive and
HBase.
• Java based. Leverages MapReduce to
transfer data in parallel.
13
©2012 Cloudera, Inc. All Rights Reserved.
14. Sqoop Overview
• Uses a “connector” abstraction.
• Two types of connectors
– Standard connectors are JDBC based.
– Direct connectors use native database
interfaces to improve performance.
• Direct connectors are available for many
open-source and commercial databases –
MySQL, PostgreSQL, Oracle, SQL
Server, Teradata, etc.
14
©2012 Cloudera, Inc. All Rights Reserved.
15. Sqoop Import Flow
Run import Collect metadata
Client Sqoop
Generate code, Pull data
Execute MR job
MapReduce Map Map Map
Write to Hadoop
Hadoop
15
©2012 Cloudera, Inc. All Rights Reserved.
16. Sqoop Limitations
Sqoop has some limitations, including:
• Poor support for security.
$ sqoop import –username scott –password tiger…
– Sqoop can read command line options from
an option file, but this still has holes.
• Error prone syntax.
• Tight coupling to JDBC model – not a
good fit for non-RDBMS systems.
16
©2012 Cloudera, Inc. All Rights Reserved.
17. Fortunately…
Sqoop 2 (incubating) will address many of
these limitations:
• Adds a web-based GUI.
• Centralized configuration.
• More flexible model.
• Improved security model.
17
©2012 Cloudera, Inc. All Rights Reserved.
18. Informatica PowerExchange
• Not just RDBMS integration – provides
consistent, native integration between
Hadoop and a range of data
sources, databases, legacy
systems, standard file formats, CRM…
• Integrated with PowerCenter for pre/post-
processing of data, administration, and
metadata management.
18
©2012 Cloudera, Inc. All Rights Reserved.
19. Power Exchange – Data Import
Access Data Pre-Process Ingest Data
Web server
Databases, PowerExchange PowerCenter
Data Warehouse
Batch HDFS
Message Queues,
Email, Social Media CDC HIVE
e.g.
Filter, Join, Cle
ERP, CRM anse
Real-time
Mainframe
19
©2012 Cloudera, Inc. All Rights Reserved.
20. Power Exchange – Data Export
Extract Data Post-Process Deliver Data
Web server
PowerCenter PowerExchange
Databases,
Data Warehouse
HDFS Batch
Real-time
ERP, CRM
e.g. Transform
to target
schema
Mainframe
20
©2012 Cloudera, Inc. All Rights Reserved.
21. Informatica PowerExchange
1. Create Ingest or
Extract Mapping
2. Create Hadoop
Connection
3. Configure Workflow
4. Configure Hive
Properties
21
©2012 Cloudera, Inc. All Rights Reserved.
22. There’s Always the Low-Tech Way…
GreenPlum
GPLoad
Hadoop GreenPlum
Processing Hive Local Disk
GreenPlum
22
©2012 Cloudera, Inc. All Rights Reserved.
23. BI/Analytics Tools
Enterprise
Data
Warehouse
Relational
Databases
Flume
Data Import/Export ETL Tools
Appliances NoSQL
23
©2012 Cloudera, Inc. All Rights Reserved.
26. ETL – The Wikipedia Definition
• Extract, transform and load (ETL) is a
process in database usage and especially
in data warehousing that involves:
– Extracting data from outside sources
– Transforming it to fit operational needs
– Loading it into the end target (DB or data
warehouse)
http://en.wikipedia.org/wiki/Extract,_transform,_load
26
©2012 Cloudera, Inc. All Rights Reserved.
27. ETL Tools
• Very common use case for Hadoop.
• Most ETL in Hadoop is still done through
plain old MapReduce.
• Companies want to leverage their existing
developer skills – many enterprises have
armies of SQL and ETL developers.
27
©2012 Cloudera, Inc. All Rights Reserved.
28. Informatica HParser
• Not exactly ETL – provides data
transformation and parsing optimized for
parallel processing on Hadoop.
• Supports deeply hierarchical data and
complex data formats.
• Transformations are defined in a Windows
UI and then deployed to a Hadoop Cluster
for execution.
28
©2012 Cloudera, Inc. All Rights Reserved.
29. HParser – How does it work?
hadoop … dt-hadoop.jar
… My_Parser /input/*/input*.txt
HDFS
1. Develop a DT transformation
2. Deploy the transformation to Hadoop
3. Run DT on Hadoop to produce
tabular data
4. Analyze the data with HIVE / PIG /
MapReduce / Other…
29
©2012 Cloudera, Inc. All Rights Reserved.
30. Pentaho
• Existing BI tools extended to support
Hadoop.
• Not just ETL – also provides data
import/export, job
orchestration, reporting, and analysis
functionality.
• Supports integration with HDFS, Hive and
Hbase.
• Community and Enterprise Editions
offered.
30
©2012 Cloudera, Inc. All Rights Reserved.
31. Pentaho
• Primary component is
Pentaho Data
Integration (PDI), also
known as Kettle.
• PDI Provides a
graphical drag-and-
drop environment for
defining ETL
jobs, which interface
with Java MapReduce
to execute in-cluster
transformations.
31
©2012 Cloudera, Inc. All Rights Reserved.
32. Other ETL Solutions
• Talend
– Also following an open-source model.
– Extending their existing data integration tools
to data integration.
• Pervasive RushAnalyzer
– Software to build and run big data ETL, data
transformation, mining and visualization on
Hadoop.
32
©2012 Cloudera, Inc. All Rights Reserved.
33. BI/Analytics Tools
Enterprise
Data
Warehouse
Relational
Databases
Flume
Data Import/Export ETL Tools
Appliances NoSQL
33
©2012 Cloudera, Inc. All Rights Reserved.
35. BI – The Forrester Research Definition
"Business Intelligence is a set of
methodologies, processes, architectures, an
d technologies that transform raw data into
meaningful and useful information used to
enable more effective strategic, tactical, and
operational insights and decision-making.” *
* http://en.wikipedia.org/wiki/Business_intelligence
35
©2012 Cloudera, Inc. All Rights Reserved.
37. Cloudera ODBC Driver
• Most of these tools use the
ODBC standard.
• Since Hive is an SQL-like ODBC
system it’s a good fit for DRIVER
ODBC. HIVEQL
• ODBC driver for Hive is
available, but has licensing HIVE SERVER
issues. HIVE
• Because of this, Cloudera
developed it’s own
drivers, available for free
download.
37
©2012 Cloudera, Inc. All Rights Reserved.
38. Hive ODBC Limitations
• Hive does not have full SQL support.
• Multi-user is currently not supported by
Hive Server.
• Poor support for security.
• Dependent on Hive – data must be loaded
in Hive to be available.
• The Thrift API in the Hive Server doesn’t
support common ODBC calls.
38
©2012 Cloudera, Inc. All Rights Reserved.
39. Hive ODBC Limitations
The Hive community is working on Hive Server 2 to
address some of these limitations:
• Improved support for multiple users.
• Improved support for ODBC and JDBC
drivers.
• And better support for security is coming.
39
©2012 Cloudera, Inc. All Rights Reserved.
41. Tableau
41
©2012 Cloudera, Inc. All Rights Reserved.
42. Other BI Connectors
• Microsoft ODBC Driver
– Part of the Hadoop on Windows solution.
– Provides connectivity for MS BI tools such as
Excel, PowerPivot, etc.
• MapR ODBC driver
– Support for standard ODBC based tools.
42
©2012 Cloudera, Inc. All Rights Reserved.
43. Analytic Tools
– RHadoop project.
– Integration of SAS analytics with Hadoop.
– Integration of SAP HANA with Hadoop
– Toad for Cloud
43
©2012 Cloudera, Inc. All Rights Reserved.
46. Example Integration
Event HParser PowerCenter/Power Data
Hive Exchange
Logs Warehouse
https://community.informatica.com/mpresources/Communities/IW2012/Docs/bos_65.pdf
46
©2012 Cloudera, Inc. All Rights Reserved.
47. Example – Migration of ETL
Logs Raw ETL (SQL) Target
Tables Tables
Data
Warehouse
HDFS ETL
Logs Flume (MapReduce)
Sqoop Target
Tables
Data
Warehouse
47
©2012 Cloudera, Inc. All Rights Reserved.
48. What’s Missing?
• Better tools for ETL without coding.
• Better tools for data governance, data
quality, etc.
– Ensuring that data in Hadoop complies with
policies, rules, etc.
• Integration with commercial enterprise
schedulers/workflow engines.
– Although open-source workflow schedulers
exist (e.g. Oozie).
48
©2012 Cloudera, Inc. All Rights Reserved.
49. Conclusions
• Hadoop integration is still in the early stages.
– Expect to see new/better tools coming from both vendors
and the open-source community.
• Despite the relative immaturity of this space, there’s
already a dizzying array of solutions available.
– Choose solutions based on existing skills and tools already
in use by your organization.
• If using current BI tools integrated with Hive keep in
mind that enhancements for multi-user, security, etc.
are on the way.
• And it bears repeating: always use the right tool for the
job.
– Hadoop won’t replace your data warehouses and
databases, but will complement them.
49
©2012 Cloudera, Inc. All Rights Reserved.
50. Thank
Questions?
You!
http://www.cloudera.com/partners/spotlight/
+1 (888) 789-1488 cloudera.com twitter.com/
cloudera
sales@cloudera.com
facebook.com/
cloudera
50
©2011 Cloudera, Inc. All Rights Reserved.
Notas del editor Common theme: moving time, space, or processor intensive processing to Hadoop. Flume provides ingestion of streaming data (e.g. logs) into Hadoop. Client executesSqoop job.Sqoop interrogates DB for column names, types, etc.Based on extracted metadata, Sqoop creates source code for table class, and then kicks off MR job. This table class can be used for processing on extracted records.Sqoop by default will guess at a column for splitting data for distribution across the cluster. This can also be specified by client. Pentaho also has integration with NoSQL DBs (Mongo, Cassandra, etc.) Most of these tools integrate to existing data stores using the ODBC standard. MSTR and Tableau are tested and certified now with the Cloudera driver, but other standard ODBC based tools should also work, and more integrations will be supported soon. Also, Cloudera has implemented a solution for multi-user, which will also soon support authentication. In memory model supports low-latency queries.