Apache Sqoop (incubating) was created to efficiently transfer big data between Hadoop related systems (such as HDFS, Hive, and HBase) and structured data stores (such as relational databases, data warehouses, and NoSQL systems). The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. In the meantime, we have encountered many new challenges that have outgrown the abilities of the current infrastructure. To fulfill more data integration use cases as well as become easier to manage and operate, a new generation of Sqoop, also known as Sqoop 2, is currently undergoing development to address several key areas, including ease of use, ease of extension, and security. This session will talk about Sqoop 2 from both the development and operations perspectives.
Hadoop Summit 2012 | A New Generation of Data Transfer Tools for Hadoop: Sqoop 2
1. A New Generation of Data Transfer
Tools for Hadoop: Sqoop 2
Bilung Lee (blee at cloudera dot com)
Kathleen Ting (kathleen at cloudera dot com)
Hadoop Summit 2012. 6/13/12 Apache Sqoop
Copyright 2012 The Apache Software Foundation
2. Who Are We?
• Bilung Lee
– Apache Sqoop Committer
– Software Engineer, Cloudera
• Kathleen Ting
– Apache Sqoop Committer
– Support Manager, Cloudera
Hadoop Summit 2012. 6/13/12 Apache Sqoop
2
Copyright 2012 The Apache Software Foundation
3. What is Sqoop?
• Bulk data transfer tool
– Import/Export from/to relational databases,
enterprise data warehouses, and NoSQL systems
– Populate tables in HDFS, Hive, and HBase
– Integrate with Oozie as an action
– Support plugins via connector based architecture
May ‘09 March ‘10 August ‘11 April ‘12
First version Moved to Moved to Apache
(HADOOP-5815) GitHub Apache Top Level Project
Hadoop Summit 2012. 6/13/12 Apache Sqoop
3
Copyright 2012 The Apache Software Foundation
4. Sqoop 1 Architecture
Document
Enterprise Based
Data Systems
Warehouse
Relational
Database
command
Hadoop
Map Task
Sqoop
HDFS/HBase/
Hive
Hadoop Summit 2012. 6/13/12 Apache Sqoop
4
Copyright 2012 The Apache Software Foundation
5. Sqoop 1 Challenges
• Cryptic, contextual command line arguments
• Tight coupling between data transfer and
output format
• Security concerns with openly shared
credentials
• Not easy to manage installation/configuration
• Connectors are forced to follow JDBC model
Hadoop Summit 2012. 6/13/12 Apache Sqoop
5
Copyright 2012 The Apache Software Foundation
6. Sqoop 2 Architecture
Hadoop Summit 2012. 6/13/12 Apache Sqoop
6
Copyright 2012 The Apache Software Foundation
7. Sqoop 2 Themes
• Ease of Use
• Ease of Extension
• Security
Hadoop Summit 2012. 6/13/12 Apache Sqoop
7
Copyright 2012 The Apache Software Foundation
8. Sqoop 2 Themes
• Ease of Use
• Ease of Extension
• Security
Hadoop Summit 2012. 6/13/12 Apache Sqoop
8
Copyright 2012 The Apache Software Foundation
9. Ease of Use
Sqoop 1 Sqoop 2
Client-only Architecture Client/Server Architecture
CLI based CLI + Web based
Client access to Hive, HBase Server access to Hive, HBase
Oozie and Sqoop tightly coupled Oozie finds REST API
Hadoop Summit 2012. 6/13/12 Apache Sqoop
9
Copyright 2012 The Apache Software Foundation
10. Sqoop 1: Client-side Tool
• Client-side installation + configuration
– Connectors are installed/configured locally
– Local requires root privileges
– JDBC drivers are needed locally
– Database connectivity is needed locally
Hadoop Summit 2012. 6/13/12 Apache Sqoop
10
Copyright 2012 The Apache Software Foundation
11. Sqoop 2: Sqoop as a Service
• Server-side installation + configuration
– Connectors are installed/configured in one place
– Managed by administrator and run by operator
– JDBC drivers are needed in one place
– Database connectivity is needed on the server
Hadoop Summit 2012. 6/13/12 Apache Sqoop
11
Copyright 2012 The Apache Software Foundation
12. Client Interface
• Sqoop 1 client interface:
– Command line interface (CLI) based
– Can be automated via scripting
• Sqoop 2 client interface:
– CLI based (in either interactive or script mode)
– Web based (remotely accessible)
– REST API is exposed for external tool integration
Hadoop Summit 2012. 6/13/12 Apache Sqoop
12
Copyright 2012 The Apache Software Foundation
13. Sqoop 1: Service Level Integration
• Hive, HBase
– Require local installation
• Oozie
– von Neumann(esque) integration:
• Package Sqoop as an action
• Then run Sqoop from node machines, causing one MR
job to be dependent on another MR job
• Error-prone, difficult to debug
Hadoop Summit 2012. 6/13/12 Apache Sqoop
13
Copyright 2012 The Apache Software Foundation
14. Sqoop 2: Service Level Integration
• Hive, HBase
– Server-side integration
• Oozie
– REST API integration
Hadoop Summit 2012. 6/13/12 Apache Sqoop
14
Copyright 2012 The Apache Software Foundation
15. Ease of Use
Sqoop 1 Sqoop 2
Client-only Architecture Client/Server Architecture
CLI based CLI + Web based
Client access to Hive, HBase Server access to Hive, HBase
Oozie and Sqoop tightly coupled Oozie finds REST API
Hadoop Summit 2012. 6/13/12 Apache Sqoop
15
Copyright 2012 The Apache Software Foundation
16. Sqoop 2 Themes
• Ease of Use
• Ease of Extension
• Security
Hadoop Summit 2012. 6/13/12 Apache Sqoop
16
Copyright 2012 The Apache Software Foundation
17. Ease of Extension
Sqoop 1 Sqoop 2
Connector forced to follow JDBC model Connector given free rein
Connectors must implement functionality Connectors benefit from common
framework of functionality
Connector selection is implicit Connector selection is explicit
Hadoop Summit 2012. 6/13/12 Apache Sqoop
17
Copyright 2012 The Apache Software Foundation
18. Sqoop 1: Implementing Connectors
• Connectors are forced to follow JDBC model
– Connectors are limited/required to use common
JDBC vocabulary (URL, database, table, etc)
• Connectors must implement all Sqoop
functionality they want to support
– New functionality may not be available for
previously implemented connectors
Hadoop Summit 2012. 6/13/12 Apache Sqoop
18
Copyright 2012 The Apache Software Foundation
19. Sqoop 2: Implementing Connectors
• Connectors are not restricted to JDBC model
– Connectors can define own domain
• Common functionality are abstracted out of
connectors
– Connectors are only responsible for data transfer
– Common Reduce phase implements data
transformation and system integration
– Connectors can benefit from future development
of common functionality
Hadoop Summit 2012. 6/13/12 Apache Sqoop
19
Copyright 2012 The Apache Software Foundation
20. Different Options, Different Results
Which is running MySQL?
$ sqoop import --connect jdbc:mysql://localhost/db
--username foo --table TEST
$ sqoop import --connect jdbc:mysql://localhost/db
--driver com.mysql.jdbc.Driver --username foo --table TEST
• Different options may lead to unpredictable
results
– Sqoop 2 requires explicit selection of a connector,
thus disambiguating the process
Hadoop Summit 2012. 6/13/12 Apache Sqoop
20
Copyright 2012 The Apache Software Foundation
21. Sqoop 1: Using Connectors
• Choice of connector is implicit
– In a simple case, based on the URL in --connect
string to access the database
– Specification of different options can lead to
different connector selection
– Error-prone but good for power users
Hadoop Summit 2012. 6/13/12 Apache Sqoop
21
Copyright 2012 The Apache Software Foundation
22. Sqoop 1: Using Connectors
• Require knowledge of database idiosyncrasies
– e.g. Couchbase does not need to specify a table
name, which is required, causing --table to get
overloaded as backfill or dump operation
– e.g. --null-string representation is not supported
by all connectors
• Functionality is limited to what the implicitly
chosen connector supports
Hadoop Summit 2012. 6/13/12 Apache Sqoop
22
Copyright 2012 The Apache Software Foundation
23. Sqoop 2: Using Connectors
• Users make explicit connector choice
– Less error-prone, more predictable
• Users need not be aware of the functionality
of all connectors
– Couchbase users need not care that other
connectors use tables
Hadoop Summit 2012. 6/13/12 Apache Sqoop
23
Copyright 2012 The Apache Software Foundation
24. Sqoop 2: Using Connectors
• Common functionality is available to all
connectors
– Connectors need not worry about common
downstream functionality, such as transformation
into various formats and integration with other
systems
Hadoop Summit 2012. 6/13/12 Apache Sqoop
24
Copyright 2012 The Apache Software Foundation
25. Ease of Extension
Sqoop 1 Sqoop 2
Connector forced to follow JDBC model Connector given free rein
Connectors must implement functionality Connectors benefit from common
framework of functionality
Connector selection is implicit Connector selection is explicit
Hadoop Summit 2012. 6/13/12 Apache Sqoop
25
Copyright 2012 The Apache Software Foundation
26. Sqoop 2 Themes
• Ease of Use
• Ease of Extension
• Security
Hadoop Summit 2012. 6/13/12 Apache Sqoop
26
Copyright 2012 The Apache Software Foundation
27. Security
Sqoop 1 Sqoop 2
Support only for Hadoop security Support for Hadoop security and role-
based access control to external systems
High risk of abusing access to external Reduced risk of abusing access to external
systems systems
No resource management policy Resource management policy
Hadoop Summit 2012. 6/13/12 Apache Sqoop
27
Copyright 2012 The Apache Software Foundation
28. Sqoop 1: Security
• Inherit/Propagate Kerberos principal for the
jobs it launches
• Access to files on HDFS can be controlled via
HDFS security
• Limited support (user/password) for secure
access to external systems
Hadoop Summit 2012. 6/13/12 Apache Sqoop
28
Copyright 2012 The Apache Software Foundation
29. Sqoop 2: Security
• Inherit/Propagate Kerberos principal for the
jobs it launches
• Access to files on HDFS can be controlled via
HDFS security
• Support for secure access to external systems
via role-based access to connection objects
– Administrators create/edit/delete connections
– Operators use connections
Hadoop Summit 2012. 6/13/12 Apache Sqoop
29
Copyright 2012 The Apache Software Foundation
30. Sqoop 1: External System Access
• Every invocation requires necessary
credentials to access external systems (e.g.
relational database)
– Workaround: create a user with limited access in
lieu of giving out password
• Does not scale
• Permission granularity is hard to obtain
• Hard to prevent misuse once credentials are
given
Hadoop Summit 2012. 6/13/12 Apache Sqoop
30
Copyright 2012 The Apache Software Foundation
31. Sqoop 2: External System Access
• Connections are enabled as first-class objects
– Connections encompass credentials
– Connections are created once and then used
many times for various import/export jobs
– Connections are created by administrator and
used by operator
• Safeguard credential access from end users
• Connections can be restricted in scope based
on operation (import/export)
– Operators cannot abuse credentials
Hadoop Summit 2012. 6/13/12 Apache Sqoop
31
Copyright 2012 The Apache Software Foundation
32. Sqoop 1: Resource Management
• No explicit resource management policy
– Users specify the number of map jobs to run
– Cannot throttle load on external systems
Hadoop Summit 2012. 6/13/12 Apache Sqoop
32
Copyright 2012 The Apache Software Foundation
33. Sqoop 2: Resource Management
• Connections allow specification of resource
management policy
– Administrators can limit the total number of
physical connections open at one time
– Connections can also be disabled
Hadoop Summit 2012. 6/13/12 Apache Sqoop
33
Copyright 2012 The Apache Software Foundation
34. Security
Sqoop 1 Sqoop 2
Support only for Hadoop security Support for Hadoop security and role-
based access control to external systems
High risk of abusing access to external Reduced risk of abusing access to external
systems systems
No resource management policy Resource management policy
Hadoop Summit 2012. 6/13/12 Apache Sqoop
34
Copyright 2012 The Apache Software Foundation
35. Demo Screenshots
Hadoop Summit 2012. 6/13/12 Apache Sqoop
35
Copyright 2012 The Apache Software Foundation
36. Demo Screenshots
Hadoop Summit 2012. 6/13/12 Apache Sqoop
36
Copyright 2012 The Apache Software Foundation
37. Demo Screenshots
Hadoop Summit 2012. 6/13/12 Apache Sqoop
37
Copyright 2012 The Apache Software Foundation
38. Demo Screenshots
Hadoop Summit 2012. 6/13/12 Apache Sqoop
38
Copyright 2012 The Apache Software Foundation
39. Demo Screenshots
Hadoop Summit 2012. 6/13/12 Apache Sqoop
39
Copyright 2012 The Apache Software Foundation
40. Takeaway
Sqoop 2 Highights:
– Ease of Use: Sqoop as a Service
– Ease of Extension: Connectors benefit from
shared functionality
– Security: Connections as first-class objects and
role-based security
Hadoop Summit 2012. 6/13/12 Apache Sqoop
40
Copyright 2012 The Apache Software Foundation
41. Current Status: work-in-progress
• Sqoop2 Development:
http://issues.apache.org/jira/browse/SQOOP-365
• Sqoop2 Blog Post:
http://blogs.apache.org/sqoop/entry/apache_sqoop_highlights_of_sqoop
• Sqoop2 Design:
http://cwiki.apache.org/confluence/display/SQOOP/Sqoop+2
Hadoop Summit 2012. 6/13/12 Apache Sqoop
41
Copyright 2012 The Apache Software Foundation
43. Hadoop Summit 2012. 6/13/12 Apache Sqoop
43
Copyright 2012 The Apache Software Foundation
Notas del editor
Apache Sqoop was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, to enhance its functionality, Sqoop needs to fulfill data integration use-cases as well as become easier to manage and operate.
It’s a Apache TLP now
Different connectors interpret these options differently. Some options are not understood for the same operation by different connectors, while some connectors have custom options that do not apply to others. Confusing for users and detrimental to effective use.Some connectors may support a certain data format while others don’tRequired to use common JDBC vocabulary (URL, database, table, etc.)Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errorsDue to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files)There are security concerns with openly shared credentialsBy requiring root privileges, local configuration and installation are not easy to manageDebugging the map job is limited to turning on the verbose flagConnectors are forced to follow the JDBC model and are required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable
Sqoop 1’s challenges will be addressed by Sqoop 2 Architecture Connector only focuses on connectivitySerialization, format conversion, Hive/HBase integration should be uniformly available via frameworkRepository Manager: creates/edits connectors, connections, jobsConnector Manager: register new connectors, enables/disables connectorsJob Manager: submit new jobs to MR, gets job progress, kills specific jobs
Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
Pause for Q. Kathleen to take over.Sqoop 2 is a work in progress and you are welcome to join our weekly conference calls discussing its’ design and implementation. Please see me afterwards for more details if interested. During our discussions we’ve identified three pain points for Sqoop 2 to address. Those of you who have used Sqoop will find those points apparent: ease of use, ease of extension, and security.
There are 4 points we want to address with Sqoop 2’s ease of use.
Like other client-side tools, Sqoop 1 requires everything – connectors, root privileges, drivers, db connectivity – to be installed and configured locally.
Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role, which will be discussed in detail later. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.Sqoop as a web-based service, exposes the REST APIFront-ended by CLI and browserBack-ended by a metadata repositoryExample of document based system is couchbase.Sqoop 1 has something called a sqoopmetastore, which is similar to a repository for metadata but not quite. That said, the model of operation for Sqoop 1 and Sqoop 2 is very different: Sqoop 1 was a limited vocabulary tool while Sqoop 2 is more metadata driven. The design of Sqoop 2’s metadata repository is such that it can be replaced by other providers.
Sqoop 1 was intended for power users as evidenced by its CLI. Sqoop 2 has two modes: one for the power user and one for the newbie. Those new to Sqoop will appreciate the interactive UI, which walks you through import/export setup, eliminating redundant/incorrect options. Various connectors are added in one place, with connectors exposing necessary options to the Sqoop framework and with the user only required to provide info relevant to their use-case.Not bound by terminal, well documented return codes
In Sqoop 1, Hive and HBase require local installation. Currently Oozie launches Sqoop by bundling it and running it on the cluster, which is error-prone and difficult to debug.
With Sqoop 2, Hive and HBase integration happens not from client but from the backend. Hive does not need to be installed on Sqoop at all. What Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit OozieWhich Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal
4 pts
Pause for questions.
Because Sqoop is heavily JDBC centric, it’s not easy to work with non relational dbCouchbase implementation required different interpretationInconsistencies between connectors
Two-phases: first, transfer; second, transform/integration with other componentsOption to opt-out of downstream processing (i.e. revert to Sqoop 1)Trade-off between ease of connector/tooling development vs faster performanceSeparating data transfer (Map) from data transform (Reduce) allows connectors to specializeConnectors benefit from a common framework of functionality – don’t have to worry about forward compabilityFunctionally, Sqoop 2 is a superset of Sqoop 1 but does it in a different way Too early in the design process to tell if the same CLI commands could be used but most likely not primarily because it is a fundamentally incompatible changeReduce phase limited to stream transformations (no aggregation to start with)
Former is running MySQL b/c specifying driver option prevents the MySQL connector from working i.e. would end up using generic JDBC connector
Based on the URL which is JDBC centric and something Sqoop 2 is moving away from in the connect string used to access the database, Sqoop attempts to predict which driver it should load.What are connectors?Plugin components based on Sqoop’s extension frameworkEfficiently transfer data between Hadoop and external storeMeant for optimized import/export or don’t support native JDBCBundled connectors: MySQL, PostgreSQL, Oracle, SQLServer, JDBCHigh-performance data transfer: Direct MySQL, Direct PostgreSQL
Cryptic and contextual command line arguments can lead to error-prone connector matching, resulting in user errors. Due to tight coupling between data transfer and the serialization format, some connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files).
With the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Connectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.
Common functionality will be abstracted out of connectors, holding them responsible only for data transport. The reduce phase will implement common functionality, ensuring that connectors benefit from future development of functionality.
Pause for questions. Bilung to take over.
No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobs