This document provides an overview of Apache Sqoop and discusses the transition from Sqoop 1 to Sqoop 2. Sqoop is a tool for transferring data between relational databases and Hadoop. Sqoop 1 was connector-based and had challenges around usability and security. Sqoop 2 addresses these with a new architecture separating connections from jobs, centralized metadata management, and role-based security for database access. Sqoop 2 is the primary focus of ongoing development to improve ease of use, extensibility, and security of data transfers with Hadoop.
May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools
1. 1
Apache Sqoop 2: The next
generation of data transfer
Hari Shreedharan
Software Engineer, Cloudera
Contributor, Apache Sqoop
Committer/PMC Member, Apache Flume
2. 2
What is Sqoop?
• Apache Top-Level Project
• SQl to hadOOP
• Tool to transfer data from relational databases
• Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
• HDFS (text, sequence file), Hive, HBase, Avro
• And vice versa
3. Why Sqoop?
• Efficient/Controlled resource utilization
• Concurrent connections, Time of operation
• Datatype mapping and conversion
• Automatic, and User override
• Metadata propagation
• Sqoop Record
• Hive Metastore
• Avro
3
5. Sqoop 1
• Based on Connectors
• Responsible for Metadata lookups, and Data Transfer
• Majority of connectors are JDBC based
• Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
• HBase Import, Avro Support, ...
5
6. 6
Sqoop 1 Challenges
• Cryptic, contextual command line arguments
• Security concerns
• Type mapping is not clearly defined
• Client needs access to Hadoop binaries/configuration
and database
• JDBC model is enforced
7. Sqoop 1 Challenges
• Non-uniform functionality
• Different connectors support different capabilities
• Overlapped/Duplicated functionality
• Different connectors may implement same capabilities
differently
• High coupling with Hadoop
• Database vendors required to understand Hadoop
idiosyncrasies in order to build connectors.
7
9. Sqoop 2 – Design Goals
• Security and Separation of Concerns
• Role based access and use
• Ease of extension
• No low-level Hadoop knowledge needed
• No functional overlap between Connectors
• Ease of Use
• Uniform functionality
• Domain specific interactions
9
10. 10
Sqoop 2: Connection vs Job metadata
There are two distinct sets of options to pass into Sqoop:
Job (distinct per table)Connection (distinct per database)
11. Sqoop 2: Workings
• Connectors register metadata
• Metadata enables creation of Connections and Jobs
• Connections and Jobs stored in Metadata Repository
• Operator runs Jobs that use appropriate connections
• Admins set policy for connection use
11
12. 12
Sqoop 2: Security
• Support for secure access to external systems via role-
based access to connection objects
• Administrators create/edit/delete connections
• Operators use connections
13. 13
Current Status: Sqoop 2
• Primary focus of the Sqoop Community
• Current Release: 1.99.2
• bits and docs: http://sqoop.apache.org/
Acommon problem that new users of Hadoop face is how to get their data into Hadoop. Using a tool called distcp, Hadoop users can efficiently copy data between clusters. This begs the question: how do I get all my data into Hadoop in the first place? One such ingest tool is Sqoop, which was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The canonical use case is performing a nightly dump of all the data in a transactional relational database into Hadoop for offline analysis. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, there are further data integration use-cases that Sqoop needs to address as well as become easier to manage and operate.
Bulk data transfer toolImport/Export from/to relational databases, enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and Hbase
* You might ask why Sqoop? What is the additional value?* Why not simply mysqldump
Command line toolClient applicationConfigured via command line arguments (60+!)+ Connector specific extra arguments+ Hadoop -D argumentsPluggable connectors/db specificUsing local Hadoop binaries and configurationWorkflow: command input from client, connect and fetch metadata from db, submit MR job, code generation
* Strength of Sqoop is one tool for all db systems* Heterogenous environments* Scope of connector is not limited to metadata operations* Going back to my previous example, Sqoop to run mysqldump for you* One drawback of this approach, too much power, not using wisely
Driver option is non-intuitive: specifying it means you want the generic connector instead of the specialized built-in connectorSecurity concerns: Openly shared cred, All clients needs to know credentials to DBs, Database access can't be centrally throttledSome connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files) Sqoop users should not need to concern themselves with Sqoop admin responsibilitiesCouchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation
* Sqoop is a great tool that is doing the job very well* Challenges that both user and developer needs to overcome were opened on previous slide* Too much power means too much responsibility
In Sqoop 2's client server architecture, the client will interact with the server and no longer directly interact with Hadoop nor the database. You'll install and configure connectors and drivers on the server, which will interface with the metastore repository and do resource throttling. Serialization, format conversion, and Hive/HBase integration should be uniformly available via the Sqoop framework.Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.With Sqoop 2, Hive and HBase integration happens not from the client but on the server. In fact, Hive does not need to be installed on Sqoop at all - what Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit Oozie. Which Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal.
* Cost of improving existing Sqoop 1 would be too high, without clear path to address some challenges
What happens when you submit a Sqoop job – you create a connection and submit a jobWith the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Users need not be aware of the functionality of all connectors e.g. Couchbase users need not care that other connectors use tablesConnectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.Connections are enabled as first-class objects,encompass credentials,are created once and then used many times for various import/export jobsMetadata is divided into two groups: Connections and Jobs.Sqoop 2 saves the metadata in the server, which houses a metadata repository – so you don’t have to keep typing it nor be responsible for securing it.
* So far we've talked a lot about high level structures, so let's take a look on the workflow of creating the job and how it differs from Sqoop1* Let's discuss how it will work from users perspective
No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobsConnections are created by administrator and used by operator (safeguard credential access from end users),can be restricted in scope based on operation (import/export) - operators cannot abuse credentials. Operators are allowed to create Jobs to import/export specific tables.It prevents misuse and abuse.
Register for HBaseCon 2013If you have questions, Jarcec will have office hours in the Cloudera booth today at 5pm and Kate will have office hours tomorrow at 11am.