May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

•Descargar como PPTX, PDF•

3 recomendaciones•2,605 vistas

This document provides an overview of Apache Sqoop and discusses the transition from Sqoop 1 to Sqoop 2. Sqoop is a tool for transferring data between relational databases and Hadoop. Sqoop 1 was connector-based and had challenges around usability and security. Sqoop 2 addresses these with a new architecture separating connections from jobs, centralized metadata management, and role-based security for database access. Sqoop 2 is the primary focus of ongoing development to improve ease of use, extensibility, and security of data transfers with Hadoop.

Tecnología

1
Apache Sqoop 2: The next
generation of data transfer
Hari Shreedharan
Software Engineer, Cloudera
Contributor, Apache Sqoop
Committer/PMC Member, Apache Flume

2
What is Sqoop?
• Apache Top-Level Project
• SQl to hadOOP
• Tool to transfer data from relational databases
• Teradata, MySQL, PostgreSQL, Oracle, Netezza
• To Hadoop ecosystem
• HDFS (text, sequence file), Hive, HBase, Avro
• And vice versa

Why Sqoop?
• Efficient/Controlled resource utilization
• Concurrent connections, Time of operation
• Datatype mapping and conversion
• Automatic, and User override
• Metadata propagation
• Sqoop Record
• Hive Metastore
• Avro
3

Sqoop 1
• Based on Connectors
• Responsible for Metadata lookups, and Data Transfer
• Majority of connectors are JDBC based
• Non-JDBC (direct) connectors for optimized data transfer
• Connectors responsible for all supported functionality
• HBase Import, Avro Support, ...
5

6
Sqoop 1 Challenges
• Cryptic, contextual command line arguments
• Security concerns
• Type mapping is not clearly defined
• Client needs access to Hadoop binaries/configuration
and database
• JDBC model is enforced

Sqoop 1 Challenges
• Non-uniform functionality
• Different connectors support different capabilities
• Overlapped/Duplicated functionality
• Different connectors may implement same capabilities
differently
• High coupling with Hadoop
• Database vendors required to understand Hadoop
idiosyncrasies in order to build connectors.
7

Sqoop 2 – Design Goals
• Security and Separation of Concerns
• Role based access and use
• Ease of extension
• No low-level Hadoop knowledge needed
• No functional overlap between Connectors
• Ease of Use
• Uniform functionality
• Domain specific interactions
9

10
Sqoop 2: Connection vs Job metadata
There are two distinct sets of options to pass into Sqoop:
Job (distinct per table)Connection (distinct per database)

Sqoop 2: Workings
• Connectors register metadata
• Metadata enables creation of Connections and Jobs
• Connections and Jobs stored in Metadata Repository
• Operator runs Jobs that use appropriate connections
• Admins set policy for connection use
11

12
Sqoop 2: Security
• Support for secure access to external systems via role-
based access to connection objects
• Administrators create/edit/delete connections
• Operators use connections

13
Current Status: Sqoop 2
• Primary focus of the Sqoop Community
• Current Release: 1.99.2
• bits and docs: http://sqoop.apache.org/

Más contenido relacionado

La actualidad más candente

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UKSkills Matter

Streamline Hadoop DevOps with Apache AmbariDataWorks Summit/Hadoop Summit

Sqoop tutorialAshoka Vanjare

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!

Apache Sqoop: A Data Transfer Tool for HadoopCloudera, Inc.

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...Yahoo Developer Network

HBase and Accumulo | Washington DC Hadoop User GroupCloudera, Inc.

The First Class Integration of Solr with Hadooplucenerevolution

11. From Hadoop to Spark 2/2Fabio Fumarola

SQOOP - RDBMS to HadoopSofian Hadiwijaya

Meet hbase 2.0enissoz

Big data: Loading your data with flume and sqoopChristophe Marchal

Hive on kafkaSzehon Ho

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQLArseny Chernov

Get started with Developing Frameworks in Go on Apache MesosJoe Stein

Flexible and Real-Time Stream Processing with Apache FlinkDataWorks Summit

Multi-tenant, Multi-cluster and Multi-container Apache HBase DeploymentsDataWorks Summit

Using Morphlines for On-the-Fly ETLCloudera, Inc.

Hive on spark berlin buzzwordsSzehon Ho

Tuning Apache Ambari performance for Big Data at scale with 3000 agentsDataWorks Summit

La actualidad más candente (20)

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

Streamline Hadoop DevOps with Apache Ambari

Sqoop tutorial

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...

Apache Sqoop: A Data Transfer Tool for Hadoop

January 2015 HUG: Using HBase Co-Processors to Build a Distributed, Transacti...

HBase and Accumulo | Washington DC Hadoop User Group

The First Class Integration of Solr with Hadoop

11. From Hadoop to Spark 2/2

SQOOP - RDBMS to Hadoop

Meet hbase 2.0

Big data: Loading your data with flume and sqoop

Hive on kafka

Compressed Introduction to Hadoop, SQL-on-Hadoop and NoSQL

Get started with Developing Frameworks in Go on Apache Mesos

Flexible and Real-Time Stream Processing with Apache Flink

Multi-tenant, Multi-cluster and Multi-container Apache HBase Deployments

Using Morphlines for On-the-Fly ETL

Hive on spark berlin buzzwords

Tuning Apache Ambari performance for Big Data at scale with 3000 agents

Destacado

Keeley's Scoop #2seniorbps

re:Invent 2013-foster-madduriRavi Madduri

ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)joseplaborda

Jsm madduri-august-2015Ravi Madduri

CI4CC sustainability-panelRavi Madduri

Internet2 Bio IT 2016 v2Dan Taylor

Supporting Barack Obama for PresidentHarvard Medical School, Partners Healthcare

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWSEd Dodds

Public.Cdsc.MiddletonHarvard Medical School, Partners Healthcare

Role of Amyloid Burden in cognitive decline Ravi Madduri

Effective ansibleWu Bigo

Big Data and GenomicsAl Costa

Big Process for Big DataIan Foster

HL7: Clinical Decision SupportHarvard Medical School, Partners Healthcare

Leap Motion Development (Rohan Puri)Camera Culture Group, MIT Media Lab

What is Media in MIT Media Lab, Why 'Camera Culture'Camera Culture Group, MIT Media Lab

Coded Photography - Ramesh RaskarCamera Culture Group, MIT Media Lab

What is SIGGRAPH NEXT? Intro by Ramesh RaskarCamera Culture Group, MIT Media Lab

Globus Genomics: Democratizing NGS AnalysisRavi Madduri

Raskar UIST Keynote 2015 NovemberCamera Culture Group, MIT Media Lab

Destacado (20)

Keeley's Scoop #2

re:Invent 2013-foster-madduri

ADAS&ME presentation @ the SCOUT project expert workshop (22-02-2017, Brussels)

Jsm madduri-august-2015

CI4CC sustainability-panel

Internet2 Bio IT 2016 v2

Supporting Barack Obama for President

Experiences In Building Globus Genomics Using Galaxy, Globus Online and AWS

Public.Cdsc.Middleton

Role of Amyloid Burden in cognitive decline

Effective ansible

Big Data and Genomics

Big Process for Big Data

HL7: Clinical Decision Support

Leap Motion Development (Rohan Puri)

What is Media in MIT Media Lab, Why 'Camera Culture'

Coded Photography - Ramesh Raskar

What is SIGGRAPH NEXT? Intro by Ramesh Raskar

Globus Genomics: Democratizing NGS Analysis

Raskar UIST Keynote 2015 November

Similar a May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.pptAjajKhan23

Big Data Developers Moscow Meetup 1 - sql on hadoopbddmoscow

Hadoop ppt1chariorienit

Cloudera Impala: A Modern SQL Engine for HadoopCloudera, Inc.

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014cdmaxime

Big data - Online TrainingLearntek1

Impala presentationtrihug

Improvements in Hadoop SecurityChris Nauroth

Stream processing on mobile networkspbelko82

Cloudera Impala - San Diego Big Data Meetup August 13th 2014cdmaxime

Webinar: What's new in CDAP 3.5?Cask Data

Hpc lunch and learnJohn D Almon

Etu Solution Day 2014 Track-D: 掌握Impala和SparkJames Chen

Apache hadoop technology : BeginnersShweta Patnaik

Deploying Grid Services Using HadoopGeorge Ang

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdfssusere05ec21

A Closer Look at Apache KuduAndriy Zabavskyy

Securing Hadoop in an Enterprise ContextHellmar Becker

Similar a May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools (20)

SQOOP AND IOTS ARCHITECTURE AND ITS APPLICATION.ppt

Big Data Developers Moscow Meetup 1 - sql on hadoop

Hadoop ppt1

Cloudera Impala: A Modern SQL Engine for Hadoop

Cloudera Impala - Las Vegas Big Data Meetup Nov 5th 2014

Big data - Online Training

Impala presentation

Improvements in Hadoop Security

Stream processing on mobile networks

Cloudera Impala - San Diego Big Data Meetup August 13th 2014

Webinar: What's new in CDAP 3.5?

Hpc lunch and learn

Etu Solution Day 2014 Track-D: 掌握Impala和Spark

Apache hadoop technology : Beginners

Deploying Grid Services Using Hadoop

impalapresentation-130130105033-phpapp02 (1)_221220_235919.pdf

A Closer Look at Apache Kudu

Securing Hadoop in an Enterprise Context

Más de Yahoo Developer Network

Developing Mobile Apps for Performance - Swapnil Patel, Verizon MediaYahoo Developer Network

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...Yahoo Developer Network

Athenz & SPIFFE, Tatsuya Yano, Yahoo JapanYahoo Developer Network

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...Yahoo Developer Network

CICD at Oath using ScrewdriverYahoo Developer Network

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, OathYahoo Developer Network

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenuYahoo Developer Network

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, AmpoolYahoo Developer Network

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...Yahoo Developer Network

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...Yahoo Developer Network

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, OathYahoo Developer Network

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...Yahoo Developer Network

Moving the Oath Grid to Docker, Eric Badger, OathYahoo Developer Network

Architecting Petabyte Scale AI ApplicationsYahoo Developer Network

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...Yahoo Developer Network

Jun 2017 HUG: YARN Scheduling – A Step BeyondYahoo Developer Network

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies Yahoo Developer Network

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...Yahoo Developer Network

February 2017 HUG: Exactly-once end-to-end processing with Apache ApexYahoo Developer Network

February 2017 HUG: Data Sketches: A required toolkit for Big Data AnalyticsYahoo Developer Network

Más de Yahoo Developer Network (20)

Developing Mobile Apps for Performance - Swapnil Patel, Verizon Media

Athenz - The Open-Source Solution to Provide Access Control in Dynamic Infras...

Athenz & SPIFFE, Tatsuya Yano, Yahoo Japan

Athenz with Istio - Single Access Control Model in Cloud Infrastructures, Tat...

CICD at Oath using Screwdriver

Big Data Serving with Vespa - Jon Bratseth, Distinguished Architect, Oath

How @TwitterHadoop Chose Google Cloud, Joep Rottinghuis, Lohit VijayaRenu

The Future of Hadoop in an AI World, Milind Bhandarkar, CEO, Ampool

Apache YARN Federation and Tez at Microsoft, Anupam Upadhyay, Adrian Nicoara,...

Containerized Services on Apache Hadoop YARN: Past, Present, and Future, Shan...

HDFS Scalability and Security, Daryn Sharp, Senior Engineer, Oath

Hadoop {Submarine} Project: Running deep learning workloads on YARN, Wangda T...

Moving the Oath Grid to Docker, Eric Badger, Oath

Architecting Petabyte Scale AI Applications

Introduction to Vespa – The Open Source Big Data Serving Engine, Jon Bratseth...

Jun 2017 HUG: YARN Scheduling – A Step Beyond

Jun 2017 HUG: Large-Scale Machine Learning: Use Cases and Technologies

February 2017 HUG: Slow, Stuck, or Runaway Apps? Learn How to Quickly Fix Pro...

February 2017 HUG: Exactly-once end-to-end processing with Apache Apex

February 2017 HUG: Data Sketches: A required toolkit for Big Data Analytics

Último

SALESFORCE EDUCATION CLOUD | FEXLE SERVICESmohitsingh558521

Sample pptx for embedding into website for demoHarshalMandlekar2

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett

"ML in Production",Oleksandr BaganFwdays

Rise of the Machines: Known As Drones...Rick Flair

WordPress Websites for Engineers: Elevate Your Brandgvaughan

What is Artificial Intelligence?????????blackmambaettijean

A Journey Into the Emotions of Software DevelopersNicole Novielli

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

From Family Reminiscence to Scholarly Archive .Alan Dix

Take control of your SAP testing with UiPath Test SuiteDianaGray10

The State of Passkeys with FIDO Alliance.pptxLoriGlavin3

Passkey Providers and Enabling Portability: FIDO Paris Seminar.pptxLoriGlavin3

Artificial intelligence in cctv survelliance.pptxhariprasad279825

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

unit 4 immunoblotting technique complete.pptxBkGupta21

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

1. 1 Apache Sqoop 2: The next generation of data transfer Hari Shreedharan Software Engineer, Cloudera Contributor, Apache Sqoop Committer/PMC Member, Apache Flume

2. 2 What is Sqoop? • Apache Top-Level Project • SQl to hadOOP • Tool to transfer data from relational databases • Teradata, MySQL, PostgreSQL, Oracle, Netezza • To Hadoop ecosystem • HDFS (text, sequence file), Hive, HBase, Avro • And vice versa

3. Why Sqoop? • Efficient/Controlled resource utilization • Concurrent connections, Time of operation • Datatype mapping and conversion • Automatic, and User override • Metadata propagation • Sqoop Record • Hive Metastore • Avro 3

4. Sqoop 1 4

5. Sqoop 1 • Based on Connectors • Responsible for Metadata lookups, and Data Transfer • Majority of connectors are JDBC based • Non-JDBC (direct) connectors for optimized data transfer • Connectors responsible for all supported functionality • HBase Import, Avro Support, ... 5

6. 6 Sqoop 1 Challenges • Cryptic, contextual command line arguments • Security concerns • Type mapping is not clearly defined • Client needs access to Hadoop binaries/configuration and database • JDBC model is enforced

7. Sqoop 1 Challenges • Non-uniform functionality • Different connectors support different capabilities • Overlapped/Duplicated functionality • Different connectors may implement same capabilities differently • High coupling with Hadoop • Database vendors required to understand Hadoop idiosyncrasies in order to build connectors. 7

8. Sqoop 2 8

9. Sqoop 2 – Design Goals • Security and Separation of Concerns • Role based access and use • Ease of extension • No low-level Hadoop knowledge needed • No functional overlap between Connectors • Ease of Use • Uniform functionality • Domain specific interactions 9

10. 10 Sqoop 2: Connection vs Job metadata There are two distinct sets of options to pass into Sqoop: Job (distinct per table)Connection (distinct per database)

11. Sqoop 2: Workings • Connectors register metadata • Metadata enables creation of Connections and Jobs • Connections and Jobs stored in Metadata Repository • Operator runs Jobs that use appropriate connections • Admins set policy for connection use 11

12. 12 Sqoop 2: Security • Support for secure access to external systems via role- based access to connection objects • Administrators create/edit/delete connections • Operators use connections

13. 13 Current Status: Sqoop 2 • Primary focus of the Sqoop Community • Current Release: 1.99.2 • bits and docs: http://sqoop.apache.org/

14. 14

Notas del editor

Acommon problem that new users of Hadoop face is how to get their data into Hadoop. Using a tool called distcp, Hadoop users can efficiently copy data between clusters. This begs the question: how do I get all my data into Hadoop in the first place? One such ingest tool is Sqoop, which was created to efficiently transfer bulk data between Hadoop and external structured datastores because databases are not easily accessible by Hadoop. The canonical use case is performing a nightly dump of all the data in a transactional relational database into Hadoop for offline analysis. The popularity of Sqoop in enterprise systems confirms that Sqoop does bulk transfer admirably. That said, there are further data integration use-cases that Sqoop needs to address as well as become easier to manage and operate.
Bulk data transfer toolImport/Export from/to relational databases, enterprise data warehouses, and NoSQL systemsPopulate tables in HDFS, Hive, and Hbase
* You might ask why Sqoop? What is the additional value?* Why not simply mysqldump
Command line toolClient applicationConfigured via command line arguments (60+!)+ Connector specific extra arguments+ Hadoop -D argumentsPluggable connectors/db specificUsing local Hadoop binaries and configurationWorkflow: command input from client, connect and fetch metadata from db, submit MR job, code generation
* Strength of Sqoop is one tool for all db systems* Heterogenous environments* Scope of connector is not limited to metadata operations* Going back to my previous example, Sqoop to run mysqldump for you* One drawback of this approach, too much power, not using wisely
Driver option is non-intuitive: specifying it means you want the generic connector instead of the specialized built-in connectorSecurity concerns: Openly shared cred, All clients needs to know credentials to DBs, Database access can't be centrally throttledSome connectors may support a certain data format that others don't (e.g. direct MySQL connector can't support sequence files) Sqoop users should not need to concern themselves with Sqoop admin responsibilitiesCouchbase does not need to specify a table name, which is required, causing --table to get overloaded as backfill or dump operation
* Sqoop is a great tool that is doing the job very well* Challenges that both user and developer needs to overcome were opened on previous slide* Too much power means too much responsibility
In Sqoop 2's client server architecture, the client will interact with the server and no longer directly interact with Hadoop nor the database. You'll install and configure connectors and drivers on the server, which will interface with the metastore repository and do resource throttling. Serialization, format conversion, and Hive/HBase integration should be uniformly available via the Sqoop framework.Sqoop 2 will be a service and as such you can install once and then run everywhere. This means that connectors will be configured in one place, managed by the Admin role and run by the Operator role. Likewise, JDBC drivers will be in one place and database connectivity will only be needed on the server.With Sqoop 2, Hive and HBase integration happens not from the client but on the server. In fact, Hive does not need to be installed on Sqoop at all - what Sqoop will do is submit requests to the HiveServer over the wire. Exposing a REST API for operation and management will help Sqoop integrate better with external systems such as Oozie.Oozie and Sqoop will be decoupled: if you install a new Sqoop connector then don’t need to install it in Oozie also.Hive will not invoke anything in Sqoop, while Oozie does invoke Sqoop so the REST API does not benefit Hive in any way but it does benefit Oozie. Which Hive/HBase server the data will be put into is the responsibility of the reduce phase which will have its own configuration and since both these systems have are on Hadoop - we don't need any added security besides passing down the Kerberos principal.
* Cost of improving existing Sqoop 1 would be too high, without clear path to address some challenges
What happens when you submit a Sqoop job – you create a connection and submit a jobWith the user making an explicit connector choice in Sqoop 2, it will be less error-prone and more predictable. Users need not be aware of the functionality of all connectors e.g. Couchbase users need not care that other connectors use tablesConnectors are no longer forced to follow the JDBC model and are no longer required to use common JDBC vocabulary (URL, database, table, etc), regardless if it is applicable.Connections are enabled as first-class objects,encompass credentials,are created once and then used many times for various import/export jobsMetadata is divided into two groups: Connections and Jobs.Sqoop 2 saves the metadata in the server, which houses a metadata repository – so you don’t have to keep typing it nor be responsible for securing it.
* So far we've talked a lot about high level structures, so let's take a look on the workflow of creating the job and how it differs from Sqoop1* Let's discuss how it will work from users perspective
No code generation, no compilation allows Sqoop to run where there are no compilers, which makes it more secure by preventing bad code from runningPreviously required direct access to Hive/HBaseMore secure because routed through Sqoop server rather than opening up access to all clients to perform jobsConnections are created by administrator and used by operator (safeguard credential access from end users),can be restricted in scope based on operation (import/export) - operators cannot abuse credentials. Operators are allowed to create Jobs to import/export specific tables.It prevents misuse and abuse.
Register for HBaseCon 2013If you have questions, Jarcec will have office hours in the Cloudera booth today at 5pm and Kate will have office hours tomorrow at 11am.

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Similar a May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools (20)

Más de Yahoo Developer Network

Más de Yahoo Developer Network (20)

Último

Último (20)

May 2013 HUG: Apache Sqoop 2 - A next generation of data transfer tools

Notas del editor