SlideShare una empresa de Scribd logo
1 de 29
sqoop
Easy, parallel database import/export



Aaron Kimball
Cloudera Inc.
June 3, 2010
Sqoop is…


        … a suite of tools that connect Hadoop
        and database systems.



▪   Import tables from databases into HDFS for deep analysis
▪   Replicate database schemas in Hive’s metastore
▪   Export MapReduce results back to a database for presentation to
    end-users
In this talk…
▪   How Sqoop works
▪   Working with imported data
▪   Sqoop 1.0 Release
The problem
Structured data in traditional databases cannot be easily
combined with complex data stored in HDFS




                Where’s the bridge?
Sqoop = SQL-to-Hadoop
▪   Easy import of data from many databases to HDFS
▪   Generates code for use in MapReduce applications
▪   Integrates with Hive




                     sqoop!       Hadoop
          mysql
                                                          Custom
                                                        MapReduce
                                                         programs
                                                      reinterpret data


                              ...Via auto-generated
                               datatype definitions
Example data pipeline
Key features of Sqoop
▪   JDBC-based implementation
    ▪   Works with many popular database vendors
▪   Auto-generation of tedious user-side code
    ▪   Write MapReduce applications to work with your data, faster
▪   Integration with Hive
    ▪   Allows you to stay in a SQL-based environment
▪   Extensible backend
    ▪   Database-specific code paths for better performance
Example input
mysql> use corp;
Database changed

mysql> describe employees;
+------------+-------------+------+-----+---------+----------------+
| Field      | Type        | Null | Key | Default | Extra          |
+------------+-------------+------+-----+---------+----------------+
| id         | int(11)     | NO   | PRI | NULL    | auto_increment |
| firstname | varchar(32) | YES |       | NULL    |                |
| lastname   | varchar(32) | YES |      | NULL    |                |
| jobtitle   | varchar(64) | YES |      | NULL    |                |
| start_date | date        | YES |      | NULL    |                |
| dept_id    | int(11)     | YES |      | NULL    |                |
+------------+-------------+------+-----+---------+----------------+
Loading into HDFS

$ sqoop import 
           --connect jdbc:mysql://db.foo.com/corp 
           --table employees




▪   Imports “employees” table into HDFS directory
    ▪   Data imported as text or SequenceFiles
    ▪   Optionally compress and split data during import
▪   Generates employees.java for your use
Example output

$ hadoop fs –cat employees/part-00000
0,Aaron,Kimball,engineer,2008-10-01,3
1,John,Doe,manager,2009-01-14,6




▪   Files can be used as input to MapReduce processing
Auto-generated class

public class employees {
  public Integer get_id();
  public String get_firstname();
  public String get_lastname();
  public String get_jobtitle();
  public java.sql.Date get_start_date();
  public Integer get_dept_id();
  // parse() methods that understand text
  // and serialization methods for Hadoop
}
Hive integration
 Imports table definition into Hive after data is imported to HDFS
Export back to a database
mysql> CREATE TABLE ads_results (
       id INT NOT NULL PRIMARY KEY,
       page VARCHAR(256),
       score DOUBLE);


$ sqoop export 
         --connect jdbc:mysql://db.foo.com/corp 
         --table ads_results       --export-dir results




▪   Exports “results” dir into “ads_results” table
Additional options
▪   Multiple data representations supported
    ▪   TextFile – ubiquitous; easy import into Hive
    ▪   SequenceFile – for binary data; better compression support,
        higher performance
▪   Supports local and remote Hadoop clusters, databases
▪   Can select a subset of columns, specify a WHERE clause
▪   Controls for delimiters and quote characters:
    ▪   --fields-terminated-by, --lines-terminated-by,
        --optionally-enclosed-by, etc.
    ▪   Also supports delimiter conversion
        (--input-fields-terminated-by, etc.)
Under the hood…
▪   JDBC
    ▪   Allows Java applications to submit SQL queries to databases
    ▪   Provides metadata about databases (column names, types, etc.)
▪   Hadoop
    ▪   Allows input from arbitrary sources via different InputFormats
    ▪   Provides multiple JDBC-based InputFormats to read from
        databases
    ▪   Can write to arbitrary sinks via OutputFormats – Sqoop includes
        a high-performance database export OutputFormat
InputFormat woes
▪   DBInputFormat allows database records to be used as mapper
    inputs
▪   The trouble with using DBInputFormat directly is:
    ▪   Connecting an entire Hadoop cluster to a database is a
        performance nightmare
    ▪   Databases have lower read bandwidth than HDFS; for repeated
        analyses, much better to make a copy in HDFS first
    ▪   Users must write a class that describes a record from each table
        they want to import or work with (a “DBWritable”)
DBWritable example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
DBWritable example
1.  class MyRecord implements Writable, DBWritable {
2.    long msg_id;
3.    String msg;
4.    public void readFields(ResultSet resultSet)
5.        throws SQLException {
6.      this.msg_id = resultSet.getLong(1);
7.      this.msg = resultSet.getString(2);
8.    }
9.    public void readFields(DataInput in) throws
10.       IOException {
11.     this.msg_id = in.readLong();
12.     this.msg = in.readUTF();
13.   }
14. }
A direct typeJDBC Type
               mapping                         Java Type
        CHAR                         String
        VARCHAR                      String
        LONGVARCHAR                  String
        NUMERIC                      java.math.BigDecimal
        DECIMAL                      java.math.BigDecimal
        BIT                          boolean
        TINYINT                      byte
        SMALLINT                     short
        INTEGER                      int
        BIGINT                       long
        REAL                         float
        FLOAT                        double
        DOUBLE                       double
        BINARY                       byte[]
        VARBINARY                    byte[]
        LONGVARBINARY                byte[]
        DATE                         java.sql.Date
        TIME                         java.sql.Time
        TIMESTAMP                    java.sql.Timestamp
        http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
Class auto-generation
Working with Sqoop
▪   Basic workflow:
    ▪   Import initial table with Sqoop
    ▪   Use auto-generated table class in MapReduce analyses
    ▪   … Or write Hive queries over imported tables
    ▪   Perform periodic re-imports to ingest new data
    ▪   Use Sqoop to export results back to databases for online access


▪   Table classes can parse records from delimited files in HDFS
Processing records in MapReduce
1.   void map(LongWritable k, Text v, Context c) {
2.       MyRecord r = new MyRecord();
3.       r.parse(v); // auto-generated text parser
4.       process(r.get_msg()); // your logic here
5.       ...
6.   }
Import parallelism
▪   Sqoop uses indexed columns to divide a table into ranges
    ▪   Based on min/max values of the primary key
    ▪   Allows databases to use index range scans
    ▪   Several worker tasks import a subset of the table each
▪   MapReduce is used to manage worker tasks
    ▪   Provides fault tolerance
    ▪   Workers write to separate HDFS nodes; wide write bandwidth
Parallel exports
▪   Results from MapReduce processing
    stored in delimited text files


▪   Sqoop can parse these files, and
    insert the results in a database table
Direct-mode imports and exports
▪   MySQL provides mysqldump for high-performance table output
    ▪   Sqoop special-cases jdbc:mysql:// for faster loading
    ▪   With MapReduce, think “distributed mk-parallel-dump”


▪   Similar mechanism used for PostgreSQL
▪   Avoids JDBC overhead


▪   On the other side…
    ▪   mysqlimport provides really fast Sqoop exports
    ▪   Writers stream data into mysqlimport via named FIFOs
Recent Developments
▪   April 2010: Sqoop moves to github
▪   May 2010: Preparing for 1.0 release
    ▪   Higher-performance pipelined export process
    ▪   Improved support for storage of large data (CLOBs, BLOBs)
    ▪   Refactored API, improved documentation
    ▪   Better platform compatibility:
        ▪   Will work with to-be-released Apache Hadoop 0.21
        ▪   Will work with Cloudera’s Distribution for Hadoop 3
▪   June 2010: Planned 1.0.0 release (in time for Hadoop Summit)


▪   Plenty of bugs to fix and features to add – see me if you want to
    help!
Conclusions
▪   Most database import/export tasks are “turning the crank”
▪   Sqoop can automate a lot of this
    ▪   Allows more efficient use of existing data sources in concert with
        new, complex data
▪   Available as part of Cloudera’s Distribution for Hadoop


                     The pitch: www.cloudera.com/sqoop
                    The code: github.com/cloudera/sqoop


                            aaron@cloudera.com
(c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0

Más contenido relacionado

La actualidad más candente

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Edureka!
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use caseDavin Abraham
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Databricks
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBill Liu
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionDataWorks Summit
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache SqoopAvkash Chauhan
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesNishith Agarwal
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigLester Martin
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Vinoth Chandar
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopJeyamariappan Guru
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...Chester Chen
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start TutorialCarl Steinbach
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprisesnvvrajesh
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCCloudera, Inc.
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Gruter
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopGruter
 

La actualidad más candente (20)

Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
Apache Sqoop Tutorial | Sqoop: Import & Export Data From MySQL To HDFS | Hado...
 
Apache sqoop with an use case
Apache sqoop with an use caseApache sqoop with an use case
Apache sqoop with an use case
 
Introduction to sqoop
Introduction to sqoopIntroduction to sqoop
Introduction to sqoop
 
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
Hudi: Large-Scale, Near Real-Time Pipelines at Uber with Nishith Agarwal and ...
 
Building large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudiBuilding large scale transactional data lake using apache hudi
Building large scale transactional data lake using apache hudi
 
Sqoop on Spark for Data Ingestion
Sqoop on Spark for Data IngestionSqoop on Spark for Data Ingestion
Sqoop on Spark for Data Ingestion
 
Introduction to Apache Sqoop
Introduction to Apache SqoopIntroduction to Apache Sqoop
Introduction to Apache Sqoop
 
Hudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilitiesHudi architecture, fundamentals and capabilities
Hudi architecture, fundamentals and capabilities
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Transformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs PigTransformation Processing Smackdown; Spark vs Hive vs Pig
Transformation Processing Smackdown; Spark vs Hive vs Pig
 
Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived Hadoop Strata Talk - Uber, your hadoop has arrived
Hadoop Strata Talk - Uber, your hadoop has arrived
 
Big data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and SqoopBig data components - Introduction to Flume, Pig and Sqoop
Big data components - Introduction to Flume, Pig and Sqoop
 
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
SF Big Analytics 20190612: Building highly efficient data lakes using Apache ...
 
Hive Quick Start Tutorial
Hive Quick Start TutorialHive Quick Start Tutorial
Hive Quick Start Tutorial
 
Hadoop For Enterprises
Hadoop For EnterprisesHadoop For Enterprises
Hadoop For Enterprises
 
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLCHBaseCon 2012 | HBase for the Worlds Libraries - OCLC
HBaseCon 2012 | HBase for the Worlds Libraries - OCLC
 
October 2014 HUG : Hive On Spark
October 2014 HUG : Hive On SparkOctober 2014 HUG : Hive On Spark
October 2014 HUG : Hive On Spark
 
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
Hadoop Summit 2014: Query Optimization and JIT-based Vectorized Execution in ...
 
Apache phoenix
Apache phoenixApache phoenix
Apache phoenix
 
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on HadoopBig Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
Big Data Camp LA 2014 - Apache Tajo: A Big Data Warehouse System on Hadoop
 

Similar a Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Jonathan Seidman
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoopnvvrajesh
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Michael Renner
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Djamel Zouaoui
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BIDenny Lee
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014NoSQLmatters
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study NotesRichard Kuo
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Databricks
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksDatabricks
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopContinuent
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopDataWorks Summit
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight ServiceNeil Mackenzie
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...IndicThreads
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfishFei Dong
 

Similar a Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK (20)

מיכאל
מיכאלמיכאל
מיכאל
 
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
Data Analysis with Hadoop and Hive, ChicagoDB 2/21/2011
 
SQL on Hadoop
SQL on HadoopSQL on Hadoop
SQL on Hadoop
 
Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014Postgres Vienna DB Meetup 2014
Postgres Vienna DB Meetup 2014
 
Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming Paris Data Geek - Spark Streaming
Paris Data Geek - Spark Streaming
 
Above the cloud: Big Data and BI
Above the cloud: Big Data and BIAbove the cloud: Big Data and BI
Above the cloud: Big Data and BI
 
Ml2
Ml2Ml2
Ml2
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
Johnny Miller – Cassandra + Spark = Awesome- NoSQL matters Barcelona 2014
 
Spark Study Notes
Spark Study NotesSpark Study Notes
Spark Study Notes
 
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
Serverless Machine Learning on Modern Hardware Using Apache Spark with Patric...
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Jump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and DatabricksJump Start into Apache® Spark™ and Databricks
Jump Start into Apache® Spark™ and Databricks
 
Real-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to HadoopReal-Time Data Loading from MySQL to Hadoop
Real-Time Data Loading from MySQL to Hadoop
 
Hadoop ecosystem
Hadoop ecosystemHadoop ecosystem
Hadoop ecosystem
 
Challenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on HadoopChallenges of Implementing an Advanced SQL Engine on Hadoop
Challenges of Implementing an Advanced SQL Engine on Hadoop
 
Windows Azure HDInsight Service
Windows Azure HDInsight ServiceWindows Azure HDInsight Service
Windows Azure HDInsight Service
 
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...Processing massive amount of data with Map Reduce using Apache Hadoop  - Indi...
Processing massive amount of data with Map Reduce using Apache Hadoop - Indi...
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Cascading on starfish
Cascading on starfishCascading on starfish
Cascading on starfish
 

Más de Skills Matter

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard LawrenceSkills Matter
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applicationsSkills Matter
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmSkills Matter
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimSkills Matter
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Skills Matter
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlSkills Matter
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsSkills Matter
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Skills Matter
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Skills Matter
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldSkills Matter
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Skills Matter
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Skills Matter
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingSkills Matter
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveSkills Matter
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSkills Matter
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tSkills Matter
 

Más de Skills Matter (20)

5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence5 things cucumber is bad at by Richard Lawrence
5 things cucumber is bad at by Richard Lawrence
 
Patterns for slick database applications
Patterns for slick database applicationsPatterns for slick database applications
Patterns for slick database applications
 
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvmScala e xchange 2013 haoyi li on metascala a tiny diy jvm
Scala e xchange 2013 haoyi li on metascala a tiny diy jvm
 
Oscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheimOscar reiken jr on our success at manheim
Oscar reiken jr on our success at manheim
 
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
Progressive f# tutorials nyc dmitry mozorov & jack pappas on code quotations ...
 
Cukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberlCukeup nyc ian dees on elixir, erlang, and cucumberl
Cukeup nyc ian dees on elixir, erlang, and cucumberl
 
Cukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.jsCukeup nyc peter bell on getting started with cucumber.js
Cukeup nyc peter bell on getting started with cucumber.js
 
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
Agile testing & bdd e xchange nyc 2013 jeffrey davidson & lav pathak & sam ho...
 
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
Progressive f# tutorials nyc rachel reese & phil trelford on try f# from zero...
 
Progressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source worldProgressive f# tutorials nyc don syme on keynote f# in the open source world
Progressive f# tutorials nyc don syme on keynote f# in the open source world
 
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
Agile testing & bdd e xchange nyc 2013 gojko adzic on bond villain guide to s...
 
Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#Dmitry mozorov on code quotations code as-data for f#
Dmitry mozorov on code quotations code as-data for f#
 
A poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testingA poet's guide_to_acceptance_testing
A poet's guide_to_acceptance_testing
 
Russ miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-diveRuss miles-cloudfoundry-deep-dive
Russ miles-cloudfoundry-deep-dive
 
Serendipity-neo4j
Serendipity-neo4jSerendipity-neo4j
Serendipity-neo4j
 
Simon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelismSimon Peyton Jones: Managing parallelism
Simon Peyton Jones: Managing parallelism
 
Plug 20110217
Plug   20110217Plug   20110217
Plug 20110217
 
Lug presentation
Lug presentationLug presentation
Lug presentation
 
I went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_tI went to_a_communications_workshop_and_they_t
I went to_a_communications_workshop_and_they_t
 
Plug saiku
Plug   saikuPlug   saiku
Plug saiku
 

Último

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsYoss Cohen
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observabilityitnewsafrica
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Kaya Weers
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...Karmanjay Verma
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationKnoldus Inc.
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxLoriGlavin3
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfIngrid Airi González
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Alkin Tezuysal
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...BookNet Canada
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfAarwolf Industries LLC
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfpanagenda
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Jeffrey Haguewood
 

Último (20)

Infrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platformsInfrared simulation and processing on Nvidia platforms
Infrared simulation and processing on Nvidia platforms
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security ObservabilityGlenn Lazarus- Why Your Observability Strategy Needs Security Observability
Glenn Lazarus- Why Your Observability Strategy Needs Security Observability
 
Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)Design pattern talk by Kaya Weers - 2024 (v2)
Design pattern talk by Kaya Weers - 2024 (v2)
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...React JS; all concepts. Contains React Features, JSX, functional & Class comp...
React JS; all concepts. Contains React Features, JSX, functional & Class comp...
 
Data governance with Unity Catalog Presentation
Data governance with Unity Catalog PresentationData governance with Unity Catalog Presentation
Data governance with Unity Catalog Presentation
 
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptxThe Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
The Fit for Passkeys for Employee and Consumer Sign-ins: FIDO Paris Seminar.pptx
 
Generative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdfGenerative Artificial Intelligence: How generative AI works.pdf
Generative Artificial Intelligence: How generative AI works.pdf
 
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
Unleashing Real-time Insights with ClickHouse_ Navigating the Landscape in 20...
 
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
Transcript: New from BookNet Canada for 2024: BNC SalesData and LibraryData -...
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
Landscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdfLandscape Catalogue 2024 Australia-1.pdf
Landscape Catalogue 2024 Australia-1.pdf
 
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdfSo einfach geht modernes Roaming fuer Notes und Nomad.pdf
So einfach geht modernes Roaming fuer Notes und Nomad.pdf
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
Email Marketing Automation for Bonterra Impact Management (fka Social Solutio...
 

Introduction to Sqoop Aaron Kimball Cloudera Hadoop User Group UK

  • 1.
  • 2. sqoop Easy, parallel database import/export Aaron Kimball Cloudera Inc. June 3, 2010
  • 3. Sqoop is… … a suite of tools that connect Hadoop and database systems. ▪ Import tables from databases into HDFS for deep analysis ▪ Replicate database schemas in Hive’s metastore ▪ Export MapReduce results back to a database for presentation to end-users
  • 4. In this talk… ▪ How Sqoop works ▪ Working with imported data ▪ Sqoop 1.0 Release
  • 5. The problem Structured data in traditional databases cannot be easily combined with complex data stored in HDFS Where’s the bridge?
  • 6. Sqoop = SQL-to-Hadoop ▪ Easy import of data from many databases to HDFS ▪ Generates code for use in MapReduce applications ▪ Integrates with Hive sqoop! Hadoop mysql Custom MapReduce programs reinterpret data ...Via auto-generated datatype definitions
  • 8. Key features of Sqoop ▪ JDBC-based implementation ▪ Works with many popular database vendors ▪ Auto-generation of tedious user-side code ▪ Write MapReduce applications to work with your data, faster ▪ Integration with Hive ▪ Allows you to stay in a SQL-based environment ▪ Extensible backend ▪ Database-specific code paths for better performance
  • 9. Example input mysql> use corp; Database changed mysql> describe employees; +------------+-------------+------+-----+---------+----------------+ | Field | Type | Null | Key | Default | Extra | +------------+-------------+------+-----+---------+----------------+ | id | int(11) | NO | PRI | NULL | auto_increment | | firstname | varchar(32) | YES | | NULL | | | lastname | varchar(32) | YES | | NULL | | | jobtitle | varchar(64) | YES | | NULL | | | start_date | date | YES | | NULL | | | dept_id | int(11) | YES | | NULL | | +------------+-------------+------+-----+---------+----------------+
  • 10. Loading into HDFS $ sqoop import --connect jdbc:mysql://db.foo.com/corp --table employees ▪ Imports “employees” table into HDFS directory ▪ Data imported as text or SequenceFiles ▪ Optionally compress and split data during import ▪ Generates employees.java for your use
  • 11. Example output $ hadoop fs –cat employees/part-00000 0,Aaron,Kimball,engineer,2008-10-01,3 1,John,Doe,manager,2009-01-14,6 ▪ Files can be used as input to MapReduce processing
  • 12. Auto-generated class public class employees { public Integer get_id(); public String get_firstname(); public String get_lastname(); public String get_jobtitle(); public java.sql.Date get_start_date(); public Integer get_dept_id(); // parse() methods that understand text // and serialization methods for Hadoop }
  • 13. Hive integration Imports table definition into Hive after data is imported to HDFS
  • 14. Export back to a database mysql> CREATE TABLE ads_results ( id INT NOT NULL PRIMARY KEY, page VARCHAR(256), score DOUBLE); $ sqoop export --connect jdbc:mysql://db.foo.com/corp --table ads_results --export-dir results ▪ Exports “results” dir into “ads_results” table
  • 15. Additional options ▪ Multiple data representations supported ▪ TextFile – ubiquitous; easy import into Hive ▪ SequenceFile – for binary data; better compression support, higher performance ▪ Supports local and remote Hadoop clusters, databases ▪ Can select a subset of columns, specify a WHERE clause ▪ Controls for delimiters and quote characters: ▪ --fields-terminated-by, --lines-terminated-by, --optionally-enclosed-by, etc. ▪ Also supports delimiter conversion (--input-fields-terminated-by, etc.)
  • 16. Under the hood… ▪ JDBC ▪ Allows Java applications to submit SQL queries to databases ▪ Provides metadata about databases (column names, types, etc.) ▪ Hadoop ▪ Allows input from arbitrary sources via different InputFormats ▪ Provides multiple JDBC-based InputFormats to read from databases ▪ Can write to arbitrary sinks via OutputFormats – Sqoop includes a high-performance database export OutputFormat
  • 17. InputFormat woes ▪ DBInputFormat allows database records to be used as mapper inputs ▪ The trouble with using DBInputFormat directly is: ▪ Connecting an entire Hadoop cluster to a database is a performance nightmare ▪ Databases have lower read bandwidth than HDFS; for repeated analyses, much better to make a copy in HDFS first ▪ Users must write a class that describes a record from each table they want to import or work with (a “DBWritable”)
  • 18. DBWritable example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 19. DBWritable example 1. class MyRecord implements Writable, DBWritable { 2. long msg_id; 3. String msg; 4. public void readFields(ResultSet resultSet) 5. throws SQLException { 6. this.msg_id = resultSet.getLong(1); 7. this.msg = resultSet.getString(2); 8. } 9. public void readFields(DataInput in) throws 10. IOException { 11. this.msg_id = in.readLong(); 12. this.msg = in.readUTF(); 13. } 14. }
  • 20. A direct typeJDBC Type mapping Java Type CHAR String VARCHAR String LONGVARCHAR String NUMERIC java.math.BigDecimal DECIMAL java.math.BigDecimal BIT boolean TINYINT byte SMALLINT short INTEGER int BIGINT long REAL float FLOAT double DOUBLE double BINARY byte[] VARBINARY byte[] LONGVARBINARY byte[] DATE java.sql.Date TIME java.sql.Time TIMESTAMP java.sql.Timestamp http://java.sun.com/j2se/1.3/docs/guide/jdbc/getstart/mapping.html
  • 22. Working with Sqoop ▪ Basic workflow: ▪ Import initial table with Sqoop ▪ Use auto-generated table class in MapReduce analyses ▪ … Or write Hive queries over imported tables ▪ Perform periodic re-imports to ingest new data ▪ Use Sqoop to export results back to databases for online access ▪ Table classes can parse records from delimited files in HDFS
  • 23. Processing records in MapReduce 1. void map(LongWritable k, Text v, Context c) { 2. MyRecord r = new MyRecord(); 3. r.parse(v); // auto-generated text parser 4. process(r.get_msg()); // your logic here 5. ... 6. }
  • 24. Import parallelism ▪ Sqoop uses indexed columns to divide a table into ranges ▪ Based on min/max values of the primary key ▪ Allows databases to use index range scans ▪ Several worker tasks import a subset of the table each ▪ MapReduce is used to manage worker tasks ▪ Provides fault tolerance ▪ Workers write to separate HDFS nodes; wide write bandwidth
  • 25. Parallel exports ▪ Results from MapReduce processing stored in delimited text files ▪ Sqoop can parse these files, and insert the results in a database table
  • 26. Direct-mode imports and exports ▪ MySQL provides mysqldump for high-performance table output ▪ Sqoop special-cases jdbc:mysql:// for faster loading ▪ With MapReduce, think “distributed mk-parallel-dump” ▪ Similar mechanism used for PostgreSQL ▪ Avoids JDBC overhead ▪ On the other side… ▪ mysqlimport provides really fast Sqoop exports ▪ Writers stream data into mysqlimport via named FIFOs
  • 27. Recent Developments ▪ April 2010: Sqoop moves to github ▪ May 2010: Preparing for 1.0 release ▪ Higher-performance pipelined export process ▪ Improved support for storage of large data (CLOBs, BLOBs) ▪ Refactored API, improved documentation ▪ Better platform compatibility: ▪ Will work with to-be-released Apache Hadoop 0.21 ▪ Will work with Cloudera’s Distribution for Hadoop 3 ▪ June 2010: Planned 1.0.0 release (in time for Hadoop Summit) ▪ Plenty of bugs to fix and features to add – see me if you want to help!
  • 28. Conclusions ▪ Most database import/export tasks are “turning the crank” ▪ Sqoop can automate a lot of this ▪ Allows more efficient use of existing data sources in concert with new, complex data ▪ Available as part of Cloudera’s Distribution for Hadoop The pitch: www.cloudera.com/sqoop The code: github.com/cloudera/sqoop aaron@cloudera.com
  • 29. (c) 2008 Cloudera, Inc. or its licensors. "Cloudera" is a registered trademark of Cloudera, Inc.. All rights reserved. 1.0