SlideShare una empresa de Scribd logo
1 de 39
Bridging the gap of Relational to
Hadoop using Sqoop@Expedia
(Enhancing Sqoop for Synchronization)
Shashank Tandon, Expedia
Kopal Niranjan, Expedia
Agenda
• Problem statement
• Why- Sqoop
• Expedia Enhancements for Sqoop.
• New Tool : Hive Merge
• Data Synchronization
• Demo
| Expedia Inc. Proprietary & Confidential1
| Expedia Inc. Proprietary & Confidential2
Data Synchronization
Problem Statement
• Import huge amount of data available on RDBMS to Hive
table
• Support multiple partitions on Hive while importing.
• Regular updates happening on RDBMS.
–Merge the new/updated data to hive tables.
–Merge the data in parallel.
| Expedia Inc. Proprietary & Confidential3
Community Solution - Sqoop
• Sqoop is an open source tool designed to efficiently
transfer bulk data between Hadoop and structured data
stores such as relational databases.
• Support various relational databases like Teradata, SQL
Server, Oracle,Mysql,DB2 etc.
| Expedia Inc. Proprietary & Confidential4
Enhanced Sqoop Features
• Enhanced Sqoop Features for community business needs.
- Hive Merge
- Merges the incremental data migrated to hdfs into your
existing hive tables.
- Supports merge based on composite keys
- Merges older partitions as well as add new partitions.
| Expedia Inc. Proprietary & Confidential5
Enhanced Sqoop Features
- Hive Dynamic Partition
- Hive Dynamic Partition with Partition Format
- Hive External Table
- Compression like Snappy
| Expedia Inc. Proprietary & Confidential6
Hcatalog for Hive
- Hcatalog is a java wrapper on top of Hive metastore.
- Sqoop supports all the latest Hive features using Hcatalog.
| Expedia Inc. Proprietary & Confidential7
External tables with HCatalog
| Expedia Inc. Proprietary & Confidential8
Sqoop Import to Hive Managed Table
| Expedia Inc. Proprietary & Confidential9
• Sqoop connects to mysql database test
• Import table MYTABLE in a hive managed table test_part1
• The hive managed table is located in /apps/hive/warehouse
| Expedia Inc. Proprietary & Confidential10
New Enhancement :Import to Hive External Table
| Expedia Inc. Proprietary & Confidential11
• The above command creates a hive table in the user managed
Directory /user/root/test_part2
| Expedia Inc. Proprietary & Confidential12
Dynamic Partitioning with HCatalog
| Expedia Inc. Proprietary & Confidential13
Sqoop Import to Hive Static Partition
• Can pass only 1 static partition as sqoop argument
| Expedia Inc. Proprietary & Confidential14
Sqoop Import to Hive Static Partition
• Check Hive Partition
| Expedia Inc. Proprietary & Confidential15
Sqoop Import to Hive Static Partition on Date column
• Can pass only 1 static partition as sqoop argument with
date value specified manually.
| Expedia Inc. Proprietary & Confidential16
Questions
| Expedia Inc. Proprietary & Confidential17
How to Import Data if there are more than 200 partitions ?
Should I manually run these jobs again and again ?
How to Import Data if the date format is month or day or year?
Is there any way that I can pass the format ?
New Enhancement : Import to Hive Dynamic Partition
• A new argument is passed –hcatalog-dynamic-partition-
keys in sqoop.
• It works along with current static partition key.
• If both are passed then it will give more preference to static
partition key.
| Expedia Inc. Proprietary & Confidential18
| Expedia Inc. Proprietary & Confidential19
New Enhancement : Import to Hive Dynamic Partition with
Date Format
• A new argument is passed –hcatalog-dynamic-partition-
key-format with argument –hcatalog-dynamic-partition-
keys.
• Check the Hive Partitions after the Sqoop Import.
• The partitions created will be in the user-specified format.
| Expedia Inc. Proprietary & Confidential20
| Expedia Inc. Proprietary & Confidential21
Password encrypted in Sqoop Metastore
• Password will now be saved in Sqoop metastore in
encrypted manner.
• The logic is same as done in file encryption where generic
passkey and algorithm is passed in command line.
| Expedia Inc. Proprietary & Confidential22
Issues with Sqoop Merge Tool
• Designed to merge two directories on HDFS. Will need
modification to support merging of Hive tables.
• The output directory must be specified while performing the
merge.
• Supports merge based on a single column.
• To merge many partitions, each will require separate
sequential Sqoop jobs.
| Expedia Inc. Proprietary & Confidential23
Merge Incremental data using Sqoop and Hive External
Table
• Import records from base table to a HDFS directory.
• Import updates using incremental imports to another HDFS
directory.
• Create a hive external table for both the directories.
• Create a view that combines record sets from both the
Base (base_table) and Change (incremental_table) tables.
| Expedia Inc. Proprietary & Confidential24
Merge Incremental data using Sqoop and Hive External
Table
• The view now contains the most up-to-date set of records.
• Generate a table from the view created in above step.
• Replace the base table with the entries from the above
generated table.
| Expedia Inc. Proprietary & Confidential25
New Tool: Hive Merge
• Import original base table into Hive
| Expedia Inc. Proprietary & Confidential26
New Tool : Hive merge
• Import incremental data into Hive
| Expedia Inc. Proprietary & Confidential27
• Finally merge data using tool hive-merge.
| Expedia Inc. Proprietary & Confidential28
New Tool : Hive merge
Acquiring locks during Hive Merge
• In order to allow only single Hive merge happen on same
table, tool acquire lock in the start and release lock once it
finishes.
| Expedia Inc. Proprietary & Confidential29
Performance metrics : Hive Merge tool
| Expedia Inc. Proprietary & Confidential30
Other Key Enhancements
• Save encrypted password in Sqoop Metastore
• Teradata varchar/char support
• Teradata current timestamp support
• Sqoop Job runs for Incremental Import
• Snappy compression support in Hcatalog
| Expedia Inc. Proprietary & Confidential31
Apache Sqoop Jiras
These are the few jiras for which the patch has been
provided by us:
• SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if
Hive table does not exists & add support for Partition Date
Format
• SQOOP-2335 :Support for Hive External Table in Sqoop –
Hcatalog
| Expedia Inc. Proprietary & Confidential32
• SQOOP-2585: Merging hive tables using sqoop
• SQOOP-2596:Precision of varchar/char column cannot be
retrieved from teradata database during sqoop import
• SQOOP-2801: Secure RDBMS password in Sqoop
Metastore in a encrypted form.
• SQOOP-2331: Snappy Compression Support in Sqoop-
Hcatalog
| Expedia Inc. Proprietary & Confidential33
34
Demo
Questions
| Expedia Inc. Proprietary & Confidential35
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential36
Step 1: Identify partitions to update. Skip this step for
non-partitioned tables.
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential37
Step 2: Merge the new partitions with the old partitions(only for
partitioned tables).
Hive Merge Internal Architecture
| Expedia Inc. Proprietary & Confidential38
Step 3: Delete older versions.

Más contenido relacionado

La actualidad más candente

Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
HostedbyConfluent
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
Jerome Boulon
 

La actualidad más candente (20)

Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational DatabasesReal-Time Data Pipelines with Kafka, Spark, and Operational Databases
Real-Time Data Pipelines with Kafka, Spark, and Operational Databases
 
Unified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache BeamUnified, Efficient, and Portable Data Processing with Apache Beam
Unified, Efficient, and Portable Data Processing with Apache Beam
 
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
How Tencent Applies Apache Pulsar to Apache InLong - Pulsar Summit Asia 2021
 
Spark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike PercySpark Summit EU talk by Mike Percy
Spark Summit EU talk by Mike Percy
 
Data Pipeline with Kafka
Data Pipeline with KafkaData Pipeline with Kafka
Data Pipeline with Kafka
 
ebay
ebayebay
ebay
 
Flink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at AlibabaFlink SQL & TableAPI in Large Scale Production at Alibaba
Flink SQL & TableAPI in Large Scale Production at Alibaba
 
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache CassandraCassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
Cassandra Summit 2014: Launching PlayStation 4 with Apache Cassandra
 
Capital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting PlatformCapital One: Using Cassandra In Building A Reporting Platform
Capital One: Using Cassandra In Building A Reporting Platform
 
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and KafkaStream, Stream, Stream: Different Streaming Methods with Spark and Kafka
Stream, Stream, Stream: Different Streaming Methods with Spark and Kafka
 
Apache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, ScaleApache Hive 2.0: SQL, Speed, Scale
Apache Hive 2.0: SQL, Speed, Scale
 
Stsg17 speaker yousunjeong
Stsg17 speaker yousunjeongStsg17 speaker yousunjeong
Stsg17 speaker yousunjeong
 
Streaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and AkkaStreaming Analytics with Spark, Kafka, Cassandra and Akka
Streaming Analytics with Spark, Kafka, Cassandra and Akka
 
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
Big Data Day LA 2015 - The Big Data Journey: How Big Data Practices Evolve at...
 
Large-scaled telematics analytics
Large-scaled telematics analyticsLarge-scaled telematics analytics
Large-scaled telematics analytics
 
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
Apache Pinot Case Study: Building Distributed Analytics Systems Using Apache ...
 
The Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open SourceThe Next Generation of Data Processing and Open Source
The Next Generation of Data Processing and Open Source
 
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
Welcome to Kafka; We’re Glad You’re Here (Dave Klein, Centene) Kafka Summit 2020
 
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey KharlamovRUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
RUNNING A PETASCALE DATA SYSTEM: GOOD, BAD, AND UGLY CHOICES by Alexey Kharlamov
 
Hadoop summit 2010, HONU
Hadoop summit 2010, HONUHadoop summit 2010, HONU
Hadoop summit 2010, HONU
 

Destacado

About Expedia, Inc.
About Expedia, Inc.About Expedia, Inc.
About Expedia, Inc.
ExpediaIncPR
 
Expedia presentation
Expedia presentationExpedia presentation
Expedia presentation
eqc3w
 
Expedia company presentation
Expedia company presentationExpedia company presentation
Expedia company presentation
Nicole Grieble
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Daniel Madrigal
 

Destacado (20)

About Expedia, Inc.
About Expedia, Inc.About Expedia, Inc.
About Expedia, Inc.
 
Expedia case study
Expedia case studyExpedia case study
Expedia case study
 
Producing Spark on YARN for ETL
Producing Spark on YARN for ETLProducing Spark on YARN for ETL
Producing Spark on YARN for ETL
 
Expedia presentation
Expedia presentationExpedia presentation
Expedia presentation
 
Expedia company presentation
Expedia company presentationExpedia company presentation
Expedia company presentation
 
File Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & ParquetFile Format Benchmark - Avro, JSON, ORC & Parquet
File Format Benchmark - Avro, JSON, ORC & Parquet
 
Simplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & TroubleshootingSimplified Cluster Operation & Troubleshooting
Simplified Cluster Operation & Troubleshooting
 
Expedia hotel view: overview
Expedia hotel view: overviewExpedia hotel view: overview
Expedia hotel view: overview
 
Machine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of DataMachine Learning for Any Size of Data, Any Type of Data
Machine Learning for Any Size of Data, Any Type of Data
 
A New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouseA New "Sparkitecture" for modernizing your data warehouse
A New "Sparkitecture" for modernizing your data warehouse
 
The Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture ViewThe Future of Apache Hadoop an Enterprise Architecture View
The Future of Apache Hadoop an Enterprise Architecture View
 
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJIntro to Spark with Zeppelin Crash Course Hadoop Summit SJ
Intro to Spark with Zeppelin Crash Course Hadoop Summit SJ
 
Accelerating Data Warehouse Modernization
Accelerating Data Warehouse ModernizationAccelerating Data Warehouse Modernization
Accelerating Data Warehouse Modernization
 
Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success Swimming Across the Data Lake, Lessons learned and keys to success
Swimming Across the Data Lake, Lessons learned and keys to success
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data AnalysisApache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
Apache Zeppelin + LIvy: Bringing Multi Tenancy to Interactive Data Analysis
 
Analysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data AnalyticsAnalysis of Major Trends in Big Data Analytics
Analysis of Major Trends in Big Data Analytics
 
Big Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyondBig Data for Managers: From hadoop to streaming and beyond
Big Data for Managers: From hadoop to streaming and beyond
 
Toward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFSToward Better Multi-Tenancy Support from HDFS
Toward Better Multi-Tenancy Support from HDFS
 
End-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service DeploymentEnd-to-End Security and Auditing in a Big Data as a Service Deployment
End-to-End Security and Auditing in a Big Data as a Service Deployment
 

Similar a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
Tillmann Eitelberg
 

Similar a Bridging the gap of Relational to Hadoop using Sqoop @ Expedia (20)

SQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsightSQL Server Konferenz 2014 - SSIS & HDInsight
SQL Server Konferenz 2014 - SSIS & HDInsight
 
gk.pptx
gk.pptxgk.pptx
gk.pptx
 
PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)PCM18 (Big Data Analytics)
PCM18 (Big Data Analytics)
 
2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final2015 nov 27_thug_paytm_rt_ingest_brief_final
2015 nov 27_thug_paytm_rt_ingest_brief_final
 
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
Deep Dive into Spark SQL with Advanced Performance Tuning with Xiao Li & Wenc...
 
Le novità di SQL Server 2022
Le novità di SQL Server 2022Le novità di SQL Server 2022
Le novità di SQL Server 2022
 
Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!Big Data with KNIME is as easy as 1, 2, 3, ...4!
Big Data with KNIME is as easy as 1, 2, 3, ...4!
 
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIMEBig Data as easy as 1, 2, 3, ... 4 ... with KNIME
Big Data as easy as 1, 2, 3, ... 4 ... with KNIME
 
Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0Managing multi tenant resource toward Hive 2.0
Managing multi tenant resource toward Hive 2.0
 
Symfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case studySymfony2 for legacy app rejuvenation: the eZ Publish case study
Symfony2 for legacy app rejuvenation: the eZ Publish case study
 
Building data pipelines with kite
Building data pipelines with kiteBuilding data pipelines with kite
Building data pipelines with kite
 
Kubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancementsKubernetes 1.16 and rancher 2.3 enhancements
Kubernetes 1.16 and rancher 2.3 enhancements
 
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & AlluxioAlluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
Alluxio 2.0 & Near Real-time Big Data Platform w/ Spark & Alluxio
 
Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14Cost-based query optimization in Apache Hive 0.14
Cost-based query optimization in Apache Hive 0.14
 
Maintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoopMaintainable cloud architecture_of_hadoop
Maintainable cloud architecture_of_hadoop
 
MySQL in the Hosted Cloud
MySQL in the Hosted CloudMySQL in the Hosted Cloud
MySQL in the Hosted Cloud
 
ECS19 - Robi Voncina - Upgrade to SharePoint 2019
ECS19 - Robi Voncina - Upgrade to SharePoint 2019ECS19 - Robi Voncina - Upgrade to SharePoint 2019
ECS19 - Robi Voncina - Upgrade to SharePoint 2019
 
MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015MySQL in the Hosted Cloud - Percona Live 2015
MySQL in the Hosted Cloud - Percona Live 2015
 
Spring Batch Performance Tuning
Spring Batch Performance TuningSpring Batch Performance Tuning
Spring Batch Performance Tuning
 
Docker for the enterprise
Docker for the enterpriseDocker for the enterprise
Docker for the enterprise
 

Más de DataWorks Summit/Hadoop Summit

How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
DataWorks Summit/Hadoop Summit
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
DataWorks Summit/Hadoop Summit
 

Más de DataWorks Summit/Hadoop Summit (20)

Running Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in ProductionRunning Apache Spark & Apache Zeppelin in Production
Running Apache Spark & Apache Zeppelin in Production
 
State of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache ZeppelinState of Security: Apache Spark & Apache Zeppelin
State of Security: Apache Spark & Apache Zeppelin
 
Unleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache RangerUnleashing the Power of Apache Atlas with Apache Ranger
Unleashing the Power of Apache Atlas with Apache Ranger
 
Enabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science PlatformEnabling Digital Diagnostics with a Data Science Platform
Enabling Digital Diagnostics with a Data Science Platform
 
Revolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and ZeppelinRevolutionize Text Mining with Spark and Zeppelin
Revolutionize Text Mining with Spark and Zeppelin
 
Double Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSenseDouble Your Hadoop Performance with Hortonworks SmartSense
Double Your Hadoop Performance with Hortonworks SmartSense
 
Hadoop Crash Course
Hadoop Crash CourseHadoop Crash Course
Hadoop Crash Course
 
Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Apache Spark Crash Course
Apache Spark Crash CourseApache Spark Crash Course
Apache Spark Crash Course
 
Dataflow with Apache NiFi
Dataflow with Apache NiFiDataflow with Apache NiFi
Dataflow with Apache NiFi
 
Schema Registry - Set you Data Free
Schema Registry - Set you Data FreeSchema Registry - Set you Data Free
Schema Registry - Set you Data Free
 
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
Building a Large-Scale, Adaptive Recommendation Engine with Apache Flink and ...
 
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
Real-Time Anomaly Detection using LSTM Auto-Encoders with Deep Learning4J on ...
 
Mool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and MLMool - Automated Log Analysis using Data Science and ML
Mool - Automated Log Analysis using Data Science and ML
 
How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient How Hadoop Makes the Natixis Pack More Efficient
How Hadoop Makes the Natixis Pack More Efficient
 
HBase in Practice
HBase in Practice HBase in Practice
HBase in Practice
 
The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)The Challenge of Driving Business Value from the Analytics of Things (AOT)
The Challenge of Driving Business Value from the Analytics of Things (AOT)
 
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS HadoopBreaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
Breaking the 1 Million OPS/SEC Barrier in HOPS Hadoop
 
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
From Regulatory Process Verification to Predictive Maintenance and Beyond wit...
 
Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop Backup and Disaster Recovery in Hadoop
Backup and Disaster Recovery in Hadoop
 

Último

Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
panagenda
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
WSO2
 

Último (20)

Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024Manulife - Insurer Transformation Award 2024
Manulife - Insurer Transformation Award 2024
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
Apidays Singapore 2024 - Scalable LLM APIs for AI and Generative AI Applicati...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
MS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectorsMS Copilot expands with MS Graph connectors
MS Copilot expands with MS Graph connectors
 
FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024FWD Group - Insurer Innovation Award 2024
FWD Group - Insurer Innovation Award 2024
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ..."I see eyes in my soup": How Delivery Hero implemented the safety system for ...
"I see eyes in my soup": How Delivery Hero implemented the safety system for ...
 
Architecting Cloud Native Applications
Architecting Cloud Native ApplicationsArchitecting Cloud Native Applications
Architecting Cloud Native Applications
 
ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 

Bridging the gap of Relational to Hadoop using Sqoop @ Expedia

  • 1. Bridging the gap of Relational to Hadoop using Sqoop@Expedia (Enhancing Sqoop for Synchronization) Shashank Tandon, Expedia Kopal Niranjan, Expedia
  • 2. Agenda • Problem statement • Why- Sqoop • Expedia Enhancements for Sqoop. • New Tool : Hive Merge • Data Synchronization • Demo | Expedia Inc. Proprietary & Confidential1
  • 3. | Expedia Inc. Proprietary & Confidential2 Data Synchronization
  • 4. Problem Statement • Import huge amount of data available on RDBMS to Hive table • Support multiple partitions on Hive while importing. • Regular updates happening on RDBMS. –Merge the new/updated data to hive tables. –Merge the data in parallel. | Expedia Inc. Proprietary & Confidential3
  • 5. Community Solution - Sqoop • Sqoop is an open source tool designed to efficiently transfer bulk data between Hadoop and structured data stores such as relational databases. • Support various relational databases like Teradata, SQL Server, Oracle,Mysql,DB2 etc. | Expedia Inc. Proprietary & Confidential4
  • 6. Enhanced Sqoop Features • Enhanced Sqoop Features for community business needs. - Hive Merge - Merges the incremental data migrated to hdfs into your existing hive tables. - Supports merge based on composite keys - Merges older partitions as well as add new partitions. | Expedia Inc. Proprietary & Confidential5
  • 7. Enhanced Sqoop Features - Hive Dynamic Partition - Hive Dynamic Partition with Partition Format - Hive External Table - Compression like Snappy | Expedia Inc. Proprietary & Confidential6
  • 8. Hcatalog for Hive - Hcatalog is a java wrapper on top of Hive metastore. - Sqoop supports all the latest Hive features using Hcatalog. | Expedia Inc. Proprietary & Confidential7
  • 9. External tables with HCatalog | Expedia Inc. Proprietary & Confidential8
  • 10. Sqoop Import to Hive Managed Table | Expedia Inc. Proprietary & Confidential9 • Sqoop connects to mysql database test • Import table MYTABLE in a hive managed table test_part1 • The hive managed table is located in /apps/hive/warehouse
  • 11. | Expedia Inc. Proprietary & Confidential10
  • 12. New Enhancement :Import to Hive External Table | Expedia Inc. Proprietary & Confidential11 • The above command creates a hive table in the user managed Directory /user/root/test_part2
  • 13. | Expedia Inc. Proprietary & Confidential12
  • 14. Dynamic Partitioning with HCatalog | Expedia Inc. Proprietary & Confidential13
  • 15. Sqoop Import to Hive Static Partition • Can pass only 1 static partition as sqoop argument | Expedia Inc. Proprietary & Confidential14
  • 16. Sqoop Import to Hive Static Partition • Check Hive Partition | Expedia Inc. Proprietary & Confidential15
  • 17. Sqoop Import to Hive Static Partition on Date column • Can pass only 1 static partition as sqoop argument with date value specified manually. | Expedia Inc. Proprietary & Confidential16
  • 18. Questions | Expedia Inc. Proprietary & Confidential17 How to Import Data if there are more than 200 partitions ? Should I manually run these jobs again and again ? How to Import Data if the date format is month or day or year? Is there any way that I can pass the format ?
  • 19. New Enhancement : Import to Hive Dynamic Partition • A new argument is passed –hcatalog-dynamic-partition- keys in sqoop. • It works along with current static partition key. • If both are passed then it will give more preference to static partition key. | Expedia Inc. Proprietary & Confidential18
  • 20. | Expedia Inc. Proprietary & Confidential19
  • 21. New Enhancement : Import to Hive Dynamic Partition with Date Format • A new argument is passed –hcatalog-dynamic-partition- key-format with argument –hcatalog-dynamic-partition- keys. • Check the Hive Partitions after the Sqoop Import. • The partitions created will be in the user-specified format. | Expedia Inc. Proprietary & Confidential20
  • 22. | Expedia Inc. Proprietary & Confidential21
  • 23. Password encrypted in Sqoop Metastore • Password will now be saved in Sqoop metastore in encrypted manner. • The logic is same as done in file encryption where generic passkey and algorithm is passed in command line. | Expedia Inc. Proprietary & Confidential22
  • 24. Issues with Sqoop Merge Tool • Designed to merge two directories on HDFS. Will need modification to support merging of Hive tables. • The output directory must be specified while performing the merge. • Supports merge based on a single column. • To merge many partitions, each will require separate sequential Sqoop jobs. | Expedia Inc. Proprietary & Confidential23
  • 25. Merge Incremental data using Sqoop and Hive External Table • Import records from base table to a HDFS directory. • Import updates using incremental imports to another HDFS directory. • Create a hive external table for both the directories. • Create a view that combines record sets from both the Base (base_table) and Change (incremental_table) tables. | Expedia Inc. Proprietary & Confidential24
  • 26. Merge Incremental data using Sqoop and Hive External Table • The view now contains the most up-to-date set of records. • Generate a table from the view created in above step. • Replace the base table with the entries from the above generated table. | Expedia Inc. Proprietary & Confidential25
  • 27. New Tool: Hive Merge • Import original base table into Hive | Expedia Inc. Proprietary & Confidential26
  • 28. New Tool : Hive merge • Import incremental data into Hive | Expedia Inc. Proprietary & Confidential27
  • 29. • Finally merge data using tool hive-merge. | Expedia Inc. Proprietary & Confidential28 New Tool : Hive merge
  • 30. Acquiring locks during Hive Merge • In order to allow only single Hive merge happen on same table, tool acquire lock in the start and release lock once it finishes. | Expedia Inc. Proprietary & Confidential29
  • 31. Performance metrics : Hive Merge tool | Expedia Inc. Proprietary & Confidential30
  • 32. Other Key Enhancements • Save encrypted password in Sqoop Metastore • Teradata varchar/char support • Teradata current timestamp support • Sqoop Job runs for Incremental Import • Snappy compression support in Hcatalog | Expedia Inc. Proprietary & Confidential31
  • 33. Apache Sqoop Jiras These are the few jiras for which the patch has been provided by us: • SQOOP-2332: Dynamic Partition in Sqoop HCatalog- if Hive table does not exists & add support for Partition Date Format • SQOOP-2335 :Support for Hive External Table in Sqoop – Hcatalog | Expedia Inc. Proprietary & Confidential32
  • 34. • SQOOP-2585: Merging hive tables using sqoop • SQOOP-2596:Precision of varchar/char column cannot be retrieved from teradata database during sqoop import • SQOOP-2801: Secure RDBMS password in Sqoop Metastore in a encrypted form. • SQOOP-2331: Snappy Compression Support in Sqoop- Hcatalog | Expedia Inc. Proprietary & Confidential33
  • 36. Questions | Expedia Inc. Proprietary & Confidential35
  • 37. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential36 Step 1: Identify partitions to update. Skip this step for non-partitioned tables.
  • 38. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential37 Step 2: Merge the new partitions with the old partitions(only for partitioned tables).
  • 39. Hive Merge Internal Architecture | Expedia Inc. Proprietary & Confidential38 Step 3: Delete older versions.