SlideShare una empresa de Scribd logo
1 de 23
Descargar para leer sin conexión
Vineet Kumar
Data Migration with Spark
#UnifiedAnalytics #SparkAISummit
Why Data Migration?
• Business requirement - Migrate Historical data into data lake for analysis - This may
require some light transformation.
• Data Science - This data can provide better business insight. More data -> better
models -> better predictions
• Enterprise Data Lake - There may be thousands of datafiles sitting in multiple
source systems.
• EDW - Data from EDW for archival purpose into the lake.
• Standards - Store data in with some standards- For example: partition strategy,
storage formats- Parquet, Avro etc.
3#UnifiedAnalytics #SparkAISummit
Data Lake
1000s of files
- Text Files
- XML
- JSON
RDMBS
- EDW
- Data Archive
Migration Issues
• Schema for text files - header missing or first line as a header.
• Data in files is neither consistent an nor in target standard format.
– Example – timestamp, Hive standard format is yyyy-MM-dd hh24:mi:ss
• Over the time file structure(s) might have changed: some new columns added or removed.
• There could be thousands of files with different structure- Separate ETL mapping/code for each file ???
• Target table can be partitioned or non-partitioned.
• Source data size can from few megabytes to terabytes, or from a few columns to thousands of columns.
• Partitioned column can be one of the existing columns or custom column based on the value passed as an
argument.
– partitionBy=partitionField, partitionValue=<<Value>>
– Example : partitionBy=‘ingestion_date’ partitionValue=‘20190401’
Data Migration Approach
Text Files
(mydata.csv)
XML Files
Hive
val df = sparksession.read.format("com.databricks.spark.csv")
.option("delimiter",”,")
.load(“mydata.csv”)
val sparksession = SparkSession
.builder()
.enableHiveSupport()
.getOrCreate()
val df = sparksession.read.format("com.databricks.spark.xml")
.load(“mydata.xml”)
RDBMS
Create Spark/Hive context
Write
<<transformation logic >>
df.write.mode(“APPEND”).
insertInto(HivetableName)JDBC call
Wait… Schema ?
6#UnifiedAnalytics #SparkAISummit
Text Files – how to get schema?
• File header is not present in file.
• Can Infer schema, but what about column names?
• Data from last couple of years and the file format has been evolved.
scala> df.columns
res01: Array[String] = Array(_c0, _c1, _c2)
XML, JSON or RDBMS sources
• Schema present in the source.
• Spark can automatically infer schema from the source.
scala> df.columns
res02: Array[String] = Array(id, name, address)
Schema for Text files
• Option 1 : File header exists in first line
• Option 2: File header from external file – JSON
• Option 3: Create empty table corresponds to csv file structure
• Option 4: define schema - StructType or Case Class
Schema for CSV files : Option 3
• Create empty table corresponds to text file structure
– Example – Text file : file1.csv
1|John|100 street1,NY|10/20/1974
– Create Hive structure corresponds to the file : file1_structure
create table file1_raw(id:string, name:string, address:string, dob:timestamp)
– Map Dataframe columns with the structure from previous step
val hiveSchema = sparksession.sql("select * from file1_raw where 1=0")
val dfColumns = hiveSchema.schema.fieldNames.toList
// You should check if column count is same before mapping it
val textFileWithSchemaDF = df.toDF(dfColumns: _*)
Before - Column Names :
scala> df.show()
+---+-----+--------------+----------+
|_c0| _c1| _c2| _c3|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
After :
scala> df2.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
Dates/timestamp ?
• Historical files - Date format can change over time.
Files from year 2004:
1|John|100 street1,NY|10/20/1974
2|Chris|Main Street,KY|10/01/1975
3|Marry|park Avenue,TN|11/10/1972
…
…
Files from year 2018 onwards:
1|John|100 street1,NY|1975-10-02
2|Chris|Main Street,KY|2010-11-20|Louisville
3|Marry|park Avenue,TN|2018-04-01 10:20:01.001
Files have different formats:
File1- dd/mm/yyyy
File2 – mm/dd/yyyy
File3 – yyyy/mm/dd:hh24:mi:ss
….
Date format changed
New columns added
Same file but different date
format
Timestamp columns
• Target Hive Table:
id: int,
name string,
dob timestamp,
address string,
move_in_date timestamp,
rent_due_date timestamp
PARTITION COLUMNS…
Timestamp Column – Can be at any
location in target table. Find these first
for (i <- 0 to (hiveSchemaArray.length - 1)) {
hiveschemaArray.toString.contains("Timestamp")
}
Transformation for Timestamp data
• Hive timestamp format - yyyy-MM-dd HH:mm:ss
• Create UDF to return valid date format. Input can be in any formats
val getHiveDateFormatUDF=udf(getValidDateFormat)
10/12/2019 2019-10-12 00:00:00getHiveDateFormatUDF
UDF Logic for Date transformation
import java.time.LocalDateTime
import java.time.format.{DateTimeFormatter, DateTimeParseException)
val inputFormats=Array(“MM/dd/yyyy”, ”yyyyMMdd”, …….)
val validHiveFormat= DateTimeFormatter.ofPattern(“yyyy-MM-dd HH:mm:ss”)
val inputDate=“10/12/2019”
for (format <- inputFormats) {
try {
val dateFormat = DateTimeFormatter.ofPattern(format)
val newDate= LocalDateTime.parse(inputDate, dateFormat)
newDate.format(validHiveFormat)
}
catch e : DateTimeParseException =>null
}
This can build from a file as an argument
Parameter to a function
Date transformation – Hive format
• Find columns with timestamp data type and apply udf on those columns
for (i <- 0 to (hiveSchemaArray.length - 1)) {
if (hiveschemaArray(i).toString.contains("Timestamp")) {
val field=schemaArray(i).toString.replace("StructField(","").split(",")(0)
val tempfield = field + "_tmp"
val tempdf = df.withColumn(tempfield,getHiveDateFormatUDF(col(field)))
.drop(field).withColumnRenamed(tempfied, field)
val newdf = tempdf
}
Before applying UDF :
scala> df2.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|10/20/1975|
| 2|Chris|Main Street,KY|10/20/1975|
| 3|Marry|park Avenue,TN|10/20/1975|
+---+-----+--------------+----------+
After applying date UDF :
scala> newDF.show()
+---+-----+--------------+----------+
|id| name| address| dob|
+---+-----+--------------+----------+
| 1| John|100 street1,NY|1975-10-20 00:00:00|
| 2|Chris|Main Street,KY|1975-10-20 00:00:00|
| 3|Marry|park Avenue,TN|1975-10-20 00:00:00|
+---+-----+--------------+----------+
Column/Data Element Position
• Spark Dataframe(df) format from text file: name at position 2, address at position 3
id| name| address| dob
--------------------------------
1|John|100 street1,NY|10/20/1974
df.write.mode(“APPEND”).insertInto(HivetableName)
Or
val query = ”INSERT OVERWRITE INTO hivetable SELECT * FROM textFileDF“
Sparksession.sql(query)
• Hive table format : address at position 2
• name switched to address, and address to name
id| address| name| dob
----------------------------------------------------------
1,John, 100 stree1 NY, Null
Id :int
Name: string
Address: string
Dob: string
Id :int
Address: string
Name: string
Dob: timestamp
File format Hive structure
Column/Data Element Position
• Relational world
INSERT INTO hivetable(id,name,address)
SELECT id,name,address from <<textFiledataFrame>>
• Read target hive table structure and column position.
//Get the Hive columns with position from target table
val hiveTableColumns= sparksession.sql("select * from hivetable where 1=0").columns
val columns = hiveTableColumns.map(x=> col(x))
// select columns source data frame and insert into target table.
dfnew.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
Partitioned Table.
• Target Hive tables can be partitioned or non-partitioned.
newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
• Partitioned by daily, monthly or hourly.
• Partitioned with different columns.
Example:
Hivetable1 – Frequency daily – partition column – load_date
Hivetable2 – Frequency monthly – partition column – load_month
• Add partitioned column as last column before inserting into hive
newDF2 = newDF.withColumn(“load_date", lit(current_timestamp()))
Partitioned Table.
• Partition column is one of the existing field from the file.
val hiveTableColumns= sparksession.sql("select * from hivetable where1=0").columns
val columns = hiveTableColumns.map(x=> col(x))
newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
• Partition is based on custom field and value passed as an argument from the command line.
spark2-submit arguments : partionBy=“field1”, partitionValue=“1100”
• Add Partition column as a last column in Dataframe
newDF2 = newDF.withColumn(partitionBy.trim().toLowerCase(), lit(partitionValue))
• Final step: before inserting into hive table
newDF2.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
Performance: Cluster Resources
• Migration runs in it’s own Yarn Pool
• Large (Files Size >10GB, or with 1000+ Columns)
# repartition size = min(fileSize(MB)/256,50)
# executors
# Executor-cores
# executor-size
• Medium(File Size : 1 – 10GB, or with < 1000 Columns)
# repartition size = min(fileSize(MB)/256,20)
# executors
# executor-cores
# executor-size
• Small
# executors
# Executor-cores
# executor-size
Data Pipeline
22#UnifiedAnalytics #SparkAISummit
Read datafile
Parquet table
Dataframe
Apply schema on Dataframe from
Hive table corresponds to text file
Perform transformation-
timestamp conversion etc
Add partitioned column to
Dataframe
Write to Hive table
DON’T FORGET TO RATE
AND REVIEW THE SESSIONS
SEARCH SPARK + AI SUMMIT

Más contenido relacionado

La actualidad más candente

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & DeltaDatabricks
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseDatabricks
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDatabricks
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lakeJames Serra
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingDatabricks
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemJames Serra
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)James Serra
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceDatabricks
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureDatabricks
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingDatabricks
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDatabricks
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big dataTrieu Nguyen
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Databricks
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureJames Serra
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseJames Serra
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureDmitry Anoshin
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Eric Sun
 

La actualidad más candente (20)

Moving to Databricks & Delta
Moving to Databricks & DeltaMoving to Databricks & Delta
Moving to Databricks & Delta
 
Considerations for Data Access in the Lakehouse
Considerations for Data Access in the LakehouseConsiderations for Data Access in the Lakehouse
Considerations for Data Access in the Lakehouse
 
DW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptxDW Migration Webinar-March 2022.pptx
DW Migration Webinar-March 2022.pptx
 
Big data architectures and the data lake
Big data architectures and the data lakeBig data architectures and the data lake
Big data architectures and the data lake
 
Large Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured StreamingLarge Scale Lakehouse Implementation Using Structured Streaming
Large Scale Lakehouse Implementation Using Structured Streaming
 
adb.pdf
adb.pdfadb.pdf
adb.pdf
 
Modern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform SystemModern Data Warehousing with the Microsoft Analytics Platform System
Modern Data Warehousing with the Microsoft Analytics Platform System
 
Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)Data Lakehouse, Data Mesh, and Data Fabric (r1)
Data Lakehouse, Data Mesh, and Data Fabric (r1)
 
Apache Spark Overview
Apache Spark OverviewApache Spark Overview
Apache Spark Overview
 
Learn to Use Databricks for Data Science
Learn to Use Databricks for Data ScienceLearn to Use Databricks for Data Science
Learn to Use Databricks for Data Science
 
Introduction to AWS Glue
Introduction to AWS GlueIntroduction to AWS Glue
Introduction to AWS Glue
 
Modernizing to a Cloud Data Architecture
Modernizing to a Cloud Data ArchitectureModernizing to a Cloud Data Architecture
Modernizing to a Cloud Data Architecture
 
Build Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks StreamingBuild Real-Time Applications with Databricks Streaming
Build Real-Time Applications with Databricks Streaming
 
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache SparkDesigning the Next Generation of Data Pipelines at Zillow with Apache Spark
Designing the Next Generation of Data Pipelines at Zillow with Apache Spark
 
Lambda architecture for real time big data
Lambda architecture for real time big dataLambda architecture for real time big data
Lambda architecture for real time big data
 
Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4Data Lakehouse Symposium | Day 4
Data Lakehouse Symposium | Day 4
 
Building an Effective Data Warehouse Architecture
Building an Effective Data Warehouse ArchitectureBuilding an Effective Data Warehouse Architecture
Building an Effective Data Warehouse Architecture
 
Introducing Azure SQL Data Warehouse
Introducing Azure SQL Data WarehouseIntroducing Azure SQL Data Warehouse
Introducing Azure SQL Data Warehouse
 
Building Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft AzureBuilding Modern Data Platform with Microsoft Azure
Building Modern Data Platform with Microsoft Azure
 
Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)Reshape Data Lake (as of 2020.07)
Reshape Data Lake (as of 2020.07)
 

Similar a Data Migration with Spark to Hive

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineChester Chen
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptssuserbad56d
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingearnwithme2522
 
Lecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptLecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptTempMail233488
 
Infrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerInfrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerMessageMedia
 
Adv DB - Full Handout.pdf
Adv DB - Full Handout.pdfAdv DB - Full Handout.pdf
Adv DB - Full Handout.pdf3BRBoruMedia
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveWill Du
 
Python and CSV Connectivity
Python and CSV ConnectivityPython and CSV Connectivity
Python and CSV ConnectivityNeeru Mittal
 
R data structures-2
R data structures-2R data structures-2
R data structures-2Victor Ordu
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsMicrosoft Tech Community
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBaseWibiData
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityMapR Technologies
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Michael Rys
 
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Michael Rys
 

Similar a Data Migration with Spark to Hive (20)

Be A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data PipelineBe A Hero: Transforming GoPro Analytics Data Pipeline
Be A Hero: Transforming GoPro Analytics Data Pipeline
 
Scaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.pptScaling Web Applications with Cassandra Presentation.ppt
Scaling Web Applications with Cassandra Presentation.ppt
 
Moving Data to and From R
Moving Data to and From RMoving Data to and From R
Moving Data to and From R
 
Apache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketingApache Hive, data segmentation and bucketing
Apache Hive, data segmentation and bucketing
 
qwe.ppt
qwe.pptqwe.ppt
qwe.ppt
 
Lecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.pptLecture 15 - MySQL- PHP 1.ppt
Lecture 15 - MySQL- PHP 1.ppt
 
Infrastructure as code deployed using Stacker
Infrastructure as code deployed using StackerInfrastructure as code deployed using Stacker
Infrastructure as code deployed using Stacker
 
Spark sql
Spark sqlSpark sql
Spark sql
 
Adv DB - Full Handout.pdf
Adv DB - Full Handout.pdfAdv DB - Full Handout.pdf
Adv DB - Full Handout.pdf
 
Ten tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache HiveTen tools for ten big data areas 04_Apache Hive
Ten tools for ten big data areas 04_Apache Hive
 
Python and CSV Connectivity
Python and CSV ConnectivityPython and CSV Connectivity
Python and CSV Connectivity
 
MYSQL-Database
MYSQL-DatabaseMYSQL-Database
MYSQL-Database
 
R data structures-2
R data structures-2R data structures-2
R data structures-2
 
Using existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analyticsUsing existing language skillsets to create large-scale, cloud-based analytics
Using existing language skillsets to create large-scale, cloud-based analytics
 
Performing Data Science with HBase
Performing Data Science with HBasePerforming Data Science with HBase
Performing Data Science with HBase
 
Introduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and SecurityIntroduction to Apache HBase, MapR Tables and Security
Introduction to Apache HBase, MapR Tables and Security
 
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
Introduction to Azure Data Lake and U-SQL for SQL users (SQL Saturday 635)
 
PT- Oracle session01
PT- Oracle session01 PT- Oracle session01
PT- Oracle session01
 
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
dotNet Miami - June 21, 2012: Richie Rump: Entity Framework: Code First and M...
 
Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)Introducing U-SQL (SQLPASS 2016)
Introducing U-SQL (SQLPASS 2016)
 

Más de Databricks

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Databricks
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Databricks
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Databricks
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDatabricks
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringDatabricks
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixDatabricks
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationDatabricks
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchDatabricks
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesDatabricks
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesDatabricks
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsDatabricks
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkDatabricks
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkDatabricks
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesDatabricks
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkDatabricks
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionDatabricks
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityDatabricks
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueDatabricks
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentDatabricks
 

Más de Databricks (20)

Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1Data Lakehouse Symposium | Day 1 | Part 1
Data Lakehouse Symposium | Day 1 | Part 1
 
Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2Data Lakehouse Symposium | Day 1 | Part 2
Data Lakehouse Symposium | Day 1 | Part 2
 
Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2Data Lakehouse Symposium | Day 2
Data Lakehouse Symposium | Day 2
 
Democratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized PlatformDemocratizing Data Quality Through a Centralized Platform
Democratizing Data Quality Through a Centralized Platform
 
Why APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML MonitoringWhy APM Is Not the Same As ML Monitoring
Why APM Is Not the Same As ML Monitoring
 
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch FixThe Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
The Function, the Context, and the Data—Enabling ML Ops at Stitch Fix
 
Stage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI IntegrationStage Level Scheduling Improving Big Data and AI Integration
Stage Level Scheduling Improving Big Data and AI Integration
 
Simplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorchSimplify Data Conversion from Spark to TensorFlow and PyTorch
Simplify Data Conversion from Spark to TensorFlow and PyTorch
 
Scaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on KubernetesScaling your Data Pipelines with Apache Spark on Kubernetes
Scaling your Data Pipelines with Apache Spark on Kubernetes
 
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark PipelinesScaling and Unifying SciKit Learn and Apache Spark Pipelines
Scaling and Unifying SciKit Learn and Apache Spark Pipelines
 
Sawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature AggregationsSawtooth Windows for Feature Aggregations
Sawtooth Windows for Feature Aggregations
 
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen SinkRedis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
Redis + Apache Spark = Swiss Army Knife Meets Kitchen Sink
 
Re-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and SparkRe-imagine Data Monitoring with whylogs and Spark
Re-imagine Data Monitoring with whylogs and Spark
 
Raven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction QueriesRaven: End-to-end Optimization of ML Prediction Queries
Raven: End-to-end Optimization of ML Prediction Queries
 
Processing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache SparkProcessing Large Datasets for ADAS Applications using Apache Spark
Processing Large Datasets for ADAS Applications using Apache Spark
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Machine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack DetectionMachine Learning CI/CD for Email Attack Detection
Machine Learning CI/CD for Email Attack Detection
 
Jeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and QualityJeeves Grows Up: An AI Chatbot for Performance and Quality
Jeeves Grows Up: An AI Chatbot for Performance and Quality
 
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + FugueIntuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
Intuitive & Scalable Hyperparameter Tuning with Apache Spark + Fugue
 
Infrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload DeploymentInfrastructure Agnostic Machine Learning Workload Deployment
Infrastructure Agnostic Machine Learning Workload Deployment
 

Último

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFxolyaivanovalion
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxolyaivanovalion
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceDelhi Call girls
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusTimothy Spann
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxolyaivanovalion
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Delhi Call girls
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxolyaivanovalion
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...SUHANI PANDEY
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Callshivangimorya083
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxolyaivanovalion
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Delhi Call girls
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...amitlee9823
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysismanisha194592
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfadriantubila
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxolyaivanovalion
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionfulawalesam
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz1
 

Último (20)

Halmar dropshipping via API with DroFx
Halmar  dropshipping  via API with DroFxHalmar  dropshipping  via API with DroFx
Halmar dropshipping via API with DroFx
 
VidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptxVidaXL dropshipping via API with DroFx.pptx
VidaXL dropshipping via API with DroFx.pptx
 
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort ServiceBDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
BDSM⚡Call Girls in Mandawali Delhi >༒8448380779 Escort Service
 
Generative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and MilvusGenerative AI on Enterprise Cloud with NiFi and Milvus
Generative AI on Enterprise Cloud with NiFi and Milvus
 
Midocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFxMidocean dropshipping via API with DroFx
Midocean dropshipping via API with DroFx
 
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
Best VIP Call Girls Noida Sector 39 Call Me: 8448380779
 
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts ServiceCall Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
Call Girls In Shalimar Bagh ( Delhi) 9953330565 Escorts Service
 
Mature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptxMature dropshipping via API with DroFx.pptx
Mature dropshipping via API with DroFx.pptx
 
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get CytotecAbortion pills in Doha Qatar (+966572737505 ! Get Cytotec
Abortion pills in Doha Qatar (+966572737505 ! Get Cytotec
 
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
VIP Model Call Girls Hinjewadi ( Pune ) Call ON 8005736733 Starting From 5K t...
 
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip CallDelhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
Delhi Call Girls CP 9711199171 ☎✔👌✔ Whatsapp Hard And Sexy Vip Call
 
BigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptxBigBuy dropshipping via API with DroFx.pptx
BigBuy dropshipping via API with DroFx.pptx
 
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Saket (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
Best VIP Call Girls Noida Sector 22 Call Me: 8448380779
 
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
Junnasandra Call Girls: 🍓 7737669865 🍓 High Profile Model Escorts | Bangalore...
 
April 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's AnalysisApril 2024 - Crypto Market Report's Analysis
April 2024 - Crypto Market Report's Analysis
 
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdfAccredited-Transport-Cooperatives-Jan-2021-Web.pdf
Accredited-Transport-Cooperatives-Jan-2021-Web.pdf
 
CebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptxCebaBaby dropshipping via API with DroFX.pptx
CebaBaby dropshipping via API with DroFX.pptx
 
Week-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interactionWeek-01-2.ppt BBB human Computer interaction
Week-01-2.ppt BBB human Computer interaction
 
Invezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signalsInvezz.com - Grow your wealth with trading signals
Invezz.com - Grow your wealth with trading signals
 

Data Migration with Spark to Hive

  • 1. Vineet Kumar Data Migration with Spark #UnifiedAnalytics #SparkAISummit
  • 2. Why Data Migration? • Business requirement - Migrate Historical data into data lake for analysis - This may require some light transformation. • Data Science - This data can provide better business insight. More data -> better models -> better predictions • Enterprise Data Lake - There may be thousands of datafiles sitting in multiple source systems. • EDW - Data from EDW for archival purpose into the lake. • Standards - Store data in with some standards- For example: partition strategy, storage formats- Parquet, Avro etc.
  • 3. 3#UnifiedAnalytics #SparkAISummit Data Lake 1000s of files - Text Files - XML - JSON RDMBS - EDW - Data Archive
  • 4. Migration Issues • Schema for text files - header missing or first line as a header. • Data in files is neither consistent an nor in target standard format. – Example – timestamp, Hive standard format is yyyy-MM-dd hh24:mi:ss • Over the time file structure(s) might have changed: some new columns added or removed. • There could be thousands of files with different structure- Separate ETL mapping/code for each file ??? • Target table can be partitioned or non-partitioned. • Source data size can from few megabytes to terabytes, or from a few columns to thousands of columns. • Partitioned column can be one of the existing columns or custom column based on the value passed as an argument. – partitionBy=partitionField, partitionValue=<<Value>> – Example : partitionBy=‘ingestion_date’ partitionValue=‘20190401’
  • 5. Data Migration Approach Text Files (mydata.csv) XML Files Hive val df = sparksession.read.format("com.databricks.spark.csv") .option("delimiter",”,") .load(“mydata.csv”) val sparksession = SparkSession .builder() .enableHiveSupport() .getOrCreate() val df = sparksession.read.format("com.databricks.spark.xml") .load(“mydata.xml”) RDBMS Create Spark/Hive context Write <<transformation logic >> df.write.mode(“APPEND”). insertInto(HivetableName)JDBC call
  • 7. Text Files – how to get schema? • File header is not present in file. • Can Infer schema, but what about column names? • Data from last couple of years and the file format has been evolved. scala> df.columns res01: Array[String] = Array(_c0, _c1, _c2) XML, JSON or RDBMS sources • Schema present in the source. • Spark can automatically infer schema from the source. scala> df.columns res02: Array[String] = Array(id, name, address)
  • 8. Schema for Text files • Option 1 : File header exists in first line • Option 2: File header from external file – JSON • Option 3: Create empty table corresponds to csv file structure • Option 4: define schema - StructType or Case Class
  • 9. Schema for CSV files : Option 3 • Create empty table corresponds to text file structure – Example – Text file : file1.csv 1|John|100 street1,NY|10/20/1974 – Create Hive structure corresponds to the file : file1_structure create table file1_raw(id:string, name:string, address:string, dob:timestamp) – Map Dataframe columns with the structure from previous step val hiveSchema = sparksession.sql("select * from file1_raw where 1=0") val dfColumns = hiveSchema.schema.fieldNames.toList // You should check if column count is same before mapping it val textFileWithSchemaDF = df.toDF(dfColumns: _*)
  • 10. Before - Column Names : scala> df.show() +---+-----+--------------+----------+ |_c0| _c1| _c2| _c3| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+ After : scala> df2.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+
  • 11. Dates/timestamp ? • Historical files - Date format can change over time. Files from year 2004: 1|John|100 street1,NY|10/20/1974 2|Chris|Main Street,KY|10/01/1975 3|Marry|park Avenue,TN|11/10/1972 … … Files from year 2018 onwards: 1|John|100 street1,NY|1975-10-02 2|Chris|Main Street,KY|2010-11-20|Louisville 3|Marry|park Avenue,TN|2018-04-01 10:20:01.001 Files have different formats: File1- dd/mm/yyyy File2 – mm/dd/yyyy File3 – yyyy/mm/dd:hh24:mi:ss …. Date format changed New columns added Same file but different date format
  • 12. Timestamp columns • Target Hive Table: id: int, name string, dob timestamp, address string, move_in_date timestamp, rent_due_date timestamp PARTITION COLUMNS… Timestamp Column – Can be at any location in target table. Find these first for (i <- 0 to (hiveSchemaArray.length - 1)) { hiveschemaArray.toString.contains("Timestamp") }
  • 13. Transformation for Timestamp data • Hive timestamp format - yyyy-MM-dd HH:mm:ss • Create UDF to return valid date format. Input can be in any formats val getHiveDateFormatUDF=udf(getValidDateFormat) 10/12/2019 2019-10-12 00:00:00getHiveDateFormatUDF
  • 14. UDF Logic for Date transformation import java.time.LocalDateTime import java.time.format.{DateTimeFormatter, DateTimeParseException) val inputFormats=Array(“MM/dd/yyyy”, ”yyyyMMdd”, …….) val validHiveFormat= DateTimeFormatter.ofPattern(“yyyy-MM-dd HH:mm:ss”) val inputDate=“10/12/2019” for (format <- inputFormats) { try { val dateFormat = DateTimeFormatter.ofPattern(format) val newDate= LocalDateTime.parse(inputDate, dateFormat) newDate.format(validHiveFormat) } catch e : DateTimeParseException =>null } This can build from a file as an argument Parameter to a function
  • 15. Date transformation – Hive format • Find columns with timestamp data type and apply udf on those columns for (i <- 0 to (hiveSchemaArray.length - 1)) { if (hiveschemaArray(i).toString.contains("Timestamp")) { val field=schemaArray(i).toString.replace("StructField(","").split(",")(0) val tempfield = field + "_tmp" val tempdf = df.withColumn(tempfield,getHiveDateFormatUDF(col(field))) .drop(field).withColumnRenamed(tempfied, field) val newdf = tempdf }
  • 16. Before applying UDF : scala> df2.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|10/20/1975| | 2|Chris|Main Street,KY|10/20/1975| | 3|Marry|park Avenue,TN|10/20/1975| +---+-----+--------------+----------+ After applying date UDF : scala> newDF.show() +---+-----+--------------+----------+ |id| name| address| dob| +---+-----+--------------+----------+ | 1| John|100 street1,NY|1975-10-20 00:00:00| | 2|Chris|Main Street,KY|1975-10-20 00:00:00| | 3|Marry|park Avenue,TN|1975-10-20 00:00:00| +---+-----+--------------+----------+
  • 17. Column/Data Element Position • Spark Dataframe(df) format from text file: name at position 2, address at position 3 id| name| address| dob -------------------------------- 1|John|100 street1,NY|10/20/1974 df.write.mode(“APPEND”).insertInto(HivetableName) Or val query = ”INSERT OVERWRITE INTO hivetable SELECT * FROM textFileDF“ Sparksession.sql(query) • Hive table format : address at position 2 • name switched to address, and address to name id| address| name| dob ---------------------------------------------------------- 1,John, 100 stree1 NY, Null Id :int Name: string Address: string Dob: string Id :int Address: string Name: string Dob: timestamp File format Hive structure
  • 18. Column/Data Element Position • Relational world INSERT INTO hivetable(id,name,address) SELECT id,name,address from <<textFiledataFrame>> • Read target hive table structure and column position. //Get the Hive columns with position from target table val hiveTableColumns= sparksession.sql("select * from hivetable where 1=0").columns val columns = hiveTableColumns.map(x=> col(x)) // select columns source data frame and insert into target table. dfnew.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
  • 19. Partitioned Table. • Target Hive tables can be partitioned or non-partitioned. newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName) • Partitioned by daily, monthly or hourly. • Partitioned with different columns. Example: Hivetable1 – Frequency daily – partition column – load_date Hivetable2 – Frequency monthly – partition column – load_month • Add partitioned column as last column before inserting into hive newDF2 = newDF.withColumn(“load_date", lit(current_timestamp()))
  • 20. Partitioned Table. • Partition column is one of the existing field from the file. val hiveTableColumns= sparksession.sql("select * from hivetable where1=0").columns val columns = hiveTableColumns.map(x=> col(x)) newDF.select(columns:_*).write.mode(“APPEND”).insertInto(tableName) • Partition is based on custom field and value passed as an argument from the command line. spark2-submit arguments : partionBy=“field1”, partitionValue=“1100” • Add Partition column as a last column in Dataframe newDF2 = newDF.withColumn(partitionBy.trim().toLowerCase(), lit(partitionValue)) • Final step: before inserting into hive table newDF2.select(columns:_*).write.mode(“APPEND”).insertInto(tableName)
  • 21. Performance: Cluster Resources • Migration runs in it’s own Yarn Pool • Large (Files Size >10GB, or with 1000+ Columns) # repartition size = min(fileSize(MB)/256,50) # executors # Executor-cores # executor-size • Medium(File Size : 1 – 10GB, or with < 1000 Columns) # repartition size = min(fileSize(MB)/256,20) # executors # executor-cores # executor-size • Small # executors # Executor-cores # executor-size
  • 22. Data Pipeline 22#UnifiedAnalytics #SparkAISummit Read datafile Parquet table Dataframe Apply schema on Dataframe from Hive table corresponds to text file Perform transformation- timestamp conversion etc Add partitioned column to Dataframe Write to Hive table
  • 23. DON’T FORGET TO RATE AND REVIEW THE SESSIONS SEARCH SPARK + AI SUMMIT