SlideShare una empresa de Scribd logo
1 de 13
Dealing With Changed Data on
Hadoop
An old data warehouse problem in a new world
Kunal Jain, Big Data Solutions Architect at Informatica
June, 2014
Agenda
• Challenges with Traditional Data Warehouse
• Requirements for Data Warehouse Optimization
• Data Warehouse Optimization Process Flow
• Dealing With Changed Data on Hadoop
• Demo
Challenges With Traditional Data Warehousing
• Expensive to scale as data volumes grow and new data types emerge
• Staging of raw data and ELT consuming capacity of data warehouse
too quickly forcing costly upgrades
• Network becoming a bottleneck to performance
• Does not handle new types of multi-structured data
• Changes to schemas cause delays in project delivery
3
Requirements for an Optimized Data Warehouse
• Cost-effective scale out infrastructure to support unlimited data
volumes
• Leverage commodity hardware and software to lower infrastructure
costs
• Leverage existing skills to lower operational costs
• Must support all types of data
• Must support agile methodologies with schema-on-read, rapid
prototyping, metadata-driven visual IDE’s, and collaboration tools
• Integrates with existing and new types of infrastructure
4
Data Warehouse Optimization Process Flow
BI Reports & AppsData Warehouse
1. Offload data & ELT
processing to Hadoop
3. Parse & prepare
(e.g. ETL, data quality)
data for analysis
4. Move high value
curated data into data
warehouse
2. Batch load raw
data (e.g. transactions,
multi-structured)
Relational, Mainframe
Documents and Emails
Social Media, Web Logs
Machine Device, Cloud
Use Case: Updates in Traditional DW/RDBMS
• Example Requirement: Historical table containing 10 Billion rows
of data
• Every day gets incremental data of 10 million rows (70% new inserts,
30% updates)
• Traditional approach: Straightforward to insert and update in a
traditional DW/RDBMS
• Challenge: Traditional infrastructure cannot scale to the data size
and is not cost-effective.
Use Case: Update/Insert in Hadoop/Hive
• Requirement: Use Hive to store massive amounts of data, but need
to perform inserts, deletes and updates.
• Typical approach: Since Hive does not support updates, the
workaround used is to perform a FULL OUTER JOIN and a FULL
TABLE REFRESH to update impacted rows
• Challenge: Table refresh / full outer join on historical tables (10B+
rows) would blow SLAs out of the water
Use Case: Update/Insert in Hadoop/Hive
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 200 20-JAN-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Target Table (10 billion + 7 million rows)
Target Table (10 billion rows)
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Staging Table (10 million rows) with 70% Inserts and 30% Updates
UPDATE
INSERT
Partitioning rows by date significantly reduces
total # of partitions impacted by updates
Use Case: Update/Insert in Hadoop/Hive
Relational
Data Source
Inserts (70%)
Updates(30%)
Staging
Target
Target
Rows: ~10M
Rows: ~10B
Rows: ~10B
Inserts
Updates
Temporary
Rows: ~13M
1. Extract & Load
2b. Bring
unchanged data
from impacted
partitions
2a. Bring new
data and the
updated data 3. Delete matching
partitions from
Target
4. Load all data
from Temporary
into TargetImpacted
Partitions
Rows: ~10B+7M
Impacted
Partition
DEMO
Optimize the Entire Data Pipeline
Increase Performance & Productivity on Hadoop
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams
Informatica on Hadoop Benefits
• Cost-effectively scale storage and processing (over 2x the
performance)
• Increase developer productivity (up to 5x over hand-coding)
• Continue to leverage existing ETL skills you have today
• Informatica Hive partitioning/UPSERT is a key capability for rapid
implementation of CDC use-case
• Ensure success with proven leader in big data and data warehouse
optimization
15

Más contenido relacionado

La actualidad más candente

Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoophadooparchbook
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonCaserta
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsDataWorks Summit/Hadoop Summit
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureDataWorks Summit
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezJan Pieter Posthuma
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production SuccessAllen Day, PhD
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Sudhir Mallem
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyDataWorks Summit
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesDataWorks Summit
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataOfir Manor
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterCloudera, Inc.
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobileDataWorks Summit
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBaseHortonworks
 

La actualidad más candente (20)

Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
HDFS tiered storage
HDFS tiered storageHDFS tiered storage
HDFS tiered storage
 
Data warehousing with Hadoop
Data warehousing with HadoopData warehousing with Hadoop
Data warehousing with Hadoop
 
HBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBaseHBaseCon 2015 General Session: State of HBase
HBaseCon 2015 General Session: State of HBase
 
Big Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive ComparisonBig Data Warehousing: Pig vs. Hive Comparison
Big Data Warehousing: Pig vs. Hive Comparison
 
Big Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the ExpertsBig Data in the Cloud - The What, Why and How from the Experts
Big Data in the Cloud - The What, Why and How from the Experts
 
To The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid AnalyticsTo The Cloud and Back: A Look At Hybrid Analytics
To The Cloud and Back: A Look At Hybrid Analytics
 
Hadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and FutureHadoop Infrastructure @Uber Past, Present and Future
Hadoop Infrastructure @Uber Past, Present and Future
 
Big Data Platform Industrialization
Big Data Platform Industrialization Big Data Platform Industrialization
Big Data Platform Industrialization
 
Hadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to TezHadoop from Hive with Stinger to Tez
Hadoop from Hive with Stinger to Tez
 
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
20140228 - Singapore - BDAS - Ensuring Hadoop Production Success
 
Splice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakesSplice machine-bloor-webinar-data-lakes
Splice machine-bloor-webinar-data-lakes
 
Curb your insecurity with HDP
Curb your insecurity with HDPCurb your insecurity with HDP
Curb your insecurity with HDP
 
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
Interactive SQL POC on Hadoop (Hive, Presto and Hive-on-Tez)
 
Exploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthyExploiting machine learning to keep Hadoop clusters healthy
Exploiting machine learning to keep Hadoop clusters healthy
 
Hive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenchesHive at Yahoo: Letters from the trenches
Hive at Yahoo: Letters from the trenches
 
Interactive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroDataInteractive SQL-on-Hadoop and JethroData
Interactive SQL-on-Hadoop and JethroData
 
HBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart MeterHBaseCon 2013: Being Smarter Than the Smart Meter
HBaseCon 2013: Being Smarter Than the Smart Meter
 
Practice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China MobilePractice of large Hadoop cluster in China Mobile
Practice of large Hadoop cluster in China Mobile
 
Integration of HIve and HBase
Integration of HIve and HBaseIntegration of HIve and HBase
Integration of HIve and HBase
 

Destacado

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveDataWorks Summit
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingDataWorks Summit
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Hortonworks
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with HadoopPhilippe Julio
 
How to Use HIT for CCM
How to Use HIT for CCMHow to Use HIT for CCM
How to Use HIT for CCMPhytel
 
Change Data Capture
Change Data CaptureChange Data Capture
Change Data Capturearnoud_otte
 
Streaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamStreaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamInformaticaMarketplace
 
Oracle GoldenGate for Big Data
Oracle GoldenGate for Big DataOracle GoldenGate for Big Data
Oracle GoldenGate for Big DataCentrifuge LLC
 
5 Keys to Oracle GoldenGate Implemenations
5 Keys to Oracle GoldenGate Implemenations5 Keys to Oracle GoldenGate Implemenations
5 Keys to Oracle GoldenGate ImplemenationsBobby Curtis
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Bobby Curtis
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaJyrki Määttä
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Samsung Business USA
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverDataWorks Summit
 
Go Faster - Remove Inhibitors to Rapid Innovation
Go Faster - Remove Inhibitors to Rapid InnovationGo Faster - Remove Inhibitors to Rapid Innovation
Go Faster - Remove Inhibitors to Rapid InnovationFred George
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Oracle streams-step-by-step-ppt
Oracle streams-step-by-step-pptOracle streams-step-by-step-ppt
Oracle streams-step-by-step-pptabderrazak jouini
 
Replacing Oracle CDC with Oracle GoldenGate
Replacing Oracle CDC with Oracle GoldenGateReplacing Oracle CDC with Oracle GoldenGate
Replacing Oracle CDC with Oracle GoldenGateStewart Bryson
 
Change data capture
Change data captureChange data capture
Change data captureJames Deppen
 

Destacado (20)

Hive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep DiveHive + Tez: A Performance Deep Dive
Hive + Tez: A Performance Deep Dive
 
Apache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data ProcessingApache Tez - A New Chapter in Hadoop Data Processing
Apache Tez - A New Chapter in Hadoop Data Processing
 
Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014Teradata - Presentation at Hortonworks Booth - Strata 2014
Teradata - Presentation at Hortonworks Booth - Strata 2014
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Big Data Analytics with Hadoop
Big Data Analytics with HadoopBig Data Analytics with Hadoop
Big Data Analytics with Hadoop
 
How to Use HIT for CCM
How to Use HIT for CCMHow to Use HIT for CCM
How to Use HIT for CCM
 
Change Data Capture
Change Data CaptureChange Data Capture
Change Data Capture
 
Streaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data StreamStreaming real time data with Vibe Data Stream
Streaming real time data with Vibe Data Stream
 
Oracle GoldenGate for Big Data
Oracle GoldenGate for Big DataOracle GoldenGate for Big Data
Oracle GoldenGate for Big Data
 
5 Keys to Oracle GoldenGate Implemenations
5 Keys to Oracle GoldenGate Implemenations5 Keys to Oracle GoldenGate Implemenations
5 Keys to Oracle GoldenGate Implemenations
 
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
Oracle GoldenGate Presentation from OTN Virtual Technology Summit - 7/9/14 (PDF)
 
Data warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-clouderaData warehouse-optimization-with-hadoop-informatica-cloudera
Data warehouse-optimization-with-hadoop-informatica-cloudera
 
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
Big Data SSD Architecture: Digging Deep to Discover Where SSD Performance Pay...
 
Hadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game ForeverHadoop & Hive Change the Data Warehousing Game Forever
Hadoop & Hive Change the Data Warehousing Game Forever
 
Go Faster - Remove Inhibitors to Rapid Innovation
Go Faster - Remove Inhibitors to Rapid InnovationGo Faster - Remove Inhibitors to Rapid Innovation
Go Faster - Remove Inhibitors to Rapid Innovation
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
Oracle streams-step-by-step-ppt
Oracle streams-step-by-step-pptOracle streams-step-by-step-ppt
Oracle streams-step-by-step-ppt
 
Innovation in diabetes
Innovation in diabetesInnovation in diabetes
Innovation in diabetes
 
Replacing Oracle CDC with Oracle GoldenGate
Replacing Oracle CDC with Oracle GoldenGateReplacing Oracle CDC with Oracle GoldenGate
Replacing Oracle CDC with Oracle GoldenGate
 
Change data capture
Change data captureChange data capture
Change data capture
 

Similar a Dealing with Changed Data in Hadoop

Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013alanfgates
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveDataWorks Summit
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysRahul Agarwal
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMatei Zaharia
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeTorsten Steinbach
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scaleOri Reshef
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyAlluxio, Inc.
 
Resolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesResolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesJean Joseph
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...DataStax
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apacheconamarsri
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL ServerMark Kromer
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Alluxio, Inc.
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeSaurabh K. Gupta
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Tech Triveni
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?DataWorks Summit
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoDataWorks Summit
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Vinoth Chandar
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseRob Winters
 

Similar a Dealing with Changed Data in Hadoop (20)

Stinger hadoop summit june 2013
Stinger hadoop summit june 2013Stinger hadoop summit june 2013
Stinger hadoop summit june 2013
 
An In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in HiveAn In-Depth Look at Putting the Sting in Hive
An In-Depth Look at Putting the Sting in Hive
 
DC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion DaysDC Migration and Hadoop Scale For Big Billion Days
DC Migration and Hadoop Scale For Big Billion Days
 
Making Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse TechnologyMaking Data Timelier and More Reliable with Lakehouse Technology
Making Data Timelier and More Reliable with Lakehouse Technology
 
Suburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data LakeSuburface 2021 IBM Cloud Data Lake
Suburface 2021 IBM Cloud Data Lake
 
Mutable data @ scale
Mutable data @ scaleMutable data @ scale
Mutable data @ scale
 
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio JourneyModernizing Global Shared Data Analytics Platform and our Alluxio Journey
Modernizing Global Shared Data Analytics Platform and our Alluxio Journey
 
Resolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tablesResolve issues with throttled dynamo db tables
Resolve issues with throttled dynamo db tables
 
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
Webinar - Macy’s: Why Your Database Decision Directly Impacts Customer Experi...
 
Lens at apachecon
Lens at apacheconLens at apachecon
Lens at apachecon
 
Big Data with SQL Server
Big Data with SQL ServerBig Data with SQL Server
Big Data with SQL Server
 
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...Data Con LA 2022 - Making real-time analytics a reality for digital transform...
Data Con LA 2022 - Making real-time analytics a reality for digital transform...
 
Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS Enabling big data & AI workloads on the object store at DBS
Enabling big data & AI workloads on the object store at DBS
 
Harness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data LakeHarness the power of Data in a Big Data Lake
Harness the power of Data in a Big Data Lake
 
Gruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in TelcoGruter TECHDAY 2014 Realtime Processing in Telco
Gruter TECHDAY 2014 Realtime Processing in Telco
 
Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...Apache CarbonData+Spark to realize data convergence and Unified high performa...
Apache CarbonData+Spark to realize data convergence and Unified high performa...
 
What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?What's New in Apache Hive 3.0?
What's New in Apache Hive 3.0?
 
What's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - TokyoWhat's New in Apache Hive 3.0 - Tokyo
What's New in Apache Hive 3.0 - Tokyo
 
Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017Hoodie - DataEngConf 2017
Hoodie - DataEngConf 2017
 
Design Principles for a Modern Data Warehouse
Design Principles for a Modern Data WarehouseDesign Principles for a Modern Data Warehouse
Design Principles for a Modern Data Warehouse
 

Más de DataWorks Summit

Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisDataWorks Summit
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiDataWorks Summit
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...DataWorks Summit
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...DataWorks Summit
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal SystemDataWorks Summit
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExampleDataWorks Summit
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberDataWorks Summit
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixDataWorks Summit
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiDataWorks Summit
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsDataWorks Summit
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureDataWorks Summit
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EngineDataWorks Summit
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...DataWorks Summit
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudDataWorks Summit
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiDataWorks Summit
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerDataWorks Summit
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...DataWorks Summit
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouDataWorks Summit
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkDataWorks Summit
 

Más de DataWorks Summit (20)

Data Science Crash Course
Data Science Crash CourseData Science Crash Course
Data Science Crash Course
 
Floating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache RatisFloating on a RAFT: HBase Durability with Apache Ratis
Floating on a RAFT: HBase Durability with Apache Ratis
 
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFiTracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
Tracking Crime as It Occurs with Apache Phoenix, Apache HBase and Apache NiFi
 
HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...HBase Tales From the Trenches - Short stories about most common HBase operati...
HBase Tales From the Trenches - Short stories about most common HBase operati...
 
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
Optimizing Geospatial Operations with Server-side Programming in HBase and Ac...
 
Managing the Dewey Decimal System
Managing the Dewey Decimal SystemManaging the Dewey Decimal System
Managing the Dewey Decimal System
 
Practical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist ExamplePractical NoSQL: Accumulo's dirlist Example
Practical NoSQL: Accumulo's dirlist Example
 
HBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at UberHBase Global Indexing to support large-scale data ingestion at Uber
HBase Global Indexing to support large-scale data ingestion at Uber
 
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and PhoenixScaling Cloud-Scale Translytics Workloads with Omid and Phoenix
Scaling Cloud-Scale Translytics Workloads with Omid and Phoenix
 
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFiBuilding the High Speed Cybersecurity Data Pipeline Using Apache NiFi
Building the High Speed Cybersecurity Data Pipeline Using Apache NiFi
 
Supporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability ImprovementsSupporting Apache HBase : Troubleshooting and Supportability Improvements
Supporting Apache HBase : Troubleshooting and Supportability Improvements
 
Security Framework for Multitenant Architecture
Security Framework for Multitenant ArchitectureSecurity Framework for Multitenant Architecture
Security Framework for Multitenant Architecture
 
Presto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything EnginePresto: Optimizing Performance of SQL-on-Anything Engine
Presto: Optimizing Performance of SQL-on-Anything Engine
 
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
Introducing MlFlow: An Open Source Platform for the Machine Learning Lifecycl...
 
Extending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google CloudExtending Twitter's Data Platform to Google Cloud
Extending Twitter's Data Platform to Google Cloud
 
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFiEvent-Driven Messaging and Actions using Apache Flink and Apache NiFi
Event-Driven Messaging and Actions using Apache Flink and Apache NiFi
 
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache RangerSecuring Data in Hybrid on-premise and Cloud Environments using Apache Ranger
Securing Data in Hybrid on-premise and Cloud Environments using Apache Ranger
 
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
Big Data Meets NVM: Accelerating Big Data Processing with Non-Volatile Memory...
 
Computer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near YouComputer Vision: Coming to a Store Near You
Computer Vision: Coming to a Store Near You
 
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache SparkBig Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
Big Data Genomics: Clustering Billions of DNA Sequences with Apache Spark
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticscarlostorres15106
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmaticsKotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
Kotlin Multiplatform & Compose Multiplatform - Starter kit for pragmatics
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 

Dealing with Changed Data in Hadoop

  • 1. Dealing With Changed Data on Hadoop An old data warehouse problem in a new world Kunal Jain, Big Data Solutions Architect at Informatica June, 2014
  • 2. Agenda • Challenges with Traditional Data Warehouse • Requirements for Data Warehouse Optimization • Data Warehouse Optimization Process Flow • Dealing With Changed Data on Hadoop • Demo
  • 3. Challenges With Traditional Data Warehousing • Expensive to scale as data volumes grow and new data types emerge • Staging of raw data and ELT consuming capacity of data warehouse too quickly forcing costly upgrades • Network becoming a bottleneck to performance • Does not handle new types of multi-structured data • Changes to schemas cause delays in project delivery 3
  • 4. Requirements for an Optimized Data Warehouse • Cost-effective scale out infrastructure to support unlimited data volumes • Leverage commodity hardware and software to lower infrastructure costs • Leverage existing skills to lower operational costs • Must support all types of data • Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools • Integrates with existing and new types of infrastructure 4
  • 5. Data Warehouse Optimization Process Flow BI Reports & AppsData Warehouse 1. Offload data & ELT processing to Hadoop 3. Parse & prepare (e.g. ETL, data quality) data for analysis 4. Move high value curated data into data warehouse 2. Batch load raw data (e.g. transactions, multi-structured) Relational, Mainframe Documents and Emails Social Media, Web Logs Machine Device, Cloud
  • 6. Use Case: Updates in Traditional DW/RDBMS • Example Requirement: Historical table containing 10 Billion rows of data • Every day gets incremental data of 10 million rows (70% new inserts, 30% updates) • Traditional approach: Straightforward to insert and update in a traditional DW/RDBMS • Challenge: Traditional infrastructure cannot scale to the data size and is not cost-effective.
  • 7. Use Case: Update/Insert in Hadoop/Hive • Requirement: Use Hive to store massive amounts of data, but need to perform inserts, deletes and updates. • Typical approach: Since Hive does not support updates, the workaround used is to perform a FULL OUTER JOIN and a FULL TABLE REFRESH to update impacted rows • Challenge: Table refresh / full outer join on historical tables (10B+ rows) would blow SLAs out of the water
  • 8. Use Case: Update/Insert in Hadoop/Hive TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 200 20-JAN-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 2 Yyy 21-FEB-13 300 21-FEB-13 3 Aaa 22-MAR-13 400 22-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Target Table (10 billion + 7 million rows) Target Table (10 billion rows) TXID Description Transaction Date Amount Last Modified Date 1 Xxxx 20-JAN-13 210 23-MAR-13 4 Ccc 23-MAR-13 150 23-MAR-13 6 Bbb 23-MAR-13 500 23-MAR-13 Staging Table (10 million rows) with 70% Inserts and 30% Updates UPDATE INSERT Partitioning rows by date significantly reduces total # of partitions impacted by updates
  • 9. Use Case: Update/Insert in Hadoop/Hive Relational Data Source Inserts (70%) Updates(30%) Staging Target Target Rows: ~10M Rows: ~10B Rows: ~10B Inserts Updates Temporary Rows: ~13M 1. Extract & Load 2b. Bring unchanged data from impacted partitions 2a. Bring new data and the updated data 3. Delete matching partitions from Target 4. Load all data from Temporary into TargetImpacted Partitions Rows: ~10B+7M Impacted Partition
  • 10. DEMO
  • 11. Optimize the Entire Data Pipeline Increase Performance & Productivity on Hadoop Archive Profile Parse CleanseETL Match Stream Load Load Services Events Replicate Topics Machine Device, Cloud Documents and Emails Relational, Mainframe Social Media, Web Logs Data Warehouse Mobile Apps Analytics & Op Dashboards Alerts Analytics Teams
  • 12. Informatica on Hadoop Benefits • Cost-effectively scale storage and processing (over 2x the performance) • Increase developer productivity (up to 5x over hand-coding) • Continue to leverage existing ETL skills you have today • Informatica Hive partitioning/UPSERT is a key capability for rapid implementation of CDC use-case • Ensure success with proven leader in big data and data warehouse optimization
  • 13. 15

Notas del editor

  1. As data volumes and business complexity grew, traditional scale up and scale out architectures become too costly. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it. For traditional grid computing the network was becoming the bottleneck as large data volumes were pushed to the CPU workloads. This placed a limit on how much data could be processed in a reasonable amount of time to meet business SLA’s Does not handle new types of multi-structured data Changes to schemas cause delays in project delivery
  2. The requirements for an optimized DW architecture include: Cost-effective scale out infrastructure to support unlimited data volumes Leverage commodity hardware and software to lower infrastructure costs Leverage existing skills to lower operational costs Must supports all types of data Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools Integrates with existing and new types of infrastructure
  3. First start by identifying what data and processing to offload from the DW to Hadoop Inactive or infrequently used data can be moved to a Hadoop-based environment Transformations that are consuming too much CPU capacity in the DW can be moved Unstructured and multi-structured data (e.g. non-relational) data should be staged in Hadoop and not the DW You can also offload data from relational and mainframe systems to the Hadoop-based environment For lower latency data originating in relational database, data can be replicated, in real-time, from relational sources to the Hadoop-based environment Use change data capture (CDC) to capture changes as the occur in your operational transactions systems and propagate these changes to Hadoop. Also, because HDFS doesn’t impose schema requirements on data, unstructured data that was previously not available to the warehouse can also be loaded Collect real-time machine and sensor data at the source as it is created and stream it directly into Hadoop instead of staging it in a temporary file system or worse yet staging it in the DW As data is ingested into the Hadoop-based environment you can leverage the power of high performance distributed grid computing to parse, extract features, integrate, normalize, standardize, and cleanse data for analysis. Data must be parsed and prepared to further analysis. For example, semi-structured data, like json or xml, is parsed into a tabular format for easier downstream consumption by analysis programs and tools. Data cleansing logic can be applied to increase the data’s trustworthiness. The Hadoop-based environment cost-effectively and automatically scales to prepare all types of data no matter the volume for analysis. After the data has been cleansed and transformed, move copy high-value datasets from the Hadoop-based environment into the DW that have been refined, curated, to augment existing tables to make it directly accessible by the enterprise’s existing BI reports and applications.
  4. Classic Data Warehouse offloading use case
  5. Informatica enables you to define the data processing flow (e.g. ETL, DQ, etc) with transformations and rules using a visual design UI. We call these mappings. When these data flows or mappings are deployed and run, Informatica optimizes the end-to-end flow from source to target to generate Hive-QL scripts Transformations that don’t map to HQL, for example name and address cleansing routines, will be run as User Defined Functions (UDF) via the VibeTM virtual data machine libraries that resides on each of the Hadoop nodes. Because we have separated the design from the deployment you can take existing PowerCenter mappings and run them on Hadoop In fact, the source and target data don’t have to reside in Hadoop. Informatica will stream the data from the source into Hadoop for processing and then deliver it to the target whether on Hadoop or another system Tech notes: Currently, the VibeTM virtual data machine library is approx 1.3GB of jar and shared library files. Note that the VibeTM virtual data machine is not a continuously executing service process (i.e. daemon), but rather is a set of libraries that are executed only within the context of map-reduce jobs.
  6. One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. Sure you can build custom adapters and scripts but there are several challenges that comes with it. To name a few: require expert knowledge of the source systems, applications, data structures, and formats The custom code should perform and scale as data volumes grows Along with the need for speed, security and reliability can not be overlooked. Thus building a robust custom adapter takes time and can be costly to maintain as software versions change. On the other hand, Informatica PowerExchange can access data from virtually any data source at any latency (e.g., batch, real time, or near real time) and deliver all your data directly to/from a Hadoop-based environment. Proven Path to Innovation 5000+ customers, 500+ partners, 100,000+ trained Informatica developers Enterprise scalability, security, & support