Dev Dives: Streamline document processing with UiPath Studio Web
Dealing with Changed Data in Hadoop
1. Dealing With Changed Data on
Hadoop
An old data warehouse problem in a new world
Kunal Jain, Big Data Solutions Architect at Informatica
June, 2014
2. Agenda
• Challenges with Traditional Data Warehouse
• Requirements for Data Warehouse Optimization
• Data Warehouse Optimization Process Flow
• Dealing With Changed Data on Hadoop
• Demo
3. Challenges With Traditional Data Warehousing
• Expensive to scale as data volumes grow and new data types emerge
• Staging of raw data and ELT consuming capacity of data warehouse
too quickly forcing costly upgrades
• Network becoming a bottleneck to performance
• Does not handle new types of multi-structured data
• Changes to schemas cause delays in project delivery
3
4. Requirements for an Optimized Data Warehouse
• Cost-effective scale out infrastructure to support unlimited data
volumes
• Leverage commodity hardware and software to lower infrastructure
costs
• Leverage existing skills to lower operational costs
• Must support all types of data
• Must support agile methodologies with schema-on-read, rapid
prototyping, metadata-driven visual IDE’s, and collaboration tools
• Integrates with existing and new types of infrastructure
4
5. Data Warehouse Optimization Process Flow
BI Reports & AppsData Warehouse
1. Offload data & ELT
processing to Hadoop
3. Parse & prepare
(e.g. ETL, data quality)
data for analysis
4. Move high value
curated data into data
warehouse
2. Batch load raw
data (e.g. transactions,
multi-structured)
Relational, Mainframe
Documents and Emails
Social Media, Web Logs
Machine Device, Cloud
6. Use Case: Updates in Traditional DW/RDBMS
• Example Requirement: Historical table containing 10 Billion rows
of data
• Every day gets incremental data of 10 million rows (70% new inserts,
30% updates)
• Traditional approach: Straightforward to insert and update in a
traditional DW/RDBMS
• Challenge: Traditional infrastructure cannot scale to the data size
and is not cost-effective.
7. Use Case: Update/Insert in Hadoop/Hive
• Requirement: Use Hive to store massive amounts of data, but need
to perform inserts, deletes and updates.
• Typical approach: Since Hive does not support updates, the
workaround used is to perform a FULL OUTER JOIN and a FULL
TABLE REFRESH to update impacted rows
• Challenge: Table refresh / full outer join on historical tables (10B+
rows) would blow SLAs out of the water
8. Use Case: Update/Insert in Hadoop/Hive
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 200 20-JAN-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
2 Yyy 21-FEB-13 300 21-FEB-13
3 Aaa 22-MAR-13 400 22-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Target Table (10 billion + 7 million rows)
Target Table (10 billion rows)
TXID Description Transaction Date Amount Last Modified Date
1 Xxxx 20-JAN-13 210 23-MAR-13
4 Ccc 23-MAR-13 150 23-MAR-13
6 Bbb 23-MAR-13 500 23-MAR-13
Staging Table (10 million rows) with 70% Inserts and 30% Updates
UPDATE
INSERT
Partitioning rows by date significantly reduces
total # of partitions impacted by updates
9. Use Case: Update/Insert in Hadoop/Hive
Relational
Data Source
Inserts (70%)
Updates(30%)
Staging
Target
Target
Rows: ~10M
Rows: ~10B
Rows: ~10B
Inserts
Updates
Temporary
Rows: ~13M
1. Extract & Load
2b. Bring
unchanged data
from impacted
partitions
2a. Bring new
data and the
updated data 3. Delete matching
partitions from
Target
4. Load all data
from Temporary
into TargetImpacted
Partitions
Rows: ~10B+7M
Impacted
Partition
11. Optimize the Entire Data Pipeline
Increase Performance & Productivity on Hadoop
Archive
Profile Parse CleanseETL Match
Stream
Load Load
Services
Events
Replicate
Topics
Machine Device,
Cloud
Documents and
Emails
Relational, Mainframe
Social Media, Web
Logs
Data Warehouse
Mobile Apps
Analytics & Op
Dashboards
Alerts
Analytics Teams
12. Informatica on Hadoop Benefits
• Cost-effectively scale storage and processing (over 2x the
performance)
• Increase developer productivity (up to 5x over hand-coding)
• Continue to leverage existing ETL skills you have today
• Informatica Hive partitioning/UPSERT is a key capability for rapid
implementation of CDC use-case
• Ensure success with proven leader in big data and data warehouse
optimization
As data volumes and business complexity grew, traditional scale up and scale out architectures become too costly. Therefore most organizations are unable to keep all the data they would like to query directly in the data warehouse. They have to archive the data to more affordable offline systems, such as a storage grid or tape backup. A typical strategy is to define a “time window” for data retention beyond which data is archived. Of course, this data is not in the warehouse so business users cannot benefit from it.
For traditional grid computing the network was becoming the bottleneck as large data volumes were pushed to the CPU workloads. This placed a limit on how much data could be processed in a reasonable amount of time to meet business SLA’s
Does not handle new types of multi-structured data
Changes to schemas cause delays in project delivery
The requirements for an optimized DW architecture include:
Cost-effective scale out infrastructure to support unlimited data volumes
Leverage commodity hardware and software to lower infrastructure costs
Leverage existing skills to lower operational costs
Must supports all types of data
Must support agile methodologies with schema-on-read, rapid prototyping, metadata-driven visual IDE’s, and collaboration tools
Integrates with existing and new types of infrastructure
First start by identifying what data and processing to offload from the DW to Hadoop
Inactive or infrequently used data can be moved to a Hadoop-based environment
Transformations that are consuming too much CPU capacity in the DW can be moved
Unstructured and multi-structured data (e.g. non-relational) data should be staged in Hadoop and not the DW
You can also offload data from relational and mainframe systems to the Hadoop-based environment
For lower latency data originating in relational database, data can be replicated, in real-time, from relational sources to the Hadoop-based environment
Use change data capture (CDC) to capture changes as the occur in your operational transactions systems and propagate these changes to Hadoop.
Also, because HDFS doesn’t impose schema requirements on data, unstructured data that was previously not available to the warehouse can also be loaded
Collect real-time machine and sensor data at the source as it is created and stream it directly into Hadoop instead of staging it in a temporary file system or worse yet staging it in the DW
As data is ingested into the Hadoop-based environment you can leverage the power of high performance distributed grid computing to parse, extract features, integrate, normalize, standardize, and cleanse data for analysis. Data must be parsed and prepared to further analysis. For example, semi-structured data, like json or xml, is parsed into a tabular format for easier downstream consumption by analysis programs and tools. Data cleansing logic can be applied to increase the data’s trustworthiness.
The Hadoop-based environment cost-effectively and automatically scales to prepare all types of data no matter the volume for analysis.
After the data has been cleansed and transformed, move copy high-value datasets from the Hadoop-based environment into the DW that have been refined, curated, to augment existing tables to make it directly accessible by the enterprise’s existing BI reports and applications.
Classic Data Warehouse offloading use case
Informatica enables you to define the data processing flow (e.g. ETL, DQ, etc) with transformations and rules using a visual design UI. We call these mappings.
When these data flows or mappings are deployed and run, Informatica optimizes the end-to-end flow from source to target to generate Hive-QL scripts
Transformations that don’t map to HQL, for example name and address cleansing routines, will be run as User Defined Functions (UDF) via the VibeTM virtual data machine libraries that resides on each of the Hadoop nodes. Because we have separated the design from the deployment you can take existing PowerCenter mappings and run them on Hadoop
In fact, the source and target data don’t have to reside in Hadoop. Informatica will stream the data from the source into Hadoop for processing and then deliver it to the target whether on Hadoop or another system
Tech notes: Currently, the VibeTM virtual data machine library is approx 1.3GB of jar and shared library files. Note that the VibeTM virtual data machine is not a continuously executing service process (i.e. daemon), but rather is a set of libraries that are executed only within the context of map-reduce jobs.
One of the first challenges Hadoop developers face is accessing all the data needed for processing and getting it into Hadoop. Sure you can build custom adapters and scripts but there are several challenges that comes with it. To name a few:
require expert knowledge of the source systems, applications, data structures, and formats
The custom code should perform and scale as data volumes grows
Along with the need for speed, security and reliability can not be overlooked.
Thus building a robust custom adapter takes time and can be costly to maintain as software versions change. On the other hand, Informatica PowerExchange can access data from virtually any data source at any latency (e.g., batch, real time, or near real time) and deliver all your data directly to/from a Hadoop-based environment.
Proven Path to Innovation
5000+ customers, 500+ partners, 100,000+ trained Informatica developers
Enterprise scalability, security, & support