Optimization

Optimization with DMExpress

Steven Haddad – Senior Software Architect
shaddad@syncsort.com

Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective.

A Family of High-Performance, Purpose-Built Data Integration Tools

For core ETL processing & database transformation
→ High-Performance ETL offload (Oracle PL/SQL, Teradata, and others)

Integrate → ETL Optimization For Informatica, DataStage, and others

→ Hadoop Optimization For Apache, HortonWorks, Cloudera, and others

Optimize
→ Rehosting Optimization For Clerity, MicroFocus, Oracle, and others

→ High-Performance Sort For z/OS, z/VSE, and Windows/UNIX/Linux
Migrate

→ Sort Optimization For SAS, DFSORT, Trillium, and others

Syncsort Confidential and Proprietary - do not copy or distribute 3

Do You Need Data Integration
Optimization/Acceleration?
 ETL is taking longer and longer

 Large budgets to purchase additional
hardware and database

 A shift in data integration processing to
database or hand-coded solutions

 Data integration environment can’t easily
be govern, maintained or expanded

 Inability to launch or staff initiatives due to
lack of resources

 Long time-to-value

 Users may lose confidence in data
Syncsort Confidential and Proprietary - do not copy or distribute
4

What is Optimization with DMExpress™ ?

 Better Performance – No Tuning

 Lower Costs for:

 Hardware

 Licenses

 IT Stuff

 Improves your Capabilities to deliver

 Reduces usage of resources

 More work in less time

 Secure your already done investment
5

Examples for Optimization with DMExpress™
→ 10 * Faster then
Major Logistic Company
DataStage Parallel
IBM DataStage
→ 26 * Faster then
Major Logistic Company
DataStage Server
→ 27 days down to 15 hours
Information Service Provider
→6 week to production
Informatica → 1/20 of disc space
Major Insurance Provider
→ significant less Memory
→ Costs/TB down from
ComScore
→ 1538 US$ to 46 US$

→ Reduce costs by 2.9 Mio $
PL/SQL Global Payments
→ 2.35h down to 3 min

→ 4:42 h down to 1:12h
AbInitio Financial Service Provider
→ 360 GB down to 4 GB WS


DMExpress Delivers Significantly Faster Performance
Even Without Any Tuning

35
Elapsed Time (m)

30
25 INFA
20
DMExpress Up to 5x Faster
15
10
→ DMExpress: No Tuning
5 → Informatica: Tuned
0

1. Copy 2. Sort 3. Aggregate

300
Elapsed Time (m)

250
Ab Initio
200
DMExpress Up to 4x Faster
150
100
→ DMExpress: No Tuning
50 → Ab Initio: Tuned
0

1. Copy / Filter 2. Sort 3. Aggregate /
Rollup

7

DMExpress Seamlessly Scales to Support Growing Requirements
Volume & Complexity

Seamlessly scale: Business Requirements
• No tuning Conventional ETL
• No ELT
• Defer hardware purchases DMExpress

Time
Continuously implement
performance stop-gap measures:
• Manual tuning
• Add/upgrade hardware
Point of problem • Push-down (ELT)
awareness

Fast: Intelligent Sort Algorithms
High Frequency and Impact Compression
Source Extract, Ratio
Compress & FTP 6X
Sort impacts every aspect of ETL increases
Partition Up To
Source Extract Data Faster
40%
Database Load Compress & FTP
Joining Up To
Records Faster
60%
Merge & Up To
Partition Data
Transformation Faster
50%
Aggregation

Aggregation Up To
Faster
70%
Merging & Joining Records
Transformation Database Up To
Load & Index Faster
40%

Syncsort has been the market leading sort technology since 1968

Maximizing Performance with Optimum Resource Utilization

The Performance Triangle
CPU DMExpress Is Different

• Patented Algorithms
Dynamically responds to CPU,
Memory & disk availability

Partition &
Buffer
• Direct I/O
Pipeline
Parallelism
Management
Bypasses file system buffer
Instruction
Memory Cache
accessing data directly at block
Cache ETL Process
Optimization
Optimizer
Optimization
level for higher performance
I/O
Optimization
Algorithm
Selection • Compression
Used for read/write & crucially
active workspace (minimizes disk
I/O Memory touches & transfer volume)

Disk & I/O Bound

DMExpress Dynamically Maximizes Throughput at Run Time

Conventional Data Integration Data Integration with DMExpress

Automatic and Dynamic

Manual and Static

Algorithms
Algorithms

Processing Time Processing Time

■ Scaling requires expensive hardware ■ Extremely efficient in commodity hardware
■ I/O operations well below disk speed ■ I/O operations at near disk speed
■ Requires exhaustive tuning ■ Automatic parallelism and pipelining
■ Sub-optimal consumption of resources ■ Automatic, efficient caching and hashing
■ Uses all memory, overflows to disk ■ Minimizes disk caching


Efficient: Dynamic ETL Optimizer
Resource Analysis

Memory
Partition &
Buffer
CPU Pipeline
Management
Parallelism

I/O
Instruction Memory
Cache ETL Process Cache
File System Optimization
Optimizer Optimization

I/O Algorithm
Optimization Selection
Data Type
Data Analysis

Record Format
Fully automatic, continuously self-tuning optimizer maximizes
#Records / throughput and resource efficiencies
Columns – Evaluates hardware, software, and data environment
– Determines optimal algorithmic flow at start-up
– Begins execution with auto-generated optimizer plan
– Continuously adjusts algorithms, memory use, parallelism based on
application and run time environment

12 Sy
ncs

Design Once Inherit Performance

Sources Read Join Aggregate Write Targets

EDW
ETL Job
DM

Thread Management
Tasks
Dynamic Optimizations

• Each ETL task runs on a separate process
• Automatic, dynamic thread management for each task
• Automatic parallelism and pipelining
• Automatic, dynamic algorithm selection

Architecture
DMExpress – White Boarding the Data
Acceleration Sales

DMExpress Architecture Delivers Maximum Performance and
Data Scalability with Automatic Dynamic Optimizations

Integration / Customization (SDK, Open APIs)
Graphical Development Environment

DMExpress Engine
High Performance Transformations User Defined Functions Automatic Continuous
Optimization

Deployment
• Sort • Load Presort Built in Functions:
Metadata

• Merge • Filter • Numeric
• Aggregate • Reformat • Text

Algorithms
• Join / Lookup • Partition • Date and Time
• Copy • Logical
• Advanced Text Processing
• Data Partitioning

Processing Time

Source/Target Connectivity


Five Simple Steps to Deploy. Tuning Is NOT One of Them.

• Single install
1. Install DMExpress • Takes less than 5 minutes

• Primary Tasks: Sort, Merge, Aggregate, Join /
2. Choose “Task” Template Lookup, Copy
• Secondary Tasks: Filter, Reformat, Partition

• Connectivity
• Standard Functions
3. Fill-in the blanks • Numeric, Text, Date/Time, Logical
• User-defined Functions
• Create Complete ETL “Jobs” by Combining
4. Integrate Multiple “Tasks”
• Define Flows – from files to direct flows

• Schedule
5. Deploy • Parameterize
• Monitor

Syncsort DMExpress Is Simple but powerful
Intuitive Graphical Interface enables Development and Maintenance

• Graphical → No coding required
Development Environment → No tuning required
→ Easily build/edit jobs and tasks
• Expression Builder → Detect differences between development,
test, and production environments
• Job/Task Diff → Users are fully functional within a few days


DMExpress Architecture
DmExpress Clients Command Line
Job Task Editor
Editor

Flat File Based 3rd party version
Metadata Repository Check-in control tool
Check-out

Design Services
Time View Local Windows / Unix / Linux Remote
Data Server Server

DMExpress Engine

Data Sources / Targets

Use Cases
Acceleration Sales

Acceleration POC – Scenario A
Processing Time in Minutes of
‘High Load Jobs’

32
40
19
30 1/2 The time
20
10
0
DataStage DMExpress
Parallel

4/6 cores 1 core
(Virtual) 1/6 The hardware
(Physical/Virt.)
Linux Linux
20

Acceleration POC – Scenario B
Processing Time in Minutes of
‘Scenario B’
40.00
40.00
21.30
30.00 1/2 The time
20.00
10.00
0.00
DataStage DMExpress
Server

14 cores 1 core
1/14 The
(Physical) (Virtual)
Hardware
HP-UX Linux
21

Use Case 1: Global Information Service Provider
 Business Challenge
 Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic
new offerings
 Environment
 Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15.
 16 core LINUX machine
 Technical Challenge
 Weekly Reporting application on 8 million DUNS numbers
 Data Sizes: 5 tables of ~1 TB each
 Bottleneck step was to join 5 tables and aggregate the output
 Prior Attempts to Increase Performance
 Manual tuning of ETL routines - lots of consultants spent many months and dollars
 Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error
 Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated)
 Problem existed since February on 2009, many attempts and touch points, production in October.
 Solution
 DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour
to run this step in DMExpress vs. 27 days.
 DMExpress invoked at the command line prior to Informatica
 Benefits
 New offering launched on time
 Able to meet SLAs
 2 weeks to finish POC
 In production in 6 weeks

Use case 2: Major Insurance Provider
 Business Challenge
 Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal
over weekend window impacts conversion rates for promotions to policyholders
 Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm
 Environment
 Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris
 Technical Challenge
 500 of GB of data, including joins and aggregations, need to be completed during weekend window
 Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked
 Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors
 No capacity in Teradata to push down transformations
 Prior Attempts to Increase Performance
 Tuning did not solve the problem
 Dynamically adjusting cache did not solve the bottleneck
 Solution
 Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations)
 Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs
 Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.
 Benefits
 DMExpress completes within weekend batch window
 Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress
 Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20th the disk space

Case Study: Enabling Up to $3M in Data
Integration Cost Savings
Before After
PL/SQL Scripts (ELT) DMExpress (ETL)
Avg. 13.5M rows per file/table

Avg. 13.5M rows per file/table
ETLTL Vertica
Oracle Oracle DMExpress

Oracle Oracle
Data
Migrator Analytics Analytics

Read files Load into staging Load into the Oracle Read files Dedupe, summarize Analysis & reporting
area, dedupe, and production data and load into Oracle
summarize using warehouse for data warehouse
PL/SQL scripts and analysis & reporting
iWay Data Migrator

• Est. TCO over 3 years: $4.4M Est. TCO over 3 years: $1.5M
• Total processing time: 2.35 hrs Total processing time: 3 min
• Complex architecture with PL/SQL, iWay Data One tool. One ETL engine. No staging
Migrator and lots of Oracle staging No coding. No tuning. Reusable objects
• Manual coding. Manual tuning. No reusability Scalable architecture supports business growth
• No scalability to support business goals and profitability objectives

POC Results – Informatica

Max I/O Ave I/O Max I/O Ave I/O
Utilization - Utilization Utilization Utilization
Memory Peak Approximate Read – Read – Write – Write
Elapsed time (Mb) CPU Time MB/Sec (Meg/s) (MB/Sec MB/Sec
PowerCenter 0:28:10 11,875 1:06:29.2 53 12 82 39
DMExpress 0:13:26 9,438 0:16:53.9 154 33 101 66
DMExpress (Linux) 0:05:43 9,957 0:16:21 N/A 83 N/A 142

Elapsed Time Memory (Gb) CPU Time
00:36:00 14.0 1:12:00
12.0 1:04:48
00:28:48 0:57:36
10.0 0:50:24
00:21:36 8.0 0:43:12
0:36:00
00:14:24 6.0 0:28:48
4.0 0:21:36
00:07:12 0:14:24
2.0
0:07:12
00:00:00 0.0 0:00:00
PC DMX DMX (Linux) PC DMX DMX (Linux) PC DMX DMX (Linux)

Benchmark Details DMExpress vs. Informatica
Current DMX

Task Time Task Time Saving
Copy 4mins 09 seconds Copy 0mins 50 seconds 80%

5 Gb Sort 7mins 26 seconds Sort 1mins 19 seconds 82%
File – Aggregate 9mins 37 seconds Aggregate 1mins 9 seconds 88%
45 M Sort & Aggregate 3mins 43 seconds Sort & Aggregate 1mins 37 seconds 57%
Records

Task Time Task Time Saving
Copy 20mins 53 seconds Copy 4mins 12 seconds 80%
Sort 31mins 48 seconds Sort 6mins 17 seconds 80%
25 Gb
Aggregate 20mins 45 seconds Aggregate 4mins 30 seconds 78%
File –
225 M Sort & Aggregate 14mins 53 seconds Sort & Aggregate 6mins 38 seconds 55%
Records

Ab Initio Benchmark

Scenario1 (copy/filter)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 47 minutes 3 hours 44 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841
Ab Initio 66 minutes 4 hours 38 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841

Scenario2 (Sort)
DMExpress 1 hour 12 min 7 hours 26 min 60 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715
Ab Initio 4 hours 42 min 9 hours 48 min 360 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715

Scenario3 (Aggregation/Rollup)
DMExpress 1 hour 21 min 7 hour 10 min 4 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752
Ab Initio 2 hours 10 hours 14 min 360 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752

Ab Initio tuned 8 ways
DMExpress with no tuning

Metadata with Miti
Acceleration Sales

ETL to DMExpress acceleration / conversion
Automatic
Conversion Utility Conversion Utility
Cognizant Migration /
Optimization COE

Parsing UNIX shell scripts
Informatica workflows
• Informatica Informatica mappings
Spreadsheets identifying the production
• IBM DataStage workflows and mappings
• PL/SQL Timing information of the job executions
over a two month period
• Etc… Resource data points for the workflows

Processing
• Flow analysis
• Expression & type analysis
• Optimization

Output Generation
• DMExpress
• Documentation

DMX Live Demo
Acceleration Sales P

Optimization

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (15)

Similar a Optimization

Similar a Optimization (20)

Optimization

Notas del editor