2. Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective.
A Family of High-Performance, Purpose-Built Data Integration Tools
For core ETL processing & database transformation
→ High-Performance ETL offload (Oracle PL/SQL, Teradata, and others)
Integrate → ETL Optimization For Informatica, DataStage, and others
→ Hadoop Optimization For Apache, HortonWorks, Cloudera, and others
Optimize
→ Rehosting Optimization For Clerity, MicroFocus, Oracle, and others
→ High-Performance Sort For z/OS, z/VSE, and Windows/UNIX/Linux
Migrate
→ Sort Optimization For SAS, DFSORT, Trillium, and others
Syncsort Confidential and Proprietary - do not copy or distribute 3
3. Do You Need Data Integration
Optimization/Acceleration?
ETL is taking longer and longer
Large budgets to purchase additional
hardware and database
A shift in data integration processing to
database or hand-coded solutions
Data integration environment can’t easily
be govern, maintained or expanded
Inability to launch or staff initiatives due to
lack of resources
Long time-to-value
Users may lose confidence in data
Syncsort Confidential and Proprietary - do not copy or distribute
4
4. What is Optimization with DMExpress™ ?
Better Performance – No Tuning
Lower Costs for:
Hardware
Licenses
IT Stuff
Improves your Capabilities to deliver
Reduces usage of resources
More work in less time
Secure your already done investment
Syncsort Confidential and Proprietary - do not copy or distribute
5
5. Examples for Optimization with DMExpress™
→ 10 * Faster then
Major Logistic Company
DataStage Parallel
IBM DataStage
→ 26 * Faster then
Major Logistic Company
DataStage Server
→ 27 days down to 15 hours
Information Service Provider
→6 week to production
Informatica → 1/20 of disc space
Major Insurance Provider
→ significant less Memory
→ Costs/TB down from
ComScore
→ 1538 US$ to 46 US$
→ Reduce costs by 2.9 Mio $
PL/SQL Global Payments
→ 2.35h down to 3 min
→ 4:42 h down to 1:12h
AbInitio Financial Service Provider
→ 360 GB down to 4 GB WS
Syncsort Confidential and Proprietary - do not copy or distribute 6
6. DMExpress Delivers Significantly Faster Performance
Even Without Any Tuning
35
Elapsed Time (m)
30
25 INFA
20
DMExpress Up to 5x Faster
15
10
→ DMExpress: No Tuning
5 → Informatica: Tuned
0
1. Copy 2. Sort 3. Aggregate
300
Elapsed Time (m)
250
Ab Initio
200
DMExpress Up to 4x Faster
150
100
→ DMExpress: No Tuning
50 → Ab Initio: Tuned
0
1. Copy / Filter 2. Sort 3. Aggregate /
Rollup
Syncsort Confidential and Proprietary - do not copy or distribute
7
7. DMExpress Seamlessly Scales to Support Growing Requirements
Volume & Complexity
Seamlessly scale: Business Requirements
• No tuning Conventional ETL
• No ELT
• Defer hardware purchases DMExpress
Time
Continuously implement
performance stop-gap measures:
• Manual tuning
• Add/upgrade hardware
Point of problem • Push-down (ELT)
awareness
Syncsort Confidential and Proprietary - do not copy or distribute 8
8. Fast: Intelligent Sort Algorithms
High Frequency and Impact Compression
Source Extract, Ratio
Compress & FTP 6X
Sort impacts every aspect of ETL increases
Partition Up To
Source Extract Data Faster
40%
Database Load Compress & FTP
Joining Up To
Records Faster
60%
Merge & Up To
Partition Data
Transformation Faster
50%
Aggregation
Aggregation Up To
Faster
70%
Merging & Joining Records
Transformation Database Up To
Load & Index Faster
40%
Syncsort has been the market leading sort technology since 1968
9. Maximizing Performance with Optimum Resource Utilization
The Performance Triangle
CPU DMExpress Is Different
• Patented Algorithms
Dynamically responds to CPU,
Memory & disk availability
Partition &
Buffer
• Direct I/O
Pipeline
Parallelism
Management
Bypasses file system buffer
Instruction
Memory Cache
accessing data directly at block
Cache ETL Process
Optimization
Optimizer
Optimization
level for higher performance
I/O
Optimization
Algorithm
Selection • Compression
Used for read/write & crucially
active workspace (minimizes disk
I/O Memory touches & transfer volume)
Disk & I/O Bound
Syncsort Confidential and Proprietary - do not copy or distribute 10
10. DMExpress Dynamically Maximizes Throughput at Run Time
Conventional Data Integration Data Integration with DMExpress
Automatic and Dynamic
Manual and Static
Algorithms
Algorithms
Processing Time Processing Time
■ Scaling requires expensive hardware ■ Extremely efficient in commodity hardware
■ I/O operations well below disk speed ■ I/O operations at near disk speed
■ Requires exhaustive tuning ■ Automatic parallelism and pipelining
■ Sub-optimal consumption of resources ■ Automatic, efficient caching and hashing
■ Uses all memory, overflows to disk ■ Minimizes disk caching
Syncsort Confidential and Proprietary - do not copy or distribute 11
11. Efficient: Dynamic ETL Optimizer
Resource Analysis
Memory
Partition &
Buffer
CPU Pipeline
Management
Parallelism
I/O
Instruction Memory
Cache ETL Process Cache
File System Optimization
Optimizer Optimization
I/O Algorithm
Optimization Selection
Data Type
Data Analysis
Record Format
Fully automatic, continuously self-tuning optimizer maximizes
#Records / throughput and resource efficiencies
Columns – Evaluates hardware, software, and data environment
– Determines optimal algorithmic flow at start-up
– Begins execution with auto-generated optimizer plan
– Continuously adjusts algorithms, memory use, parallelism based on
application and run time environment
12 Sy
ncs
12. Design Once Inherit Performance
Sources Read Join Aggregate Write Targets
EDW
ETL Job
DM
Thread Management
Tasks
Dynamic Optimizations
• Each ETL task runs on a separate process
• Automatic, dynamic thread management for each task
• Automatic parallelism and pipelining
• Automatic, dynamic algorithm selection
Syncsort Confidential and Proprietary - do not copy or distribute 13
14. DMExpress Architecture Delivers Maximum Performance and
Data Scalability with Automatic Dynamic Optimizations
Integration / Customization (SDK, Open APIs)
Graphical Development Environment
DMExpress Engine
High Performance Transformations User Defined Functions Automatic Continuous
Optimization
Deployment
• Sort • Load Presort Built in Functions:
Metadata
• Merge • Filter • Numeric
• Aggregate • Reformat • Text
Algorithms
• Join / Lookup • Partition • Date and Time
• Copy • Logical
• Advanced Text Processing
• Data Partitioning
Processing Time
Source/Target Connectivity
Syncsort Confidential and Proprietary - do not copy or distribute 15
15. Five Simple Steps to Deploy. Tuning Is NOT One of Them.
• Single install
1. Install DMExpress • Takes less than 5 minutes
• Primary Tasks: Sort, Merge, Aggregate, Join /
2. Choose “Task” Template Lookup, Copy
• Secondary Tasks: Filter, Reformat, Partition
• Connectivity
• Standard Functions
3. Fill-in the blanks • Numeric, Text, Date/Time, Logical
• User-defined Functions
• Create Complete ETL “Jobs” by Combining
4. Integrate Multiple “Tasks”
• Define Flows – from files to direct flows
• Schedule
5. Deploy • Parameterize
• Monitor
Syncsort Confidential and Proprietary - do not copy or distribute 16
16. Syncsort DMExpress Is Simple but powerful
Intuitive Graphical Interface enables Development and Maintenance
• Graphical → No coding required
Development Environment → No tuning required
→ Easily build/edit jobs and tasks
• Expression Builder → Detect differences between development,
test, and production environments
• Job/Task Diff → Users are fully functional within a few days
Syncsort Confidential and Proprietary - do not copy or distribute 17
17. DMExpress Architecture
DmExpress Clients Command Line
Job Task Editor
Editor
Flat File Based 3rd party version
Metadata Repository Check-in control tool
Check-out
Design Services
Time View Local Windows / Unix / Linux Remote
Data Server Server
DMExpress Engine
Data Sources / Targets
19. Acceleration POC – Scenario A
Processing Time in Minutes of
‘High Load Jobs’
32
40
19
30 1/2 The time
20
10
0
DataStage DMExpress
Parallel
4/6 cores 1 core
(Virtual) 1/6 The hardware
(Physical/Virt.)
Linux Linux
20
20. Acceleration POC – Scenario B
Processing Time in Minutes of
‘Scenario B’
40.00
40.00
21.30
30.00 1/2 The time
20.00
10.00
0.00
DataStage DMExpress
Server
14 cores 1 core
1/14 The
(Physical) (Virtual)
Hardware
HP-UX Linux
21
21. Use Case 1: Global Information Service Provider
Business Challenge
Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic
new offerings
Environment
Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15.
16 core LINUX machine
Technical Challenge
Weekly Reporting application on 8 million DUNS numbers
Data Sizes: 5 tables of ~1 TB each
Bottleneck step was to join 5 tables and aggregate the output
Prior Attempts to Increase Performance
Manual tuning of ETL routines - lots of consultants spent many months and dollars
Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error
Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated)
Problem existed since February on 2009, many attempts and touch points, production in October.
Solution
DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour
to run this step in DMExpress vs. 27 days.
DMExpress invoked at the command line prior to Informatica
Benefits
New offering launched on time
Able to meet SLAs
2 weeks to finish POC
In production in 6 weeks
22. Use case 2: Major Insurance Provider
Business Challenge
Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal
over weekend window impacts conversion rates for promotions to policyholders
Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm
Environment
Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris
Technical Challenge
500 of GB of data, including joins and aggregations, need to be completed during weekend window
Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked
Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors
No capacity in Teradata to push down transformations
Prior Attempts to Increase Performance
Tuning did not solve the problem
Dynamically adjusting cache did not solve the bottleneck
Solution
Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations)
Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs
Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.
Benefits
DMExpress completes within weekend batch window
Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress
Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20th the disk space
23. Case Study: Enabling Up to $3M in Data
Integration Cost Savings
Before After
PL/SQL Scripts (ELT) DMExpress (ETL)
Avg. 13.5M rows per file/table
Avg. 13.5M rows per file/table
ETLTL Vertica
Oracle Oracle DMExpress
Oracle Oracle
Data
Migrator Analytics Analytics
Read files Load into staging Load into the Oracle Read files Dedupe, summarize Analysis & reporting
area, dedupe, and production data and load into Oracle
summarize using warehouse for data warehouse
PL/SQL scripts and analysis & reporting
iWay Data Migrator
• Est. TCO over 3 years: $4.4M Est. TCO over 3 years: $1.5M
• Total processing time: 2.35 hrs Total processing time: 3 min
• Complex architecture with PL/SQL, iWay Data One tool. One ETL engine. No staging
Migrator and lots of Oracle staging No coding. No tuning. Reusable objects
• Manual coding. Manual tuning. No reusability Scalable architecture supports business growth
• No scalability to support business goals and profitability objectives
Syncsort Confidential and Proprietary - do not copy or distribute 24
24. POC Results – Informatica
Max I/O Ave I/O Max I/O Ave I/O
Utilization - Utilization Utilization Utilization
Memory Peak Approximate Read – Read – Write – Write
Elapsed time (Mb) CPU Time MB/Sec (Meg/s) (MB/Sec MB/Sec
PowerCenter 0:28:10 11,875 1:06:29.2 53 12 82 39
DMExpress 0:13:26 9,438 0:16:53.9 154 33 101 66
DMExpress (Linux) 0:05:43 9,957 0:16:21 N/A 83 N/A 142
Elapsed Time Memory (Gb) CPU Time
00:36:00 14.0 1:12:00
12.0 1:04:48
00:28:48 0:57:36
10.0 0:50:24
00:21:36 8.0 0:43:12
0:36:00
00:14:24 6.0 0:28:48
4.0 0:21:36
00:07:12 0:14:24
2.0
0:07:12
00:00:00 0.0 0:00:00
PC DMX DMX (Linux) PC DMX DMX (Linux) PC DMX DMX (Linux)
26. Ab Initio Benchmark
Scenario1 (copy/filter)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 47 minutes 3 hours 44 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841
Ab Initio 66 minutes 4 hours 38 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841
Scenario2 (Sort)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 1 hour 12 min 7 hours 26 min 60 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715
Ab Initio 4 hours 42 min 9 hours 48 min 360 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715
Scenario3 (Aggregation/Rollup)
Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes)
DMExpress 1 hour 21 min 7 hour 10 min 4 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752
Ab Initio 2 hours 10 hours 14 min 360 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752
Ab Initio tuned 8 ways
DMExpress with no tuning
Syncsort Confidential and Proprietary - do not copy or distribute 27
28. ETL to DMExpress acceleration / conversion
Automatic
Conversion Utility Conversion Utility
Cognizant Migration /
Optimization COE
Parsing UNIX shell scripts
Informatica workflows
• Informatica Informatica mappings
Spreadsheets identifying the production
• IBM DataStage workflows and mappings
• PL/SQL Timing information of the job executions
over a two month period
• Etc… Resource data points for the workflows
Processing
• Flow analysis
• Expression & type analysis
• Optimization
Output Generation
• DMExpress
• Documentation
Syncsort Confidential and Proprietary - do not copy or distribute 29
So, How do you know if DMExpress is the right technology for you? Well, you can start by using the TDWI Checklist report for accelerating data integration….<Bring a hard copy of the Checklist and deliver to the customer>
So, How do you know if DMExpress is the right technology for you? Well, you can start by using the TDWI Checklist report for accelerating data integration….<Bring a hard copy of the Checklist and deliver to the customer>
The result, is that you can normally achieve much higher performance than the leading DI tools, even with no tuning. As an example, I’m showing 2 benchmarks we ran at a customer site, comparing DMExpress vs. Informatica at the top and AbInitio at the bottom.
So we talked about speed and efficiency. Now lets talk a bit more about ease of use. Most DI platforms talk about ease-of-use in terms of a nice GUI. However, Syncsort takes the concept of ease-of-use one step further to attack one of the most complex and time-consuming tasks: fine-tuning. For that, let me tell you a little bit about our technology. Traditional Data Integration is manual and static. Moreover, it was not designed with efficiency in mind, this means there’s a suboptimal use of resources, while they are very CPU and memory intensive, they still run I/O operations well below disk speed. Therefore, scaling requires very expensive hardware and time-consuming tuning.Every time there’s changes, IT has to go back and re-tune the system. Well, DMExpress provides a completely different approach, DMExpress is completely automatic and dynamic.Coming from 40 yrs of performance expertise, the engine minimizes CPU and memory utilization, while running I/O operations at or near disk speed. More importantly, it requires no tuning whatsoever, this means it automatically adapts to changes in real time, providing automatic parallelism and pipelining. This transforms into:Higher performance out of the boxMuch better ease of use at a point where users can design high-performance ETL tasks & jobs with minimum trainingSignificant savings in terms of IT staff hours and hardware.
A Task is a basic unit of work: sort, aggregate, join, etc.A DMExpressJob is a collection of TasksEach Task executes on a separate processDMExpress automatically: manages threads for each Task
Dun & BradstreetData Sizes: 5 tables of ~1 TB each.Processing need: Bottleneck step in INFA was Join 5 tables and aggregate the output.Application: Weekly Reporting application on millions of DUNS number.Data warehouse: Oracle 10g.Original Approach: ETL using INFA. Not meeting SLAs. SLAs is to run this process in a weeks time.Attempts to improve performance: Tuned the ETL environment to try meeting SLAs. No successConverted the ETL mapping to ELT in INFA. No success. Process would abort with ORA-01555:Snapshot too old error, because the process run in the Database too long and tables are being updated during the processing.Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolate)! DMExpress benchmarked: DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. The output file was then read by INFA and loaded into the target table. Total run time was 15 hour to run this step in DMExpress.POC environment: 4 core LINUX boxDMExpress is currently in production and used as a performance complement to INFA.Current production environment: 16 core LINUX box.High level flow in production: SOURCES Oracle DMX (extract 9 hours) Flat Files DMX INFA TARGET DATA MARTWhen they used DMX:Agg not fast – gotta presort, not enough mem to Agg without DMX, or alternate is push down. However, push down to db table just to order by is not an optionDMX extracted data in 6 hours, filtered on the fly and and landing to disk – 2 to 3 tb – offload the load from dbDetail Trade data mart – transactional very busy, offload really benefited the customerLot of Cognizant folks and lot of time spent over many months.
Application is used for: Campaign management, portfolio mgt, product analysis, marketing analytics, customer analytics.SLA: Start Friday at 6 pm, final load is on Monday 6 pm – Data Flow:Flat files sources trickling in 10 source systems, 200 flat files – 500 GB in total (3 customer systems, quote systems) (weekly – Friday night) -> Standardization process (INFA + DMX, Aggregation, preparing data for Trillium – Friday at 6 pm to Sat 3 pm) -> Trillium + DMX plug in (customer house holding and address std – 12 hours, ends 3 am Sunday) -> DI (INFA and DMX – building customer hierarchies, i.e aggregating customers to households, bunch of roll ups – 18 hours, ends at Sunday 9pm) -> Dimensional Model Builds and Loads (sorting, joining, CDC, joining keys back to the fact) -> Dim Data Mart (Teradata load time is good portion of the 18 hours). Some anecdotal info from Jeff (Baax)1. Push down not practical - Flat file to database and back to flat file to do work in Trillium and – network costs, db load/unload costs, load a 40 GB just to sort – not an option!2.Took the engineers only2 weeks by themselves and enabled a 6 month deployment (1/6 of that timewas DMX )3. One of the larger table – 150 mill – original approach was truncate and load (12 to 16 hours). Changed the approach to do a CDC in DMX and just to inserts and updates using TD multiload. Now it takes hours to do the DMX CDC and ½ hour to load the results!4. Machine downtime and maintenance adds to the complexity5. Database Monday IDs get locked on Monday at 8 am (real SLA is 8am, exception needed to extend to hard SLA which is 6 pm, causes a lot of aggravation!)6. Due to data volume growth – customer is looking to optimize all the time – DMX provides a very easy, scalable way to deal with this need and implement the jobs. 7. DMX/INFA hand off:Today it’s a file hand offExploring pipes instead of filesMaestro calls a DMX separate jobWorkflow/Session – command task invokes DMX (landing a file). When and where are 2 tools necessary a.) A huge join – started by building a 50 GB join – 30 hours –Inner join outputs file gets read into infa – do some biz logic b.) A huge Agg - INFA memory agg – do DMX sort – INFA complex agg.8. Ratio of numberof INFA/DMX jobs is 70/30
Global Payments a leading electronic transaction processing organization serving millions of customers uses DMExpress as their ETL standard. BUSINESS CHALLENGESGlobal Payments came to us as they were planning to consolidate all of their global operations into their US data center. With this purpose they had some challenges:+ First, they wanted to reduce costs, that was one of the key drivers behind the initiative+ Reduce operational risk and improve customer service, providing a more consistent level of service across the world (The fact that they had to manually script in PL/SQL many of their transformations pushing transformations into strained Oracle database, sometimes resulted in errors that could jeopardize daily operations. In addition, under the existing architecture they had to lock Oracle tables for hours which had a huge impact on all database users)+ GPN wanted to open a new revenue source by offering a new service with more granular reporting to their customers. Because of the reasons above, transformations for the new service had to happen outside the Oracle database+ Cut processing times to allow for future growth. They were experiencing around 50% YoY data growth.+ Global operations also meant shorter batch windows with 24x7 operations+ Consolidation of operations meant that staff of 5 FTEs previously managing US and NA operations would now how to manage all the international operations + They wanted to go into production in less than 60 days, while minimizing any impact to its existing operationsBEFORE / PAIN POINTS: They had an architecture with iWay Data Migrator doing some of the work, but since this tool couldn’t cope with the performance and scalability requirements, they had to hand-code a lot of their transformations in PL/SQL. This resulted in several pain points including: Very complex architecture due to the use of both PL/SQL and data migrator. Constant tuning required with little or no reusability, resulting in very long development cycles and time-to-value Their architect said there was real pain on the limitations of error logging with data migrator. Having a tool like DMExpress helped significantly on this area. Higher Costs: In terms of hardware required by their ETL tool as well as database capacity to execute PL/SQL scriptsOne of their processes, had to dedupe and summarize several tables with some of them exceeding 13M rows in length. Processing time was taking more than 2 hrs to completeBENEFITSWe went on site and conducted a POC and a business value analysis (BVA). The results showed: Processing times improved by almost 9x (from 141 min to 3 min for key processing tasks) Significant savings when compared to other options (including informatica, their existing architecture, and DataStage. In fact, they had prior experience working with DataStage so they were looking heavily at DS). However, dring the BVA we did a thorough analysis of their DI strategy and TCO, evaluating operational as well as capital costs in 3 key categories: Hardware costs, database/staging costs, and IT Staff productivity. Please notice ETL software license costs were not included in the analysis. However, our pricing was still very competitive and in the lower end of the competition. The results of the analysis show savings of nearly US $3M over 3 years (more details about the analysis can be found on the third slide)Global Payments was able to deploy to production in approximately 4 weeks.The new architecture is helping GPN achieve their growth and profitability goals with a technology that can scale cost-effectively to support growing data volumes.DISCOVERY QUESTIONS THAT HELPED QUALIFY THIS OPPORTUNITY How critical is your need to reduce processing time (improve performance)?What is your time frame for getting the problem solved?What solutions have you considered?How many people do you have developing/maintaining PL/SQL?What is the size (type/# cores)of your DB server(s) Would you find it advantages to reduce your DB cost?Do you know what the DB server(s) are costing you?What the impact would be if you could move the DI work off the DB Server(s)? Other discovery questions Transformations taking place (sort, merge, join, look-ups)Data sizesCurrent performance (processing) timesDI/DW/BI environment