SlideShare una empresa de Scribd logo
1 de 29
Optimization with DMExpress

Steven Haddad – Senior Software Architect
shaddad@syncsort.com
Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective.

          A Family of High-Performance, Purpose-Built Data Integration Tools

                                                                                               For core ETL processing & database transformation
                                                                    → High-Performance ETL     offload (Oracle PL/SQL, Teradata, and others)


                        Integrate                                   → ETL Optimization         For Informatica, DataStage, and others




                                                                    → Hadoop Optimization      For Apache, HortonWorks, Cloudera, and others

                        Optimize
                                                                    → Rehosting Optimization   For Clerity, MicroFocus, Oracle, and others



                                                                    → High-Performance Sort    For z/OS, z/VSE, and Windows/UNIX/Linux
                         Migrate

                                                                    → Sort Optimization        For SAS, DFSORT, Trillium, and others



Syncsort Confidential and Proprietary - do not copy or distribute                                                                                  3
Do You Need Data Integration
  Optimization/Acceleration?
 ETL is taking longer and longer

 Large budgets to purchase additional
  hardware and database

 A shift in data integration processing to
  database or hand-coded solutions

 Data integration environment can’t easily
  be govern, maintained or expanded

 Inability to launch or staff initiatives due to
  lack of resources

 Long time-to-value

 Users may lose confidence in data
Syncsort Confidential and Proprietary - do not copy or distribute
                                                                    4
What is Optimization with DMExpress™ ?

 Better Performance – No Tuning

 Lower Costs for:

         Hardware

         Licenses

         IT Stuff

 Improves your Capabilities to deliver

 Reduces usage of resources

 More work in less time

 Secure your already done investment
Syncsort Confidential and Proprietary - do not copy or distribute
                                                                    5
Examples for Optimization with DMExpress™
                                                                    → 10 * Faster then
                                                                                                 Major Logistic Company
                                                                    DataStage Parallel
       IBM DataStage
                                                                    → 26 * Faster then
                                                                                                 Major Logistic Company
                                                                    DataStage Server
                                                                    → 27 days down to 15 hours
                                                                                                 Information Service Provider
                                                                    →6 week to production
          Informatica                                               → 1/20 of disc space
                                                                                              Major Insurance Provider
                                                                    → significant less Memory
                                                                    → Costs/TB down from
                                                                                                 ComScore
                                                                    → 1538 US$ to 46 US$

                                                                    → Reduce costs by 2.9 Mio $
              PL/SQL                                                                            Global Payments
                                                                    → 2.35h down to 3 min


                                                                    → 4:42 h down to 1:12h
             AbInitio                                                                            Financial Service Provider
                                                                    → 360 GB down to 4 GB WS

Syncsort Confidential and Proprietary - do not copy or distribute                                                               6
DMExpress Delivers Significantly Faster Performance
  Even Without Any Tuning

                             35
     Elapsed Time (m)




                             30
                             25                                                      INFA
                             20
                                                                                     DMExpress   Up to 5x Faster
                             15
                             10
                                                                                                 → DMExpress: No Tuning
                              5                                                                  → Informatica: Tuned
                              0

                                       1. Copy           2. Sort    3. Aggregate



                             300
          Elapsed Time (m)




                             250
                                                                                     Ab Initio
                             200
                                                                                     DMExpress   Up to 4x Faster
                             150
                             100
                                                                                                 → DMExpress: No Tuning
                              50                                                                 → Ab Initio: Tuned
                                  0

                                      1. Copy / Filter    2. Sort   3. Aggregate /
                                                                        Rollup

Syncsort Confidential and Proprietary - do not copy or distribute
                                                                                                                          7
DMExpress Seamlessly Scales to Support Growing Requirements
                  Volume & Complexity



                                        Seamlessly scale:               Business Requirements
                                        • No tuning                     Conventional ETL
                                        • No ELT
                                        • Defer hardware purchases      DMExpress




                                                                             Time
                                                                     Continuously implement
                                                                     performance stop-gap measures:
                                                                     • Manual tuning
                                                                     • Add/upgrade hardware
                                        Point of problem             • Push-down (ELT)
                                           awareness
Syncsort Confidential and Proprietary - do not copy or distribute                               8
Fast: Intelligent Sort Algorithms
      High Frequency and Impact                                     Compression
                                                  Source Extract,   Ratio
                                                 Compress & FTP               6X
    Sort impacts every aspect of ETL                                increases
                                                    Partition       Up To
                         Source Extract              Data           Faster
                                                                             40%
   Database Load         Compress & FTP
                                                    Joining         Up To
                                                    Records         Faster
                                                                             60%
                                                    Merge &         Up To
                                Partition Data
                                                 Transformation     Faster
                                                                             50%
Aggregation

                                                  Aggregation       Up To
                                                                    Faster
                                                                             70%
      Merging &          Joining Records
      Transformation                              Database          Up To
                                                 Load & Index       Faster
                                                                             40%



     Syncsort has been the market leading sort technology since 1968
Maximizing Performance with Optimum Resource Utilization

                       The Performance Triangle
                                 CPU                                                               DMExpress Is Different

                                                                                              • Patented Algorithms
                                                                                                Dynamically responds to CPU,
                                                                                                Memory & disk availability

                                          Partition &
                                                           Buffer
                                                                                              • Direct I/O
                                           Pipeline
                                          Parallelism
                                                         Management
                                                                                                Bypasses file system buffer
                                Instruction
                                                                      Memory Cache
                                                                                                accessing data directly at block
                                   Cache         ETL Process
                               Optimization
                                                  Optimizer
                                                                      Optimization
                                                                                                level for higher performance
                                              I/O
                                          Optimization
                                                          Algorithm
                                                          Selection                           • Compression
                                                                                                Used for read/write & crucially
                                                                                                active workspace (minimizes disk
     I/O                                                                             Memory     touches & transfer volume)

                                    Disk & I/O Bound
Syncsort Confidential and Proprietary - do not copy or distribute                                                                  10
DMExpress Dynamically Maximizes Throughput at Run Time

                 Conventional Data Integration                                   Data Integration with DMExpress

                                                                                                              Automatic and Dynamic


                                          Manual and Static




                                                                        Algorithms
    Algorithms




                 Processing Time                                                         Processing Time


         ■ Scaling requires expensive hardware                      ■ Extremely efficient in commodity hardware
         ■ I/O operations well below disk speed                     ■ I/O operations at near disk speed
         ■ Requires exhaustive tuning                               ■ Automatic parallelism and pipelining
         ■ Sub-optimal consumption of resources                     ■ Automatic, efficient caching and hashing
                  ■   Uses all memory, overflows to disk                             ■    Minimizes disk caching

Syncsort Confidential and Proprietary - do not copy or distribute                                                                     11
Efficient: Dynamic ETL Optimizer
Resource Analysis



                      Memory
                                                       Partition &
                                                                          Buffer
                       CPU                              Pipeline
                                                                        Management
                                                       Parallelism

                         I/O
                                             Instruction                              Memory
                                                Cache         ETL Process               Cache
                     File System            Optimization
                                                               Optimizer             Optimization




                                                          I/O            Algorithm
                                                      Optimization       Selection
                      Data Type
Data Analysis




                    Record Format
                                    Fully automatic, continuously self-tuning optimizer maximizes
                     #Records /     throughput and resource efficiencies
                      Columns        –   Evaluates hardware, software, and data environment
                                     –   Determines optimal algorithmic flow at start-up
                                     –   Begins execution with auto-generated optimizer plan
                                     –   Continuously adjusts algorithms, memory use, parallelism based on
                                         application and run time environment

                                                                                         12                  Sy
                                                                                                             ncs
Design Once Inherit Performance


                                             Sources                Read   Join   Aggregate   Write    Targets

                                                                                                          EDW
      ETL Job
                                                                                                        DM


                                                                                         Thread Management
                                                                                         Tasks
                                                                                         Dynamic Optimizations

                                             • Each ETL task runs on a separate process
                                             • Automatic, dynamic thread management for each task
                                             • Automatic parallelism and pipelining
                                             • Automatic, dynamic algorithm selection
Syncsort Confidential and Proprietary - do not copy or distribute                                                13
Architecture
DMExpress – White Boarding the Data
        Acceleration Sales
DMExpress Architecture Delivers Maximum Performance and
     Data Scalability with Automatic Dynamic Optimizations




                                                                                                                                           Integration / Customization (SDK, Open APIs)
            Graphical Development Environment

                                                                       DMExpress Engine
              High Performance Transformations                           User Defined Functions                     Automatic Continuous
                                                                                                                        Optimization




                                                                                                                                                                                          Deployment
              •   Sort                    •   Load Presort               Built in Functions:
Metadata




              •   Merge                   •   Filter                     • Numeric
              •   Aggregate               •   Reformat                   • Text




                                                                                                      Algorithms
              •   Join / Lookup           •   Partition                  • Date and Time
              •   Copy                                                   • Logical
                                                                         • Advanced Text Processing
                                                                         • Data Partitioning


                                                                                                                   Processing Time

                                                                    Source/Target Connectivity




Syncsort Confidential and Proprietary - do not copy or distribute                                                                                                                                      15
Five Simple Steps to Deploy. Tuning Is NOT One of Them.

                                                                    • Single install
                       1. Install DMExpress                         • Takes less than 5 minutes

                                                                    • Primary Tasks: Sort, Merge, Aggregate, Join /
                       2. Choose “Task” Template                      Lookup, Copy
                                                                    • Secondary Tasks: Filter, Reformat, Partition

                                                                    • Connectivity
                                                                    • Standard Functions
                       3. Fill-in the blanks                           • Numeric, Text, Date/Time, Logical
                                                                    • User-defined Functions
                                                                    • Create Complete ETL “Jobs” by Combining
                       4. Integrate                                   Multiple “Tasks”
                                                                    • Define Flows – from files to direct flows

                                                                    • Schedule
                       5. Deploy                                    • Parameterize
                                                                    • Monitor
Syncsort Confidential and Proprietary - do not copy or distribute                                                     16
Syncsort DMExpress Is Simple but powerful
    Intuitive Graphical Interface enables Development and Maintenance

                 • Graphical                                        → No coding required
                   Development Environment                          → No tuning required
                                                                    → Easily build/edit jobs and tasks
                 • Expression Builder                               → Detect differences between development,
                                                                      test, and production environments
                 • Job/Task Diff                                    → Users are fully functional within a few days




Syncsort Confidential and Proprietary - do not copy or distribute                                                    17
DMExpress Architecture
                  DmExpress Clients                Command Line
                   Job          Task Editor
                  Editor




                                    Flat File Based                        3rd party version
                                  Metadata Repository         Check-in        control tool
                                                              Check-out




        Design                            Services
      Time View       Local          Windows / Unix / Linux       Remote
         Data         Server                                      Server



                               DMExpress Engine




                               Data Sources / Targets
Use Cases
DMExpress – White Boarding the Data
        Acceleration Sales
Acceleration POC – Scenario A
                        Processing Time in Minutes of
                              ‘High Load Jobs’


                              32
                   40
                                                 19
                   30                                       1/2 The time
                   20
                   10
                    0
                        DataStage       DMExpress
                         Parallel




  4/6 cores                                     1 core
                                                (Virtual)   1/6 The hardware
(Physical/Virt.)
    Linux                                       Linux
                                                                           20
Acceleration POC – Scenario B
                     Processing Time in Minutes of
                             ‘Scenario B’
                            40.00
             40.00
                                             21.30
             30.00                                               1/2 The time
             20.00
             10.00
              0.00
                      DataStage       DMExpress
                       Server




 14 cores                                            1 core
                                                                     1/14 The
(Physical)                                           (Virtual)
                                                                     Hardware
  HP-UX                                              Linux
                                                                                21
Use Case 1: Global Information Service Provider
 Business Challenge
    Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic
        new offerings
 Environment
    Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15.
    16 core LINUX machine
 Technical Challenge
    Weekly Reporting application on 8 million DUNS numbers
    Data Sizes: 5 tables of ~1 TB each
    Bottleneck step was to join 5 tables and aggregate the output
 Prior Attempts to Increase Performance
    Manual tuning of ETL routines - lots of consultants spent many months and dollars
    Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error
    Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated)
    Problem existed since February on 2009, many attempts and touch points, production in October.
 Solution
    DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour
        to run this step in DMExpress vs. 27 days.
    DMExpress invoked at the command line prior to Informatica
 Benefits
    New offering launched on time
    Able to meet SLAs
    2 weeks to finish POC
    In production in 6 weeks
Use case 2: Major Insurance Provider
 Business Challenge
    Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal
        over weekend window impacts conversion rates for promotions to policyholders
    Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm
 Environment
    Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris
 Technical Challenge
    500 of GB of data, including joins and aggregations, need to be completed during weekend window
    Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked
    Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors
    No capacity in Teradata to push down transformations
 Prior Attempts to Increase Performance
    Tuning did not solve the problem
    Dynamically adjusting cache did not solve the bottleneck
 Solution
    Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations)
    Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs
    Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.
 Benefits
    DMExpress completes within weekend batch window
    Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress
    Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20th the disk space
Case Study: Enabling Up to $3M in Data
                                                                                  Integration Cost Savings
                                                                       Before                                                                                                   After
                                                          PL/SQL Scripts (ELT)                                                                                           DMExpress (ETL)
Avg. 13.5M rows per file/table




                                                                                                                            Avg. 13.5M rows per file/table
                                                                               ETLTL                                                                                                                Vertica
                                                                                Oracle                        Oracle                                                       DMExpress



                                   Oracle                                                                                                                    Oracle
                                                                            Data
                                                                            Migrator                         Analytics                                                                             Analytics


                                 Read files                 Load into staging                        Load into the Oracle             Read files                      Dedupe, summarize      Analysis & reporting
                                                            area, dedupe, and                        production data                                                  and load into Oracle
                                                            summarize using                          warehouse for                                                    data warehouse
                                                            PL/SQL scripts and                       analysis & reporting
                                                            iWay Data Migrator

                                  • Est. TCO over 3 years: $4.4M                                                            Est. TCO over 3 years: $1.5M
                                  • Total processing time: 2.35 hrs                                                         Total processing time: 3 min
                                  • Complex architecture with PL/SQL, iWay Data                                             One tool. One ETL engine. No staging
                                    Migrator and lots of Oracle staging                                                     No coding. No tuning. Reusable objects
                                  • Manual coding. Manual tuning. No reusability                                            Scalable architecture supports business growth
                                  • No scalability to support business goals                                                 and profitability objectives
                                 Syncsort Confidential and Proprietary - do not copy or distribute                                                                                                                  24
POC Results – Informatica

                                                                          Max I/O      Ave I/O     Max I/O     Ave I/O
                                                                        Utilization - Utilization Utilization Utilization
                                              Memory Peak   Approximate    Read        – Read      – Write     – Write
                               Elapsed time      (Mb)        CPU Time      MB/Sec       (Meg/s)    (MB/Sec      MB/Sec
             PowerCenter          0:28:10       11,875       1:06:29.2        53           12          82          39
              DMExpress           0:13:26        9,438       0:16:53.9       154           33         101          66
           DMExpress (Linux)      0:05:43        9,957        0:16:21        N/A           83         N/A         142




            Elapsed Time                               Memory (Gb)                                        CPU Time
00:36:00                                       14.0                                         1:12:00
                                               12.0                                         1:04:48
00:28:48                                                                                    0:57:36
                                               10.0                                         0:50:24
00:21:36                                        8.0                                         0:43:12
                                                                                            0:36:00
00:14:24                                        6.0                                         0:28:48
                                                4.0                                         0:21:36
00:07:12                                                                                    0:14:24
                                                2.0
                                                                                            0:07:12
00:00:00                                        0.0                                         0:00:00
               PC        DMX   DMX (Linux)             PC        DMX      DMX (Linux)                    PC         DMX     DMX (Linux)
Benchmark Details DMExpress vs. Informatica
                   Current                                          DMX



          Task               Time                Task               Time               Saving
          Copy               4mins 09 seconds    Copy               0mins 50 seconds   80%

5 Gb      Sort               7mins 26 seconds    Sort               1mins 19 seconds   82%
File –    Aggregate          9mins 37 seconds    Aggregate          1mins 9 seconds    88%
45 M      Sort & Aggregate   3mins 43 seconds    Sort & Aggregate   1mins 37 seconds   57%
Records

          Task               Time                Task               Time               Saving
          Copy               20mins 53 seconds   Copy               4mins 12 seconds   80%
          Sort               31mins 48 seconds   Sort               6mins 17 seconds   80%
25 Gb
          Aggregate          20mins 45 seconds   Aggregate          4mins 30 seconds   78%
File –
225 M     Sort & Aggregate   14mins 53 seconds   Sort & Aggregate   6mins 38 seconds   55%
Records
Ab Initio Benchmark

                                                                            Scenario1 (copy/filter)
                         Elapsed time                CPU time       Temp Workspace     Records read    Record written      Data read      Data written (bytes)
DMExpress                 47 minutes             3 hours 44 min          0 GB          2,926,155,265    452,375,411     383,326,339,715     59,261,178,841
 Ab Initio                66 minutes             4 hours 38 min          0 GB          2,926,155,265    452,375,411     383,326,339,715     59,261,178,841

                                                                                 Scenario2 (Sort)
                         Elapsed time                CPU time       Temp Workspace     Records read    Record written      Data read      Data written (bytes)
DMExpress              1 hour 12 min             7 hours 26 min         60 GB          2,926,155,265 2,926,155,265 383,326,339,715         383,326,339,715
 Ab Initio             4 hours 42 min            9 hours 48 min         360 GB         2,926,155,265 2,926,155,265 383,326,339,715         383,326,339,715

                                                                        Scenario3 (Aggregation/Rollup)
                        Elapsed time               CPU time     Temp Workspace Records read Record written    Data read                   Data written (bytes)
DMExpress               1 hour 21 min            7 hour 10 min       4 GB      2,926,155,265 27,179,924    383,326,339,715                  4,022,628,752
 Ab Initio                 2 hours              10 hours 14 min     360 GB     2,926,155,265 27,179,924    383,326,339,715                  4,022,628,752




             Ab Initio tuned 8 ways
             DMExpress with no tuning
Syncsort Confidential and Proprietary - do not copy or distribute                                                                                            27
Metadata with Miti
DMExpress – White Boarding the Data
        Acceleration Sales
ETL to DMExpress acceleration / conversion
                                     Automatic
                                  Conversion Utility                               Conversion Utility
                                 Cognizant Migration /
                                  Optimization COE


                                                                    Parsing                      UNIX shell scripts
                                                                                                 Informatica workflows
                                                                    •   Informatica              Informatica mappings
                                                                                                 Spreadsheets identifying the production
                                                                    •   IBM DataStage            workflows and mappings
                                                                    •   PL/SQL                   Timing information of the job executions
                                                                                                 over a two month period
                                                                    •   Etc…                     Resource data points for the workflows




                                                                    Processing
                                                                    •   Flow analysis
                                                                    •   Expression & type analysis
                                                                    •   Optimization

                                                                    Output Generation
                                                                    •   DMExpress
                                                                    •   Documentation
Syncsort Confidential and Proprietary - do not copy or distribute                                                                       29
DMX Live Demo
DMExpress – White Boarding the Data
       Acceleration Sales P

Más contenido relacionado

La actualidad más candente

Db2 and storage management (mullins)
Db2 and storage management (mullins)Db2 and storage management (mullins)
Db2 and storage management (mullins)Craig Mullins
 
DB2 V10 Migration Guidance
DB2 V10 Migration GuidanceDB2 V10 Migration Guidance
DB2 V10 Migration GuidanceCraig Mullins
 
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...John Campbell
 
DB2 pureScale Overview Sept 2010
DB2 pureScale Overview Sept 2010DB2 pureScale Overview Sept 2010
DB2 pureScale Overview Sept 2010Laura Hood
 
Using Release(deallocate) and Painful Lessons to be learned on DB2 locking
Using Release(deallocate) and Painful Lessons to be learned on DB2 lockingUsing Release(deallocate) and Painful Lessons to be learned on DB2 locking
Using Release(deallocate) and Painful Lessons to be learned on DB2 lockingJohn Campbell
 
DB2 and storage management
DB2 and storage managementDB2 and storage management
DB2 and storage managementCraig Mullins
 
DB2 Accounting Reporting
DB2  Accounting ReportingDB2  Accounting Reporting
DB2 Accounting ReportingJohn Campbell
 
DB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningDB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningJohn Campbell
 
DB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesDB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesJohn Campbell
 
Netapp Evento Virtual Business Breakfast 20110616
Netapp Evento  Virtual  Business  Breakfast 20110616Netapp Evento  Virtual  Business  Breakfast 20110616
Netapp Evento Virtual Business Breakfast 20110616Bruno Banha
 
System z Technology Summit Streamlining Utilities
System z Technology Summit Streamlining UtilitiesSystem z Technology Summit Streamlining Utilities
System z Technology Summit Streamlining UtilitiesSurekha Parekh
 
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) Surekha Parekh
 
Zettabyte еQual To 1 Billion Terabytes
Zettabyte еQual To 1 Billion TerabytesZettabyte еQual To 1 Billion Terabytes
Zettabyte еQual To 1 Billion TerabytesStas Kolbin
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recoverybupbechanhgmail
 

La actualidad más candente (15)

Db2 and storage management (mullins)
Db2 and storage management (mullins)Db2 and storage management (mullins)
Db2 and storage management (mullins)
 
DB2 V10 Migration Guidance
DB2 V10 Migration GuidanceDB2 V10 Migration Guidance
DB2 V10 Migration Guidance
 
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
DB2 for z/OS Bufferpool Tuning win by Divide and Conquer or Lose by Multiply ...
 
DBA101
DBA101DBA101
DBA101
 
DB2 pureScale Overview Sept 2010
DB2 pureScale Overview Sept 2010DB2 pureScale Overview Sept 2010
DB2 pureScale Overview Sept 2010
 
Using Release(deallocate) and Painful Lessons to be learned on DB2 locking
Using Release(deallocate) and Painful Lessons to be learned on DB2 lockingUsing Release(deallocate) and Painful Lessons to be learned on DB2 locking
Using Release(deallocate) and Painful Lessons to be learned on DB2 locking
 
DB2 and storage management
DB2 and storage managementDB2 and storage management
DB2 and storage management
 
DB2 Accounting Reporting
DB2  Accounting ReportingDB2  Accounting Reporting
DB2 Accounting Reporting
 
DB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and PlanningDB2 for z/OS Real Storage Monitoring, Control and Planning
DB2 for z/OS Real Storage Monitoring, Control and Planning
 
DB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer ExperiencesDB2 11 for z/OS Migration Planning and Early Customer Experiences
DB2 11 for z/OS Migration Planning and Early Customer Experiences
 
Netapp Evento Virtual Business Breakfast 20110616
Netapp Evento  Virtual  Business  Breakfast 20110616Netapp Evento  Virtual  Business  Breakfast 20110616
Netapp Evento Virtual Business Breakfast 20110616
 
System z Technology Summit Streamlining Utilities
System z Technology Summit Streamlining UtilitiesSystem z Technology Summit Streamlining Utilities
System z Technology Summit Streamlining Utilities
 
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India) DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
DB2 10 Migration Planning & Customer experiences - Chris Crone (IDUG India)
 
Zettabyte еQual To 1 Billion Terabytes
Zettabyte еQual To 1 Billion TerabytesZettabyte еQual To 1 Billion Terabytes
Zettabyte еQual To 1 Billion Terabytes
 
Ibm db2 analytics accelerator high availability and disaster recovery
Ibm db2 analytics accelerator  high availability and disaster recoveryIbm db2 analytics accelerator  high availability and disaster recovery
Ibm db2 analytics accelerator high availability and disaster recovery
 

Similar a Optimization

Architecting Cloud Solutions
Architecting Cloud SolutionsArchitecting Cloud Solutions
Architecting Cloud SolutionsAMD
 
Cloud Opportunities with Virtualization
Cloud Opportunities with VirtualizationCloud Opportunities with Virtualization
Cloud Opportunities with VirtualizationKellyn Pot'Vin-Gorman
 
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Steven Totman
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015Daniela Zuppini
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDataStax
 
A Hybrid Technology Platform for Increasing the Speed of Operational Analytics
A Hybrid Technology Platform for Increasing the Speed of Operational AnalyticsA Hybrid Technology Platform for Increasing the Speed of Operational Analytics
A Hybrid Technology Platform for Increasing the Speed of Operational AnalyticsIBMGovernmentCA
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsorthuguk
 
Real World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeReal World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeTimothy Gelter
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the CloudKellyn Pot'Vin-Gorman
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreModern Data Stack France
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardwareinside-BigData.com
 
CloudOpt intro
CloudOpt introCloudOpt intro
CloudOpt introCloudOpt
 
NetApp HCI. Enterprise-Scale
NetApp HCI. Enterprise-ScaleNetApp HCI. Enterprise-Scale
NetApp HCI. Enterprise-ScaleNetApp
 

Similar a Optimization (20)

Virtualization and Containers
Virtualization and ContainersVirtualization and Containers
Virtualization and Containers
 
Architecting Cloud Solutions
Architecting Cloud SolutionsArchitecting Cloud Solutions
Architecting Cloud Solutions
 
Cloud Opportunities with Virtualization
Cloud Opportunities with VirtualizationCloud Opportunities with Virtualization
Cloud Opportunities with Virtualization
 
Database Migrations to the Cloud
Database Migrations to the CloudDatabase Migrations to the Cloud
Database Migrations to the Cloud
 
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
Hadoop Summit Amsterdam 2013 - Making Hadoop Ready for Prime Time - Syncsort ...
 
Tanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata MigrationsTanel Poder - Performance stories from Exadata Migrations
Tanel Poder - Performance stories from Exadata Migrations
 
IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015IBMHadoopofferingTechline-Systems2015
IBMHadoopofferingTechline-Systems2015
 
Designing a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for DummiesDesigning a Distributed Cloud Database for Dummies
Designing a Distributed Cloud Database for Dummies
 
A Hybrid Technology Platform for Increasing the Speed of Operational Analytics
A Hybrid Technology Platform for Increasing the Speed of Operational AnalyticsA Hybrid Technology Platform for Increasing the Speed of Operational Analytics
A Hybrid Technology Platform for Increasing the Speed of Operational Analytics
 
Why Hadoop is important to Syncsort
Why Hadoop is important to SyncsortWhy Hadoop is important to Syncsort
Why Hadoop is important to Syncsort
 
Cloud computing
Cloud computingCloud computing
Cloud computing
 
Real World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and AdobeReal World Modern Development Use Cases with RackHD and Adobe
Real World Modern Development Use Cases with RackHD and Adobe
 
The Last Frontier- Virtualization, Hybrid Management and the Cloud
The Last Frontier-  Virtualization, Hybrid Management and the CloudThe Last Frontier-  Virtualization, Hybrid Management and the Cloud
The Last Frontier- Virtualization, Hybrid Management and the Cloud
 
Tame that Beast
Tame that BeastTame that Beast
Tame that Beast
 
Syncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScoreSyncsort et le retour d'expérience ComScore
Syncsort et le retour d'expérience ComScore
 
DDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your HardwareDDN: Protecting Your Data, Protecting Your Hardware
DDN: Protecting Your Data, Protecting Your Hardware
 
CloudOpt intro
CloudOpt introCloudOpt intro
CloudOpt intro
 
NetApp HCI. Enterprise-Scale.
NetApp HCI. Enterprise-Scale.NetApp HCI. Enterprise-Scale.
NetApp HCI. Enterprise-Scale.
 
NetApp HCI. Enterprise-Scale
NetApp HCI. Enterprise-ScaleNetApp HCI. Enterprise-Scale
NetApp HCI. Enterprise-Scale
 
Agile DBA
Agile DBA Agile DBA
Agile DBA
 

Optimization

  • 1. Optimization with DMExpress Steven Haddad – Senior Software Architect shaddad@syncsort.com
  • 2. Introducing DMExpress™ - Fast. Efficient. Simple. Cost Effective. A Family of High-Performance, Purpose-Built Data Integration Tools For core ETL processing & database transformation → High-Performance ETL offload (Oracle PL/SQL, Teradata, and others) Integrate → ETL Optimization For Informatica, DataStage, and others → Hadoop Optimization For Apache, HortonWorks, Cloudera, and others Optimize → Rehosting Optimization For Clerity, MicroFocus, Oracle, and others → High-Performance Sort For z/OS, z/VSE, and Windows/UNIX/Linux Migrate → Sort Optimization For SAS, DFSORT, Trillium, and others Syncsort Confidential and Proprietary - do not copy or distribute 3
  • 3. Do You Need Data Integration Optimization/Acceleration?  ETL is taking longer and longer  Large budgets to purchase additional hardware and database  A shift in data integration processing to database or hand-coded solutions  Data integration environment can’t easily be govern, maintained or expanded  Inability to launch or staff initiatives due to lack of resources  Long time-to-value  Users may lose confidence in data Syncsort Confidential and Proprietary - do not copy or distribute 4
  • 4. What is Optimization with DMExpress™ ?  Better Performance – No Tuning  Lower Costs for:  Hardware  Licenses  IT Stuff  Improves your Capabilities to deliver  Reduces usage of resources  More work in less time  Secure your already done investment Syncsort Confidential and Proprietary - do not copy or distribute 5
  • 5. Examples for Optimization with DMExpress™ → 10 * Faster then Major Logistic Company DataStage Parallel IBM DataStage → 26 * Faster then Major Logistic Company DataStage Server → 27 days down to 15 hours Information Service Provider →6 week to production Informatica → 1/20 of disc space Major Insurance Provider → significant less Memory → Costs/TB down from ComScore → 1538 US$ to 46 US$ → Reduce costs by 2.9 Mio $ PL/SQL Global Payments → 2.35h down to 3 min → 4:42 h down to 1:12h AbInitio Financial Service Provider → 360 GB down to 4 GB WS Syncsort Confidential and Proprietary - do not copy or distribute 6
  • 6. DMExpress Delivers Significantly Faster Performance Even Without Any Tuning 35 Elapsed Time (m) 30 25 INFA 20 DMExpress Up to 5x Faster 15 10 → DMExpress: No Tuning 5 → Informatica: Tuned 0 1. Copy 2. Sort 3. Aggregate 300 Elapsed Time (m) 250 Ab Initio 200 DMExpress Up to 4x Faster 150 100 → DMExpress: No Tuning 50 → Ab Initio: Tuned 0 1. Copy / Filter 2. Sort 3. Aggregate / Rollup Syncsort Confidential and Proprietary - do not copy or distribute 7
  • 7. DMExpress Seamlessly Scales to Support Growing Requirements Volume & Complexity Seamlessly scale: Business Requirements • No tuning Conventional ETL • No ELT • Defer hardware purchases DMExpress Time Continuously implement performance stop-gap measures: • Manual tuning • Add/upgrade hardware Point of problem • Push-down (ELT) awareness Syncsort Confidential and Proprietary - do not copy or distribute 8
  • 8. Fast: Intelligent Sort Algorithms High Frequency and Impact Compression Source Extract, Ratio Compress & FTP 6X Sort impacts every aspect of ETL increases Partition Up To Source Extract Data Faster 40% Database Load Compress & FTP Joining Up To Records Faster 60% Merge & Up To Partition Data Transformation Faster 50% Aggregation Aggregation Up To Faster 70% Merging & Joining Records Transformation Database Up To Load & Index Faster 40% Syncsort has been the market leading sort technology since 1968
  • 9. Maximizing Performance with Optimum Resource Utilization The Performance Triangle CPU DMExpress Is Different • Patented Algorithms Dynamically responds to CPU, Memory & disk availability Partition & Buffer • Direct I/O Pipeline Parallelism Management Bypasses file system buffer Instruction Memory Cache accessing data directly at block Cache ETL Process Optimization Optimizer Optimization level for higher performance I/O Optimization Algorithm Selection • Compression Used for read/write & crucially active workspace (minimizes disk I/O Memory touches & transfer volume) Disk & I/O Bound Syncsort Confidential and Proprietary - do not copy or distribute 10
  • 10. DMExpress Dynamically Maximizes Throughput at Run Time Conventional Data Integration Data Integration with DMExpress Automatic and Dynamic Manual and Static Algorithms Algorithms Processing Time Processing Time ■ Scaling requires expensive hardware ■ Extremely efficient in commodity hardware ■ I/O operations well below disk speed ■ I/O operations at near disk speed ■ Requires exhaustive tuning ■ Automatic parallelism and pipelining ■ Sub-optimal consumption of resources ■ Automatic, efficient caching and hashing ■ Uses all memory, overflows to disk ■ Minimizes disk caching Syncsort Confidential and Proprietary - do not copy or distribute 11
  • 11. Efficient: Dynamic ETL Optimizer Resource Analysis Memory Partition & Buffer CPU Pipeline Management Parallelism I/O Instruction Memory Cache ETL Process Cache File System Optimization Optimizer Optimization I/O Algorithm Optimization Selection Data Type Data Analysis Record Format Fully automatic, continuously self-tuning optimizer maximizes #Records / throughput and resource efficiencies Columns – Evaluates hardware, software, and data environment – Determines optimal algorithmic flow at start-up – Begins execution with auto-generated optimizer plan – Continuously adjusts algorithms, memory use, parallelism based on application and run time environment 12 Sy ncs
  • 12. Design Once Inherit Performance Sources Read Join Aggregate Write Targets EDW ETL Job DM Thread Management Tasks Dynamic Optimizations • Each ETL task runs on a separate process • Automatic, dynamic thread management for each task • Automatic parallelism and pipelining • Automatic, dynamic algorithm selection Syncsort Confidential and Proprietary - do not copy or distribute 13
  • 13. Architecture DMExpress – White Boarding the Data Acceleration Sales
  • 14. DMExpress Architecture Delivers Maximum Performance and Data Scalability with Automatic Dynamic Optimizations Integration / Customization (SDK, Open APIs) Graphical Development Environment DMExpress Engine High Performance Transformations User Defined Functions Automatic Continuous Optimization Deployment • Sort • Load Presort Built in Functions: Metadata • Merge • Filter • Numeric • Aggregate • Reformat • Text Algorithms • Join / Lookup • Partition • Date and Time • Copy • Logical • Advanced Text Processing • Data Partitioning Processing Time Source/Target Connectivity Syncsort Confidential and Proprietary - do not copy or distribute 15
  • 15. Five Simple Steps to Deploy. Tuning Is NOT One of Them. • Single install 1. Install DMExpress • Takes less than 5 minutes • Primary Tasks: Sort, Merge, Aggregate, Join / 2. Choose “Task” Template Lookup, Copy • Secondary Tasks: Filter, Reformat, Partition • Connectivity • Standard Functions 3. Fill-in the blanks • Numeric, Text, Date/Time, Logical • User-defined Functions • Create Complete ETL “Jobs” by Combining 4. Integrate Multiple “Tasks” • Define Flows – from files to direct flows • Schedule 5. Deploy • Parameterize • Monitor Syncsort Confidential and Proprietary - do not copy or distribute 16
  • 16. Syncsort DMExpress Is Simple but powerful Intuitive Graphical Interface enables Development and Maintenance • Graphical → No coding required Development Environment → No tuning required → Easily build/edit jobs and tasks • Expression Builder → Detect differences between development, test, and production environments • Job/Task Diff → Users are fully functional within a few days Syncsort Confidential and Proprietary - do not copy or distribute 17
  • 17. DMExpress Architecture DmExpress Clients Command Line Job Task Editor Editor Flat File Based 3rd party version Metadata Repository Check-in control tool Check-out Design Services Time View Local Windows / Unix / Linux Remote Data Server Server DMExpress Engine Data Sources / Targets
  • 18. Use Cases DMExpress – White Boarding the Data Acceleration Sales
  • 19. Acceleration POC – Scenario A Processing Time in Minutes of ‘High Load Jobs’ 32 40 19 30 1/2 The time 20 10 0 DataStage DMExpress Parallel 4/6 cores 1 core (Virtual) 1/6 The hardware (Physical/Virt.) Linux Linux 20
  • 20. Acceleration POC – Scenario B Processing Time in Minutes of ‘Scenario B’ 40.00 40.00 21.30 30.00 1/2 The time 20.00 10.00 0.00 DataStage DMExpress Server 14 cores 1 core 1/14 The (Physical) (Virtual) Hardware HP-UX Linux 21
  • 21. Use Case 1: Global Information Service Provider  Business Challenge  Severe competitive pressure from Google Finance, Yahoo! Finance, Morningstar, and others forced development of strategic new offerings  Environment  Informatica 8.11 SP3, Oracle 10.2 RAC 6 nodes, DMExpress 5.2.15.  16 core LINUX machine  Technical Challenge  Weekly Reporting application on 8 million DUNS numbers  Data Sizes: 5 tables of ~1 TB each  Bottleneck step was to join 5 tables and aggregate the output  Prior Attempts to Increase Performance  Manual tuning of ETL routines - lots of consultants spent many months and dollars  Converted the ETL mapping to ELT. No success - Process would abort with ORA-01555: Snapshot too old error  Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolated)  Problem existed since February on 2009, many attempts and touch points, production in October.  Solution  DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. Total run time was 15 hour to run this step in DMExpress vs. 27 days.  DMExpress invoked at the command line prior to Informatica  Benefits  New offering launched on time  Able to meet SLAs  2 weeks to finish POC  In production in 6 weeks
  • 22. Use case 2: Major Insurance Provider  Business Challenge  Unable to complete processing to deliver new highly personalized offers and pricing to their agents via their agent marketing portal over weekend window impacts conversion rates for promotions to policyholders  Need to start the processing on Friday night 6pm, causing data from load to be done only by Wednesday 6 pm  Environment  Informatica version 7.x, 8.6.1, Trillium, Teradata, reporting - MicroStrategy, Hyperion/Brio,DMExpress 6.9, Maestro , Sun Solaris  Technical Challenge  500 of GB of data, including joins and aggregations, need to be completed during weekend window  Certain jobs would not even not run – need to abort (30 hour + runs). No alternative – no tuning worked  Very slow I/O when joins spill to disk. All of the memory on the system is grabbed! Virtual memory errors  No capacity in Teradata to push down transformations  Prior Attempts to Increase Performance  Tuning did not solve the problem  Dynamically adjusting cache did not solve the bottleneck  Solution  Output from Trillium is sent to DMExpress and Informatica to integrate and aggregate the data (Joins, and aggregations)  Started out with 10 critical DMExpress jobs and now expanded to 700+ DMExpress tasks, 200 DMExpress jobs  Orchestrated within PowerCenter Workflow Manager – command task and also called separately from Maestro.  Benefits  DMExpress completes within weekend batch window  Extremely simple and scalable approach – very short learning curve – 1 month to deploy DMExpress  Significantly less memory used by DMX - more parallel jobs due to efficiency. DMExpress takes 1/20th the disk space
  • 23. Case Study: Enabling Up to $3M in Data Integration Cost Savings Before After PL/SQL Scripts (ELT) DMExpress (ETL) Avg. 13.5M rows per file/table Avg. 13.5M rows per file/table ETLTL Vertica Oracle Oracle DMExpress Oracle Oracle Data Migrator Analytics Analytics Read files Load into staging Load into the Oracle Read files Dedupe, summarize Analysis & reporting area, dedupe, and production data and load into Oracle summarize using warehouse for data warehouse PL/SQL scripts and analysis & reporting iWay Data Migrator • Est. TCO over 3 years: $4.4M Est. TCO over 3 years: $1.5M • Total processing time: 2.35 hrs Total processing time: 3 min • Complex architecture with PL/SQL, iWay Data One tool. One ETL engine. No staging Migrator and lots of Oracle staging No coding. No tuning. Reusable objects • Manual coding. Manual tuning. No reusability Scalable architecture supports business growth • No scalability to support business goals and profitability objectives Syncsort Confidential and Proprietary - do not copy or distribute 24
  • 24. POC Results – Informatica Max I/O Ave I/O Max I/O Ave I/O Utilization - Utilization Utilization Utilization Memory Peak Approximate Read – Read – Write – Write Elapsed time (Mb) CPU Time MB/Sec (Meg/s) (MB/Sec MB/Sec PowerCenter 0:28:10 11,875 1:06:29.2 53 12 82 39 DMExpress 0:13:26 9,438 0:16:53.9 154 33 101 66 DMExpress (Linux) 0:05:43 9,957 0:16:21 N/A 83 N/A 142 Elapsed Time Memory (Gb) CPU Time 00:36:00 14.0 1:12:00 12.0 1:04:48 00:28:48 0:57:36 10.0 0:50:24 00:21:36 8.0 0:43:12 0:36:00 00:14:24 6.0 0:28:48 4.0 0:21:36 00:07:12 0:14:24 2.0 0:07:12 00:00:00 0.0 0:00:00 PC DMX DMX (Linux) PC DMX DMX (Linux) PC DMX DMX (Linux)
  • 25. Benchmark Details DMExpress vs. Informatica Current DMX Task Time Task Time Saving Copy 4mins 09 seconds Copy 0mins 50 seconds 80% 5 Gb Sort 7mins 26 seconds Sort 1mins 19 seconds 82% File – Aggregate 9mins 37 seconds Aggregate 1mins 9 seconds 88% 45 M Sort & Aggregate 3mins 43 seconds Sort & Aggregate 1mins 37 seconds 57% Records Task Time Task Time Saving Copy 20mins 53 seconds Copy 4mins 12 seconds 80% Sort 31mins 48 seconds Sort 6mins 17 seconds 80% 25 Gb Aggregate 20mins 45 seconds Aggregate 4mins 30 seconds 78% File – 225 M Sort & Aggregate 14mins 53 seconds Sort & Aggregate 6mins 38 seconds 55% Records
  • 26. Ab Initio Benchmark Scenario1 (copy/filter) Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes) DMExpress 47 minutes 3 hours 44 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841 Ab Initio 66 minutes 4 hours 38 min 0 GB 2,926,155,265 452,375,411 383,326,339,715 59,261,178,841 Scenario2 (Sort) Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes) DMExpress 1 hour 12 min 7 hours 26 min 60 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715 Ab Initio 4 hours 42 min 9 hours 48 min 360 GB 2,926,155,265 2,926,155,265 383,326,339,715 383,326,339,715 Scenario3 (Aggregation/Rollup) Elapsed time CPU time Temp Workspace Records read Record written Data read Data written (bytes) DMExpress 1 hour 21 min 7 hour 10 min 4 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752 Ab Initio 2 hours 10 hours 14 min 360 GB 2,926,155,265 27,179,924 383,326,339,715 4,022,628,752 Ab Initio tuned 8 ways DMExpress with no tuning Syncsort Confidential and Proprietary - do not copy or distribute 27
  • 27. Metadata with Miti DMExpress – White Boarding the Data Acceleration Sales
  • 28. ETL to DMExpress acceleration / conversion Automatic Conversion Utility Conversion Utility Cognizant Migration / Optimization COE Parsing UNIX shell scripts Informatica workflows • Informatica Informatica mappings Spreadsheets identifying the production • IBM DataStage workflows and mappings • PL/SQL Timing information of the job executions over a two month period • Etc… Resource data points for the workflows Processing • Flow analysis • Expression & type analysis • Optimization Output Generation • DMExpress • Documentation Syncsort Confidential and Proprietary - do not copy or distribute 29
  • 29. DMX Live Demo DMExpress – White Boarding the Data Acceleration Sales P

Notas del editor

  1. So, How do you know if DMExpress is the right technology for you? Well, you can start by using the TDWI Checklist report for accelerating data integration….<Bring a hard copy of the Checklist and deliver to the customer>
  2. So, How do you know if DMExpress is the right technology for you? Well, you can start by using the TDWI Checklist report for accelerating data integration….<Bring a hard copy of the Checklist and deliver to the customer>
  3. The result, is that you can normally achieve much higher performance than the leading DI tools, even with no tuning. As an example, I’m showing 2 benchmarks we ran at a customer site, comparing DMExpress vs. Informatica at the top and AbInitio at the bottom.
  4. So we talked about speed and efficiency. Now lets talk a bit more about ease of use. Most DI platforms talk about ease-of-use in terms of a nice GUI. However, Syncsort takes the concept of ease-of-use one step further to attack one of the most complex and time-consuming tasks: fine-tuning. For that, let me tell you a little bit about our technology. Traditional Data Integration is manual and static. Moreover, it was not designed with efficiency in mind, this means there’s a suboptimal use of resources, while they are very CPU and memory intensive, they still run I/O operations well below disk speed. Therefore, scaling requires very expensive hardware and time-consuming tuning.Every time there’s changes, IT has to go back and re-tune the system. Well, DMExpress provides a completely different approach, DMExpress is completely automatic and dynamic.Coming from 40 yrs of performance expertise, the engine minimizes CPU and memory utilization, while running I/O operations at or near disk speed. More importantly, it requires no tuning whatsoever, this means it automatically adapts to changes in real time, providing automatic parallelism and pipelining. This transforms into:Higher performance out of the boxMuch better ease of use at a point where users can design high-performance ETL tasks & jobs with minimum trainingSignificant savings in terms of IT staff hours and hardware.
  5. A Task is a basic unit of work: sort, aggregate, join, etc.A DMExpressJob is a collection of TasksEach Task executes on a separate processDMExpress automatically: manages threads for each Task
  6. Dun & BradstreetData Sizes: 5 tables of ~1 TB each.Processing need: Bottleneck step in INFA was Join 5 tables and aggregate the output.Application: Weekly Reporting application on millions of DUNS number.Data warehouse: Oracle 10g.Original Approach: ETL using INFA. Not meeting SLAs. SLAs is to run this process in a weeks time.Attempts to improve performance: Tuned the ETL environment to try meeting SLAs. No successConverted the ETL mapping to ELT in INFA. No success. Process would abort with ORA-01555:Snapshot too old error, because the process run in the Database too long and tables are being updated during the processing.Broke up the ELT process into 100,000 record batches to prevent the oracle error. The process ran in 27 days (extrapolate)! DMExpress benchmarked: DMExpress extracted five 1 TB tables in 6 hours and performed the joins and aggregation in 9 hours. The output file was then read by INFA and loaded into the target table. Total run time was 15 hour to run this step in DMExpress.POC environment: 4 core LINUX boxDMExpress is currently in production and used as a performance complement to INFA.Current production environment: 16 core LINUX box.High level flow in production: SOURCES  Oracle  DMX (extract 9 hours)  Flat Files  DMX INFA  TARGET DATA MARTWhen they used DMX:Agg not fast – gotta presort, not enough mem to Agg without DMX, or alternate is push down. However, push down to db table just to order by is not an optionDMX extracted data in 6 hours, filtered on the fly and and landing to disk – 2 to 3 tb – offload the load from dbDetail Trade data mart – transactional very busy, offload really benefited the customerLot of Cognizant folks and lot of time spent over many months.
  7. Application is used for: Campaign management, portfolio mgt, product analysis, marketing analytics, customer analytics.SLA: Start Friday at 6 pm, final load is on Monday 6 pm – Data Flow:Flat files sources trickling in 10 source systems, 200 flat files – 500 GB in total (3 customer systems, quote systems) (weekly – Friday night) -> Standardization process (INFA + DMX, Aggregation, preparing data for Trillium – Friday at 6 pm to Sat 3 pm) -> Trillium + DMX plug in (customer house holding and address std – 12 hours, ends 3 am Sunday) -> DI (INFA and DMX – building customer hierarchies, i.e aggregating customers to households, bunch of roll ups – 18 hours, ends at Sunday 9pm) -> Dimensional Model Builds and Loads (sorting, joining, CDC, joining keys back to the fact) -> Dim Data Mart (Teradata load time is good portion of the 18 hours). Some anecdotal info from Jeff (Baax)1. Push down not practical - Flat file to database and back to flat file to do work in Trillium and – network costs, db load/unload costs, load a 40 GB just to sort – not an option!2.Took the engineers only2 weeks by themselves and enabled a 6 month deployment (1/6 of that timewas DMX )3. One of the larger table – 150 mill – original approach was truncate and load (12 to 16 hours). Changed the approach to do a CDC in DMX and just to inserts and updates using TD multiload. Now it takes hours to do the DMX CDC and ½ hour to load the results!4. Machine downtime and maintenance adds to the complexity5. Database Monday IDs get locked on Monday at 8 am (real SLA is 8am, exception needed to extend to hard SLA which is 6 pm, causes a lot of aggravation!)6. Due to data volume growth – customer is looking to optimize all the time – DMX provides a very easy, scalable way to deal with this need and implement the jobs. 7. DMX/INFA hand off:Today it’s a file hand offExploring pipes instead of filesMaestro calls a DMX separate jobWorkflow/Session – command task invokes DMX (landing a file). When and where are 2 tools necessary a.) A huge join – started by building a 50 GB join – 30 hours –Inner join outputs file gets read into infa – do some biz logic b.) A huge Agg - INFA memory agg – do DMX sort – INFA complex agg.8. Ratio of numberof INFA/DMX jobs is 70/30
  8. Global Payments a leading electronic transaction processing organization serving millions of customers uses DMExpress as their ETL standard. BUSINESS CHALLENGESGlobal Payments came to us as they were planning to consolidate all of their global operations into their US data center. With this purpose they had some challenges:+ First, they wanted to reduce costs, that was one of the key drivers behind the initiative+ Reduce operational risk and improve customer service, providing a more consistent level of service across the world (The fact that they had to manually script in PL/SQL many of their transformations pushing transformations into strained Oracle database, sometimes resulted in errors that could jeopardize daily operations. In addition, under the existing architecture they had to lock Oracle tables for hours which had a huge impact on all database users)+ GPN wanted to open a new revenue source by offering a new service with more granular reporting to their customers. Because of the reasons above, transformations for the new service had to happen outside the Oracle database+ Cut processing times to allow for future growth. They were experiencing around 50% YoY data growth.+ Global operations also meant shorter batch windows with 24x7 operations+ Consolidation of operations meant that staff of 5 FTEs previously managing US and NA operations would now how to manage all the international operations + They wanted to go into production in less than 60 days, while minimizing any impact to its existing operationsBEFORE / PAIN POINTS: They had an architecture with iWay Data Migrator doing some of the work, but since this tool couldn’t cope with the performance and scalability requirements, they had to hand-code a lot of their transformations in PL/SQL. This resulted in several pain points including: Very complex architecture due to the use of both PL/SQL and data migrator. Constant tuning required with little or no reusability, resulting in very long development cycles and time-to-value Their architect said there was real pain on the limitations of error logging with data migrator. Having a tool like DMExpress helped significantly on this area. Higher Costs: In terms of hardware required by their ETL tool as well as database capacity to execute PL/SQL scriptsOne of their processes, had to dedupe and summarize several tables with some of them exceeding 13M rows in length. Processing time was taking more than 2 hrs to completeBENEFITSWe went on site and conducted a POC and a business value analysis (BVA). The results showed: Processing times improved by almost 9x (from 141 min to 3 min for key processing tasks) Significant savings when compared to other options (including informatica, their existing architecture, and DataStage. In fact, they had prior experience working with DataStage so they were looking heavily at DS). However, dring the BVA we did a thorough analysis of their DI strategy and TCO, evaluating operational as well as capital costs in 3 key categories: Hardware costs, database/staging costs, and IT Staff productivity. Please notice ETL software license costs were not included in the analysis. However, our pricing was still very competitive and in the lower end of the competition. The results of the analysis show savings of nearly US $3M over 3 years (more details about the analysis can be found on the third slide)Global Payments was able to deploy to production in approximately 4 weeks.The new architecture is helping GPN achieve their growth and profitability goals with a technology that can scale cost-effectively to support growing data volumes.DISCOVERY QUESTIONS THAT HELPED QUALIFY THIS OPPORTUNITY How critical is your need to reduce processing time (improve performance)?What is your time frame for getting the problem solved?What solutions have you considered?How many people do you have developing/maintaining PL/SQL?What is the size (type/# cores)of your DB server(s) Would you find it advantages to reduce your DB cost?Do you know what the DB server(s) are costing you?What the impact would be if you could move the DI work off the DB Server(s)? Other discovery questions Transformations taking place (sort, merge, join, look-ups)Data sizesCurrent performance (processing) timesDI/DW/BI environment