SAP HANA SPS10- Enterprise Information Management

1© 2014 SAP AG or an SAP affiliate company. All rights reserved.
SAP HANA SPS 10 - What’s New?
Enterprise Information Management
SAP HANA Product Management May, 2015
(Delta from SPS 09 to SPS 10)

© 2015 SAP SE or an SAP affiliate company. All rights reserved. 2Public
Agenda
SAP HANA smart data integration
 New Adapters
 Writing to Virtual Tables
 Web-Based .hdbflowgraph Editor
 Remote Object Search
 DDL Replication
 Support for Multitenant Database Containers
 Support for Extended Storage Tables (Dynamic Tiering)
 Support for HANA smart data access remote sources
 Logical Partitions
 New Load Behaviors
 Adapter SDK Enhancements

Agenda
SAP HANA smart data quality
 Profiling – Metadata, Semantic and Frequency Distribution
 Updated Cleanse Transform
 New Match Transform
 Side Effect Data – Match & Cleanse
 Task Management

SAP HANA smart data
integration

New Adapters
ASEAdapter
 Federation
 Bulk extraction
 Log Based Real Time Replication
HanaAdapter
 Federation
 Bulk extraction
 Trigger Based Real Time Replication
TeradataAdapter
 Federation
 Bulk extraction
 Trigger Based Real Time Replication

Writing to Virtual Tables
Provides the ability to write data to a virtual table in a remote source
In SPS9, virtual tables could be queried directly or used as a Data Source in a Flowgraph. In SPS10,
it’s also possible to have a Data Sink node (i.e. target) point to a virtual table from a remote source
configured using one the following adapters
 ASEAdapter
 FileAdapter
 HanaAdapter
 TeradataAdapter
 DB2LogReaderAdapter
 OracleLogReaderAdapter
 MssqlLogReaderAdapter

New .hdbflowgraph editor
The HANA Web-Based Development Workbench has a
new .hdbflowgraph editor that allows you to model a
set of transformations applied to one or many data
sources
It provides the same capabilities already available in HANA
Studio in SPS09.
 Batch and real time data movements with transformations
It also provides the following new capabilities
 An updated Cleanse transform with content type detection and an
easy to follow configuration process
 A new Match transform with content type detection and an easy
to follow configuration process

Remote Object Search
Allows you to search for remote objects (e.g. tables) in
a remote source
When invoking this functionality for the first time, you must
populate the dictionary (a HANA table) that will hold the
object name and descriptions.
This functionality can be invoked
 By right-clicking on a remote source (Web Based Developer
Workbench – Catalog only)
 When selecting objects for replication in the .hdbreptask editor
 FileAdapter
 HanaAdapter
 TeradataAdapter
 DB2LogReaderAdapter
 OracleLogReaderAdapter
 MssqlLogReaderAdapter
 DB2ECCAdapter
 OracleECCAdapter
 MssqlECCAdapter
This functionality is supported for remote sources configured using the following adapters

DDL Replication
Data Definition Language(DDL) operations can be replicated just like insert, update and delete
operations
The following DDL operations are supported
 ALTER TABLE ADD COLUMN
 ALTER TABLE DROP COLUMN
DDL replication is possible when
 The .hdbreptask is enabled for real time
 The Table Level Replication setting is selected for the remote object
DDL replication is supported for remote sources configured using the following adapters
 All tables
– DB2LogReaderAdapter
– OracleLogReaderAdapter
– MssqlLogReaderAdapter
 Transparent tables only
– DB2ECCAdapter
– OracleECCAdapter
– MssqlECCAdapter

Support for Multitenant Database Containers
HANA EIM can be used to replicate or transform data in a HANA system with Multitenant
Database Containers
Each container
 Has its own dpserver
 Must be configured individually
– Register the Data Provisioning Agent(s)
– Register the Data Provisioning Adapter(s)
– Create Remote Sources
Support for Multitenant Database Containers was introduced in HANA SPS09 revision 95

Support for Extended Storage Tables (Dynamic Tiering)
The .hdbflowgraph object supports extended
storage tables as Data Sources (source) or as
Data Sinks (target)
Data can be taken from a row/column table and
loaded into an extended table, or vice versa
 The data can be transformed before it’s persisted in
the target
– Filter, Join, Union, Pivot, Case, etc…
 The data movement can be scheduled
– By calling the task in a stored procedure and scheduling
the stored procedure using the XS Job Scheduler
– By creating a script that uses HDBSQL to call the task and
invoking the script with a third party scheduler

Support for HANA smart data access remote sources
Remote sources created using HANA smart data access adapters are now displayed in the
.hdbreptask editor of the HANA Web-Based Development Workbench
When configuring a remote source, HANA smart data access adapters always have indexserver as the
Source Location.
 Initial Load Only
– smart data access adapters don’t have real time change data capture capabilities so this configuration option
will be selected and disabled

Logical Partitions
Provides the ability to expedite the extraction of data from a remote source
By creating multiple logical partitions, the system will execute parallel queries on a virtual table, each
extracting a subset of the entire dataset
 Is available in the Partitions tab of the .hdbreptask editor and in the Partitions tab of the Data Source
node of the .hdbflowgraph editor
 One or more named partitions can be created
– Partitions are used to create filter criteria to select subsets of data
 A hidden partition will be created to extract all records that don’t meet the filter criteria of all named
partitions
 Partitions can only be created for one column in the table
 Partitions are only allowed on non-null columns
Recommendation – Select a column with an index in the remote source for even better performance

Logical Partitions
The following types of partitions are supported
 Range
– Can only contain a single value
– The values must be entered in order from lowest to
highest e.g. 10,000,000; 20,000,000
o These partitions will generate three different queries that will
be executed in parallel
• select col1, col2, coln from table where colx <= 10,000,000
• select col1, col2, coln from table where colx >10,000,000
and colx <= 20,000,000
• select col1, col2, coln from table where colx > 20,000,000

Logical Partitions
The following types of partitions are supported
 List
– Each named partition can contain a single value
o Canada – ‘CA’
o United States – ‘US’
o Germany – ‘DE’
– Each named partition can contain multiple comma
delimited values
o North America – ‘CA’, ‘US’, ‘MX’
o Europe – ‘DE’, ‘FR’, ‘GB’, ‘IT’, ‘ES’

Replicate, Replicate with logical delete
Allows you to change the behavior of the real time replication functionality
When selecting a table for real time replication, you can choose one of the following load behaviors
 Replicate (default value)
– Applies insert, update and delete operations to the target table in HANA.
 Replicate with logical delete
– Applies insert and update operations and converts delete operations to update operations
– Creates two new columns in the target table
o The incoming database operation (I, U or D)
o The timestamp of the transaction applied to the target table in HANA
– Produces rows that can be used by consuming applications like SAP Business Warehouse and SAP Data
Services to identify which records changed and when. This is especially useful when the remote source
doesn’t provide a way for SAP BW or SAP DS to identify changed records directly.

Preserve all
 Preserve all
– Applies insert operations and converts update and delete operations to insert operations, resulting in a history
table containing all changes that occur over time
– Creates three new columns in the target table and adds them to the primary key
o The incoming database operation (I, U or D)
o The timestamp of the transaction applied to the target table in HANA
o The sequence number of the operations within a transaction
• Is necessary to ensure uniqueness because a single transaction can contain multiple update operations on the same
record
– Produces rows that can be used by consuming applications like SAP Business Warehouse and SAP Data
Services to identify which records changed and when. This is especially useful when the remote source
doesn’t provide a way for SAP BW or SAP DS to identify changed records directly.
– Produces rows that can be used for historical reporting

UPSERT
The Adapter SDK provides new operations that can enable the creation of new custom HANA
EIM adapters or enhance the capabilities of existing custom adapters
In addition to the Insert, Before Image (Update), After Image (Update) and Delete operations that were
introduced in the initial version of the HANA EIM SDK in SPS9, the following row types are now
available.
 RowType.UPSERT
– Inserts or Updates the record
– The primary key columns of the target table are used to check for the existence of the record, not the primary
key columns of the source table
– Performs an update if the record exists in the target table
– Performs an insert if the record doesn’t in the target table

EXTERMINATE
 RowType.EXTERMINATE
– Deletes records based on the primary key from the incoming source record
– Only the primary key fields are used, all others may be null
– If these records are sent to a table via remote subscription with a filter, the filter will not be applied
– If these records are sent to a task, it will only be provided to the Table Comparison transform for processing
and to the table writer to perform the delete.
Please note that the RowType.DELETE requires the entire record as it exists in the target table in
order to perform the delete so using RowType.EXTERMINATE might be a preferable option.

REPLACE
The following row types are used together in order to replace an existing set of rows from a target table
with a new set of incoming rows.
For example, an existing sales order is changed where some items are added, others are removed and others
have their quantities changed. When a remote source can’t provide the details of the change but instead
provides the end result, the following row types must be used.
 RowType.BEGIN_REPLACE_SET
– A row that indicates that a set of rows to be replaced will be provided immediately after this row

REPLACE
 RowType.TRUNCATE_REPLACE_TARGET
– A row that identifies all records to be removed
o the column values in the row are used to identify the records to be deleted e.g. order_id = ‘010203’ will delete all order
detail records for this order
o The columns which have values can be primary key columns
o The columns which have values can be non-primary key columns but those columns must be non-null
o LOB columns can’t be used
– If all the values in the row are null, the entire table will be truncated
 RowType.REPLACE
– A new row to be inserted
– Is optional. If no replace rows are provided, then rows will be deleted and not replaced.
 RowType.END_REPLACE_SET
– Indicates that all rows to be replaced were provided

Profiling
Metadata, Semantic and Frequency Distribution

Semantic Profiling
Semantic profiling shows the character semantics and byte semantics of existing data and
assigns a content type to each column specified
This process relies on reviewing the existing data to determine and uncover anomalies in the
databases. Such a profile is useful in finding areas where the content of the existing system is not what
we would have expected it to be because of irregularities in the data.
Semantic profiling stored procedure:
PROCEDURE _SYS_TASK.PROFILE_SEMANTIC (
IN schema_name NVARCHAR(256),
IN object_name NVARCHAR(256),
IN profile_sample TINYINT,
IN columns _SYS_TASK.PROFILE_SEMANTIC_COLUMNS,
OUT result _SYS_TASK.PROFILE_SEMANTIC_RESULT
)

Metadata Profiling
Metadata profiling looks at column names, lengths and types as well as the location of the table
to determine its contents
The metadata can then be used to discover problems such as illegal values, misspelling, missing
values, varying value representation, and duplicates
Metadata profiling stored procedure:
PROCEDURE _SYS_TASK.PROFILE_METADATA (
IN columns _SYS_TASK.PROFILE_METADATA_COLUMNS,
OUT result _SYS_TASK.PROFILE_METADATA_RESULT
)

Frequency Distribution Profiling
Distribution profiling allows you to create profiles of patterns, words and fields in existing data
For example, you could perform distribution profiling on single columns of data individually to get an
understanding of frequency distribution of different values, type, and use of each column
Contains pattern, word and field profiling
Frequency distribution stored procedure:
CREATE PROCEDURE _SYS_TASK.PROFILE_METADATA (
IN columns _SYS_TASK.PROFILE_METADATA_COLUMNS,
OUT result _SYS_TASK.PROFILE_METADATA_RESULT
)

Cleanse
HANA Web-Based Development Workbench – .hdbflowgraph editor

Cleanse Configuration
A wizard will guide users through the process of
creating a cleanse configuration. Cleanse rules will
be suggested based upon semantic profiling results
The following cleanse components are supported
 Person, Firm, Address, Phone, Email and Title

Content Types
Content types describe data within each column and
are grouped together to form cleanse components.
The cleanse components determine the cleanse rules
that can be used.
The semantic profiling results can be reviewed and
modified if needed
 To change the content type if the results were ambiguous
 To fine-tune the results in order to affect the mapping of columns
to the cleanse components
There are over 20 pre-defined content types that can be assigned
to any column

Cleanse Components
Cleanse components are the entities defined that will be
mapped into the cleanse operation
Cleanse components can be composed of
 1-N number of input columns depending upon type
– Address and Person will usually have more than 1 input column
associated with them
 Data from one input source

Cleanse Configuration Settings
The cleanse configuration settings will determine how
the data will be formatted on output
The cleanse configuration settings consist of
 Person, Address, Firm, Title, Email and Phone settings
 Enabling/Disabling the generation of side effect data

Cleanse Configuration Output
A set of best practice output fields will be automatically
selected for the user based upon the semantic profiling
results
Users can perform the following related to output field
selection
 Adjust the output fields based upon the visual representation
 Select from a list of suggested actions
 Manually customize the output fields from a list of fields for each
cleanse component
Full control of the entire output schema from the cleanse operation
is possible

Match
HANA Web-Based Development Workbench – .hdbflowgraph editor

Match Configuration
A wizard will guide users through the process of
creating a match configuration. Match policies will
be suggested based upon semantic profiling results
The following match components are supported
 Person, Firm, Address, Phone, Email, Date and Custom
 Components are used to define match policies
The following policies are supported and can be used in
combination with each other
 Person, Firm, Address, Phone, Email, Date and Custom

Content Types
Content types describe the data in each column and
are grouped together to form match components
For each source, the semantic profiling results for each
content type can be chosen or ignored for matching
 View cleansed components
 View uncleansed columns (input data)
Address and Person components contain multiple content
types
 Person may contain First Name and Last Name and other
combinations
 Address may contain Country, Address Line, City, Region and
Postcode

Match Components
Match components are used individually or in
combination with each other to form match policies
Match components can be composed of
 Multiple input columns from semantic profiling results defined
by content types
– Each match component can be user defined
 Multiple input columns from a cleanse operation defined
from the MATCH_STD_* columns
If a cleanse operation does not precede the match
operation, then the MATCH_STD_* fields will be generated

Adding Custom Match Components
Custom match components can be added to a
configuration to be used to create a custom match
policy
A custom match component is defined:
 By providing a name for the match component
 By selecting the column associated with the match component
– On a source-by-source basis when multiple sources are
being used
Custom match components can be used in match policies:
 When performing exact-based matching
 When performing fuzzy-based matching
– Only when combined with Phone, Email or Address

Match Policies
Match policies are used to determine how matches
are identified within a single source, or across
multiple sources of data
Policies can be created by:
 Selecting one or more components
A match policy must contain one of the following
components:
 Address
 Phone
 Email
 Date
 Custom

Match Configuration and Policy Settings
The settings for the match configuration and policies
can be customized to fine-tune how matches are
determined
Person, Address and Firm component
 Thresholds can be changed to tighter or looser
 Settings can be enabled/disabled for different match scenarios
Custom component
 Thresholds can be changed to tighter or looser
 Settings can be enabled/disabled for different match scenarios
Side effect data
 None, Minimal, Basic, Full

Multi-source Matching
The match operation supports finding duplicates
within sources of data and across sources of data
This can be configured by
 Directly mapping each data source to the match operation
 Leveraging the union operation to combine the multiple
sources intoa common data model
– A column specifying the source is required here
Source settings
 Define a constant source ID
 Get a source ID from a column
 Remove source from determining duplicates within it

Side Effect Data
Match & Cleanse

Side Effect Data Overview
Side effect data is generated by the cleanse and match operations and provides insight and
clarity into the impact and results of each operation. This provides the framework to easily
develop capabilities to create custom review and remediation tools for Data Quality in HANA
Side effect cleanse/match configuration options:
 None
– Side effect data is not generated
 Minimal
– Generates only the statistic tables that contain summary information about the operation stored in the _SYS_TASK schema
 Basic
– Generates the statistic tables that contain summary and detailed information about the operation
 Full
– Generates everything in basic along with a copy of the input data prior to the operation. The copy of the input data is stored
in the user’s schema

Side Effect Data for Match
Match side effect data will provide summary and detailed information related to the match
operation along with details specific to each match found on a group or record level
Match side effect tables consist of (in schema _SYS_TASK):
 MATCH_STATISTICS
– Provides a summary of a specified match operation including match groups, matches found, unique records, number of
match groups to review, the comparisons performed and number of decisions made
 MATCH_SOURCE_STATISTICS
– Provides a summary of input sources and the data when doing multi-source matching
 MATCH_GROUP_INFO
– Provides detailed information of a specified match group within a match operation including how many records are in the
match group, review/conflict flags and how many sources of data the match group contains
 MATCH_RECORD_INFO
– Provides the relationship information on a record-by-record basis for each match group within a match operation
 MATCH_TRACING
– Provides very detailed information on a record-by-record basis as to how and why the match was made along with the score

Match Side Effect Data – Table Relationships
The match side effect data is stored in a relational data model
The data in the tables in stored in order of level of detail provided
from summary information in MATCH_STATISTICS to detailed
match record information in MATCH_TRACING.
All data can be queried essentially using TASK_EXECUTION_ID,
GROUP_ID and ROW_ID
TASK_EXECUTIONS
MATCH_STATISTICS MATCH_SOURCE_STATI
STICS
MATCH_GROUP_INFO
MATCH_RECORD_INFO
MATCH_TRACING

Side Effect Data for Cleanse
Cleanse side effect data will provide summary and detailed information related to the cleanse
operation along with details specific to how the data (entities and components) was changed
Cleanse side effect tables consist of (in schema _SYS_TASK):
 CLEANSE_STATISTICS
– Provides a summary of a specified cleanse operation including number of valid, suspect, blank and high significant changes
on an entity-by-entity basis. An entity is equivalent to a cleanse component (Address, Person, Firm, Phone, etc.)
 CLEANSE_ADDRESS_RECORD_INFO
– Provides a summary of the address cleansing results of a specific operation including assignment level, assignment type
and assignment information code (V/I/C) for each row in the input data
 CLEANSE_CHANGE_INFO
– Provides detailed information on a row-by-row, entity-by-entity and component-by-component basis that explains the
significance of the change and the type of change. This makes cleanse a complete white box with transparency
 CLEANSE_INFO_CODES
– Provides detailed information on a row-by-row and entity-by-entity basis that defines exactly the issue with the data that
caused the entity to not validate during the cleansing operation

Cleanse Side Effect Data – Table Relationships
The cleanse side effect data is stored in a relational data model
The data in the tables in stored in order of level of detail provided
from summary information in CLEANSE_STATISTICS to detailed
cleanse information in CLEANSE_CHANGE_INFO.
All data can be queried essentially using TASK_EXECUTION_ID,
ENTITY_ID and ROW_ID
ENTITY_ID can be looked up using data found in the
TASK_LOCALIZATION using the LOC_ID column
TASK_EXECUTIONS
CLEANSE_STATISTICS CLEANSE_ADDRESS_R
ECORD_INFO
CLEANSE_CHANGE_INF
O
TASK_LOCALIZATION

Task Management
Tasks can now be stopped before execution completes using a new SQL statement
CANCEL TASK <TASK_EXECUTION_ID> [WAIT <TIME_IN_SECONDS>]
The cancel task command can be used:
 Within a SQL console
 Within a stored procedure
Retrieve the TASK_EXECUTION_ID by:
 Obtaining the last task execution ID
– SELECT session_context('TASK_EXECUTION_ID') FROM dummy;
 Viewing the monitoring information
– SELECT * FROM M_TASKS WHERE TASK_EXECUTION_ID = CAST(session_context('TASK_EXECUTION_ID') AS
BIGINT);

Disclaimer
This presentation outlines our general product direction and should not be relied on in making
a purchase decision. This presentation is not subject to your license agreement or any other
agreement with SAP.
SAP has no obligation to pursue any course of business outlined in this presentation or to
develop or release any functionality mentioned in this presentation. This presentation and
SAP’s strategy and possible future developments are subject to change and may be changed
by SAP at any time for any reason without notice.
This document is provided without a warranty of any kind, either express or implied, including
but not limited to, the implied warranties of merchantability, fitness for a particular purpose, or
non-infringement. SAP assumes no responsibility for errors or omissions in this document,
except if such damages were caused by SAP intentionally or grossly negligent.

Additional Resources
 SAP HANA EIM documentation on SAP Help Portal
– http://help.sap.com/hana_options_eim
 SAP HANA Academy on YouTube – What’s new with SAP HANA SPS10 playlist
– https://www.youtube.com/playlist?list=PLkzo92owKnVxweu0HK_3QjCfHiMn0jIcA

© 2015 SAP SE or an SAP affiliate company. All rights reserved.
Thank you
Contact information
Richard LeBlanc | Ken Beutler
SAP HANA EIM Product Management
richard.leblanc@sap.com | ken.beutler@sap.com

SAP HANA SPS10- Enterprise Information Management

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a SAP HANA SPS10- Enterprise Information Management

Similar a SAP HANA SPS10- Enterprise Information Management (20)

Más de SAP Technology

Más de SAP Technology (20)

Último

Último (20)

SAP HANA SPS10- Enterprise Information Management