NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation

© 2017 IBM Corporation
Data: Spark and the Data Federation
Leif Pedersen
Executive IT Specialist,
z Analytics, Europe
Email: Leif.Pedersen@dk.ibm.com

Systems of InsightSystems of Record Systems of
Engagement
Look like a “déjà vu”?
2

In the new insight economy, winners infuse analytics
everywhere to drive better outcomes!
Create new business models
(CEO)
Attract, grow, retain
customers
(CMO)
Transform financial
& management
processes
(CFO)
Manage risk
(CRO)
Prioritize IT investment
for innovation
(CIO, CDO)
Optimize
operations
(COO)
Fight fraud and
counter threats
(CSO)
Systems of Insight
Systems of Record Systems of
Engagement
3

All Data New Dev StylesNew Analytics More People
Business
Value
Embrace all data
Run at the speed
of business
1 Enable all analytics
IBM Analytics Point of View - Make DATA SIMPLE and
ACCESSIBLE to ALL
DATA
Professionals are
leading THE
Transformation!
2
3
4

The Evolution in the Approach to Getting Value from Data
Operations Data Warehousing Self-service
Analytics
New Business
Imperatives
Maturity High
High
Low
Data-Informed
Decision Making
• Full dataset analysis
(no more sampling)
• Extract value from
non-relational data
• 360o
view of all
enterprise data
• Exploratory analysis
and discovery
Warehouse
Modernization
• Data lake
• Data offload
• ETL offload
• Queryable archive
and staging
Lower the Cost
of Storage
Ensure resiliency
and availability
Business
Transformation
• Create new business
models
• Risk-aware decision
making
• Fight fraud and
counter threats
• Optimize operations
• Attract, grow, retain
customers
Value
We
are
here
5

SoE
Analytics evolution to support all Analytics Apps on all Data –
The Mainframe Use case
6
Applications Data
SoI
HDFSMap / Reduce
Spark
Historical data in DB2 for z/OS &
IBM DB2 Analytics Accelerator
Other Data
BI Reporting Data Warehouse / Data Marts
The Data Lake Evolution
Operational Data stored in
VSAM, IMS, DB2
SoR
Core Business supported by
CICS, IMS, WAS
z/OSRules
Score
execution
Machine Learning
The Predictive Analytics EvolutionScore
Creation
IT Operational Data

z Systems Analytics Areas complement existing Analytics
Environments.
IBMDB2Analytics
Accelerator
In transaction rules and
score execution
Intraday capability for ad-hoc
queries & predictive analytics
Availability of historical
data (in raw format)
Accelerated reporting to
fulfill internal and regulatory
requirements
Ability to transform
data before offload to
DWH or reporting
Ability to create new
models at any time
Quasi Real Time
availability of data
for analytics
Instant access to raw data
for new report generation in
hours instead of days
Load and merge of ANY non
DB2 z/OS data
Scoring Rules
A
zDatazApps
Scoring
Rules
Explore data to
uncover hidden
insights
A
7

Opportunity to rethink business processes: analytics as an integral part of the process itself,
rather than a separate activity performed after the fact
o Transform business processes, not just provide existing styles of analytics faster and without latency
Enable business leaders to perform, in the context of operational processes, advanced and
sophisticated real-time analysis of their business data
Hybrid transaction/analytical processing will empower application leaders to
innovate via greater situation awareness and improved business agility.
Gartner Research Note G00259033 28 January 2014: Hybrid Transaction/Analytical Processing Will Foster Opportunities for Dramatic Business Innovation
The integration of transactions and analytics is an
emerging and important market segment
“
”
Analytics as
part of the
flow of
business
Insights on
every
transaction

Hybrid Transaction/Analytical Processing (HTAP) - with DB2
Analytics Accelerator
OLAP
DB2forz/OS
Processing
IBMDB2AnalyticsAccelerator
DB2 for z/OS CPU savings
target
• Operational (in transaction)
analytics
• (complex) OLTP
Accelerator focus
• Ad-hoc queries
• Complex queries scanning
large amount of data
• ETL acceleration/virtual
transformation
Complex queries (more history)
OLTP Transactions
High concurrency
Hybrid Transactional &
Analytical Processing
Standard reports

Data Warehouse and Data Lake
A Data Lake is…
+An analytics sandbox for exploring data to
gain insight
+An enterprise-wide catalog to find data across
the enterprise and to link from business term to
technical metadata
+An environment for enabling reuse data
transformations and queries
+An environment where users can access vast
amounts raw data
+An environment for developing and proving
an analytics model and then moving into
production; experience in production may drive
further experimentation in the data lake
A Data Lake is not…
- A data warehouse or data mart of all of the data
in an enterprise
- A high-performance production environment
- A production reporting application
- A purpose-built system to solve a specific
problem
10

Fast Runtime Environment
– Interactive or batch processing
– Based on data in-memory processing
• High performance for multi-step processes where Spark can
pass the data directly without using disk storage.
– Parallel processing
Interface to Data
– Accessing Hadoop based HDFS data, Cassandra,
Hbase, …
– Accessing any traditional databases using JDBC
Interface for Applications – Ease of Use APIs supported
by modern languages
– Stack of libraries including SQL, Machine Learning,
GraphX, and Spark Streaming
– Over 80 high-level operators that make it easy to build
parallel applications
– Many languages supported including Java, Scala, Python
and R
• Spark is actually written in Scala
Spark, a Transaction Manager for Analytics Applications
11
Spark is NOT a datastore, NOT a
replacement for Hadoop!

2. Spark lets you develop line-of-business
applications faster
3. Spark learns from data and delivers in real
time
With Hadoop, you ask a question and get back
a batch of data. With Spark, you may say,
“continue to give me answers to this
question”…and when new data comes, the
user is smarter.
1. Spark makes it easier to access and work
with all data
- Enables new data-based use cases
- All data: Internal/ External, Structured/
Unstructured
- Real-time insights, from all data
sources
- Automates analytics with Machine
Learning
- Clients that lead in data, lead their
industry
Design
Develop
ment
Data
Science
Why Spark matters to a business?
12

VSAM
z/OSKey
Business
Transaction
& Batch
Systems
Spark Applications: IBM
and Partners
AdabasIMSDB2 z/OS
Distributed
Teradata
HDFS
Apache Spark Core
Spark
Stream
Spark
SQL
MLib GraphX
RDD
DF
RDD
DF
Optimized data access
IBM z/OS Platform for Apache Spark
and *many* more . . .
Spark can run on z/OS close to z/OS-based Applications & Data
Values:
Data-in place analytics,
without need to ETL or move
data for analytic purposes
Optimized access and z/OS
governed ‘in-memory’
capabilities for core business
data
Unique capability to access
almost all z/OS sources with
Apache Spark SQL & many
non-z data sources
Almost all zIIP eligible
Integration of analytics
across core systems, social
data, website information,
etc.
13
and *many* more including SMF, OPERLOG, SYSLOGs, . . .

© 2017 IBM Corporation14
Examples of Spark Use Case

Client Insight Analytics over transactions & customer interactions
Leverage data on z/OS (DB2, VSAM) & distributed (Oracle, SQL Server, HDFS) to enable real-time access from data
science teams focused on client insight to develop patterns, models
Data Distillation - Hybrid Architecture
Run Spark z/OS to access, aggregate, filter and *distill* large volumes of data
Make available smaller, aggregated analytic results for access by: customer insight solutions, data science
environments
360 Degree View: Customers, Payments, Transactions
Leverage Spark z/OS to get real-time or near real-time view of current status of payments, transactions,
customers combining data from OLTP, distributed sources, & streaming
IT Analytics
Analyze real-time streamed SMF data, combined with archived SMF data and syslog data, visualize and interact with data
science Jupyter Notebook to find patterns
Use Case Patterns

Distill the Data:
• Use Spark z/OS for data blending, cleansing, transform, etc with data-
in-place
• Store results in ‘Tidy’ Data Repository
• Refresh as needed
Explore the results
Data exploration, investigation
leveraging ‘Tidy’ Repository
Values:
• Leverage most current business data for data science
• Efficiencies in reducing ETL
• Leverage common analytics ecosystem skill
• Integrate Spark on multiple platforms for optimal analytics infrastructure
Use Case #1: Hybrid Data Science

Use Case #2: Optimized Customer Insight
Customer
z/OS
Transactio
nMerchant
Spark Analytic
Result Set
Call
Center
Apache Spark Core
Spark
Stream
Spark
SQL
MLib GraphX
RDD
DF
RDD
DF
Optimized data Layer
IBM z/OS Platform for Apache
Spark
Subset of
Data: distilled,
filtered,
transformed
BI
Dashboard
Components
Data
Cube
Analytical
Engines
Web
Portal
Analytics
API
Gatewa
y
APIs
Pre-Built
Dashboards
Pre-Built
Data Models
Pre-Built
Analytical
Models
Transform (if
needed), &
populate BBCI
staging area /
cache
Input &
Output
Tidy Data
Values:
• Avoid costly and ineffective wholesale copy of data
• Frequent refresh of most relevant data elements to customer insights solution
• Faster time to implementation for business solution to deliver insights on churn, cross-
sell, etc.
Customer Insight for
Banking Solution

Use Case #3: Real-Time Application Event
Analytics Use Case
Spark
z/OS
Event Stream
CICS Event triggers create an event stream that would
be captured by Spark running on its own z/OS LPAR
Spark configured for high availability to avoid impacting
CICS
Real-Time Analytics with Spark z/OS:
Real time analytics to provide feedback into the
Systems of Engagement or Monitoring Systems on
types of banking services and frequency of
consumption
Real time monitoring of core business processes
and applications
Historical Analysis leverages IDAA:
Batch Load of Events for historical, trending and
reporting
Real
Time
Analytics,
can
include
scoring
DB2 Analytics
Accelerator
Loader
Channel
System of
Engagement
CICS Transactions
Monitor
LogstreamLogstream
IBM DB2
Analytics
Accelerat
or
Real-Time Consumption Batch Load Overnight
Historical
Analysis,
Reporting
DB2
z/OS

Use Case #4: Surface Spark Results to JDBC / ODBC Applications
DB2 z/OS
z/OS
Apache Spark Core
Spark
Strea
m
Spark
SQL
MLib
Graph
X
DF
RDD
DF
RDD
DFStor
• Persist
specific
Spark
Result
Sets
• Backed
by VSAM
• Leverage
z/OS SAF,
Dataset
mgmt
HDFS
JDBC / ODBC /
REST, noSQL
Client
accessing
Spark RDDs,
example:
Cognos ,
Tableau, …
Optimized Data Layer
IMSVSAM

Use Case #5: Analyzing SMF Data with Spark
• Spark application is
agnostic to data source
and number of sources
• MDSS required on at
least one system, MDSS
agents required on all
systems. No IPL required
for installation
• Logstream recording
mode required for
realtime interfaces
MDSS Client
LPAR1
MDSS Client
LPAR2
MDSS Client
LPAR3
SMF
Realtime
Logstream
Logstream
Logstream SMF
Realtime
Logstream
Logstream
Logstream SMF
Realtime
Logstream
Logstream
Logstream
Spark Application using SparkSQL
Optimized Data Integration Layer (MDSS)
JDBC
LPARn
SMF
Realtime
Logstream
Logstream
Logstream
Dump Data Sets
Analyze real-time in-memory SMF data, combined with archived data
Analyze data across multiple LPARs
Augment with SYSLOG and other sources for richer analytic outcome
Efficiencies in avoiding data movement

Use Cases for Real Time SMF Analytics
Detect excessive memory consumption – SMF30
Monitor high water mark for real memory usage for jobs and send alerts if usage exceeds normal
consumption
Detect security violations in real-time – SMF 80
Monitor volume of datasets/files accessed per user within a given time period and raise alerts for above
normal access rates
Real time monitoring resource usage in cloud environments (CPU, Memory, Disk)
A list of supported SMF record types can be found in the Redbook “Apache Spark Implementation on IBM z/OS” - page 78
http://www.redbooks.ibm.com/abstracts/sg248325.html

IBM Open Data Analytics for z/OS

Business Applications
CustomerTransactionMerchant
Distributed
Apache Spark
Distilled
Insight
Query
Acceleration
Leveraging IBM Z for Optimized Analytics
Federate analytics leveraging data in place for more current insights at scale,
optimized security, privacy and reduced costs
DataData
Data
Prep
Data
Prep
ML
Algo
ML
Algo
ModelModel DeployDeploy PredictPredict
Python
Distilled
InsightAnalytic Result
Sets
Govern, Manage, Algorithm Assist…
Monitor, Feedback
Pauselss GC
New SIMD instructions 32 TB Memory
Pervasive Encryption
23
IBM Machine Learning for z/OS
Optimized Data Integration Layer

IBM Open Data Analytics for z/OS: Offering Overview
What is in the Offering?
IBM Open Data Analytics for z/OS (IBM
product):
• Apache Spark 2.1.1 enabled for z/OS
• Python 3.6.1
• All Pre-requisite libraries
• Select Anaconda Libraries (approx. 250 including
pandas, dask, numpy, scikit-learn, matplotlib…)
• Optimized Data Integration Layer: optimized for
Spark & Python db access to z/OS data
• Integration with WLM z/OS for resource
management aligned with job priority
• Integration with security (SAF) interfaces
• Support & Service available from IBM for a fee
–Very aggressive pricing for zIIPs (cores) and memory for
Open Data Analytics z/OS workload
Ecosystem
–GitHub zos-spark repository
•Jupyter Notebooks (Scala, Python Workbenches)
•Kernel gateway, Jupyter client, kernel toree
•Sample data & code snippets
–Rocket:
•Collaboration for Optimized Data Layer
•Industry vertical mappings, e.g. ISO8583-1, ACH,
SMF, etc.
–Continuum:
• Access to z/OS channel on Anaconda cloud for
updates / refreshes & Package management
• Option to license private mirrored environment
• Services & Consulting for Python

Value: Increase Integration through Persisting Analytic Results for Enterprise Collaboration
VSAM
z/OS
DF Store:
• Specific
Spark &
Python
Result
Sets
• Backed by
VSAM
• Leverage
z/OS SAF,
Dataset
mgmtOptimized Data Layer
Apache Spark Core
Spark
Stream
DF DF
MLib Graphx
Spark
SQL
Python 3.6.1
Core Packages:
• numpy
• scikit-learn
• dask
• pandas
• Matplotlib
• Etc.
IMS
DB2
z/OS
HDFS
JDBC / ODBC /
REST,
noSQL
Client
accessing
Spark
RDDs,
example:
Cognos ,
Tableau, …

NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation

Similar a NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation (20)

Más de NRB

Más de NRB (20)

Último

Último (20)

NRB - LUXEMBOURG MAINFRAME DAY 2017 - Data Spark and the Data Federation