Big data has changed the IT landscape. Learn how
your existing IIG investment, combined with our
latest innovations in integration and governance, is a
springboard to success with big data use cases that
unlock valuable new insights. Presenter: David Corrigan, Big Data Specialist, IBM
New Innovations in Information Management for Big Data - Smarter Business 2013
1. New Innovations in Information Integration &
Governance (IIG) for Big Data
David Corrigan
Director of Product Marketing, InfoSphere
2. Data Confidence Is Essential
If you want to find new insights
from big data . . .
and ACT on those insights . . .
you need confidence in the data
used for insight
Information Integration & Governance (IIG)
• Make decisions with greater certainty
• Analyze rapidly while providing necessary controls
• Increase the value of data
3. Building Big Data Confidence is Essential
3x 77%
80%
Organizations with IIG
outperform their
competitors
Outperform
Competitors
Organizations rated
their decision making
as good or excellent
Transform the
Front Office
Experience
Establish
Trusted Information
Organizations establish
high or very high level of
trust in data
4. IIG Evolves for the Era of Big Data
Automated Integration
Business users need rapid data
provisioning among the zones
Visual Context
Categorize, index, and find
big data to optimize its usage
Agile Governance
Ensure appropriate actions based on
the value of the data
1
2
3
How do I get access to
new big data sources?
How do I digest all of
this new information?
How do manage all of
this new data?
5. Six Innovations that Build Big Data Confidence
Visual
Context
Agile
Governance
Automated
Integration
Big Match
Integration of master records from
big data with probabilistic matching
powered by Hadoop
Big Data
Catalogue
Categorize metadata on all big
data sources
MDM for Big Data
Rapid mastering of new big data
sources and extension of 360°
view with unstructured big data
* Statement of Direction
Data Click
Self-service data
provisioning for big
data repositories
Information Governance
Dashboard
Visual context to give immediate
status on governance policies
Big Data Privacy &
Security
Monitor and mask sensitive big
data in Hadoop, NoSQL, &
relational systems *
*
*
6. InfoSphere Data Click
Self-service Data Provisioning
Innovation
• Two-click data provisioning designed for business
users
• Integration of more big data sources – JSON,
NoSQL, Hadoop, JDBC
Value
• Rapid provisioning of ad-hoc repositories
• Faster time to insight
• Self service to eliminate the IT bottleneck
Usage
• Enables rapid analysis of big data sources
Data
Provisioning in
1 5000th
the time
Of traditional
approach
Automated
Integration
2
Click Data
Access
* Source: IBM performance lab testing, showing JDBC inserts at
5.8% to 74% faster
7. Big Match
Find & Integrate Master Data in Big Data Sources
MDM BigInsight
s
Big Match Engine
Match
Millions
Of Records
Automated
Integration
How It Works
• Probabilistic matching on big data platform
(BigInsights-Hadoop)
• Matching at a higher volume
• Matching of a wider variety of data sets
Client Value
• Find master data within big data sources
• Get an answer faster – enable real-time matching
at big data volumes
Usage
• Provides more context by detecting master
entities faster
* Source: IBM InfoSphere performance
team test results
8. Big Data Catalogue
Find Big Data More Easily
Visual
Context
Big Data Catalogue
170x
Improvement in
metadata import
performance*
Innovation
• Stores metadata on every available big data
source
• Provides structure to the Hadoop landing zone so
data may be easily found and leveraged
• Classifies data (origin, lineage, source, value….)
Value
• Find data more easily within a growing Hadoop
landing zone and a complex zone architecture
• Rapidly leverage new big data sources
Usage
• Enables optimal usage of big data * Source: IBM internal performance
results, where three test runs with
the latest version averaged 11.46
seconds vs 1,964 seconds with the
previous release
9. Information Governance Dashboard
Visualize and Control Governance Visual
Context
Innovation
• Measurements for policies and KPIs
• Rapid creation of tailored dashboards
Value
• Immediate insight into governance policy status
• Interception of issues when they start, right at the
source
Usage
• Raises data confidence with visual governance
status
1000s
Of data points
and policies
visualized
10. Big Data Privacy and Security
Protect a Wider Variety of Sources
InfoSphere
Optim
InfoSphere
Guardium
Agile
Governance
80%
Faster Activity
Monitoring*
Innovation
• Data activity monitoring of more NoSQL, Hadoop,
and Relational Systems
• Masking of sensitive data used in Hadoop
Value
• Protection is a pre-requisite for the fundamental
assumption of big data – sharing data for new
insight
• Automation enables protection without inhibiting
speed
Usage
• Ensures sensitive data is protected and secure
RDBMS
Hadoop
NoSQL
Data Warehouses
Application Data
and Files
•Source: IBM internal benchmarks
of InfoSphere Guardium V9 p50
11. MDM for Big Data
The Complete 360° View of Important Data
MDM Data Explorer
Agile
Governance
21K
Customer-centric
transactions per
second*
How It Works
• Extend the master view with federated,
unstructured big data
• Hybrid styles enable linking source records or
consolidating based on confidence
Client Value
• Visualize every related data item in the 360° view
• Rapidly onboard new big data sources
• MDM adapts to the source
Usage
• Provides a complete understanding of the
customer or master entity
* Source: InfoSphere MDM with DB2 pureScale
achieves: 21,000 customer-centric transactions a
second, 2X transaction rate of Oracle MDM on
Exalogic/Exadata using ½ the number of cores
Note to U.S. Government Users Restricted Rights --
Use, duplication or disclosure restricted by GSA ADP
Schedule Contract with IBM Corp.
Approved Claim in US/Canada only.
Results valid as of 10/21/2012.
13. InfoSphere Delivers Data Confidence
For Big Data Use Cases
Big Data Exploration Enhanced 360o View
of the Customer
Operations Analysis Data Warehouse Augmentation
Security/Intelligence
Extension
Understand confidence
Determine risk Establish master record
Extent to all sources
Automatic data protection
Mask sensitive information
High volume data integration
Automatic data protection
High volume data integration
Agile big data archiving and retrieval
14. Use Case Spotlight: Enhanced 360° View
MDM and Big Data
Deliver the Complete 360° View
Capabilities Required to
Be Successful
1. Combine structured MDM and
unstructured big data
2. Rapidly onboard uncertain data
sources in a registry style to
separate low and high confidence
data
3. Find and match master data
entities within big data sources
MDM
Integration &
Quality
Data Explorer
Single Version
of the Truth
Extended View
of Master Data
15. Use Case Spotlight: Data Warehouse Augmentation
Improve your data warehouse
by improving data confidence
Integration &
Quality
Data Warehouse
High performance
data loads
MD
M
Archiving Security &
Privacy
Test Data
Management Automated
Archiving Automated
Data Protection
Self-service
Testing
More Accurate
Analysis
Capabilities Required to
Be Successful
1. Self-service integration for ad-hoc
requests
2. Understand context of all available
big data with a single metadata
repository and business glossary
3. Mask any variety of sensitive data
before ingestion
4. Automatically protect big data with
activity monitoring
5. Store and analyze archive files on
Hadoop
16. A Busy Year of Innovation within the Labs
Literally dozens of
innovations that raise
confidence in big data
Two highlights:
1. BLU Acceleration
2. PureData System
for Hadoop
17. BLU Acceleration
BLU Acceleration
IBM Research & Development Lab Innovations
Dynamic In-Memory
In-memory columnar processing with
dynamic movement of unused data to storage
Actionable Compression
Industry’s first data compression that preserves order
so that the data can be used without decompressing
Parallel Vector Processing
Multi-core and SIMD parallelism
(Single Instruction Multiple Data)
Data Skipping
Skips unnecessary processing of irrelevant data
Super Fast, Super Easy—
Create, Load and Go!
No indexes, No aggregates,
No tuning, No SQL changes,
No schema changes
18. Iqbal Goralwalla, Head
of
DB2 Managed Services,
Triton
Lennart Henäng,
IT Architect
Yong Zhou, Sr. Manager of Data
Warehouse & Business
Intelligence Dept.
BLU Acceleration: Customers are Seeing Great Results
“100x speed up
with literally no
tuning!”
“Converting this row-
organized uncompressed
table to a column-
organized table in DB2
10.5 delivered a massive
15.4x savings!”
“With BLU Acceleration, we’ve
been able to reduce the time
spent on pre-aggregation by
30x—from one hour to two
minutes! BLU Acceleration is
truly amazing.”
19. PureData System for Hadoop
Bringing big data to the enterprise
Simplify the delivery of unstructured data to the enterprise
Integrate Hadoop with the data warehouse
Leverage Hadoop for data archive
Provide best in class security
Provide data exploration across structured and unstructured
data
Accelerate insight with machine data
Accelerate insight with social data
20. Confidence Is Essential for Actionable Insight
• Make decisions with greater certainty
• Analyze rapidly while providing necessary
controls
• Increase the value of data
Visual Context
Agile Governance
Automated Integration
Key Points:
- Big data has created a new era of opportunity for organizations of all types.
- Big data offers insights that can lead to solutions to some of the thorniest business challenges.
- But before you act on the insights gleaned from big data, you need confidence in the data.
- Effective IIG helps you gain that confidence.
Relevant Story:
Message in a Nutshell:
- Gain confidence in your data before you act!
Key Points
There’s a notion that you should govern data to make it an asset, or because you ought to do it, or because you have to due to compliance.
Those are true, but the real reason you do it is for competitive advantage.
Information is supposed to inform all of our decisions – to unlock new insights for competitive advantage, to gain market share, etc.
But the biggest hindrance to using information is confidence – if users don’t trust the data, they won’t use it. Trusting the data means you actually use it to your advantage, and that’s the source of outperforming peers.
Those same companies are able to transform their front office experience – by making faster decisions at the point of interaction, and making better decisions. In fact, 4 out of 5 companies with mature IIG rated their decision making as 7/10 (very good) or higher. In other words, better data means better decisions
And the users have confidence in their data because they know it’s trusted – it’s made obvious to them what has been done to verify, validate, and improve the information they are using. In other words, they make better decisions because they trust their data.
Client Stories & Anecdotes
24,800 Lives Saved with better information confidence - Premier used a variety of IBM software products to improve patient health and reduce costs. The InfoSphere products (Master Data Management, Information Server, DataStage, and QualityStage) were able to create a singular, trusted view of each data entry in the system. The combination of all of the products were able to create a better data warehouse.
Catchy Statement
The reason you integrate and govern data is as simple as this – you’ll outperform your competitors by making better decisions because your employees have confidence in and therefore use data available to them.
Key Points:
- IIG was important before the big data era
- But in this new world, where it’s assumed that new and diverse data will be broadly shared and used for deeper insights, it’s more important than ever before.
- We have looked at the new requirements and built on our existing capabilities so our clients can have a sound basis for confidence in their data
and the insights derived from the data
Relevant Story:
Message in a Nutshell: Without confidence, what good are analysis and insights based on big data?
Key Points:
- Beyond the capabilities included in this announcement, IBM is moving in a direction toward extending its support for automated integration, visual context and agile integration.
- Now we would like to share some of the additional capabilities included in this Statement of Direction While
Capabilities included in SOD:
Big Match: Rapid matching of master data for faster insights
Big Data Catalog: An easy way to find the right data, despite high volumes
Agile MDM for Big Data: Extension of MDM across unstructured big data
More details are ahead on the next few slides
Message in a Nutshell: IBM has a continuing plan to enhance big data confidence.
Key Points:
- With so many initiatives dependent on data, simply getting access to the right data is a challenge.
- InfoSphere Data Click accelerates a whole host of projects by making it easier to get started, without dealing
with long waits for IT resources
- Data Click has been very well received since its introduction last year, and now it is becoming even more
helpful by enabling integration of data from more big data sources (JSON, NoSQL, Hadoop, lots of others
via JDBC)
How InfoSphere Data Click Works:
Data Click now provides rapid access to a wide range of data, in repositories like Teradata, Netezza, SQL Server,
Greenplum, Informix, Sybase, files and more . . . in addition to the original sources (DB2, Oracle) and original
target (Netezza)
Relevant Story:
-
Message in a Nutshell: Universal connectivity with just two clicks
Key Points:
- Matching master records is a compute-intensive process—one that can become a bottleneck with big data.
- Without understanding master entities such as customers, products, and locations, how can you derive actionable and accurate insight from big data?
How Big Match Is Designed to Work:
- By running the matching engine on Hadoop (InfoSphere BigInsights), MDM can match in real time.
- Big Match will enable rapid and accurate detection of duplicates and related information within large volumes or streams of big data, prior to ingestion within key internal systems or analysis
Relevant Story:
- WAITING FOR CONFIRMATION OF PROOF POINT
Message in a Nutshell: Big Match matches big data fast.
Key Points:
- One of the hardest challenges of big data is simply finding the right data.
- A Big Data Catalogue can make it easy for data users and scientists to ‘shop for data.’
How Big Data Catalog Is Designed to Work:
- It ingests and stores metadata from every available source, classifies data, and makes it easy to search and find via a user interface or SOA APIs.
- A Big Data Catalogue provides structure to Hadoop landing zones, enabling users to search, find, and leverage big data more quickly.
Relevant Story:
- Early testing shows significant improvements our speed in importing metadata
- We also intend to provide programmatic methods for importing large amounts of metadata.
- We’re seeing results like metadata import performance that is up to 170x faster than before and which can be run programmatically via the command line rather than manually orchestrated via the GUI.
Message in a Nutshell: Big Data Catalog enables shopping for data.
Key Points:
- A dashboard can be customized to reflect each organization’s policies and priorities.
- A dashboard can display both governance policies and operational results
- The more broadly an organization uses IIG capabilities, the richer the dashboard can be.
How Information Governance Dashboard Works:
- Metadata APIs enable application-specific dashboards and views in areas like data quality, master data, security and privacy
- The dashboard can support drill-down from a top-level view, for examining further detail and prompting appropriate action.
Relevant Story: Until now, executives, managers and leaders like Chief Data Officers haven’t been able to get a clear and complete view of governance policies and operational results. A dashboard enables a new level of insight—available immediately, to support informed decisions.
Message in a Nutshell: Seeing is believing!
Key Points:
- A common misconception is that data governance is a heavy-weight process that needs to be applied consistently against all data, for all use cases, if at all.
- Now there is a much better approach: agile governance, with controls that are appropriate to the data, the use case and the organization.
How Big Data Privacy and Security Works:
- IBM provides agile privacy and security for sensitive data in both traditional environments and newer NoSQL platforms, including Cassandra, GreenPlum, Hortonworks and MongoDB.
-The new 64 bit architecture for high performance security provides data security at big data scale.
Relevant Story: We’re seeing performance improvements of up to 300% from previous versions of our data security capabilities—by processing more data in batches, doing more things in parallel, generally running through big data faster, to make sure it is secure and protected.
Message in a Nutshell: We’re providing appropriate governance for different big data use cases.
Key Points:
- Organizations have fluid requirements that cannot all be addressed by a single MDM style, whether virtual or physical.
- A unified InfoSphere MDM engine would support implementations with virtual, physical and hybrid MDM styles, with high performance
How Agile MDM for Big Data Is Designed to Work:
- InfoSphere Data Explorer provides the capabilities to extend MDM across unstructured big data with federated views, and visualization of the complete master record.
Relevant Story:
- In our preliminary testing, we’re seeing 21,000 customer-centric transactions processed per second when InfoSphere MDM works with DB2 pureScale
- That’s twice the transaction rate of Oracle MDM on Exalogic/Exadata using ½ the number of cores
Message in a Nutshell: Agile MDM will be flexible and fast enough for big data environments.
Key Points
Through hundreds of client implementations, briefings and consultations – we’ve determined a common set of big data use cases
Each of the use cases requires different big data technology
Each of the use cases requires a different set of governance capabilities and a different level of appropriate governance
For example, big data exploration. This use case is all about ingesting big data quickly or discovering it in its source systems, determining its relative value, experimenting with big data, and utilizing it. From an IIG perspective – its critical that you be able to discover and determine the confidence of the data. That’s not so say it should be improved or governed yet while you’re exploring. It’s focused on understanding your confidence level in the data to determine if you trust the outcomes, or whether the data needs to be improved before it’s analyzed.
Enhanced 360° View – this use case is about truly knowing everything about master entities such as the customer. In order to find big data for the customer, you first need to establish the unique customer record – and that’s where MDM along with data quality and integration play a role.
Security and Intelligence Extension – this use case is about monitoring data – log data, network data – to prevent data loss, threats, fraud, among other things. IIG helps by providing automatic protection of sensitive data, masking it, and also aiding in the detection of fraudulent individuals and networks.
Operations Analysis – this use case is all about analyzing operational data – from machines and networks – either streaming information or data at rest. It requires high volume data integration to move and integrate data among the zones.
DW Augmentation – this use case focused on augmenting the DW – sometimes that means archiving data from the DW but still being able to access and analyze it, sometimes it includes complementing the DW with unstructured data and unconventional sources. IIG helps by providing high volume data integration to and from the DW, as well as archiving capabilities to track the lifecycle of data.
Key Points
The use case is about joining the power of MDM with the power of big data to truly know everything about your customer.
MDM manages big data volumes for structured master data – matching, consolidating, and providing master data as a service.
Data Explorer extends that view by finding and displaying all available big data related to that customer record .
The capabilities you need for a true 360° view include:
Combining structured and unstructured master data – join master records with unstructured content in one view
Onboard new data sources – keep them as separate but linked records to enable a complete view – and as your confidence level with those uncertain sources rises – merge them into a single golden record. Hybrid MDM – the ability to act as both a virtual/registry style approach for some systems while acting as a transaction-hub, single physical record for other systems enables organization to onboard big data systems as ‘virtual records’ rapidly, and consolidate to the physical record over time.
Finding master entities within big data sources – the ability to match data at big data volumes as well as identifying master records in new big data sources.
Catchy Statement
Many software categories have proclaimed victory in the holy grail that is the “360° view” but each has fallen short by only offering a piece of that view. Finally, this is a solution that delivers on that promise.
Key Points
This use case is about augmenting the DW with the power of new big data technologies
In order to do that effectively, you also need IIG capabilities, such as
Self-service integration – the ability for business users, or data scientists and analytic professionals who work in the LOB, to access and integrate data on demand
Understand context – to view the context of what data is available in the DW, what is available to augment the DW, and how it is related. Also the ability to have a business glossary of terms, of very industry-specific terms, to ensure everyone is utilizing the correct terminology
Mask sensitive data to ensure privacy
Protect and monitor data within the DW to prevent data loss/breaches.
Store and analyze archive files on Hadoop – manage the lifecycle of data and the compliance requirements for archiving and disposal of data.
Key Points:
- The innovations just keep coming, and many of them can increase user confidence in big data.
- IBM has had a drumbeat of important big data-related announcements in the last several months
- A network of partners extends our reach and extends functionality by building applications on top of our platform.
Relevant Story: A few business partners who are expanding our solutions are here today.
- Kingland’s 360 Data Enterprise Hub builds on our MDM capabilities with unique capabilities for financial markets and banking.
- Stream Integration offers Product Information Monitor, to help clients deliver new products to market with confidence and understanding of the impression that product will have before it’s even released. It monitors data quality, policies and lineage, and also gathers market sentiment from external sources.
- InfoTrellis Customer ConnectID helps clients to leverage big data to improve customer service and increase share of wallet
Message in a Nutshell: IBM and IBM partners keep innovating to bring more value to clients.
Key Points:
BLU Acceleration is a combination of innovations from IBM® Research and Development Labs that dramatically simplify and speed reporting and analytics. Ten labs around the world have filed more than 25 patents over the years of developing these new technologies.
The result of these innovations, as demonstrated by our early adopter clients and partners, is a performance boost of 8-25 times1 as compared to a traditional relational database approach, and data compression of 10 times2 as compared to uncompressed tables. We have even seen examples of 1200x faster3 analytic query performance.
With Dynamic In-memory capabilities, BLU Acceleration is memory optimized, but not memory constrained. This means it can deliver the performance of in-memory columnar processing without the cost or limitations of in-memory only systems. BLU Acceleration does not require all data to fit in memory in order to achieve breakthrough performance. The system has the efficiency and intelligence of keeping the most relevant data in memory to maximize performance – optimizing both system memory and CPU memory (known as cache). This means, as data volumes grow, clients do not need to continuously buy expensive memory.
The patented encoding technology of Actionable Compression preserves the order of the data, enabling compressed data in BLU tables to be used without decompressing it. As a result of the very high levels of actionable compression and elimination of indexes and aggregates, BLU Acceleration significantly reduces the need for storage. These storage savings result in cost saving on multiple fronts: e.g., hardware, power, and maintenance.
BLU Acceleration is designed to take full advantage of the latest innovations in microprocessor advancements. With SIMD processing (Single Instruction Multiple Data), BLU Acceleration can apply a single instruction to many data elements simultaneously, for faster data processing. BLU Acceleration is as designed to take advantage of multiple cores for maximum core utilization.
BLU Acceleration automatically detects large sections of data that don’t qualify for a query – and skips the unnecessary processing of this irrelevant data. E.g. skipping all the records prior to 2010 for a question about data from 2010 to the present.
Relevant Story:
What makes these results even more remarkable is the simplicity of BLU Acceleration. Easy to set up and self optimizing, BLU Acceleration eliminates the need for indexes, aggregates, or time consuming database tuning to achieve top performance and storage efficiency. BLU Acceleration is delivered as multi-platform software with flexibility to deploy on existing infrastructure to reduce cost and risk.
Message in a Nutshell:
Speed - Lightning-fast analytics and reporting
Simplicity - Easy to set up, use and maintain
Affordability - Efficient use of resources for dramatic cost savings
Key Points:
The result of these innovations, as demonstrated by our early adopter clients and partners, is a performance boost of 8-25 times as compared to a traditional relational database approach, and data compression of 10 times as compared to uncompressed tables. We have even seen examples of 1200x faster analytic query performance.
By providing analytical insights at lightning speed, BLU Acceleration can fulfill the promise of “speed of thought” analytics—where the system can answer questions almost as rapidly as the user can think to ask them. Faster answers can unlock insights that lead to more satisfied and loyal customers, more revenue, more cost efficient operations, lower business risk, or a combination of these that unlock new business opportunities.
With BLU, clients can analyze more data faster and more efficiently than ever before to uncover insights for growing revenue and for reducing cost or risk. They can get more value from the IT budget by reducing labor, storage and system resources required for high performance reporting and analytics.
Relevant Story:
A large credit card processing company in Europe started a Proof of Concept (POC) with BLU Acceleration, and within minutes uncovered tens of thousands of Euros in fraudulent transactions! They requested IBM that the POC not be turned off!
Message in a Nutshell:
With BLU Acceleration, clients across the board are seeing orders of magnitude improvement in performance, massive reduction in storage requirements, and they are doing all that while reducing complexity and time-to-value!
Key Points:
Relevant Story:
Message in a Nutshell:
Key Points:
- Taking advantage of the big data opportunity means gaining new insights and putting them to work.
- Before you rely on new insights, you need confidence in the underlying data.
- With existing IIG capabilities, with today’s important new announcements, and with things to come from IBM and IBM partners, we are delivering what’s needed
for organizations to build confidence so they can act on new insights based on big data.
Message in a Nutshell: InfoSphere IIG builds confidence in big data.
Key Points
Confidence is iterative. Varying amounts of IIG are required for each big data use case. It’s only with agile governance that you can apply the appropriate level of governance to be successful.
I’ll leave you with a final question – are you confident in your data?
You definitely need to answer that question before you begin your big data journey.