More Related Content Similar to The Emerging Data Lake IT Strategy (20) More from Thomas Kelly, PMP (8) The Emerging Data Lake IT Strategy1. © 2014
The Emerging Data Lake IT Strategy
An Evolving Approach for Dealing with Big Data & Changing Environments
SPEAKERS:
Thomas Kelly, Practice Director
Cognizant Technology Solutions
Sean Martin, Founder and CTO
Cambridge Semantics
bit.ly/DataLake
2. © 20142
We’re living in an amazing world of information sharing,
connecting with family, neighbors, vendors, and customers
all over the world
3. © 20143
Telling the world
about what we like
and don’t like
#HIMYMfinale
@MLB
… is now following Cognizant Technology Solutions
and Cambridge Semantics
5. © 20145
We’re deciding what advertising that we want to see…
… and what we don’t
Unsubscribe
Influencing
how business
and customers
engage
6. © 20146
Many businesses have emerged that embrace this model of
customer engagement
and we’ve said Goodbye to businesses that didn’t
10 million stays in 2013,
without owning a hotel
Grew to nearly $75B in
annual retail revenue in 2013,
without opening a storefront Shares over 40 million
photos each day
7. © 20147
Retail
Engaging in a more
personalized shopping
experience, retailers are
building a stronger
relationship with each
customer
9. © 20149
Life Sciences and Healthcare
Combining health, genetic,
clinical, and public sciences
data to bring effective
therapies to patients sooner
10. © 201410
Financial Services
Delivering innovative
products and services,
based on a 360° view of
the Customer, across all
business lines, engaging
all available data assets,
internal and external
11. © 201411
The Challenges That We're Addressing
Onboarding and Integrating Data is Slow and Expensive
• Transforming data from a growing variety of technologies
• Custom coded ETL
• Existing ETL processes are not reusable
• Optimization for analytics is time-consuming and costly
• Often wait until there is a defined need for a set of data, delaying benefits
realization while waiting to onboard the data
Data Provenance is Often Poorly Recorded
• Data meaning is “lost in translation”
• Data transformations tracked in spreadsheets
• Post-onboarding, maintenance and analysis cost for onboarded data is high
• Recreating data lineage is manual, time-consuming, and error-prone
12. © 201412
The Challenges That We're Addressing
Target Data is Difficult to Consume
• Optimization favors known analytics, but not well suited to new requirements
• A one-size-fits-all canonical view is used rather than fit-for-purpose views
• Or, lacks a conceptual model to easily consume the target data
• Difficult to identify what data is available, how to get access, and how to
integrate the data to answer a question
Industrializing the Big Data Environment is Difficult to Manage
• Proliferation of data silos leads to inconsistency/syncing issues
• Conflicting objectives of opening access to data assets while managing
security and privacy requirements
• Velocity of business change rapidly invalidate data organization and analytics
optimizations
• Managing the integration/interaction with the multiple data management
technologies that make up the Big Data environment
13. © 201413
Data
Ingestion
The Data Lake is made up of four key
components
Data Lake Management
Data Management Query Management
Delivering
• Low Cost, High Performance Storage
• Flexible, Easy-to-Use Data Organization
• Performance-Optimized Analytics
• Automation of most manual Development and
Query Activities
• Self-Service End-User Features
• Intelligent Processing
14. © 201414
Data Ingestion
Data Lake Management
Data Management Query Management
Data Sources
Linked Data
Internet of Things IoT
Data
Ingestion
On-Demand
Query
Streaming
Semantic
Tagging
Scheduled
Batch Load
Model-
Driven
Self-Service
Desktop and Mobile
Operational
Systems
Social Media and
Cloud
15. © 201415
Data Management
Data Lake Management
Data Management Query Management
Provenance
Data
Movement
Data Sources
Linked Data
Internet of Things IoT
Semantic
Graph
Columnar
In Memory
Data
Ingestion
On-Demand
Query
Streaming
Semantic
Tagging
Scheduled
Batch Load
Model-
Driven
Self-Service
Desktop and Mobile
NoSQL Map Reduce
Operational
Systems
Social Media and
Cloud
HDFS Storage
Structured and
Unstructured Data
HDFS Storage
16. © 201416
Data
Ingestion
Data Lake Management
Data Management Query Management
Semantic
Graph
Columnar
In Memory
Provenance
Data
Movement
Data Lake Management
Data Assets
Catalog
WorkflowModels
Access
Management
Data Sources
Linked Data
Internet of Things IoT
Data Mappings
• Source-to-Target
• Transformations
• Internal and External
Data Assets
• Defined Data Orgs
(ontologies,
taxonomies, thesauri)
• Authorization and Access Rules
• Rule-based Security
• Group, Role, and User Level
Authorization
• Auditable Access
• Processes
• Schedules
• Provenance
Capture
On-Demand
Query
Streaming
Semantic
Tagging
Scheduled
Batch Load
Model-
Driven
Self-Service
Business-Focused
• Business Unit Data
Organization and Terms
• Optimized to Assist
Analytics
Monitoring
• Monitor and Manage
Data Lake Operations
Desktop and Mobile
Data Governance
• Focus on Shared Data
• Standard Models
• Controlled Vocabulary
• Common Definitions
• Standards-based Data
Views (FIBO, CDISC/RDF)
NoSQL Map Reduce
Operational
Systems
Social Media and
Cloud
Structured and
Unstructured Data
HDFS Storage
17. © 201417
Query Management
Data
Ingestion
On-Demand
Query
Streaming
Semantic
Tagging
Data Lake Management
Data Management
Scheduled
Batch Load
Model-
Driven
Self-Service
Query Management
Provenance
Data
Movement
Data Sources
Linked Data
Internet of Things IoT
Semantic
Graph
Columnar
In Memory
Query Data, Metadata,
and Provenance
Capture and Share
Analytics Expertise
Semantic Search
Analytics Directed to
the Best Query Engine
Data Discovery
Desktop and Mobile
NoSQL Map Reduce
Operational
Systems
Social Media and
Cloud
HDFS Storage
Structured and
Unstructured Data
HDFS Storage
18. © 201418
Semantic Technology Delivers “Smart” Data
Integrates a network of internal and external data assets,
insulating end users from the details of the underlying
technologies
Captures expertise (logic, inferencing) and integrates it with
the data, delivering “smart” data to non-expert users
Manages a comprehensive inventory of the data assets
Secures access to the right data assets by the right users
19. © 201419
Key W3C Standards in Semantic Technology
Resource Description
Framework (RDF)
Framework for storing and
integrating data and data
definitions in the form of subject-
predicate-object expressions, or
“triples”. Relationships are
organized in a logical graph
model. Reduced development
time and cost; faster time-to-
business value.
Web Ontology Language
(OWL)
An ontology is a comprehensive
model of data definitions and
relationships that is human- and
machine-readable. Ontologies
are inheritable and extensible.
Improved application quality,
flexible iterative / investigative
approach, easily adapts to
business change.
SPARQL
Query Language
SQL-like query language for
semantic data that can leverage
the ontological relationships and
constructs to execute smarter
queries. Access multiple
internal and external databases
simultaneously in a single query.
Access and integrate data
across business silos.
Inference
Reasoning over data through
business rules. Expertise is
captured and embedded in the
ontology model, accessible
through user queries. This is
the “smart” in Smart Data.
Easier end user access to
expertise; intelligent systems
capabilities.
Linked Data
Connects data contained in
different databases, allowing
queries to find, share and
combine data so insights can be
identified across the Web.
Connect disparate databases to
navigate and integrate data
regardless of location or
technology platform.
RDB to RDF Mapping
Language (R2RML)
Preserving current investments
in relational technology, R2RML
maps relational data to an
ontology. SPARQL can query
RDF and relational databases
simultaneously.
Low cost of entry to use
Semantic Technology to deliver
high-value solutions
20. © 201420
The Common Model is the “Data Glue”
Lead
(SFA system)
Quote
(Quote system)
Order
(OMS system)
Contract
(CMS system)
Common Model
(“Data Glue”)
Source Systems
• Different business entities in
physical systems actually share
many of the same concepts,
meanings, and relationships
• Semantic data science exposes
common business concepts and
connects them with their physical
expression in production systems
• Data is “glued” together by its
business meaning, rather than
physical structures dictated by
the underlying technologies
The conceptual model can be directly used by both business and IT users to
operationalize data services, understand the data landscape, track data lineage, and
conduct downstream analytics.
21. © 201421
Semantic Models Relate Data by Business
Meaning
Life
Events
Life Style
Preferences
Interests
Customer
Music
Purchasing
Personal
Network
Entertainment
Profession
22. © 201422
Implications to the Existing IT Architecture
and Practices
User Tools to Discover
and Optimize Data
Relationships
Structured and
Unstructured
Data, Voice,
and Video
Data Analysis
Automation
Extends Existing
Investments in
IT Architecture
Manages
Secure Access
Builds Out Enterprise
Data Models, with
Integration Hub
Capabilities
Self-Service Data Feeds
and Analytics
Infrastructure
Capacity
Elasticity
Reduction of
Data Mart Silos
Easier
Access
to
External
Data
23. © 201423
Data Lake Approach to Meeting Business Needs
Business Needs
Traditional Technologies
and Practices
Data Lake Technologies
and Practices
Onboard New Data
Comprehensive analysis creates rigid
structure that is difficult to change, or
Minimal definition of data organization
requires detailed understanding of data
contents
Flexible data model can be revised or extended
without redesign of the database
Agile, evolutionary refinement of the data
organization, leveraging new insights as users work
with the data
Connect External Data
External data is collected and loaded into
the analytics repository.
Data is streamed, or is refreshed on a
scheduled frequency.
External data can be sourced from databases,
spreadsheets, Web pages, news feeds, and more;
data is queried through common methods, without
regard to location, with real-time values delivered at
query time.
Integrate Data between
Business Units or Business
Partners
Governance activities establish common
vocabulary, and data definitions
And, systems of record publish existing data
specifications or ontology model; each organization
defines data in a manner that is best suited for its
business.
Shared data is copied to an integrated
database.
Federation and virtualization features provide
choices in which data to copy and which data to
retain in the system(s) of record
Organization-specific definitions may
require duplicating certain data in marts
All models can be supported through a single copy of
the data, maintained in the data lake or system of
record.
Capture and Embed Expertise
Expertise often captured in the reporting
and analytics; change management
challenge when updates required.
Expertise captured in the data definitions; single,
shared definition minimizes change management
efforts
24. © 201424
Lessons learned from early adopters
Prioritize
Prioritize data onboarding by the data’s ability to
contribute to customer engagement
Onboard Onboard data assets as they become available
Connect Connect to available internal and external data assets
Load Load the data unfiltered/untransformed
Organize Use models to provide organization to the data
Customize
Create models that are tailored to the needs of the
business groups
Search Make it easy to find data
Secure
Manage security and privacy, but make it easy to
authorize access to data that users need
25. © 201425
Addressing Challenges
- Privacy vs Personal Value
- Granularity of customer understanding
- Delivering strategic objectives when projects tend
to have a technical focus
- Opening access to data
- Need for executive sponsorship
- Access to external data
- Establishing firewalls
- Persistent, pervasive data quality issues
26. © 201426
Clues to better customer engagement will be
found in the ever-growing volume of data that
we’re creating
27. © 201427
A Data Lake Strategy helps you to create a
personalized, engaging experience with each
customer
Visibility Self-Service
SmartProvenance
Open, yet Secure
Internet Scale
Agile
Adaptable
Universal
Data Access