Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value. Businesses often have to interact with different data sources to get a unified view of the business or to resolve discrepancies. These EDW data repositories are often large and complex, are business critical, and cannot afford downtime. This session will share best practices and lessons learned for building a Data Fabric on Spark / Hadoop / HIVE/ NoSQL that provides a unified view, enables a simplified access to the data repositories, resolves technical challenges and adds business value.
Driving Behavioral Change for Information Management through Data-Driven Gree...
From Insights to Value - Building a Modern Logical Data Lake to Drive User Adoption and Business Value
1. From Insights to Value
Building a Modern Logical Data Lake To Drive User Adoption and Business Value
Vineet Tyagi / Impetus
2. We Make Big Data Work
We have been supporting several Fortune 500 customers on their Big Data Journey since last 10
years
Across the board we have seen
• Fast changing analytic and reporting requirements
• Lack of end user self service capabilities
• Need for better collaboration and agility in working with trusted data
• Data is in silos making it difficult to get closer to customers, additionally data from traditional
sources still remains important.
5. The new normal for enterprise IT
EDW + BDW (Lake/s) == Unified Enterprise Data
6. Making insights and data in the lake readily discoverable, accessible and usable
“Visual data-discovery, an important enabler of end user self-service”
Challenge 1 : Providing a complete seamless view of business data
7. Challenge 2 : Simplified & Self-Serve enablement of BI
Provision
Cluster
Discover and Blend
New Sources
Data Access and
Exploration
Ingest and
Transform data
Security and
Governance
BI, Analytics and
Models
8. Challenge 3 : Support Use-Case driven Data Access mechanisms
Specific Query &
Reporting
SQL
Cross Dimensional
Fast Slice Dice and
Drill Down
OLAP
Data from MPP,
Relational and
Hadoop
Data
Virtualization
Finding the “Needle
in a Haystack”
Search
“Don’t Know What
You Don’t Know”
Self Service Data
Discovery
9. Challenge 4 : Leverage EDW and BDW coexistence
Optimizing the placement of enterprise workloads and the data on which they operate
The multi-platform environment is the warehouse
Frees capacity on high-end analytic and data warehouse systems
• Immediate ROI on Hadoop
Get a platform better suited to advanced analytics
“Visual data-discovery, an important enabler of end user self-service”
10. Challenge 5 : Collaboration & Reuse
Collaboration & Reuse of data and analytical assets on a logical data lake
Data Democratization
• Lowering adoption barriers for your stakeholders
• Getting the data they want should be fast and easy
Analytical Democratization
• Publishing and Discovery of analytical assets
• Ability to reuse in simple and consistent way
12. Logical Data Lake : Modern Analytical Data Fabric
Landing and
ingestion
Structured
Unstructured
External Social
Machine
Geospatial
Time Series
Streaming
Enterprise
Data Lake
Real-Time applications
Data Federation/
Virtualization
Exploration &
discovery
Data Wrangling
RDBMS MPP
Enterprise Meta Data Management
Accelerators
Traditional
data
repositories
Provisioning, Workflow, Monitoring and Governance
13. Providing a complete seamless view of business data
• Simple, consistent view of meta information
• Automated sourcing and seeding
• Social and usage based enrichment
• Not ONLY a data catalogue
• Analytical asset catalogue
• Leverage and supplement existing business ontologies
14. Leverage EDW and BDW coexistence
Optimizing the placement of enterprise workloads and the data on which they operate
“Right Positioning” of workloads based on price / performance
• Most bang for the buck
Build a platform better suited to advanced analytics with Big Data technologies
• Retaining what works “in situ”
“Visual data-discovery, an important enabler of end user self-service”
16. Architectural Patterns
Architectural Patterns: Streaming
Pattern 1: Streaming Ingestion
Pattern 2: Near real time event Processing with external context.
Pattern 3: Near Real Time Partitioned event processing with external context.
Pattern 4: Complex topology for Aggregations or Machine Learning.
Architectural Patterns: Batch + Streaming
Pattern 5: The Lambda Architecture: Hadoop and Storm
Pattern 6: Merging Batch and Streaming: Kappa a post lambda architecture.
Pattern 7: Unified Batch and Stream Processing: Flink or Spark
17. Pattern 1: Streaming Ingestion
Use Case Scenarios
1. Efficiently collecting, aggregating, and moving large amounts of streaming
data into Hadoop cluster
2. Emphasis on low-latency persisting of events to
• HDFS
• Apache HBase
• Apache Solr
18. Pattern 2: Near real time event Processing
Use Case Scenarios:
Alerting, flagging, transforming and filtering of events as they arrive.
Take immediate decisions to transform the data
Take some sort of external action.
The decision often depends on external profile or metadata.
The user code can interact with local memory or distributed cache
The user code can interact with external storage system like Hbase
19. Pattern 3: Near Real Time with partitioned external context
Use Case Scenarios:
When external context information required for event processing doesn’t fit in
local memory
Calling to external system like HBase does not meet the SLA requirements
20. Pattern 4: Complex topology for Aggregations or ML
Use Case Scenarios:
• Real time data from complex and flexible set of operations
• Complex operations like counts, averages, sessionization
• Results often depend upon windowed computations or require more active data
• Focus shifts from ultra low frequency to functionality and accuracy.
• Machine-learning model building that operate on batches of data.
21. Speed Layer 1. Compensate for recently updated data
2. Do fast incremental computations on the newly arrived data
3. Batch layer would eventually overwrite speed layer.
4. Provide random reads and random writes Storm, Flume etc.
Serving Layer 1. Random access to batch views
2. Bulk updates from Batch Layer
3. No random writes Hbase, Solr etc.
Batch Layer 1. Stores master data set.
2. Compute Batch views Hadoop, Solr cluster
Pattern 5: Lambda Architecture
22. Problems with Lambda Architecture.
• You implement your transformation logic twice once in the batch system and another time in
the stream processing system. The two needs to be in sync to give the right result.
• You stitch together the results from both the batch views and the real time views to produce a
complete answer.
Kappa architecture swtiches over to using a canonical data store that is an append only immutable log instead
of relational DB like SQL or a key-value store like Cassandra, The seving layer uses the data streamed through
the computational system and stored in auxiliary stores for serving.
Pattern 6: Kappa Architecture
23. No. Technology Strength
1 Flink + Distributed stream and batch data processing
+ Distributed computations over data streams
+ High throughput and low latency
+ Exactly-once guaranty
+ Batch processing applications as special cases of stream processing
2 Spark + Fast
+ Low latency
Pattern 7: Unified Batch and Streaming Framework
24. Time Range based technology choices
Map Reduce
Impala Impala Impala
Flume Interceptors Spark Streaming Spark Spark
Custom Storm Trident Tez Tez
50ms > 500 ms >30,000 ms >90,000 ms
25. Logical Data Lake : Modern Analytical Data Fabric
Landing and
Ingestion
Structured
Unstructured
External Social
Machine
Geospatial
Time Series
Streaming Real-Time Applications
Data Federation/
Virtualization
Exploration &
Discovery
Data Wrangling
RDBMS MPP
Enterprise Meta Data Management
Accelerators
Traditional
Data
Repositories
Provisioning, Workflow, Monitoring and Governance
DATA BLENDING
Metadata & Discovery
WORKLOAD MIGRATION
Data Blending
Enterprise
Data Lake