Traditional databases and batch ETL operations have not been able to serve the growing data volumes and the need for fast and continuous data processing.
How can modern enterprises provide their business users real-time access to the most up-to-date and complete data?
In our upcoming webinar, our experts will talk about how real-time CDC improves data availability and fast data processing through incremental updates in the big data lake, without modifying or slowing down source systems. Join this session to learn:
What is CDC and how it impacts business
The various methods for CDC in the enterprise data warehouse
The key factors to consider while building a next-gen CDC architecture:
Batch vs. real-time approaches
Moving from just capturing and storing, to capturing enriching, transforming, and storing
Avoiding stopgap silos to state-through processing
Implementation of CDC through a live demo and use-case
You can view the webinar here - https://www.streamanalytix.com/webinar/planning-your-next-gen-change-data-capture-cdc-architecture-in-2019/
For more information visit - https://www.streamanalytix.com
3. Agenda
What is CDC?
Various methods for CDC in the enterprise data warehouse
Key considerations for implementing a next-gen CDC architecture
Demo
Q&A
4. About Impetus
We exist to create powerful and intelligent enterprises through deep
data awareness, data integration, and data analytics.
5. About Impetus
Many of North America’s most respected and well-known brands trust
us as their strategic big data and analytics partner.
6. Transformation
Legacy EDW to
big data/cloud
Unification
Data processing,
preparation, and
access
Analytics
Real-time, machine
learning, and AI
Self-service
BI on big data/
cloud
End-to-End Big Data Solutions
7. What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
Continuous Ingestion in the Data Lake
Capturing streaming data changes
Database migration to cloud
Data preparation for analytics and ML jobs
We still have a legacy system
15. What Does CDC Mean for the Enterprise?
Batch
Replicate
Filter Transform
In-memory
Batch
RDBMS
Data
warehouse
Files
RDBMS
Data
warehouse
Hadoop
Streaming
Legacy
Real-time Incremental
16. Modern CDC Applications
Data lake: Continuous ingestion and pipeline
automation
Streaming: Data changes to Kafka, Kinesis, or
other queues
Cloud: Data workload migration
Business applications: Data preparation for
analytics and ML jobs
Legacy system: Data delivery and query offload
Data lakes
CloudStreaming
Data
warehouse
Files Legacy
RDBMS
17. Methods of Change Data Capture
Database triggers Data modification stamping Log based CDC
19. Date Modification Stamping
Transactional applications keep track of metadata in every row
• Tracks when the row was created and last modified
• Enables filter on the DATE_MODIFIED column
Challenges
• There is no DATE_MODIFIED for a deleted row
• Trigger based DATE_MODIFIED
• Extracting is resource intensive
20. Log Based CDC
Uses transactional logs
Challenges
• Interpreting transaction logs
• No direct interface to transaction log by vendors
• Agents and interfaces change with new database versions
• Supplemental logging increases volume of data
22. Next Gen Architecture Considerations
Ease of Use
Pre-packaged operators
Extensibility
Modern user experience
Real-time
Change Data Capture
Stream live updates
Optimized for high performance
Hybrid
Multiple vendors
On-premise and cloud
Databases, data warehouse,
and data lake
23. Value Proposition of CDC
Incremental update efficiency
Source/production impact
Time to value
TCO
Scale and flexibility
24. Continuous Ingestion in the Data Lake 46%
Capturing streaming data changes 58%
Database migration to cloud 38%
Data preparation for analytics and ML jobs 35%
We still have a legacy system 46%
What are the different change data capture use cases currently deployed
in your organization (choose all that apply)?
25. ETL, Real-time Stream Processing and Machine Learning Platform
+ A Visual IDE for Apache Spark
26. CDC with StreamAnalytix
Turnkey
adapters for
CDC vendor
ETL and data
wrangling
visual
operators
Elastic
compute
ReconcileTransform Enrich
Structured
data stores
CDC streams
Unstructured
data streams
File stores
Structured
data stores
Message
queues
Hadoop/Hive
Cloud storage
and DW
29. CDC Capabilities in StreamAnalytix
Turnkey reconciliation feature for Hadoop offload
30. CDC Capabilities in StreamAnalytix
Large set of visual operators for ETL,
analytics, and stream processing
Zero code approach to ETL design
Built in NFR support
31. StreamAnalytix CDC Solution Design
StreamAnalytix Workflow
A complete CDC solution has three parts:
Each aspect of the solution is modelled as a StreamAnalytix pipeline
Data
de-normalization
Join transactional data
with data at rest, and
store de-normalized data
on HDFS
Merge previously
processed transactional
data with new
incremental updates
Incremental
updates in Hive
Data ingestion
and staging
Stream data from Attunity,
replicate from Kafka or
LogMiner for multiple
tables, and store raw data
into HDFS
32. Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
33. Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
34. Pipeline #1: Data Ingestion and Staging (Streaming)
Data Enrichment
Enriches incoming data with
metadata information and event
timestamp
HDFS
Stores CDC data on HDFS in
landing area using OOB HDFS
emitter. HDFS files are rotated
based on time and size
Data Ingestion via Attunity
‘Channel’
Reads the data from Attunity and
target is Kafka. Configured to read
data feeds and metadata from a
separate topic
35. Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
36. Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
37. Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
38. Pipeline #2: Data De-normalization (Batch)
Performs outer join to
merge incremental and
static data
Store de-normalized
data to HDFS directory
HDFS Data Channel
Ingests incremental data
from previous runs of the
staging location
Reads reference (data at
rest) from a fixed HDFS
location
39. Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
40. Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
41. Pipeline #3: Incremental Updates in Hive (Batch)
Reconciliation Step
Hive “merge into” SQL, performs
insert, update and delete
operation based on the
operation in incremental data
Clean up step
Runs a drop table command on
the managed table to clean up
processed data – to avoid
repeated processing
Run a Hive SQL query to load a
managed table from the HDFS
incremental data generated
from Pipeline#2
42. Workflow: Oozie Coordinator Job
Oozie orchestration flow created using StreamAnalytix
webstudio – it orchestrates pipeline#2 and pipeline#3 into a
single Oozie flow that can be scheduled as shown here
43. Demo
Data channels
• Attunity Replicate and LogMiner
Data processing pipeline walkthrough
• Data filters and enrichment
• Analytics and data processing operators
• Data stores
44. Summary
Do more with your data acquisition flows
• Acquire and process data in real-time
• Enrich data from data marts
• Publish processed data as it arrives
• Multiple parallel processing paths (read once, process multiple times)
Move away from fragmented processes
• Unify data analytics and data processing/ETL flows
45. Conclusion
The right next gen CDC solution can make data ready for analytics as
it arrives in near real-time
CDC-based data integration is far more complex than the full export
and import of your database
A unified platform simplifies and reduces the complexities of
operationalizing CDC flows
46. LIVE Q&A
For a free trial download or cloud access visit www.StreamAnalytix.com
For any questions, contact us at inquiry@streamanalytix.com
Notas del editor
intro - 5 min
Poll - 2 mins
Saurabh
Background 2 min
Goals 3 min
Steps 3 min
Methods 3 min
Architectural considerations 2 min
Sameer
CDC with SAX 3 min
Deployment
Key Benefits 3 min
Demo - 10 min
Beyond CDC - 5 min
Q&A - 10 mins
CDC minimizes the resources required for ETL ( extract, transform, load ) processes because it only deals with data changes. The goal of CDC is to ensure data synchronicity.
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.
DataGenerator: It generates a dummy record on start of pipeline.
LoadStagingInHive: It runs a
MergeStagingAndMasterData: It runs a.
DropStagingHiveData: It runs a drop table command on hive to drop the managed table loaded in second step.