SlideShare una empresa de Scribd logo
1 de 40
Real Time Big Data Analytic Platform with
Oracle GoldenGate Big Data Adapters
Rajit Saha Principal Big Data Engineer
Vengata(Venky) Guruswamy Principal Database Administrator
Confidential
 Introduction - LendingClub
 Real-time Big Data Platform Use cases
 Lambda Architecture – On-Premise / AWS
 Data Processing Flow/Algorithm
 GoldenGate Big Data Adapter Implementation Architecture,
Configuration and Troubleshooting Scenarios
Agenda
2
Confidential 3
LendingClub
LendingClub is
America’s
largest online
marketplace
connecting
borrowers and
investors.
 Headquartered in San Francisco, CA
 Office in Westborough, MA
Confidential 4
Product Lines
Personal Loans
Loans up to $40K
600+ FICO
36 and 60 mo. terms
Small Business
Business Loans up to
$300K
At least $75,000 in annual
sales
At least 2 years in
business
Patient
Financing
Extended plans up to
$50K; no-interest
plans up to $32K
Auto Refinance
Must have an
outstanding balance
of $5K-$55K, initiated
in the last 3 months
with 24 months of
remaining payments
Confidential
 Maintain Reporting Databases
 Implement transaction history for key
tables(Change Data capture)
 Enable real-time feed to Big data platform
5
Use Cases of
Oracle
GoldenGate
@
LendingClub
Confidential
 LendingClub’s Big Data Platform is
responsible for generating thousands of
reports for the company – Daily, Monthly,
Quarterly
 Investor Reports, Financial Reports, Collection
Reports, Risk Reports, Marketing Reports etc.
6
Why Big Data
Platform ?
Why Real-time
Data Needed ?
• Near Real-time Availability of OLTP Data in
Hadoop Based OLAP warehouse
• Thousands of Tables and rapidly increasing
storage needs. rapidly increasing storage
needs.
• Increasingly additonal 100s of internal Adhoc
Users added every month. every month.
Confidential
 Typical changes are:
 Positive Response from customers
 Indicated DON’T CALL on a particular phone number
from previous contacts.
 Under certain circumstance, it can also be limited on
how many times we can contact borrower by legal or
regulatory reason.
 NO CALL list needs be uploaded and fed into dialer in
a very timely fashion, and ideally it should be near
real-time at the event/change occurred.
7
NO Call list is a list of
phone numbers we
upload to dialer
system in order to
suppress Collections
calls due to loan
and/or customer
status changes
throughout the day.
Real-time Use Cases NO CALL List Generation
Confidential 8
What is it?
Why are we using
Real-time?
Who is it for?
Real-time Use Cases Marketing Dashboard
Track the progress of all the marketing channels with
respect to quarterly forecast.
Business needs numbers every hour to change the
strategy in case any channel performance compared to
target.
Marketing Team
Confidential 9
Lambda Architecture On-Premise HortonWorks Cluster
EXTRACT
TRAIL
REPLICAT
INCREMENTAL
TRANSACTIONS
Data
Science
CONSUMERS
TRAIL
TRAIL
Confidential 10
Lambda Architecture AWS S3/EMR Clusters
EXTRACT
TRAIL
REPLICAT
INCREMENTAL
TRANSACTIONS
Data
Science
CONSUMERS
TRAIL
TRAIL
Confidential
SPEED LAYER - DATA FLOW DIAGRAM
Confidential
 DDL Verification from Oracle Source to GG Table before
processing starts
 Data Inserted fromAvro GG Table to CDC partition table
 Hive Dynamic Partition on OP_TS
 ORC (Optimized Row Columnar) Format
 Nightly Snapshot Creation –
 Latest Updates from CDC tables and rest from previous
day Snapshot
 Create Hive and Presto Real Time View with CDC and Nightly
Snapshot
 Data Validation : Full table Checksum with Sqoop Snapshot
12
Few Important
Steps in Speed
Layer Processing
Process Flow ..
Confidential
How are we getting latest update
– For a Table , within a transaction for a row
CREATE EXTERNAL TABLE
SCHEMA_GG.FOO (
TABLE STRING,
OP_TYPE STRING,
OP_TS STRING,
CURRENT_TS STRING ,
POS STRING ,
PRIMARY_KEYS ARRAY<STRING> ,
TOKENS MAP<STRING,STRING> ,
PKEY STRING ,
..
..
..
{"TKN-OPTYPE":"INSERT","TKN-RECORDLENGTH":"1400","TKN-LOGRBA":"1774157","TKN-
OBJECTNAME":”SCHEMA.FOO","TKN-FILERBA":"","TKN-LAGMSEC":"54420","TKN-
USERNAME":”XXX","TKN-SCN":"7633890649912","TKN-RSN":"7633890649838","TKN-
TRANSACTIONINDICATOR":"BEGIN","TKN-RECORDTIMESTAMP":"2017-07-30 23:49:14","TKN-
LOGPOSITION":"19784208","TKN-FILESEQNO":"","TKN-ROWID":"AAA46FACCAAD4CtAAS","TKN-
XID":"184.13.1380795","TKN-COMMITTIMESTAMP":"2017-07-30 23:49:15.000000","TKN-REDOTHREAD":"1"}
SELECT TOKENS FROM SCHEMA_GG.FOO LIMIT 1;Input Table From GG Adaptor
All the transaction records goes to Hive Partitioned ORC
CDC table- SCHEMA_RT_CDC.FOO
SELECT DISTINCT VW.PKEY,.. FROM (SELECT FOO.ID, .. RANK() OVER (PARTITION BY
FOO.PKEY ORDER BY FOO.TKN-SCN DESC, FOO.TKN-RSN DESC, FOO.POS DESC, FOO
.CURRENT_TS DESC) AS DEDUP_RANK FROM SCHEMA_RT_CDC.FOO WHERE
FOO.OP_TS_DATE >='2017-07-30') VW WHERE VW.DEDUP_RANK = 1
Confidential
Verification of Real Time View definition
SELECT PKEY, OP_TS, TKN_SCN, TKN_RSN, POS, CURRENT_TS, MODIFIED_D,
OP_TYPE FROM SCHEMA_RT_CDC.FOO WHERE PKEY = 2181322736 ORDER BY
TKN_SCN DESC,TKN_RSN DESC,POS DESC,CURRENT_TS DESC
Records from CDC
SELECT PKEY, MODIFIED_D FROM SCHEMA_RT.FOO WHERE PKEY = 2181322736Records from Real Time View
PKEY OP_TS TKN_SCN TKN_RSN POS CURRENT_TS MODIFIED_D OP_TYPE
2181322736
2017-06-29
02:32:33.0
7630245451950 7630245451909 00004193750007692575 2017-06-28 19:33:53.778 2017-06-28 19:32:33.0 U
2181322736
2017-06-29
02:32:33.0
7630245451950 7630245451909 00004193750007692575 2017-06-28 19:33:53.778 2017-06-28 19:32:33.0 U
2181322736
2017-06-29
00:39:16.001
7630236411132 7630236411130 00004191270019251264 2017-06-28 17:41:09.519 2017-06-28 17:39:16.0 U
2181322736
2017-06-29
00:39:16.001
7630236411132 7630236411130 00004191270019251264 2017-06-28 17:41:09.519 2017-06-28 17:39:16.0 U
2181322736
2017-06-28
17:02:59.002
7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I
2181322736
2017-06-28
17:02:59.002
7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I
2181322736
2017-06-28
17:02:59.002
7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I
2181322736
2017-06-28
17:02:59.002
7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I
2181322736 2017-06-28 19:32:33.0
Confidential
SLAs
Transaction -> Import Process Start
Avg ~5mins
Transaction -> Import Process End
Avg ~ 5 – 15 mins
Spikes are for NIGHTLY SNAPSHOT Creation
1. Full table creation
2. TEZ is not good for Skew JOIN
3. Cluster Busy
Confidential
GoldenGate Big Data Adapter Implementation
Architecture
Confidential 17
GoldenGate BigData Adapter Pipeline
Db Node 1
DB
Extract
Data
Pump
Golden
Gate
Adapter
Servers
(HA)
Replicat for
HDFS Hive
Db Node 1 TRAIL FILE 1
TRAIL FILE 2
TRAIL FILE 3
TRAIL FILE 1
TRAIL FILE 2
TRAIL FILE 3
Confidential 18
GoldenGate BigData Infrastructure Architecture
Confidential 19
GoldenGate BigData Infrastructure Architecture – Site Failover
Confidential
GoldenGate Big Data Adapter
Configuration
Confidential
Source Extract Parameter - eprod.prm
NOCOMPRESSUPDATES
GETUPDATEBEFORES
TABLE <schema name>.*, tokens (
TKN-COMMITTIMESTAMP = @GETENV('GGHEADER', 'COMMITTIMESTAMP'),
TKN-FILESEQNO = @GETENV('RECORD', 'FILESEQNO'),
TKN-FILERBA = @GETENV('RECORD', 'FILERBA'),
TKN-LAGMSEC = @GETENV('LAG', 'MSEC'),
TKN-LOGPOSITION = @GETENV('GGHEADER', 'LOGPOSITION'),
TKN-OBJECTNAME = @GETENV('GGHEADER', 'OBJECTNAME'),
TKN-OPTYPE = @GETENV('GGHEADER', 'OPTYPE'),
TKN-RECORDLENGTH = @GETENV('GGHEADER', 'RECORDLENGTH'),
TKN-RECORDTIMESTAMP = @GETENV('RECORD', 'TIMESTAMP'),
TKN-ROWID = @GETENV('RECORD', 'ROWID'),
TKN-RSN = @GETENV('RECORD', 'RSN'),
TKN-SCN = @GETENV('TRANSACTION', 'CSN'),
TKN-TRANSACTIONINDICATOR = @GETENV('GGHEADER', 'TRANSACTIONINDICATOR'),
TKN-USERNAME = @GETENV('TRANSACTION', 'USERNAME'),
TKN-XID = @GETENV('TRANSACTION', 'XID')
)
Confidential
Source Data Pump Extract Parameter - pprod.prm
EXTRACT PPROD
PASSTHRU
RMTHOST <Goldengate adapter server>, MGRPORT <port no>
-- Trail file written by Pump to the remote host
RMTTRAIL /filesystem/pumptrail_ALL/pt
-- Pass data through without mapping, filtering, conversion:
PASSTHRU
-- Specify tables to be captured:
TABLE *.*;
Confidential
Target – Replicat Parameter - rhdfs.prm
REPLICAT RHDFS
TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props
REPORTCOUNT EVERY 1 MINUTES, RATE
GROUPTRANSOPS 10000
-Schemas
MAP SCHEMA1.*, TARGET SCHEMA1_GG.*;
MAP SCHEMA2.*, TARGET SCHEMA2_GG.*;
-Heartbeat
MAP DBASCHEMA.HEARTBEAT,TARGET DBASCHEMA_GG.HEARTBEAT;
Confidential
gg.handler.hdfs.type=hdfs
##performance
gg.handler.hdfs.maxFileSize=256m
gg.handler.hdfs.fileRollInterval=5m
gg.handler.hdfs.inactivityRollInterval=5m
#config
gg.handler.hdfs.fileSuffix=.avro
gg.handler.hdfs.partitionByTable=true
gg.handler.hdfs.rollOnMetadataChange=true
gg.handler.hdfs.format=avro_row_ocf
## hive jdbc config
gg.handler.hdfs.schemaFilePath=/ogg/schema
gg.handler.hdfs.rootFilePath=/ogg/data
gg.handler.hdfs.hiveJdbcUrl=jdbc:hive2://@Hive_Server_Name:Hive_Port
Target – Replicat Parameters - hdfs.props
Confidential
Oracle GoldenGate to AWS S3 Connector
gg.handlerlist=hdfs
gg.handler.hdfs.type=hdfs
gg.handler.hdfs.rootFilePath=s3://LCBUCKET/OGG
s3replicat.props
Custom Built Hadoop Client from Trunk
hadoop-3.0.0-alpha3-SNAPSHOT
<property>
<name>fs.s3a.access.key</name>
<value><< AWS KEY >></value>
</property>
<property>
<name>fs.s3a.secret.key</name>
<value><<AWS SECRET KEY >></value>
</property>
<property>
<name>fs.s3a.proxy.host</name>
<value> << PROXY SERVER >> </value>
</property>
<property>
<name>fs.s3a.proxy.port</name>
<value><< PROXY PORT >></value>
</property>
<property>
<name>fs.s3a.connection.ssl.enabled</name>
<value>true</value>
</property>
<property>
<name>fs.s3.impl</name>
<value>org.apache.hadoop.fs.s3a.S3AFileSystem</value>
</property>
<property>
<name>fs.s3a.server-side-encryption-algorithm</name>
<value>SSE-KMS</value>
</property>
<property>
<name>fs.s3a.server-side-encryption-key</name>
<value>arn:aws:kms:<< KEY >></value>
</property>
Hadoop Client Talks to s3 via Hadoop-AWS
package via s3a protocol
EMR only understand s3 , so s3a to s3
mapping is done by
Confidential
GGSCI - OutputSecurity - Authentication
### Kerberos params
gg.handler.name.authType=kerberos
gg.handler.name.kerberosKeytabFile=/etc/security/keytabs/lcapp.headless.keytab
gg.handler.name.kerberosPrincipal=osuser@PROD-DOMAIN.COM
Confidential
GGSCI - OutputSecurity - Username/password Encryption
ORACLEWALLETUSERNAME ggadalias ggadapters
ORACLEWALLETPASSWORD ggadalias ggadapters
gg.handler.hdfs.hiveJdbcUsername=ORACLEWALLETUSERNAME[ggadalias ggadapters]
gg.handler.hdfs.hiveJdbcPassword=ORACLEWALLETPASSWORD[ggadalias ggadapters]
gg.handler.hdfs.hiveJdbcUserName=unencrypted_username
gg.handler.hdfs.hiveJdbcPassword=unencrypted_password
Confidential
GGSCI - Output
Password encryption steps:
1. Add credential store:
GGSCI > ADD CREDENTIALSTORE
Credential store created in ./dircrd/.
2. Add the credential for these users.
ALTER CREDENTIALSTORE ADD USER <Hive_username>, password
<Hive_password> alias ggadalias domain ggadapters
Security-Credential Store
Confidential
How HDFS Handler handles the Data Flow ?
• The output format chosen is Avro row ocf format for three reasons
• Integration with Hive
• Seamless Schema Evolution
• More compact than Avro op ocf
• There are two files produced namely Avro and Avsc files.
Confidential
GGSCI - OutputConfigure High Availability for GoldenGate Resource
As ROOT user.
agctl add goldengate gg1 --gg_home /bdggvol/ggaws --instance_type dual
--nodes node1,node2 -network 1 --ip 99.99.99.99 --user appuser --group oinstall
--filesystems ora.registry.acfs --environment_vars
"LD_LIBRARY_PATH=/opt/java/default/jre/lib/amd64/server:/lib,
PATH=/opt/java/default/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin,
JAVA_HOME=/opt/java/default"
$GRID_HOME/bin/crsctl setperm resource xag.gg1-vip.vip -u user:oracle:rwx
$GRID_HOME/bin/crsctl setperm resource xag.gg1-vip.vip -u group:oinstall:rwx
$GRID_HOME/bin/crsctl setperm resource xag.gg1.goldengate -u group:oinstall:rwx
$ GRID_HOME/bin/crsctl setperm resource ora.net1.network -u user:appuser:rwx
Confidential
GGSCI - OutputSchema Regex
Avro Doesn’t like ($) symbol ! – No problem :
gg.schemareplaceregex=[$:]
gg.schemareplacestring=_
Confidential
 Upgrade from 12.2 to 12.3 has changed the timestamp format from [YYYY-MM-
DD:HH24:MI:SS.FFF] to [YYYY-MM-DD HH24:MI:SS.FFF]
 To avoid the dreaded ERROR OGG-15050 Error loading Java VM runtime library: (2 No
such file or directory) ensure to set the following parameter JAVA_HOME,
LD_LIBRARY_PATH and PATH properly. Remember manager passes the ENV variables.
 It’s a multi-threaded process and ensure to allocate sufficient Unix resources like NPROC ..
 Please monitor the heart beat table from Hive.
 Design the application to be Idempotent.
 Run the HDFS handler from dedicated hadoop client node.
32
Goldengate Big Data Adapter – Things To Remember
Confidential
Troubleshooting Scenarios
Confidential
GGSCI - OutputRewind – Functionality
Confidential
GGSCI - OutputBefore Patch - 17995064 
Confidential
GGSCI - OutputAfter Patch - 17995064 
Confidential
GGSCI - OutputLobs-Clobs-Blobs
• Data team needed the CLOBS even though the CLOB is not updated.
• GETUPDATEBEFORES doesn’t work for CLOB
• Lobs :Always Include LOBS In Trail File (Doc ID 1639717.1)
• Include the extract parameters NOCOMPRESSUPDATES and
FETCHCOLS option in the TABLE parameter, for example:
NOCOMPRESSUPDATES
TABLE schema.mytable, fetchcols (mylobs1, mylob2);
Confidential
 SR 3-15688859001 : Data loss after a glitch between goldengate
adapter and namenode
 Please test the rewind functionality and ability to handle duplicate data
 Please upgrade to latest release - OGGBD 12.3.1.1.0 which contains the fix for
the abend issue.
 Monitor heart beat table in Hive.
 Monitor the ERROR messages from logfiles - RHDFS_info_log4j.log [Splunk]
 java.io.EOFException: Premature EOF: no length prefix available
38
Ability to handle hadoop failures
Confidential
GoldenGate Big Data Adapter - Monitoring [Wavefront]
Q & A
Thank You

Más contenido relacionado

La actualidad más candente

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & FeaturesDataStax Academy
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingMichael Rainey
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Carlos Sierra
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsAnil Nair
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaGuido Schmutz
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guideRyan Blue
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiDatabricks
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Yohei Onishi
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng ShiDatabricks
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®confluent
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best PracticesMatillion
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...confluent
 
Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...Lucas Jellema
 
Introduction to SQL Server Security
Introduction to SQL Server SecurityIntroduction to SQL Server Security
Introduction to SQL Server SecurityJason Strate
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeDatabricks
 
Oracle goldengate 11g schema replication from standby database
Oracle goldengate 11g schema replication from standby databaseOracle goldengate 11g schema replication from standby database
Oracle goldengate 11g schema replication from standby databaseuzzal basak
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Databricks
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming JobsDatabricks
 
Embedding Data & Analytics With Looker
Embedding Data & Analytics With LookerEmbedding Data & Analytics With Looker
Embedding Data & Analytics With LookerLooker
 

La actualidad más candente (20)

Cassandra Introduction & Features
Cassandra Introduction & FeaturesCassandra Introduction & Features
Cassandra Introduction & Features
 
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data StreamingOracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
Oracle GoldenGate and Apache Kafka: A Deep Dive Into Real-Time Data Streaming
 
Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360Understanding my database through SQL*Plus using the free tool eDB360
Understanding my database through SQL*Plus using the free tool eDB360
 
Oracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret InternalsOracle RAC 19c: Best Practices and Secret Internals
Oracle RAC 19c: Best Practices and Secret Internals
 
Location Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache KafkaLocation Analytics - Real-Time Geofencing using Apache Kafka
Location Analytics - Real-Time Geofencing using Apache Kafka
 
Parquet performance tuning: the missing guide
Parquet performance tuning: the missing guideParquet performance tuning: the missing guide
Parquet performance tuning: the missing guide
 
A Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and HudiA Thorough Comparison of Delta Lake, Iceberg and Hudi
A Thorough Comparison of Delta Lake, Iceberg and Hudi
 
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)Building a Data Pipeline using Apache Airflow (on AWS / GCP)
Building a Data Pipeline using Apache Airflow (on AWS / GCP)
 
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
Apache Kylin: Speed Up Cubing with Apache Spark with Luke Han and Shaofeng Shi
 
Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®Introduction to KSQL: Streaming SQL for Apache Kafka®
Introduction to KSQL: Streaming SQL for Apache Kafka®
 
Google BigQuery Best Practices
Google BigQuery Best PracticesGoogle BigQuery Best Practices
Google BigQuery Best Practices
 
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
Kafka Summit NYC 2017 - Singe Message Transforms are not the Transformations ...
 
Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...Apache Superset - open source data exploration and visualization (Conclusion ...
Apache Superset - open source data exploration and visualization (Conclusion ...
 
Introduction to SQL Server Security
Introduction to SQL Server SecurityIntroduction to SQL Server Security
Introduction to SQL Server Security
 
Massive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta LakeMassive Data Processing in Adobe Using Delta Lake
Massive Data Processing in Adobe Using Delta Lake
 
Oracle goldengate 11g schema replication from standby database
Oracle goldengate 11g schema replication from standby databaseOracle goldengate 11g schema replication from standby database
Oracle goldengate 11g schema replication from standby database
 
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
Designing ETL Pipelines with Structured Streaming and Delta Lake—How to Archi...
 
Productizing Structured Streaming Jobs
Productizing Structured Streaming JobsProductizing Structured Streaming Jobs
Productizing Structured Streaming Jobs
 
Embedding Data & Analytics With Looker
Embedding Data & Analytics With LookerEmbedding Data & Analytics With Looker
Embedding Data & Analytics With Looker
 
Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security Apache Ranger Hive Metastore Security
Apache Ranger Hive Metastore Security
 

Similar a Oracle Goldengate for Big Data - LendingClub Implementation

LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateRajit Saha
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?confluent
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkTimothy Spann
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015aioughydchapter
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...Timothy Spann
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with GimelAlluxio, Inc.
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelDeepak Chandramouli
 
Data integration
Data integrationData integration
Data integrationBallerina
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in MotionRuhani Arora
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationEDB
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotXiang Fu
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
 
Building Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaBuilding Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaGuido Schmutz
 
Mass Report Generation Using REST APIs
Mass Report Generation Using REST APIsMass Report Generation Using REST APIs
Mass Report Generation Using REST APIsSalesforce Developers
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Tom Diederich
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseYang Li
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...HostedbyConfluent
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everythingLew Tucker
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13DECK36
 

Similar a Oracle Goldengate for Big Data - LendingClub Implementation (20)

LendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGateLendingClub RealTime BigData Platform with Oracle GoldenGate
LendingClub RealTime BigData Platform with Oracle GoldenGate
 
EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?EDA Meets Data Engineering – What's the Big Deal?
EDA Meets Data Engineering – What's the Big Deal?
 
126 Golconde
126 Golconde126 Golconde
126 Golconde
 
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and FlinkDBA Fundamentals Group: Continuous SQL with Kafka and Flink
DBA Fundamentals Group: Continuous SQL with Kafka and Flink
 
Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015Aioug ha day oct2015 goldengate- High Availability Day 2015
Aioug ha day oct2015 goldengate- High Availability Day 2015
 
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
2024 February 28 - NYC - Meetup Unlocking Financial Data with Real-Time Pipel...
 
Unified Data Access with Gimel
Unified Data Access with GimelUnified Data Access with Gimel
Unified Data Access with Gimel
 
Data orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | GimelData orchestration | 2020 | Alluxio | Gimel
Data orchestration | 2020 | Alluxio | Gimel
 
Data integration
Data integrationData integration
Data integration
 
Azure Stream Analytics : Analyse Data in Motion
Azure Stream Analytics  : Analyse Data in MotionAzure Stream Analytics  : Analyse Data in Motion
Azure Stream Analytics : Analyse Data in Motion
 
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware AccelerationHTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
HTAP By Accident: Getting More From PostgreSQL Using Hardware Acceleration
 
Real-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache PinotReal-time Analytics with Trino and Apache Pinot
Real-time Analytics with Trino and Apache Pinot
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Building Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache KafkaBuilding Event-Driven (Micro) Services with Apache Kafka
Building Event-Driven (Micro) Services with Apache Kafka
 
Mass Report Generation Using REST APIs
Mass Report Generation Using REST APIsMass Report Generation Using REST APIs
Mass Report Generation Using REST APIs
 
Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)Ingesting streaming data for analysis in apache ignite (stream sets theme)
Ingesting streaming data for analysis in apache ignite (stream sets theme)
 
Apache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouseApache kylin 2.0: from classic olap to real-time data warehouse
Apache kylin 2.0: from classic olap to real-time data warehouse
 
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
SingleStore & Kafka: Better Together to Power Modern Real-Time Data Architect...
 
Cloud Computing ...changes everything
Cloud Computing ...changes everythingCloud Computing ...changes everything
Cloud Computing ...changes everything
 
Log everything! @DC13
Log everything! @DC13Log everything! @DC13
Log everything! @DC13
 

Último

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdfHuman37
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样vhwb25kk
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfJohn Sterrett
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSINGmarianagonzalez07
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一F sss
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDRafezzaman
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfchwongval
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Cantervoginip
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理e4aez8ss
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Seán Kennedy
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFAAndrei Kaleshka
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.pptamreenkhanum0307
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryJeremy Anderson
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...Boston Institute of Analytics
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...GQ Research
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档208367051
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfBoston Institute of Analytics
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceSapana Sha
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)jennyeacort
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our WorldEduminds Learning
 

Último (20)

20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf20240419 - Measurecamp Amsterdam - SAM.pdf
20240419 - Measurecamp Amsterdam - SAM.pdf
 
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
1:1定制(UQ毕业证)昆士兰大学毕业证成绩单修改留信学历认证原版一模一样
 
DBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdfDBA Basics: Getting Started with Performance Tuning.pdf
DBA Basics: Getting Started with Performance Tuning.pdf
 
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
2006_GasProcessing_HB (1).pdf HYDROCARBON PROCESSING
 
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
办理学位证中佛罗里达大学毕业证,UCF成绩单原版一比一
 
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTDINTERNSHIP ON PURBASHA COMPOSITE TEX LTD
INTERNSHIP ON PURBASHA COMPOSITE TEX LTD
 
Multiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdfMultiple time frame trading analysis -brianshannon.pdf
Multiple time frame trading analysis -brianshannon.pdf
 
ASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel CanterASML's Taxonomy Adventure by Daniel Canter
ASML's Taxonomy Adventure by Daniel Canter
 
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
科罗拉多大学波尔得分校毕业证学位证成绩单-可办理
 
Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...Student Profile Sample report on improving academic performance by uniting gr...
Student Profile Sample report on improving academic performance by uniting gr...
 
How we prevented account sharing with MFA
How we prevented account sharing with MFAHow we prevented account sharing with MFA
How we prevented account sharing with MFA
 
Machine learning classification ppt.ppt
Machine learning classification  ppt.pptMachine learning classification  ppt.ppt
Machine learning classification ppt.ppt
 
Defining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data StoryDefining Constituents, Data Vizzes and Telling a Data Story
Defining Constituents, Data Vizzes and Telling a Data Story
 
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
NLP Data Science Project Presentation:Predicting Heart Disease with NLP Data ...
 
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
Biometric Authentication: The Evolution, Applications, Benefits and Challenge...
 
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
原版1:1定制南十字星大学毕业证(SCU毕业证)#文凭成绩单#真实留信学历认证永久存档
 
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdfPredicting Salary Using Data Science: A Comprehensive Analysis.pdf
Predicting Salary Using Data Science: A Comprehensive Analysis.pdf
 
Call Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts ServiceCall Girls In Dwarka 9654467111 Escorts Service
Call Girls In Dwarka 9654467111 Escorts Service
 
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
Call Us ➥97111√47426🤳Call Girls in Aerocity (Delhi NCR)
 
Learn How Data Science Changes Our World
Learn How Data Science Changes Our WorldLearn How Data Science Changes Our World
Learn How Data Science Changes Our World
 

Oracle Goldengate for Big Data - LendingClub Implementation

  • 1. Real Time Big Data Analytic Platform with Oracle GoldenGate Big Data Adapters Rajit Saha Principal Big Data Engineer Vengata(Venky) Guruswamy Principal Database Administrator
  • 2. Confidential  Introduction - LendingClub  Real-time Big Data Platform Use cases  Lambda Architecture – On-Premise / AWS  Data Processing Flow/Algorithm  GoldenGate Big Data Adapter Implementation Architecture, Configuration and Troubleshooting Scenarios Agenda 2
  • 3. Confidential 3 LendingClub LendingClub is America’s largest online marketplace connecting borrowers and investors.  Headquartered in San Francisco, CA  Office in Westborough, MA
  • 4. Confidential 4 Product Lines Personal Loans Loans up to $40K 600+ FICO 36 and 60 mo. terms Small Business Business Loans up to $300K At least $75,000 in annual sales At least 2 years in business Patient Financing Extended plans up to $50K; no-interest plans up to $32K Auto Refinance Must have an outstanding balance of $5K-$55K, initiated in the last 3 months with 24 months of remaining payments
  • 5. Confidential  Maintain Reporting Databases  Implement transaction history for key tables(Change Data capture)  Enable real-time feed to Big data platform 5 Use Cases of Oracle GoldenGate @ LendingClub
  • 6. Confidential  LendingClub’s Big Data Platform is responsible for generating thousands of reports for the company – Daily, Monthly, Quarterly  Investor Reports, Financial Reports, Collection Reports, Risk Reports, Marketing Reports etc. 6 Why Big Data Platform ? Why Real-time Data Needed ? • Near Real-time Availability of OLTP Data in Hadoop Based OLAP warehouse • Thousands of Tables and rapidly increasing storage needs. rapidly increasing storage needs. • Increasingly additonal 100s of internal Adhoc Users added every month. every month.
  • 7. Confidential  Typical changes are:  Positive Response from customers  Indicated DON’T CALL on a particular phone number from previous contacts.  Under certain circumstance, it can also be limited on how many times we can contact borrower by legal or regulatory reason.  NO CALL list needs be uploaded and fed into dialer in a very timely fashion, and ideally it should be near real-time at the event/change occurred. 7 NO Call list is a list of phone numbers we upload to dialer system in order to suppress Collections calls due to loan and/or customer status changes throughout the day. Real-time Use Cases NO CALL List Generation
  • 8. Confidential 8 What is it? Why are we using Real-time? Who is it for? Real-time Use Cases Marketing Dashboard Track the progress of all the marketing channels with respect to quarterly forecast. Business needs numbers every hour to change the strategy in case any channel performance compared to target. Marketing Team
  • 9. Confidential 9 Lambda Architecture On-Premise HortonWorks Cluster EXTRACT TRAIL REPLICAT INCREMENTAL TRANSACTIONS Data Science CONSUMERS TRAIL TRAIL
  • 10. Confidential 10 Lambda Architecture AWS S3/EMR Clusters EXTRACT TRAIL REPLICAT INCREMENTAL TRANSACTIONS Data Science CONSUMERS TRAIL TRAIL
  • 11. Confidential SPEED LAYER - DATA FLOW DIAGRAM
  • 12. Confidential  DDL Verification from Oracle Source to GG Table before processing starts  Data Inserted fromAvro GG Table to CDC partition table  Hive Dynamic Partition on OP_TS  ORC (Optimized Row Columnar) Format  Nightly Snapshot Creation –  Latest Updates from CDC tables and rest from previous day Snapshot  Create Hive and Presto Real Time View with CDC and Nightly Snapshot  Data Validation : Full table Checksum with Sqoop Snapshot 12 Few Important Steps in Speed Layer Processing Process Flow ..
  • 13. Confidential How are we getting latest update – For a Table , within a transaction for a row CREATE EXTERNAL TABLE SCHEMA_GG.FOO ( TABLE STRING, OP_TYPE STRING, OP_TS STRING, CURRENT_TS STRING , POS STRING , PRIMARY_KEYS ARRAY<STRING> , TOKENS MAP<STRING,STRING> , PKEY STRING , .. .. .. {"TKN-OPTYPE":"INSERT","TKN-RECORDLENGTH":"1400","TKN-LOGRBA":"1774157","TKN- OBJECTNAME":”SCHEMA.FOO","TKN-FILERBA":"","TKN-LAGMSEC":"54420","TKN- USERNAME":”XXX","TKN-SCN":"7633890649912","TKN-RSN":"7633890649838","TKN- TRANSACTIONINDICATOR":"BEGIN","TKN-RECORDTIMESTAMP":"2017-07-30 23:49:14","TKN- LOGPOSITION":"19784208","TKN-FILESEQNO":"","TKN-ROWID":"AAA46FACCAAD4CtAAS","TKN- XID":"184.13.1380795","TKN-COMMITTIMESTAMP":"2017-07-30 23:49:15.000000","TKN-REDOTHREAD":"1"} SELECT TOKENS FROM SCHEMA_GG.FOO LIMIT 1;Input Table From GG Adaptor All the transaction records goes to Hive Partitioned ORC CDC table- SCHEMA_RT_CDC.FOO SELECT DISTINCT VW.PKEY,.. FROM (SELECT FOO.ID, .. RANK() OVER (PARTITION BY FOO.PKEY ORDER BY FOO.TKN-SCN DESC, FOO.TKN-RSN DESC, FOO.POS DESC, FOO .CURRENT_TS DESC) AS DEDUP_RANK FROM SCHEMA_RT_CDC.FOO WHERE FOO.OP_TS_DATE >='2017-07-30') VW WHERE VW.DEDUP_RANK = 1
  • 14. Confidential Verification of Real Time View definition SELECT PKEY, OP_TS, TKN_SCN, TKN_RSN, POS, CURRENT_TS, MODIFIED_D, OP_TYPE FROM SCHEMA_RT_CDC.FOO WHERE PKEY = 2181322736 ORDER BY TKN_SCN DESC,TKN_RSN DESC,POS DESC,CURRENT_TS DESC Records from CDC SELECT PKEY, MODIFIED_D FROM SCHEMA_RT.FOO WHERE PKEY = 2181322736Records from Real Time View PKEY OP_TS TKN_SCN TKN_RSN POS CURRENT_TS MODIFIED_D OP_TYPE 2181322736 2017-06-29 02:32:33.0 7630245451950 7630245451909 00004193750007692575 2017-06-28 19:33:53.778 2017-06-28 19:32:33.0 U 2181322736 2017-06-29 02:32:33.0 7630245451950 7630245451909 00004193750007692575 2017-06-28 19:33:53.778 2017-06-28 19:32:33.0 U 2181322736 2017-06-29 00:39:16.001 7630236411132 7630236411130 00004191270019251264 2017-06-28 17:41:09.519 2017-06-28 17:39:16.0 U 2181322736 2017-06-29 00:39:16.001 7630236411132 7630236411130 00004191270019251264 2017-06-28 17:41:09.519 2017-06-28 17:39:16.0 U 2181322736 2017-06-28 17:02:59.002 7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I 2181322736 2017-06-28 17:02:59.002 7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I 2181322736 2017-06-28 17:02:59.002 7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I 2181322736 2017-06-28 17:02:59.002 7630205764444 7630205764072 00004187440038392981 2017-06-28 10:04:13.76 2017-06-28 10:02:58.0 I 2181322736 2017-06-28 19:32:33.0
  • 15. Confidential SLAs Transaction -> Import Process Start Avg ~5mins Transaction -> Import Process End Avg ~ 5 – 15 mins Spikes are for NIGHTLY SNAPSHOT Creation 1. Full table creation 2. TEZ is not good for Skew JOIN 3. Cluster Busy
  • 16. Confidential GoldenGate Big Data Adapter Implementation Architecture
  • 17. Confidential 17 GoldenGate BigData Adapter Pipeline Db Node 1 DB Extract Data Pump Golden Gate Adapter Servers (HA) Replicat for HDFS Hive Db Node 1 TRAIL FILE 1 TRAIL FILE 2 TRAIL FILE 3 TRAIL FILE 1 TRAIL FILE 2 TRAIL FILE 3
  • 18. Confidential 18 GoldenGate BigData Infrastructure Architecture
  • 19. Confidential 19 GoldenGate BigData Infrastructure Architecture – Site Failover
  • 20. Confidential GoldenGate Big Data Adapter Configuration
  • 21. Confidential Source Extract Parameter - eprod.prm NOCOMPRESSUPDATES GETUPDATEBEFORES TABLE <schema name>.*, tokens ( TKN-COMMITTIMESTAMP = @GETENV('GGHEADER', 'COMMITTIMESTAMP'), TKN-FILESEQNO = @GETENV('RECORD', 'FILESEQNO'), TKN-FILERBA = @GETENV('RECORD', 'FILERBA'), TKN-LAGMSEC = @GETENV('LAG', 'MSEC'), TKN-LOGPOSITION = @GETENV('GGHEADER', 'LOGPOSITION'), TKN-OBJECTNAME = @GETENV('GGHEADER', 'OBJECTNAME'), TKN-OPTYPE = @GETENV('GGHEADER', 'OPTYPE'), TKN-RECORDLENGTH = @GETENV('GGHEADER', 'RECORDLENGTH'), TKN-RECORDTIMESTAMP = @GETENV('RECORD', 'TIMESTAMP'), TKN-ROWID = @GETENV('RECORD', 'ROWID'), TKN-RSN = @GETENV('RECORD', 'RSN'), TKN-SCN = @GETENV('TRANSACTION', 'CSN'), TKN-TRANSACTIONINDICATOR = @GETENV('GGHEADER', 'TRANSACTIONINDICATOR'), TKN-USERNAME = @GETENV('TRANSACTION', 'USERNAME'), TKN-XID = @GETENV('TRANSACTION', 'XID') )
  • 22. Confidential Source Data Pump Extract Parameter - pprod.prm EXTRACT PPROD PASSTHRU RMTHOST <Goldengate adapter server>, MGRPORT <port no> -- Trail file written by Pump to the remote host RMTTRAIL /filesystem/pumptrail_ALL/pt -- Pass data through without mapping, filtering, conversion: PASSTHRU -- Specify tables to be captured: TABLE *.*;
  • 23. Confidential Target – Replicat Parameter - rhdfs.prm REPLICAT RHDFS TARGETDB LIBFILE libggjava.so SET property=dirprm/hdfs.props REPORTCOUNT EVERY 1 MINUTES, RATE GROUPTRANSOPS 10000 -Schemas MAP SCHEMA1.*, TARGET SCHEMA1_GG.*; MAP SCHEMA2.*, TARGET SCHEMA2_GG.*; -Heartbeat MAP DBASCHEMA.HEARTBEAT,TARGET DBASCHEMA_GG.HEARTBEAT;
  • 25. Confidential Oracle GoldenGate to AWS S3 Connector gg.handlerlist=hdfs gg.handler.hdfs.type=hdfs gg.handler.hdfs.rootFilePath=s3://LCBUCKET/OGG s3replicat.props Custom Built Hadoop Client from Trunk hadoop-3.0.0-alpha3-SNAPSHOT <property> <name>fs.s3a.access.key</name> <value><< AWS KEY >></value> </property> <property> <name>fs.s3a.secret.key</name> <value><<AWS SECRET KEY >></value> </property> <property> <name>fs.s3a.proxy.host</name> <value> << PROXY SERVER >> </value> </property> <property> <name>fs.s3a.proxy.port</name> <value><< PROXY PORT >></value> </property> <property> <name>fs.s3a.connection.ssl.enabled</name> <value>true</value> </property> <property> <name>fs.s3.impl</name> <value>org.apache.hadoop.fs.s3a.S3AFileSystem</value> </property> <property> <name>fs.s3a.server-side-encryption-algorithm</name> <value>SSE-KMS</value> </property> <property> <name>fs.s3a.server-side-encryption-key</name> <value>arn:aws:kms:<< KEY >></value> </property> Hadoop Client Talks to s3 via Hadoop-AWS package via s3a protocol EMR only understand s3 , so s3a to s3 mapping is done by
  • 26. Confidential GGSCI - OutputSecurity - Authentication ### Kerberos params gg.handler.name.authType=kerberos gg.handler.name.kerberosKeytabFile=/etc/security/keytabs/lcapp.headless.keytab gg.handler.name.kerberosPrincipal=osuser@PROD-DOMAIN.COM
  • 27. Confidential GGSCI - OutputSecurity - Username/password Encryption ORACLEWALLETUSERNAME ggadalias ggadapters ORACLEWALLETPASSWORD ggadalias ggadapters gg.handler.hdfs.hiveJdbcUsername=ORACLEWALLETUSERNAME[ggadalias ggadapters] gg.handler.hdfs.hiveJdbcPassword=ORACLEWALLETPASSWORD[ggadalias ggadapters] gg.handler.hdfs.hiveJdbcUserName=unencrypted_username gg.handler.hdfs.hiveJdbcPassword=unencrypted_password
  • 28. Confidential GGSCI - Output Password encryption steps: 1. Add credential store: GGSCI > ADD CREDENTIALSTORE Credential store created in ./dircrd/. 2. Add the credential for these users. ALTER CREDENTIALSTORE ADD USER <Hive_username>, password <Hive_password> alias ggadalias domain ggadapters Security-Credential Store
  • 29. Confidential How HDFS Handler handles the Data Flow ? • The output format chosen is Avro row ocf format for three reasons • Integration with Hive • Seamless Schema Evolution • More compact than Avro op ocf • There are two files produced namely Avro and Avsc files.
  • 30. Confidential GGSCI - OutputConfigure High Availability for GoldenGate Resource As ROOT user. agctl add goldengate gg1 --gg_home /bdggvol/ggaws --instance_type dual --nodes node1,node2 -network 1 --ip 99.99.99.99 --user appuser --group oinstall --filesystems ora.registry.acfs --environment_vars "LD_LIBRARY_PATH=/opt/java/default/jre/lib/amd64/server:/lib, PATH=/opt/java/default/bin:/usr/lib64/qt-3.3/bin:/usr/local/bin, JAVA_HOME=/opt/java/default" $GRID_HOME/bin/crsctl setperm resource xag.gg1-vip.vip -u user:oracle:rwx $GRID_HOME/bin/crsctl setperm resource xag.gg1-vip.vip -u group:oinstall:rwx $GRID_HOME/bin/crsctl setperm resource xag.gg1.goldengate -u group:oinstall:rwx $ GRID_HOME/bin/crsctl setperm resource ora.net1.network -u user:appuser:rwx
  • 31. Confidential GGSCI - OutputSchema Regex Avro Doesn’t like ($) symbol ! – No problem : gg.schemareplaceregex=[$:] gg.schemareplacestring=_
  • 32. Confidential  Upgrade from 12.2 to 12.3 has changed the timestamp format from [YYYY-MM- DD:HH24:MI:SS.FFF] to [YYYY-MM-DD HH24:MI:SS.FFF]  To avoid the dreaded ERROR OGG-15050 Error loading Java VM runtime library: (2 No such file or directory) ensure to set the following parameter JAVA_HOME, LD_LIBRARY_PATH and PATH properly. Remember manager passes the ENV variables.  It’s a multi-threaded process and ensure to allocate sufficient Unix resources like NPROC ..  Please monitor the heart beat table from Hive.  Design the application to be Idempotent.  Run the HDFS handler from dedicated hadoop client node. 32 Goldengate Big Data Adapter – Things To Remember
  • 35. Confidential GGSCI - OutputBefore Patch - 17995064 
  • 36. Confidential GGSCI - OutputAfter Patch - 17995064 
  • 37. Confidential GGSCI - OutputLobs-Clobs-Blobs • Data team needed the CLOBS even though the CLOB is not updated. • GETUPDATEBEFORES doesn’t work for CLOB • Lobs :Always Include LOBS In Trail File (Doc ID 1639717.1) • Include the extract parameters NOCOMPRESSUPDATES and FETCHCOLS option in the TABLE parameter, for example: NOCOMPRESSUPDATES TABLE schema.mytable, fetchcols (mylobs1, mylob2);
  • 38. Confidential  SR 3-15688859001 : Data loss after a glitch between goldengate adapter and namenode  Please test the rewind functionality and ability to handle duplicate data  Please upgrade to latest release - OGGBD 12.3.1.1.0 which contains the fix for the abend issue.  Monitor heart beat table in Hive.  Monitor the ERROR messages from logfiles - RHDFS_info_log4j.log [Splunk]  java.io.EOFException: Premature EOF: no length prefix available 38 Ability to handle hadoop failures
  • 39. Confidential GoldenGate Big Data Adapter - Monitoring [Wavefront]
  • 40. Q & A Thank You