Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Building Operational Data Lake
using Spark and SequoiaDB
+
Yang Peng VP of Solutions
pengyang@sequoiadb.com

Have more enterprise customers than MongoDB and Couchbase
combined in China. Over 200 enterprise Customers, 10 of them
are Global Fortune 500 companies
Backed by Silicon Valley based VCs: Qiming Venture’s A round
investment, and DCM’s B round of $ 10 million investment. The
largest investment among all Chinese distributed database
companies so far.
Top Distributed Database Vendor in China

SequoiaDB has been
certified as one of the 14
Databricks(Spark) global
distributors.

SequoiaDB is listed by
Firstmark in the “Global Big
Data Landscape”, and is
the only Chinese company
ever selected.

SequoiaDB’s products:
New Generation Distributed
Content Management Software
New Generation Distributed
Multi-Model Database
Structured Semi-Structured Unstructured
NoSQLSQL Object files

Operational Data Lake
With SequoiaDB+Spark

Client Data Scale
• Around 700M user accounts in total
• 2PB data volume
• 10K RPS in Peak
• Query in 10B rows of data
• Response within 100ms

Highlights of Operational Data Lake
• Real-time & high concurrency queries
• Full-scale history data storage
• Multi-Model data management

Banking Operational Data Lake Use Case
Requirements:
Providing legal and fraud investigation data query platform with centralized storage of all history data from the
core systems, unified data query interfaces and ad-hoc query services.
n Centralized Data Storage and Data Cleaning
• T + 1 data synchronization from core system such as credit card system and E-banking system with
unified data channel.
• Unified ETL including data cleaning.
n Unified Interfaces for Query
• Query API and interfaces for operators
• Ad-hoc queries

Initial Idea
ODSCore System
History
Production
Environment
Core sys
Credit Card
sys
Middleware
E-banking
Payment
Initialize
T+1
Sync
SDB
SDB API
SPARK
SQL&
PG SQL
History Data Querying
Query Query
Initialize
& Sync
Initialize
Investigation
system
Counter
Gateway
Data Acquisition Data Processing Data Displaying
Open Platform
Bigdata platform
Debt Collection Initialize
& Sync
FTP
Query
Access Platform
Query
Import

Historical Data Ratio
New Core
System
4%
New Credit
Card System
20%
Legacy Core
System
21%Legacy Credit
Card System
32%
Debt Collection
22%
Online
customer…
Data Ratio in Data management Platform Data Services in History Data Platform
Data Services
1 Legacy Core System
2 Core System
3
Legacy Credit Card
System
4 Credit Card System
5 Credit Card Collection
6
Credit Card Customer
Services
7 Administration

Initial Architecture
Source
Data
Core System
TB1
TB4
……
Credit Card
TB2
TB5
……
E-banking
TB3
TB6
……
Other
systems
……
ODS FTP FTPODS
SDB Data Storage Area
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
TB1
TB2
TB3
TB4
……
.......
TB4=TB1+TB2+TB3
Stores Modified Data
SDB
Clusters
TB1、TB2、TB3
the source data, will be
directly stored into SDB.
SPARK
PGSQL
SDBAPI
Source Data TB1、TB2、TB3 is imported
using SDB API
Spark and PGSQL will modify the data and
store into TB4
WEB Applications
1 Master 2 Slaves
For each group of data
MySQL
Python
SPARKSQL
SDB
DatabaseCluster
SDB
Data Processing
Area

Challenges
• Main repository data used as a replacement of tape
• Main repository contains all the business source data in different schema, there
may not be query friendly
• Minimize I/O and computing activities on those boxes
• Reading data directly from main repository is NOT allowed
• Lack of isolation of source data management
• Scaling issue
• Lack of unified management tools for data, system source and query
• Lack of standard interface for external systems
• Performance issue for querying data in main repository

• Manage the query data and computing resource in an isolated area
– Data are cleaned and reconstructed in the cache region based on query request
– Data can replay and reconstructed any time you want
Solutions
Main Repository
Cache Region
Select * from T1, T2, T3
Dynamic
Load

Data Scheduling and Processing Area
Online Query Area
Data Service
SDBClusters
Main Repository Area
Operational Data Lake
Ad-hoc Query
Area
Data-Lab Area
Sand Box
Data Lifecycle Management
Ad-hoc QueryOnline Query
Online Real-time query, requires fast
response
All the original data from all the business systems.
Ad-hoc query must be tested in the sandbox
testing area before query the business database.
Testing the query
instructions for query.
Operational Data Lake：Overall Business Architecture

Online
Trading
Services
ODL：Detail Technology Architecture
Replicas of
data in one
week for
testing the Ad-
hoc query
instructions
Data replicas for Ad-
hoc query business
Online Trading Ad-hoc Query Executing
Ad-hoc Query
Testing
Management
Archiving
Management
Mission
Management
Source
Management
Users
Management
Original
Systems
Core
systems
Credit Card
Systems
E-Banking
System ……
O
D
S
N
A
S
CDCReal-time
Using CDC to fetch database log
synchronizing data for online
data.
Main Repository data
is imported from latest
T+1 data by ODS+FTP
Server Layer
Data Scheduling and Processing Area
Data that defined by
the specific online
query business
Online Query
Data Services Sandbox
C
D
F
T
P
O
D
S
Data
Relicas
Monitoring
Management
Data
Replicas
SequoiaDBClusters
Electric
Banking Data
Cluster
Credit Card
Data Clusters 。。。
Core Data
Clusters
Main Repository
C
D
F
T
P
Real-time
data from
business
systems
ECM Warm Data
Images Docs
ECM Online Data
Images Docs
ECM Archiving Data
Images Docs
ECM Platform Operational Data Lake
Ad-hoc Query
Modified
Data
Real-time
Data
Data
Relicas
Data
Relicas

1
• Data imported into main repository from external systems
2
• Data reconstructed by Scheduled and Processing layer
3
• Modified data will be imported into the Query Layer
4
• Users execute queries
5
• Middleware layer queries the data through SQL or SDB APIs
6
• Return the query result to system
Data Process & Schedule
Layer
Query Services
Modified
Data
Data
Replica
Query Middleware
Main Repository
Data
Storage
Main
Repository
Main
Repository
Source
Data Core Systems
Other systems
……
ODL：Online Query
n Query Process

1
Data imported to the main repository
2
Data imported Ad-hoc Query Layer by Scheduled and Processing layer.
3
Sandbox layer fetch replicas of sample data from Ad-hoc Query Layer
4
Users enter the query instructions
5
Testing area forwarded the instruction to sandbox layer for testing
6
Sandbox executes the instructions and return the result
7
If the preview test pass, the query will be sent to executing area
8
Executing area requests the query result from the Ad-hoc Query Data
storage layer
9
10
Return query result
Structured Data Storage
Data Process & Schedule Layer
SandboxExternal Data Service
Replicas of data
in one month for
testing the free
query
instructionsOriginal Data Replicas
Ad hoc Query Ad-hoc Query
Main Repository
Data
Storage
Main
Repository
Main
Repository
Data
Source
Core system
Other
systems……
Result Temporary
Space
ODL：Ad-hoc Query
n Ad-hoc Query Process
Result of Ad-hoc query will store in a temporary space and return the
processing result.

SparkSQL + SequoiaDB Connector
• SparkSQL is very useful as a ETL tool
• Users can write standard SQL to join data from multiple tables and load into
target table
• Spark is able to connect to any external data source as long as connector is
provided
SequoiaDB 2.8.1+, Spark 2.0+, JDK7+, Scala 2.11.x

SequoiaDB Connector – Spark SQL
1. Spark SQL
• Create Table or View
create [temporary] <table|view> <name>[(schema)]
using com.sequoiadb.spark options (<options>);
• Insert Data in SQL
insert into table <name1> select * from <name2>;
Options Description
Host SequoiaDB address and port
CollectionSpace The collection space for the table
Collection The collection for the table
Username Authentication username
Password Authentication password
SamplingRatio Sampling rate for schema
generation
SamplingNum Max number of records to
sample
SamplingWithID Include “_id” column in schema
SamplingSingle Take sample in a single partition
BulkSize Batch job size of bulk insert

SequoiaDB Connector - RDD
2. RDD Usage
• Create
import org.apache.spark._
import com.sequoiadb.spark._
val conf = new SparkConf().setMaster(“spark://server1:7077”)
conf.set(“sequoiadb.host”, “server2:11810”)
val spark = new SparkContext(conf)
val rdd = spark.loadFromSequoiadb(“sample”, “Employee”)
println(“count = ” + rdd.count())
rdd.saveToSequoiadb(“server3:11810”, “sample”, “newEmployee”)

Spark Connector Architecture
• DefaultSource
• SdbRelation
• SdbRDD
• SdbFilter
• SdbPartitioner
• SdbRDDIterator
• SdbCursor
• SdbSchemaSampler
• SdbWriter

+createRelation()
<<interface>>
RelationProvider
+createRelation()
<<interface>>
SchemaRelationProvider
+createRelation()
<<interface>>
CreatableRelationProvider
+shortName()
<<interface>>
DataSourceRegister
DefaultSource
+unhandledFilters()
BaseRelation
SdbRelation
+buildScan()
<<interface>>
TableScan
+buildScan()
<<interface>>
PrunedScan
+buildScan()
<<interface>>
PrunedFilteredScan
+insert()
<<interface>>
InsertableRelation
+compute()
+getPartitions()
+getPreferredLocations()
RDD
SdbRDD
SdbBsonRDD SdbRowRDD
<<interface>>
Partition
SdbPartition
+computePartitions()
SdbPartitioner
SdbSinglePartitioner SdbShardingPartitioner
SdbDatablockPartitioner
+hasNext()
+next()
<<interface>>
Iterator
SdbRDDIterator
SdbRowRDDIterator SdbBsonRDDIterator
SdbFilter
SdbConfig
SdbSchemaSampler
SdbWriter
+hasNext()
+next()
+close()
<<interface>>
SdbCursor
SdbNormalCursor SdbFastCursor

SequoiaDB Connector Data Type Compatibility
SequoiaDB Type SparkSQL Type SQL Type
Int32 IntegerType Int
Int64 LongType Bigint
Double DoubleType Double
Decimal DecimalType Decimal
String StringType String
ObjectId StringType String
Boolean BooleanType Boolean
Date DateType Date
Timestamp TimestampType Timestamp
Binary BinaryType Binary
Null NullType Null
Object StructType Struct<field:type>
Array ArrayType Array<type>
MinKey StringType String
MaxKey StringType String

• Dynamically load data from main repository to cache region, based on table and query predicates
– Ex: Select * from T1, T2 where T1.c = T2.c and T1.date between ( 2017-01-01 and 2017-02-01 )
– This query will load first month of data in T1 and full T2 data into cache region
– Data for most table are both sharded by PK and partitioned by date
– Loading data from main repository to cache is simply SFTP the data files
– 10Gb network is good enough for fast copy
• Remove expired cache (LRU) when it’s no longer needed
• Return the data from cache region, instead of reading from the main repository
What the scheduler does?
Spark
SequoiaDB
Connector
Message
Processor
REST
Business
Application
SequoiaDB
Cluster
Partition
Manager
Data
Scheduler
Spark
RDD
Task
Manager
SequoiaDB
standalone
SequoiaDB
standalone
SequoiaDB
standalone
Metadata
• Message Processing
• Metadata Management
• Partition Management
• Task Management
• RESTful API
Scheduler Architecture

Spark Connector Message Processor Partition Manager Task Manager Metadata SequoiaDB Cluster Standalones
explain(query) explain(query)
explain(query)
explain(query)
listReplicaGroups
listReplicaGroups
query partitions
query partitions
compute unloaded partitions
create copy task
finish copy task
copy partitions
insert copy task
insert copy task
finish copy task
finish copy task
load partitions
load partitions
save partitions
save partitions
generate explain
explain(query)explain(query)

Data Scheduler
Explain Data Flow

• Override SdbPartitioner
– Locate the related data and load to the cache region
– Getting partition information from cache region instead of main repository
• Stateless scheduler service
– Perform data copy tasks before returning the explain to connector
– Remove expired cache when it’s no longer needed
– Support HA and load balance
• Data safety
– Ad-hoc queries are run in cache region instead of main repository
– Spark workers are installed in the cache region servers
Data Scheduler Highlights

Project Achievement
• There are about 40 different applications systems.
• OLTP features are used in order to make sure the data in SequoiaDB is consistent
with production database.
• This is an online OLTP application in financial industry using distributed database.
• SequoiaDB have more than 107 physical nodes deployed in this banking customer as
a unified Operational Data Lake.

SequoiaDB
www.sequoiadb.com
sales_support@sequoiadb.com
(086)400-8038-339

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Recommended

Recommended

More Related Content

What's hot

What's hot (20)

Similar to Building Operational Data Lake using Spark and SequoiaDB with Yang Peng

Similar to Building Operational Data Lake using Spark and SequoiaDB with Yang Peng (20)

More from Databricks

More from Databricks (20)

Recently uploaded

Recently uploaded (20)

Building Operational Data Lake using Spark and SequoiaDB with Yang Peng