This talk, presented at JavaOne 2011, describes the ODC, a distributed, in-memory database built in Java that holds objects in a normalized form in a way that alleviates the traditional degradation in performance associated with joins in shared-nothing architectures. The presentation describes the two patterns that lie at the core of this model. The first is an adaptation of the Star Schema model used to hold data either replicated or partitioned data, depending on whether the data is a fact or a dimension. In the second pattern, the data store tracks arcs on the object graph to ensure that only the minimum amount of data is replicated. Through these mechanisms, almost any join can be performed across the various entities stored in the grid, without the need for key shipping or iterative wire calls.
Balancing Replication and Partitioning in a Distributed Java Database
1.
2. In-memory databases offer
significant gains in But three issues
performance as all data is have stunted their
freely available. There is no uptake: Address
Traditional disk- spaces only being
need to page to and from disk.
oriented database large enough for a
architecture is subset of a typical
showing its age. This makes joins a user’s data. The ‘one
problem. When data more bit’ problem
must be joined across and durability.
multiple machines
performance degradation
Snowflake is inevitable. Distributed in-
Schemas allow memory databases
us to mix But this model only goes solve these three
Partitioning so far. “Connected problems but at the
and Replication Replication” takes us a price of loosing the
so joins never step further allowing us single address space.
hit the wire. to make the best possible
use of replication.
3.
4.
5. The lay of the land:
The main architectural
constructs in the database
industry
10. ms
μs
ns
ps
1MB Disk/Network
1MB Main Memory
0.000,000,000,000
Cross Continental Main Memory
L1 Cache Ref
Round Trip
Ref
Cross Network L2 Cache Ref
Round Trip
* L1 ref is about 2 clock cycles or 0.7ns. This is
the time it takes light to travel 20cm
29. Keys Fs-Fz
Keys Xa-Yd
Scalable storage, bandwidth
and processing
30.
31.
32.
33. Trader
Party
Version 1
Trade
Trader
Party
Version 2
Trade
Trader
Party
Version 3
Trade
Trader
Party
Version 4
Trade
…and you need
versioning to do MVCC
34. Trade
Trader
Party
Party
Trader
Trade
Party
Trader
Trade
Party
35. So better to use
partitioning, spreading
data around the
cluster.
52. Valuation Legs
Valuations
art Transaction Mapping
Cashflow Mapping Facts:
Party Alias
Transaction
=>Big,
Cashflows common
Legs
Parties
keys
Ledger Book
Source Book
Dimensions
Cost Centre
Product =>Small,
Risk Organisation Unit
Business Unit
crosscutting
HCS Entity Keys
Set of Books
0 37,500,000 75,000,000 112,500,000 150,000,000
53.
54. Coherence’s
KeyAssociation
gives us this
Trades
MTMs
Common
Key
58. Valuation Legs
Valuations
Facts:
art Transaction Mapping
Cashflow Mapping
Party Alias
=>Big
=>Distribute
Transaction
Cashflows
Legs
Parties
Ledger Book
Source Book Dimensions
=>Small
Cost Centre
Product
Risk Organisation Unit
=> Replicate
Business Unit
HCS Entity
Set of Books
0 37,500,000 75,000,000 112,500,000 150,000,000
59. We use a variant on a
Snowflake Schema to
partition big stuff, that has
the same key and replicate
small stuff that has
crosscutting keys.
63. Select Transaction, MTM, ReferenceData From
MTM, Transaction, Ref Where Cost Centre = ‘CC1’
LBs[]=getLedgerBooksFor(CC1)
SBs[]=getSourceBooksFor(LBs[])
So we have all the bottom level
dimensions needed to query facts
Transactions
Mtms
Cashflows
Partitioned
64. Select Transaction, MTM, ReferenceData From
MTM, Transaction, Ref Where Cost Centre = ‘CC1’
LBs[]=getLedgerBooksFor(CC1)
SBs[]=getSourceBooksFor(LBs[])
So we have all the bottom level
dimensions needed to query facts
Transactions
Get all Transactions and
Mtms
MTMs (cluster side join) for
the passed Source Books
Cashflows
Partitioned
65.
66. Select Transaction, MTM, ReferenceData From
MTM, Transaction, Ref Where Cost Centre = ‘CC1’
Populate raw facts LBs[]=getLedgerBooksFor(CC1)
(Transactions) with SBs[]=getSourceBooksFor(LBs[])
dimension data
So we have all the bottom level
before returning to
dimensions needed to query facts
client.
Transactions
Get all Transactions and
Mtms
MTMs (cluster side join) for
the passed Source Books
Cashflows
Partitioned
67.
68. Replicated Partitioned
Java
client
Dimensions
Facts
API
We never have to do a distributed join!
69. So all the big stuff is
held paritioned
And we can join
without shipping keys
around and having
intermediate results
71. Trader
Party
Version 1
Trade
Trader
Party
Version 2
Trade
Trader
Party
Version 3
Trade
Trader
Party
Version 4
Trade
72. Trade
Trader
Party
Party
Trader
Trade
Party
Trader
Trade
Party
73.
74.
75.
76.
77.
78. Valuation Legs
Valuations
rt Transaction Mapping
Cashflow Mapping
Party Alias
Facts
Transaction
Cashflows
Legs
Parties This is a dimension
Ledger Book
• It has a different
Source Book
Cost Centre key to the Facts.
Dimensions
Product
• And it’s BIG
Risk Organisation Unit
Business Unit
HCS Entity
Set of Books
0 125,000,000
79.
80.
81.
82.
83. Party Alias
Parties
Ledger Book
Source Book
Cost Centre
Product
Risk Organisation Unit
Business Unit
HCS Entity
Set of Books
0 1,250,000 2,500,000 3,750,000 5,000,000
84. Party Alias
Parties
Ledger Book
Source Book
Cost Centre
Product
Risk Organisation Unit
Business Unit
HCS Entity
Set of Books
20 1,250,015 2,500,010 3,750,005 5,000,000
85.
86. So we only replicate
‘Connected’ or ‘Used’
dimensions
87. Processing Layer
Dimension Caches
(Replicated)
Transactions
Data Layer
As new Facts are added Mtms
relevant Dimensions that
they reference are moved
Cashflows
to processing layer caches
Fact Storage
(Partitioned)
88.
89. Query Layer
Save Trade
(With connected
dimension Caches)
Data Layer
Cache
Trade
(All Normalised)
Store
Partitioned
Trigger
Source Cache
Party Ccy
Alias
Book
90. Query Layer
(With connected
dimension Caches)
Data Layer
Trade
(All Normalised)
Party Source Ccy
Alias
Book
91. Query Layer
(With connected
dimension Caches)
Data Layer
Trade
(All Normalised)
Party Source Ccy
Alias
Book
Party
Ledger
Book
92. ‘Connected Replication’
A simple pattern which
recurses through the foreign
keys in the domain model,
ensuring only ‘Connected’
dimensions are replicated
93.
94. Java
client
Java schema
API
Java ‘Stored
Procedures’
and ‘Triggers’
Big data sets are held distributed and only joined on the grid to collocated objects. Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)
Big data sets are held distributed and only joined on the grid to collocated objects. Small data sets are held in replicated caches so they can be joined in process (only ‘active’ data is held)