Current MVCC model of PostgreSQL relies on snapshots which contain enumeration of transactions which are currently running. This solution served very well when it was invented. That time, typical server could have 2 or 4 sockets with single-core processors at maximum.
Much time has elapsed since then. Nowadays, you can't surprise anyone with 72-cores machine. And we have really reached the limitation of our snapshot representation. Consistent snapshot taking require shared ProcArrayLock. On heavily loaded system, contention for ProcArrayLock is very high. Even group commit optimization in 9.6 doesn't remove this problem.
The possible solution is switching to commit sequence numbers (CSNs). When using CSN one can easily determine which of them commits earlier just by comparing their CSNs. This approach completely removes contention for ProcArrayLock during taking snapshots. However, another problem arises: we need really fast XID => CSN map. Slow map in this place can completely nullify advantages of CSN approach.
In this talk I'll present a patch with elegant solution of XID => CSN map problem. Idea is to minimize usage of this map the same way we minimize clog accesses. While setting hint bits we can rewrite XIDs with CSNs. That arises set of technical problems but all of them appears to be solvable.
9999266834 Call Girls In Noida Sector 22 (Delhi) Call Girl Service
The future is CSN
1. The future is CSN
Alexander Korotkov
Postgres Professional
Alexander Korotkov The future is CSN 1 / 31
2. Russian developers of PostgreSQL:
Alexander Korotkov, Teodor Sigaev, Oleg Bartunov
▶ Speakers at PGCon, PGConf: 20+ talks
▶ GSoC mentors
▶ PostgreSQL commi ers (1+1 in progress)
▶ Conference organizers
▶ 50+ years of PostgreSQL expertship:
development, audit, consul ng
▶ Postgres Professional co-founders
PostgreSQL CORE
▶ Locale support
▶ PostgreSQL extendability:
GiST(KNN), GIN, SP-GiST
▶ Full Text Search (FTS)
▶ NoSQL (hstore, jsonb)
▶ Indexed regexp search
▶ Create AM & Generic WAL
▶ Table engines (WIP)
Extensions
▶ intarray
▶ pg_trgm
▶ ltree
▶ hstore
▶ plantuner
▶ jsquery
▶ RUM
Alexander Korotkov The future is CSN 2 / 31
3. Why do we need snapshots? (1/5)
Alexander Korotkov The future is CSN 3 / 31
4. Why do we need snapshots? (2/5)
Alexander Korotkov The future is CSN 4 / 31
5. Why do we need snapshots? (3/5)
Alexander Korotkov The future is CSN 5 / 31
6. Why do we need snapshots? (4/5)
Alexander Korotkov The future is CSN 6 / 31
7. Why do we need snapshots? (5/5)
Alexander Korotkov The future is CSN 7 / 31
8. Exis ng MVCC snapshots
Alexander Korotkov The future is CSN 8 / 31
9. How do snapshots work?
▶ Array of ac ve transac on ids is stored in shared
memory.
▶ GetSnapshotData() scans all the ac ve xids while
holding shared ProcArrayLock.
▶ Assigning of new xid doesn’t require ProcArrayLock.
▶ Clearing ac ve xid requires exclusive ProcArrayLock.
▶ 9.6 comes with “group clear xid” op miza on.
Mul ple xids of finished transac ons could be
cleared using single exclusive ProcArrayLock.
Alexander Korotkov The future is CSN 9 / 31
10. Problem with of snapshots
▶ Nowadays mul -core systems running can run
thousands of backends simultaneously. For short
queries GetSnapshotData() becomes just CPU
expensive.
▶ LWLock subsystem is just not designed for high
concurrency. In par cular, exclusive lock waits could
have infinite starva on. Therefore, it’s impossible to
connect while there is high flow of short readonly
queries.
▶ In the mixed read-write workload, ProcArrayLock
could become a bo leneck.
Alexander Korotkov The future is CSN 10 / 31
12. CSN development history
▶ Jun 7, 2013 – proposal by Ants Aasma
▶ May 30, 2014 – first path by Heikki Linnakangas
▶ PGCon 2015 – talk by Dilip Kumar (no patch
published)
▶ Aug, 2016 – Heikki returned to this work
Alexander Korotkov The future is CSN 12 / 31
13. CSN snapshots proper es
Pro:
▶ Taking snapshots is cheaper. It’s even possible to
make it lockless.
▶ CSN snapshots are more friendly to distributed
systems. Distributed visibility techniques like
incremental snapshots or Clock-SI assumes that
snapshot is represented by single number.
Cons:
▶ Have to map XID ⇒ CSN while visibility check.
Alexander Korotkov The future is CSN 13 / 31
14. XID ⇒ CSN map worst case
1 M rows table, xmin is random in 10 M transac ons
version first scan, ms next scans, ms
master 2500 50
csn 4900 4900
Without CSN we have to lookup CLOG only during first
scan of the table. During first scan hint bits are set.
Second and subsequent scans use hint bits and don’t
lookup CLOG.
Alexander Korotkov The future is CSN 14 / 31
15. Could we hint XID ⇒ CSN map as well?
▶ In general, it’s possible. We could rewrite XID of
commi ed transac on into its CSN.
▶ Xmin and xmax are 32-bit. Usage of 32-bit CSN is
undesirable. We already have xid and mul xact
wraparounds. Yet another CSN wraparound would
be discouraging.
▶ Se ng hint bits is not WAL-logged. We need to
preserve this property.
Alexander Korotkov The future is CSN 15 / 31
16. Make both XID and CSN 64-bit
▶ Add 64-bit xid_epoch, mul xact_epoch and
csn_epoch to page header.
▶ Allocate high bit of xmin and xmax for CSN flag.
▶ Actual xid or csn stored in xmin or xmax should be
found as corresponding epoch plus xmin or xmax.
▶ We s ll can address 231
xids from xmin and xmax as
we did before.
▶ Wraparound is possible only inside single page. And
it could be resolved by single page freeze.
Alexander Korotkov The future is CSN 16 / 31
17. CSN-rewrite patch
▶ Use 64-bit XID and CSN as described before.
▶ Rewrite XID to CSN instead of se ng “commi ed”
hint bit.
▶ Lockless snapshot taking.
▶ WIP, not published yet.
Alexander Korotkov The future is CSN 17 / 31
18. XID ⇒ CSN map worst case
1 M rows table, xmin is random in 10 M transac ons
version first scan, ms next scans, ms
master 2500 50
csn 4900 4900
csn-rewrite 4900 50
Subsequent scans of table is as cheap as it was before.
First scan s ll have a room for op miza on.
Alexander Korotkov The future is CSN 18 / 31
20. Taking snapshots (SELECT 1)
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
2500000
3000000
3500000
4000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -f sel1.sql on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 20 / 31
21. Read-only benchmark
0 50 100 150 200 250
# Clients
0
500000
1000000
1500000
2000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -S on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 21 / 31
22. Read-write benchmark
0 50 100 150 200 250
# Clients
0
20000
40000
60000
80000
100000
120000
140000
160000
TPS
pgbench -s 1000 -j $n -c $n -M prepared on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 22 / 31
23. Random: 78% read queries, 22% write
queries
0 50 100 150 200 250
# Clients
0
200000
400000
600000
800000
1000000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -b select-only@9 -b tpcb-like@1
on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 23 / 31
24. Custom script with extra 20 read queries
set naccounts 100000 * :scale
set aid1 random(1, :naccounts)
...........................................................
set aid20 random(1, :naccounts)
set aid random(1, 100000 * :scale)
set bid random(1, 1 * :scale)
set tid random(1, 10 * :scale)
set delta random(-5000, 5000)
SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid1);
...........................................................
SELECT abalance FROM pgbench_accounts WHERE aid IN (:aid20);
BEGIN;
UPDATE pgbench_accounts SET abalance = abalance + :delta WHERE aid = :aid;
SELECT abalance FROM pgbench_accounts WHERE aid = :aid;
UPDATE pgbench_tellers SET tbalance = tbalance + :delta WHERE tid = :tid;
UPDATE pgbench_branches SET bbalance = bbalance + :delta WHERE bid = :bid;
INSERT INTO pgbench_history (tid, bid, aid, delta, mtime) VALUES (:tid, :bid, :aid, :d
END;
Alexander Korotkov The future is CSN 24 / 31
25. Custom script with extra 20 read queries
0 50 100 150 200 250
# Clients
0
10000
20000
30000
40000
50000
60000
70000
TPS
pgbench -s 1000 -j $n -c $n -M prepared -f rrw.sql on 4 x 18 cores Intel Xeon E7-8890 processors
median of 3 2-minute runs with shared_buffers = 32GB, max_connections = 300
master
xact-align
csn-rewrite
Alexander Korotkov The future is CSN 25 / 31
27. Further PostgreSQL OLTP bo lenecks
▶ Buffer manager – slow hash-table, pin, locks etc.
▶ Synchronous protocol.
▶ Executor.
▶ Slow xid alloca on – a lot of locks.
Alexander Korotkov The future is CSN 27 / 31
28. Further PostgreSQL OLTP bo lenecks in
numbers
▶ SELECT val FROM t WHERE id IN (:id1, ... :id10) –
150K per second = 1.5M key-value pairs per second, no gain.
Bo leneck in buffer manager.
▶ SELECT 1 with CSN-rewrite patch – 3.9M queries per second.
Protocol and executor are bo lenecks.
▶ SELECT txid_current() – 390K per second. Bo leneck in
locks.
Alexander Korotkov The future is CSN 28 / 31
29. How can we improve PostgreSQL OLTP?
▶ True in-memory engine without buffer manager.
▶ Asynchronous binary protocol for processing more
short queries.
▶ Executor improvements including JIT-compila on.
▶ Lockless xid alloca on.
Alexander Korotkov The future is CSN 29 / 31
30. Conclusion
▶ Despite all the micro-op miza ons made, our
snapshot model could be a bo leneck on modern
mul core systems. And it would be even worse
bo leneck on future systems.
▶ CSN is the way to remove this bo leneck. It also
more friendly to distributed systems.
▶ It’s possible to minimize XID ⇒ CSN map in the
same way we minimize CLOG accesses.
Alexander Korotkov The future is CSN 30 / 31
31. Thank you for a en on!
Alexander Korotkov The future is CSN 31 / 31