SlideShare una empresa de Scribd logo
1 de 89
Descargar para leer sin conexión
Mvcc Unmasked

                           BRUCE MOMJIAN




                             January, 2012


This talk explains how Multiversion Concurrency Control
(MVCC) is implemented in Postgres, and highlights optimizations
which minimize the downsides of MVCC.
Creative Commons Attribution License         http://momjian.us/presentations



                                                                 Mvcc Unmasked 1 / 89
Unmasked: Who Are These People?




                              Mvcc Unmasked 2 / 89
Unmasked: The Original Star Wars Cast




Left to right: Han Solo, Darth Vader, Chewbacca, Leia, Luke
                     Skywalker, R2D2                    Mvcc Unmasked   3 / 89
Why Unmask MVCC?




◮   Predict concurrent query behavior
◮   Manage MVCC performance effects
◮   Understand storage space reuse




                                        Mvcc Unmasked 4 / 89
Outline




1. Introduction to MVCC
2. MVCC Implementation Details
3. MVCC Cleanup Requirements and Behavior




                                            Mvcc Unmasked 5 / 89
What is MVCC?




Multiversion Concurrency Control (MVCC) allows Postgres to
offer high concurrency even during significant database
read/write activity. MVCC specifically offers behavior where
"readers never block writers, and writers never block readers".
This presentation explains how MVCC is implemented in
Postgres, and highlights optimizations which minimize the
downsides of MVCC.




                                                          Mvcc Unmasked 6 / 89
Which Other Database Systems Support MVCC?




◮   Oracle
◮   DB2 (partial)
◮   MySQL with InnoDB
◮   Informix
◮   Firebird
◮   MSSQL (optional, disabled by default)




                                            Mvcc Unmasked 7 / 89
MVCC Behavior


Cre 40
Exp               INSERT

Cre 40
Exp 47            DELETE

Cre 64   old (delete)
Exp 78
                  UPDATE
Cre 78   new (insert)
Exp


                                 Mvcc Unmasked 8 / 89
MVCC Snapshots


MVCC snapshots control which tuples are visible for SQL
statements. A snapshot is recorded at the start of each SQL
statement in READ COMMITTED transaction isolation mode, and
at transaction start in SERIALIZABLE transaction isolation mode.
In fact, it is frequency of taking new snapshots that controls the
transaction isolation behavior.
When a new snapshot is taken, the following information is
gathered:
  ◮   the highest-numbered committed transaction
  ◮   the transaction numbers currently executing
Using this snapshot information, Postgres can determine if a
transaction’s actions should be visible to an executing statement.



                                                          Mvcc Unmasked 9 / 89
MVCC Snapshots Determine Row Visibility
 Create−Only

  Cre 30                                    Sequential Scan
  Exp                   Visible

  Cre 50                                                 Snapshot
  Exp                  Invisible

  Cre 110                                                The highest−numbered
  Exp                  Invisible                         committed transaction: 100

                                                         Open Transactions: 25, 50, 75
Create & Expire

  Cre 30                                                 For simplicity, assume all other
  Exp 80              Invisible                          transactions are committed.
  Cre 30
  Exp 75               Visible

  Cre 30
  Exp 110              Visible

Internally, the creation xid is stored in the system column ’xmin’, and expire in ’xmax’.

                                                                                            Mvcc Unmasked 10 / 89
Confused Yet?


Source code comment in src/backend/utils/time/tqual.c:
     ((Xmin == my-transaction &&           inserted by the current transaction
       Cmin < my-command &&                before this command, and
       (Xmax is null ||                    the row has not been deleted, or
        (Xmax == my-transaction &&         it was deleted by the current transaction
         Cmax >= my-command)))             but not before this command,
     ||                                    or
      (Xmin is committed &&                the row was inserted by a committed transaction, and
          (Xmax is null ||                 the row has not been deleted, or
           (Xmax == my-transaction &&      the row is being deleted by this transaction
            Cmax >= my-command) ||         but it’s not deleted "yet", or
           (Xmax != my-transaction &&      the row was deleted by another transaction
            Xmax is not committed))))      that has not been committed


mao [Mike Olson] says 17 march 1993: the tests in this routine are
correct; if you think they’re not, you’re wrong, and you should think about
it again. i know, it happened to me.




                                                                                      Mvcc Unmasked 11 / 89
Implementation Details




All queries were generated on an unmodified version of Postgres.
The contrib module pageinspect was installed to show internal
heap page information and pg_freespacemap was installed to show
free space map information.




                                                       Mvcc Unmasked 12 / 89
Setup

CREATE TABLE mvcc_demo (val INTEGER);
CREATE TABLE
DROP VIEW IF EXISTS mvcc_demo_page0;
DROP VIEW
CREATE VIEW mvcc_demo_page0 AS
        SELECT ’(0,’ || lp || ’)’ AS ctid,
                CASE lp_flags
                        WHEN 0 THEN ’Unused’
                        WHEN 1 THEN ’Normal’
                        WHEN 2 THEN ’Redirect to ’ || lp_off
                        WHEN 3 THEN ’Dead’
                END,
                t_xmin::text::int8 AS xmin,
                t_xmax::text::int8 AS xmax,
                t_ctid
        FROM heap_page_items(get_raw_page(’mvcc_demo’, 0))
        ORDER BY lp;
CREATE VIEW


                                                         Mvcc Unmasked 13 / 89
INSERT Using Xmin



    DELETE FROM mvcc_demo;
    DELETE 0
    INSERT INTO mvcc_demo VALUES (1);
    INSERT 0 1
    SELECT xmin, xmax, * FROM mvcc_demo;
     xmin | xmax | val
    ------+------+-----
     5409 |    0 | 1
    (1 row)
All the queries used in this presentation are available at
http://momjian.us/main/writings/pgsql/mvcc.sql.




                                                             Mvcc Unmasked 14 / 89
Just Like INSERT


Cre 40
Exp               INSERT

Cre 40
Exp 47            DELETE

Cre 64   old (delete)
Exp 78
                  UPDATE
Cre 78   new (insert)
Exp


                                    Mvcc Unmasked 15 / 89
DELETE Using Xmax


DELETE FROM mvcc_demo;
DELETE 1
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5411 |    0 | 1
(1 row)

BEGIN WORK;
BEGIN
DELETE FROM mvcc_demo;
DELETE 1



                                       Mvcc Unmasked 16 / 89
DELETE Using Xmax
SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
(0 rows)

     SELECT xmin, xmax, * FROM mvcc_demo;
        xmin | xmax | val
       ------+------+-----
        5411 | 5412 |   1
       (1 row)

SELECT txid_current();
 txid_current
--------------
         5412
(1 row)

COMMIT WORK;
COMMIT                                      Mvcc Unmasked 17 / 89
Just Like DELETE


Cre 40
Exp               INSERT

Cre 40
Exp 47            DELETE

Cre 64   old (delete)
Exp 78
                  UPDATE
Cre 78   new (insert)
Exp


                                    Mvcc Unmasked 18 / 89
UPDATE Using Xmin and Xmax


DELETE FROM mvcc_demo;
DELETE 0
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5413 |    0 | 1
(1 row)

BEGIN WORK;
BEGIN
UPDATE mvcc_demo SET val = 2;
UPDATE 1



                                       Mvcc Unmasked 19 / 89
UPDATE Using Xmin and Xmax


SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5414 |    0 | 2
(1 row)

       SELECT xmin, xmax, * FROM mvcc_demo;
        xmin | xmax | val
       ------+------+-----
        5413 | 5414 |   1
       (1 row)

COMMIT WORK;
COMMIT



                                              Mvcc Unmasked 20 / 89
Just Like UPDATE


Cre 40
Exp               INSERT

Cre 40
Exp 47            DELETE

Cre 64   old (delete)
Exp 78
                  UPDATE
Cre 78   new (insert)
Exp


                                   Mvcc Unmasked 21 / 89
Aborted Transaction IDs Remain

DELETE   FROM mvcc_demo;
DELETE   1
INSERT   INTO mvcc_demo VALUES (1);
INSERT   0 1

BEGIN WORK;
BEGIN
DELETE FROM mvcc_demo;
DELETE 1
ROLLBACK WORK;
ROLLBACK

SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5415 | 5416 | 1
(1 row)

                                           Mvcc Unmasked 22 / 89
Aborted IDs Can Remain Because
        Transaction Status Is Recorded Centrally
         pg_clog
      XID        Status flags

       028   0   0   0   1   0   0   1   0


       024   1   0   1   0   0   0   0   0


       020   1   0   1   0   0   1   0   0


       016   0   0   0   0   0   0   1   0


       012   0   0   0   1   0   1   1   0


       008   1   0   1   0   0   0   1   0


       004   1   0   1   0   0   0   0   0


       000   1   0   0   1   0   0   1   0     Tuple   Creation XID: 15   Expiration XID: 27

                              00 In Progress                xmin               xmax
                     01 Aborted
             10 Committed
       Transaction Id (XID)



Transaction roll back marks the transaction ID as aborted. All
sessions will ignore such transactions; it is not ncessary to revisit
each row to undo the transaction.                           Mvcc Unmasked                      23 / 89
Row Locks Using Xmax

DELETE   FROM mvcc_demo;
DELETE   1
INSERT   INTO mvcc_demo VALUES (1);
INSERT   0 1

BEGIN WORK;
BEGIN
SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5416 |    0 | 1
(1 row)
SELECT xmin, xmax, * FROM mvcc_demo FOR UPDATE;
 xmin | xmax | val
------+------+-----
 5416 |    0 | 1
(1 row)

                                                  Mvcc Unmasked 24 / 89
Row Locks Using Xmax



    SELECT xmin, xmax, * FROM mvcc_demo;
     xmin | xmax | val
    ------+------+-----
     5416 | 5417 | 1
    (1 row)
    COMMIT WORK;
    COMMIT
The tuple bit HEAP_XMAX_EXCL_LOCK is used to indicate that
xmax is a locking xid rather than an expiration xid.




                                                     Mvcc Unmasked 25 / 89
Multi-Statement Transactions




Multi-statement transactions require extra tracking because each
statement has its own visibility rules. For example, a cursor’s
contents must remain unchanged even if later statements in the
same transaction modify rows. Such tracking is implemented
using system command id columns cmin/cmax, which is
internally actually is a single column.




                                                       Mvcc Unmasked 26 / 89
INSERT Using Cmin



DELETE FROM mvcc_demo;
DELETE 1

BEGIN WORK;
BEGIN
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (3);
INSERT 0 1




                                    Mvcc Unmasked 27 / 89
INSERT Using Cmin




SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5419 |    0 |    0 | 1
 5419 |    1 |    0 | 2
 5419 |    2 |    0 | 3
(3 rows)
COMMIT WORK;
COMMIT




                                             Mvcc Unmasked 28 / 89
DELETE Using Cmin



DELETE FROM mvcc_demo;
DELETE 3

BEGIN WORK;
BEGIN
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (3);
INSERT 0 1




                                    Mvcc Unmasked 29 / 89
DELETE Using Cmin



SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5421 |    0 |    0 | 1
 5421 |    1 |    0 | 2
 5421 |    2 |    0 | 3
(3 rows)

DECLARE c_mvcc_demo CURSOR FOR
SELECT xmin, xmax, cmax, * FROM mvcc_demo;
DECLARE CURSOR




                                             Mvcc Unmasked 30 / 89
DELETE Using Cmin
    DELETE FROM mvcc_demo;
    DELETE 3
    SELECT xmin, cmin, xmax, * FROM mvcc_demo;
     xmin | cmin | xmax | val
    ------+------+------+-----
    (0 rows)

    FETCH ALL FROM c_mvcc_demo;
     xmin | xmax | cmax | val
    ------+------+------+-----
     5421 | 5421 |    0 | 1
     5421 | 5421 |    1 | 2
     5421 | 5421 |    2 | 3
    (3 rows)
    COMMIT WORK;
    COMMIT
A cursor had to be used because the rows were created and
deleted in this transaction and therefore never visible outside this
transaction.                                              Mvcc Unmasked   31 / 89
UPDATE Using Cmin
DELETE FROM mvcc_demo;
DELETE 0
BEGIN WORK;
BEGIN
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (3);
INSERT 0 1
SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5422 |    0 |    0 | 1
 5422 |    1 |    0 | 2
 5422 |    2 |    0 | 3
(3 rows)
DECLARE c_mvcc_demo CURSOR FOR
SELECT xmin, xmax, cmax, * FROM mvcc_demo;
                                             Mvcc Unmasked 32 / 89
UPDATE Using Cmin
UPDATE mvcc_demo SET val = val * 10;
UPDATE 3
SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5422 |    3 |    0 | 10
 5422 |    3 |    0 | 20
 5422 |    3 |    0 | 30
(3 rows)

FETCH ALL FROM c_mvcc_demo;
 xmin | xmax | cmax | val
------+------+------+-----
 5422 | 5422 |    0 | 1
 5422 | 5422 |    1 | 2
 5422 | 5422 |    2 | 3
(3 rows)
COMMIT WORK;
COMMIT
                                             Mvcc Unmasked 33 / 89
Modifying Rows From Different Transactions
DELETE FROM mvcc_demo;
DELETE 3
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
SELECT xmin, xmax, * FROM mvcc_demo;
 xmin | xmax | val
------+------+-----
 5424 |    0 | 1
(1 row)

BEGIN WORK;
BEGIN
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (3);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (4);
INSERT 0 1
                                       Mvcc Unmasked 34 / 89
Modifying Rows From Different Transactions



SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5424 |    0 |    0 | 1
 5425 |    0 |    0 | 2
 5425 |    1 |    0 | 3
 5425 |    2 |    0 | 4
(4 rows)
UPDATE mvcc_demo SET val = val * 10;
UPDATE 4




                                             Mvcc Unmasked 35 / 89
Modifying Rows From Different Transactions

SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5425 |    3 |    0 | 10
 5425 |    3 |    0 | 20
 5425 |    3 |    0 | 30
 5425 |    3 |    0 | 40
(4 rows)

      SELECT xmin, xmax, cmax, * FROM mvcc_demo;
        xmin | xmax | cmax | val
       ------+------+------+-----
        5424 | 5425 |    3 | 1
       (1 row)

COMMIT WORK;
COMMIT

                                                   Mvcc Unmasked 36 / 89
Combo Command Id




Because cmin and cmax are internally a single system column, it is
impossible to simply record the status of a row that is created and
expired in the same multi-statement transaction. For that reason,
a special combo command id is created that references a local
memory hash that contains the actual cmin and cmax values.




                                                         Mvcc Unmasked 37 / 89
UPDATE Using Combo Command Ids
-- use TRUNCATE to remove even invisible rows
TRUNCATE mvcc_demo;
TRUNCATE TABLE

BEGIN WORK;
BEGIN
DELETE FROM   mvcc_demo;
DELETE 0
DELETE FROM   mvcc_demo;
DELETE 0
DELETE FROM   mvcc_demo;
DELETE 0
INSERT INTO   mvcc_demo VALUES (1);
INSERT 0 1
INSERT INTO   mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO   mvcc_demo VALUES (3);
INSERT 0 1
                                                Mvcc Unmasked 38 / 89
UPDATE Using Combo Command Ids


SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5427 |    3 |    0 | 1
 5427 |    4 |    0 | 2
 5427 |    5 |    0 | 3
(3 rows)

DECLARE c_mvcc_demo CURSOR FOR
SELECT xmin, xmax, cmax, * FROM mvcc_demo;
DECLARE CURSOR
UPDATE mvcc_demo SET val = val * 10;
UPDATE 3




                                             Mvcc Unmasked 39 / 89
UPDATE Using Combo Command Ids

SELECT xmin, cmin, xmax, * FROM mvcc_demo;
 xmin | cmin | xmax | val
------+------+------+-----
 5427 |    6 |    0 | 10
 5427 |    6 |    0 | 20
 5427 |    6 |    0 | 30
(3 rows)

FETCH ALL FROM c_mvcc_demo;
 xmin | xmax | cmax | val
------+------+------+-----
 5427 | 5427 |    0 | 1
 5427 | 5427 |    1 | 2
 5427 | 5427 |    2 | 3
(3 rows)


                                             Mvcc Unmasked 40 / 89
UPDATE Using Combo Command Ids
    SELECT   t_xmin AS xmin,
             t_xmax::text::int8 AS xmax,
             t_field3::text::int8 AS cmin_cmax,
             (t_infomask::integer & X’0020’::integer)::bool AS is_combocid
    FROM heap_page_items(get_raw_page(’mvcc_demo’, 0))
    ORDER BY 2 DESC, 3;
     xmin | xmax | cmin_cmax | is_combocid
    ------+------+-----------+-------------
     5427 | 5427 |           0 | t
     5427 | 5427 |           1 | t
     5427 | 5427 |           2 | t
     5427 |     0 |          6 | f
     5427 |     0 |          6 | f
     5427 |     0 |          6 | f
    (6 rows)
    COMMIT WORK;
    COMMIT

The last query uses /contrib/pageinspect, which allows visibility
of internal heap page structures and all stored rows, including
those not visible in the current snapshot. (Bit 0x0020 is
internally called HEAP_COMBOCID.)
                                                                     Mvcc Unmasked 41 / 89
MVCC Implementation Summary



     xmin: creation transaction number, set by   INSERT   and
            UPDATE
     xmax: expire transaction number, set by UPDATE and
           DELETE; also used for explicit row locks
cmin/cmax: used to identify the command number that created
           or expired the tuple; also used to store combo
           command ids when the tuple is created and expired
           in the same transaction, and for explicit row locks




                                                          Mvcc Unmasked 42 / 89
Traditional Cleanup Requirements




Traditional single-row-version (non-MVCC) database systems
require storage space cleanup:
 ◮   deleted rows
 ◮   rows created by aborted transactions




                                                      Mvcc Unmasked 43 / 89
MVCC Cleanup Requirements



MVCC has additional cleanup requirements:
 ◮   The creation of a new row during UPDATE (rather than
     replacing the existing row); the storage space taken by the
     old row must eventually be recycled.
 ◮   The delayed cleanup of deleted rows (cleanup cannot occur
     until there are no transactions for which the row is visible)
Postgres handles both traditional and MVCC-specific cleanup
requirements.




                                                           Mvcc Unmasked 44 / 89
Cleanup Behavior




Fortunately, Postgres cleanup happens automatically:
  ◮   On-demand cleanup of a single heap page during row access,
      specifically when a page is accessed by SELECT, UPDATE, and
      DELETE
  ◮   In bulk by an autovacuum processes that runs in the
      background
Cleanup can also be initiated manually by   VACUUM.




                                                        Mvcc Unmasked 45 / 89
Aspects of Cleanup




Cleanup involves recycling space taken by several entities:
  ◮   heap tuples/rows (the largest)
  ◮   heap item pointers (the smallest)
  ◮   index entries




                                                         Mvcc Unmasked 46 / 89
Internal Heap Page




     Page Header   Item   Item   Item




8K


                                        Tuple

         Tuple               Tuple          Special




                                                Mvcc Unmasked 47 / 89
Indexes Point to Items, Not Tuples


                       Indexes




      Page Header   Item   Item   Item




8K


                                         Tuple3

          Tuple2             Tuple1           Special


                                                  Mvcc Unmasked 48 / 89
Heap Tuple Space Recycling

                              Indexes




             Page Header   Item   Item   Item

                           Dead   Dead



    8K




                                    Tuple3           Special


Indexes prevent item pointers from being recycled.
                                                        Mvcc Unmasked 49 / 89
VACUUM Later Recycle Items

                               Indexes




             Page Header   Item     Item     Item

                           Unused   Unused



    8K




                                       Tuple3       Special

VACUUM performs index cleanup, then can mark “dead” items as
“unused”.                                          Mvcc Unmasked   50 / 89
Cleanup of Deleted Rows

TRUNCATE mvcc_demo;
TRUNCATE TABLE

-- force page to < 10% empty
INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240);
INSERT 0 240

-- compute free space percentage
SELECT (100 * (upper - lower) / pagesize::float8)::integer AS free_
FROM page_header(get_raw_page(’mvcc_demo’, 0));
 free_pct
----------
        6
(1 row)
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1

                                                   Mvcc Unmasked 51 / 89
Cleanup of Deleted Rows
SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5430 |    0 | (0,241)
(1 row)
DELETE FROM mvcc_demo WHERE val > 0;
DELETE 1
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5430 | 5431 | (0,241)
 (0,242) | Normal | 5432 |    0 | (0,242)
(2 rows)
                                             Mvcc Unmasked 52 / 89
Cleanup of Deleted Rows

    DELETE   FROM mvcc_demo WHERE val > 0;
    DELETE   1
    INSERT   INTO mvcc_demo VALUES (3);
    INSERT   0 1

    SELECT * FROM mvcc_demo_page0
    OFFSET 240;
      ctid | case | xmin | xmax | t_ctid
    ---------+--------+------+------+---------
     (0,241) | Dead |        |      |
     (0,242) | Normal | 5432 | 5433 | (0,242)
     (0,243) | Normal | 5434 |    0 | (0,243)
    (3 rows)
In normal, multi-user usage, cleanup might have been delayed
because other open transactions in the same database might still
need to view the expired rows. However, the behavior would be
the same, just delayed.
                                                       Mvcc Unmasked 53 / 89
Cleanup of Deleted Rows

-- force single-page cleanup via SELECT
SELECT * FROM mvcc_demo
OFFSET 1000;
 val
-----
(0 rows)

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Dead |        |      |
 (0,242) | Dead |        |      |
 (0,243) | Normal | 5434 |    0 | (0,243)
(3 rows)


                                             Mvcc Unmasked 54 / 89
Same as this Slide


                      Indexes




     Page Header   Item   Item   Item

                   Dead   Dead



8K




                            Tuple3      Special


                                           Mvcc Unmasked 55 / 89
Cleanup of Deleted Rows

SELECT pg_freespace(’mvcc_demo’);
 pg_freespace
--------------
 (0,0)
(1 row)

VACUUM mvcc_demo;
VACUUM

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Unused |      |      |
 (0,242) | Unused |      |      |
 (0,243) | Normal | 5434 |    0 | (0,243)
(3 rows)

                                             Mvcc Unmasked 56 / 89
Same as this Slide


                       Indexes




     Page Header   Item     Item     Item

                   Unused   Unused



8K




                               Tuple3       Special


                                               Mvcc Unmasked 57 / 89
Free Space Map (FSM)



    SELECT pg_freespace(’mvcc_demo’);
     pg_freespace
    --------------
     (0,416)
    (1 row)
VACUUM also updates the free space map (FSM), which records
pages containing significant free space. This information is used
to provide target pages for INSERTs and some UPDATEs (those
crossing page boundaries). Single-page vacuum does not update
the free space map.




                                                        Mvcc Unmasked 58 / 89
Another Free Space Map Example



TRUNCATE mvcc_demo;
TRUNCATE TABLE

VACUUM mvcc_demo;
VACUUM

SELECT pg_freespace(’mvcc_demo’);
 pg_freespace
--------------
(0 rows)




                                    Mvcc Unmasked 59 / 89
Another Free Space Map Example

INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
VACUUM mvcc_demo;
VACUUM
SELECT pg_freespace(’mvcc_demo’);
 pg_freespace
--------------
 (0,8128)
(1 row)
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
VACUUM mvcc_demo;
VACUUM
SELECT pg_freespace(’mvcc_demo’);
 pg_freespace
--------------
 (0,8096)
(1 row)
                                    Mvcc Unmasked 60 / 89
Another Free Space Map Example



DELETE FROM mvcc_demo WHERE val = 2;
DELETE 1
VACUUM mvcc_demo;
VACUUM

SELECT pg_freespace(’mvcc_demo’);
 pg_freespace
--------------
 (0,8128)
(1 row)




                                       Mvcc Unmasked 61 / 89
VACUUM Also Removes End-of-File Pages
    DELETE FROM mvcc_demo WHERE val = 1;
    DELETE 1
    VACUUM mvcc_demo;
    VACUUM

    SELECT pg_freespace(’mvcc_demo’);
     pg_freespace
    --------------
    (0 rows)
    SELECT pg_relation_size(’mvcc_demo’);
     pg_relation_size
    ------------------
                    0
    (1 row)
VACUUM FULL shrinks the table file to its minimum size, but
requires an exclusive table lock.

                                                      Mvcc Unmasked 62 / 89
Optimized Single-Page Cleanup of Old UPDATE Rows




The storage space taken by old UPDATE tuples can be reclaimed
just like deleted rows. However, certain UPDATE rows can even
have their items reclaimed, i.e. it is possible to reuse certain old
UPDATE items, rather than marking them as “dead” and requiring
VACUUM to reclaim them after removing referencing index entries.
Specifically, such item reuse is possible with special HOT update
(heap-only tuple) chains, where the chain is on a single heap page
and all indexed values in the chain are identical.




                                                          Mvcc Unmasked 63 / 89
Single-Page Cleanup of HOT UPDATE Rows


HOT update items can be freed (marked “unused”) if they are in
the middle of the chain, i.e. not at the beginning or end of the
chain. At the head of the chain is a special “Redirect” item
pointers that are referenced by indexes; this is possible because
all indexed values are identical in a HOT/redirect chain.
Index creation with HOT chains is complex because the chains
might contain inconsistent values for the newly indexed columns.
This is handled by indexing just the end of the HOT chain and
allowing the index to be used only by transactions that start after
the index has been created. (Specifically, post-index-creation
transactions cannot see the inconsistent HOT chain values due to
MVCC visibility rules; they only see the end of the chain.)




                                                         Mvcc Unmasked 64 / 89
Initial Single-Row State


                      Indexes




     Page Header   Item   Item     Item

                          Unused   Unused



8K




                            Tuple, v1       Special


                                               Mvcc Unmasked 65 / 89
UPDATE Adds a New Row

                                Indexes




              Page Header    Item   Item   Item

                                           Unused



    8K




                 Tuple, v2           Tuple, v1       Special

No index entry added because indexes only point to the head of
the HOT chain.                                        Mvcc Unmasked   66 / 89
Redirect Allows Indexes To Remain Valid


                             Indexes




         Page Header    Item     Item   Item

                  Redirect



8K




            Tuple, v2             Tuple, v3    Special


                                                  Mvcc Unmasked 67 / 89
UPDATE Replaces Another Old Row


                          Indexes




      Page Header    Item     Item   Item

               Redirect



8K




         Tuple, v4             Tuple, v3    Special


                                               Mvcc Unmasked 68 / 89
All Old UPDATE Row Versions Eventually Removed

                                  Indexes




             Page Header     Item     Item   Item

                       Redirect              Unused



    8K




                                       Tuple, v4      Special

This cleanup was performed by another operation on the same
page.                                                Mvcc Unmasked   69 / 89
Cleanup of Old Updated Rows


TRUNCATE mvcc_demo;
TRUNCATE TABLE
INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240);
INSERT 0 240
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5437 |    0 | (0,241)
(1 row)




                                                   Mvcc Unmasked 70 / 89
Cleanup of Old Updated Rows




UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
UPDATE 1
SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5437 | 5438 | (0,242)
 (0,242) | Normal | 5438 |    0 | (0,242)
(2 rows)




                                                    Mvcc Unmasked 71 / 89
Cleanup of Old Updated Rows


    UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
    UPDATE 1
    SELECT * FROM mvcc_demo_page0
    OFFSET 240;
      ctid |        case       | xmin | xmax | t_ctid
    ---------+-----------------+------+------+---------
     (0,241) | Redirect to 242 |      |      |
     (0,242) | Normal          | 5438 | 5439 | (0,243)
     (0,243) | Normal          | 5439 |    0 | (0,243)
    (3 rows)
No index entry added because indexes only point to the head of
the HOT chain.



                                                          Mvcc Unmasked 72 / 89
Same as this Slide


                         Indexes




     Page Header    Item     Item   Item

              Redirect



8K




        Tuple, v2             Tuple, v3    Special


                                              Mvcc Unmasked 73 / 89
Cleanup of Old Updated Rows



UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
UPDATE 1

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid |        case       | xmin | xmax | t_ctid
---------+-----------------+------+------+---------
 (0,241) | Redirect to 243 |      |      |
 (0,242) | Normal          | 5440 |    0 | (0,242)
 (0,243) | Normal          | 5439 | 5440 | (0,242)
(3 rows)




                                                      Mvcc Unmasked 74 / 89
Same as this Slide


                         Indexes




     Page Header    Item     Item   Item

              Redirect



8K




        Tuple, v4             Tuple, v3    Special


                                              Mvcc Unmasked 75 / 89
Cleanup of Old Updated Rows

-- transaction now committed, HOT chain allows tid
-- to be marked as ‘‘Unused’’
SELECT * FROM mvcc_demo
OFFSET 1000;
 val
-----
(0 rows)

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid |        case       | xmin | xmax | t_ctid
---------+-----------------+------+------+---------
 (0,241) | Redirect to 242 |      |      |
 (0,242) | Normal          | 5440 |    0 | (0,242)
 (0,243) | Unused          |      |      |
(3 rows)


                                                      Mvcc Unmasked 76 / 89
Same as this Slide


                         Indexes




     Page Header    Item     Item   Item

              Redirect              Unused



8K




                              Tuple, v4      Special


                                                Mvcc Unmasked 77 / 89
VACUUM Does Not Remove the Redirect



VACUUM mvcc_demo;
VACUUM

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid |        case       | xmin | xmax | t_ctid
---------+-----------------+------+------+---------
 (0,241) | Redirect to 242 |      |      |
 (0,242) | Normal          | 5440 |    0 | (0,242)
 (0,243) | Unused          |      |      |
(3 rows)




                                                      Mvcc Unmasked 78 / 89
Cleanup Using Manual VACUUM
TRUNCATE mvcc_demo;
TRUNCATE TABLE
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (2);
INSERT 0 1
INSERT INTO mvcc_demo VALUES (3);
INSERT 0 1

SELECT ctid, xmin, xmax
FROM mvcc_demo_page0;
 ctid | xmin | xmax
-------+------+------
 (0,1) | 5442 |    0
 (0,2) | 5443 |    0
 (0,3) | 5444 |    0
(3 rows)
DELETE FROM mvcc_demo;
DELETE 3
                                      Mvcc Unmasked 79 / 89
Cleanup Using Manual VACUUM
SELECT ctid, xmin, xmax
FROM mvcc_demo_page0;
 ctid | xmin | xmax
-------+------+------
 (0,1) | 5442 | 5445
 (0,2) | 5443 | 5445
 (0,3) | 5444 | 5445
(3 rows)

-- too small to trigger autovacuum
VACUUM mvcc_demo;
VACUUM

SELECT pg_relation_size(’mvcc_demo’);
 pg_relation_size
------------------
                0
(1 row)
                                        Mvcc Unmasked 80 / 89
The Indexed UPDATE Problem




The updating of any indexed columns prevents the use of
“redirect” items because the chain must be usable by all indexes,
i.e. a redirect/ HOT UPDATE cannot require additional index
entries due to an indexed value change.
In such cases, item pointers can only be marked as “dead”, like
DELETE does.
No previously shown UPDATE queries modified indexed columns.




                                                        Mvcc Unmasked 81 / 89
Index mvcc_demo Column




CREATE INDEX i_mvcc_demo_val on mvcc_demo (val);
CREATE INDEX




                                                   Mvcc Unmasked 82 / 89
UPDATE of an Indexed Column


TRUNCATE mvcc_demo;
TRUNCATE TABLE
INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240);
INSERT 0 240
INSERT INTO mvcc_demo VALUES (1);
INSERT 0 1

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5449 |    0 | (0,241)
(1 row)




                                                   Mvcc Unmasked 83 / 89
UPDATE of an Indexed Column
UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
UPDATE 1
SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Normal | 5449 | 5450 | (0,242)
 (0,242) | Normal | 5450 |    0 | (0,242)
(2 rows)
UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
UPDATE 1
SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Dead |        |      |
 (0,242) | Normal | 5450 | 5451 | (0,243)
 (0,243) | Normal | 5451 |    0 | (0,243)
(3 rows)
                                                    Mvcc Unmasked 84 / 89
UPDATE of an Indexed Column


UPDATE mvcc_demo SET val = val + 1 WHERE val > 0;
UPDATE 1

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Dead |        |      |
 (0,242) | Dead |        |      |
 (0,243) | Normal | 5451 | 5452 | (0,244)
 (0,244) | Normal | 5452 |    0 | (0,244)
(4 rows)




                                                    Mvcc Unmasked 85 / 89
UPDATE of an Indexed Column

SELECT * FROM mvcc_demo
OFFSET 1000;
 val
-----
(0 rows)

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Dead |        |      |
 (0,242) | Dead |        |      |
 (0,243) | Dead |        |      |
 (0,244) | Normal | 5452 |    0 | (0,244)
(4 rows)


                                             Mvcc Unmasked 86 / 89
UPDATE of an Indexed Column


VACUUM mvcc_demo;
VACUUM

SELECT * FROM mvcc_demo_page0
OFFSET 240;
  ctid | case | xmin | xmax | t_ctid
---------+--------+------+------+---------
 (0,241) | Unused |      |      |
 (0,242) | Unused |      |      |
 (0,243) | Unused |      |      |
 (0,244) | Normal | 5452 |    0 | (0,244)
(4 rows)




                                             Mvcc Unmasked 87 / 89
Cleanup Summary


                                                 Reuse     Non-HOT   HOT

 Cleanup                                         Heap      Item      Item     Clean        Update

 Method        Triggered By      Scope           Tuples?   State     State    Indexes?     FSM

 Single-Page   SELECT, UPDATE,   single heap     yes       dead      unused   no           no

               DELETE            page

 VACUUM        autovacuum        all potential   yes       unused    unused   yes          yes

               or manually       heap pages



Cleanup is possible only when there are no active transactions for
which the tuples are visible.
HOT items are UPDATE chains that span a single page and contain
identical indexed column values.
In normal usage, single-page cleanup performs the majority of the
cleanup work, while VACUUM reclaims “dead” item pointers, removes
unnecessary index entries, and updates the free space map (FSM).

                                                                                      Mvcc Unmasked 88 / 89
Conclusion




All the queries used in this presentation are available at
http://momjian.us/main/writings/pgsql/mvcc.sql.
http://momjian.us/presentations                              Escher, Relativity
                                                                Mvcc Unmasked     89 / 89

Más contenido relacionado

La actualidad más candente

你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
maclean liu
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
maclean liu
 
Transactions
TransactionsTransactions
Transactions
kalyan_bu
 
Uvm dcon2013
Uvm dcon2013Uvm dcon2013
Uvm dcon2013
sean chen
 

La actualidad más candente (20)

Caspar's First Kernel Patch
Caspar's First Kernel PatchCaspar's First Kernel Patch
Caspar's First Kernel Patch
 
你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能你所不知道的Oracle后台进程Smon功能
你所不知道的Oracle后台进程Smon功能
 
Noinject
NoinjectNoinject
Noinject
 
Threads
ThreadsThreads
Threads
 
Px execution in rac
Px execution in racPx execution in rac
Px execution in rac
 
QEMU Sandboxing for dummies
QEMU Sandboxing for dummiesQEMU Sandboxing for dummies
QEMU Sandboxing for dummies
 
The Next Step in AS3 Framework Evolution - FITC Amsterdam 2013
The Next Step in AS3 Framework Evolution - FITC Amsterdam 2013The Next Step in AS3 Framework Evolution - FITC Amsterdam 2013
The Next Step in AS3 Framework Evolution - FITC Amsterdam 2013
 
Systemd cheatsheet
Systemd cheatsheetSystemd cheatsheet
Systemd cheatsheet
 
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilitiesBlueHat v18 || A mitigation for kernel toctou vulnerabilities
BlueHat v18 || A mitigation for kernel toctou vulnerabilities
 
Performance tuning a quick intoduction
Performance tuning   a quick intoductionPerformance tuning   a quick intoduction
Performance tuning a quick intoduction
 
Drd
DrdDrd
Drd
 
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021Linux  /proc filesystem for MySQL DBAs - FOSDEM 2021
Linux /proc filesystem for MySQL DBAs - FOSDEM 2021
 
Deep review of LMS process
Deep review of LMS processDeep review of LMS process
Deep review of LMS process
 
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
【Maclean liu技术分享】拨开oracle cbo优化器迷雾,探究histogram直方图之秘 0321
 
Concurrency 2010
Concurrency 2010Concurrency 2010
Concurrency 2010
 
Introduction of unit test on android kernel
Introduction of unit test on android kernelIntroduction of unit test on android kernel
Introduction of unit test on android kernel
 
Transactions
TransactionsTransactions
Transactions
 
MyShell - English
MyShell - EnglishMyShell - English
MyShell - English
 
Uvm dcon2013
Uvm dcon2013Uvm dcon2013
Uvm dcon2013
 
Writing Nginx Modules
Writing Nginx ModulesWriting Nginx Modules
Writing Nginx Modules
 

Similar a Mvcc Unmasked (Bruce Momjian)

Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013
EDB
 
Profiling ruby
Profiling rubyProfiling ruby
Profiling ruby
nasirj
 
基于Mongodb的压力评测工具 ycsb的一些概括
基于Mongodb的压力评测工具 ycsb的一些概括基于Mongodb的压力评测工具 ycsb的一些概括
基于Mongodb的压力评测工具 ycsb的一些概括
Louis liu
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
MySQLConference
 

Similar a Mvcc Unmasked (Bruce Momjian) (20)

How Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protectionsHow Triton can help to reverse virtual machine based software protections
How Triton can help to reverse virtual machine based software protections
 
Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013Mv unmasked.w.code.march.2013
Mv unmasked.w.code.march.2013
 
Mvc express presentation
Mvc express presentationMvc express presentation
Mvc express presentation
 
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
IBM IMPACT 2014 - AMC-1883 - Where's My Message - Analyze IBM WebSphere MQ Re...
 
WMQ Toolbox: 20 Scripts, One-liners, & Utilities for UNIX & Windows
WMQ Toolbox: 20 Scripts, One-liners, & Utilities for UNIX & Windows WMQ Toolbox: 20 Scripts, One-liners, & Utilities for UNIX & Windows
WMQ Toolbox: 20 Scripts, One-liners, & Utilities for UNIX & Windows
 
PostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQPostgreSQL Meetup Berlin at Zalando HQ
PostgreSQL Meetup Berlin at Zalando HQ
 
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016Advanced percona xtra db cluster in a nutshell... la suite plsc2016
Advanced percona xtra db cluster in a nutshell... la suite plsc2016
 
Advanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suiteAdvanced Percona XtraDB Cluster in a nutshell... la suite
Advanced Percona XtraDB Cluster in a nutshell... la suite
 
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
Cloud patterns - NDC Oslo 2016 - Tamir DresherCloud patterns - NDC Oslo 2016 - Tamir Dresher
Cloud patterns - NDC Oslo 2016 - Tamir Dresher
 
PVS-Studio Meets Octave
PVS-Studio Meets Octave PVS-Studio Meets Octave
PVS-Studio Meets Octave
 
Profiling ruby
Profiling rubyProfiling ruby
Profiling ruby
 
Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)Counter Wars (JEEConf 2016)
Counter Wars (JEEConf 2016)
 
基于Mongodb的压力评测工具 ycsb的一些概括
基于Mongodb的压力评测工具 ycsb的一些概括基于Mongodb的压力评测工具 ycsb的一些概括
基于Mongodb的压力评测工具 ycsb的一些概括
 
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
Analyzing the Performance Effects of Meltdown + Spectre on Apache Spark Workl...
 
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance ProblemsD Trace Support In My Sql Guide To Solving Reallife Performance Problems
D Trace Support In My Sql Guide To Solving Reallife Performance Problems
 
A Post About Analyzing PHP
A Post About Analyzing PHPA Post About Analyzing PHP
A Post About Analyzing PHP
 
Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108Artimon - Apache Flume (incubating) NYC Meetup 20111108
Artimon - Apache Flume (incubating) NYC Meetup 20111108
 
200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience200 Open Source Projects Later: Source Code Static Analysis Experience
200 Open Source Projects Later: Source Code Static Analysis Experience
 
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
You need Event Mesh, not Service Mesh - Chris Suszynski [WJUG 301]
 
Training Slides: 104 - Basics - Working With Command Line Tools
Training Slides: 104 - Basics - Working With Command Line ToolsTraining Slides: 104 - Basics - Working With Command Line Tools
Training Slides: 104 - Basics - Working With Command Line Tools
 

Más de Ontico

Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Ontico
 

Más de Ontico (20)

One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
One-cloud — система управления дата-центром в Одноклассниках / Олег Анастасье...
 
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)Масштабируя DNS / Артем Гавриченков (Qrator Labs)
Масштабируя DNS / Артем Гавриченков (Qrator Labs)
 
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
Создание BigData-платформы для ФГУП Почта России / Андрей Бащенко (Luxoft)
 
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
Готовим тестовое окружение, или сколько тестовых инстансов вам нужно / Алекса...
 
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
Новые технологии репликации данных в PostgreSQL / Александр Алексеев (Postgre...
 
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
PostgreSQL Configuration for Humans / Alvaro Hernandez (OnGres)
 
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
Inexpensive Datamasking for MySQL with ProxySQL — Data Anonymization for Deve...
 
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
Опыт разработки модуля межсетевого экранирования для MySQL / Олег Брославский...
 
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
ProxySQL Use Case Scenarios / Alkin Tezuysal (Percona)
 
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)MySQL Replication — Advanced Features / Петр Зайцев (Percona)
MySQL Replication — Advanced Features / Петр Зайцев (Percona)
 
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
Внутренний open-source. Как разрабатывать мобильное приложение большим количе...
 
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
Подробно о том, как Causal Consistency реализовано в MongoDB / Михаил Тюленев...
 
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
Балансировка на скорости проводов. Без ASIC, без ограничений. Решения NFWare ...
 
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
Перехват трафика — мифы и реальность / Евгений Усков (Qrator Labs)
 
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
И тогда наверняка вдруг запляшут облака! / Алексей Сушков (ПЕТЕР-СЕРВИС)
 
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
Как мы заставили Druid работать в Одноклассниках / Юрий Невиницин (OK.RU)
 
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
Разгоняем ASP.NET Core / Илья Вербицкий (WebStoating s.r.o.)
 
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...100500 способов кэширования в Oracle Database или как достичь максимальной ск...
100500 способов кэширования в Oracle Database или как достичь максимальной ск...
 
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
Apache Ignite Persistence: зачем Persistence для In-Memory, и как он работает...
 
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
Механизмы мониторинга баз данных: взгляд изнутри / Дмитрий Еманов (Firebird P...
 

Mvcc Unmasked (Bruce Momjian)

  • 1. Mvcc Unmasked BRUCE MOMJIAN January, 2012 This talk explains how Multiversion Concurrency Control (MVCC) is implemented in Postgres, and highlights optimizations which minimize the downsides of MVCC. Creative Commons Attribution License http://momjian.us/presentations Mvcc Unmasked 1 / 89
  • 2. Unmasked: Who Are These People? Mvcc Unmasked 2 / 89
  • 3. Unmasked: The Original Star Wars Cast Left to right: Han Solo, Darth Vader, Chewbacca, Leia, Luke Skywalker, R2D2 Mvcc Unmasked 3 / 89
  • 4. Why Unmask MVCC? ◮ Predict concurrent query behavior ◮ Manage MVCC performance effects ◮ Understand storage space reuse Mvcc Unmasked 4 / 89
  • 5. Outline 1. Introduction to MVCC 2. MVCC Implementation Details 3. MVCC Cleanup Requirements and Behavior Mvcc Unmasked 5 / 89
  • 6. What is MVCC? Multiversion Concurrency Control (MVCC) allows Postgres to offer high concurrency even during significant database read/write activity. MVCC specifically offers behavior where "readers never block writers, and writers never block readers". This presentation explains how MVCC is implemented in Postgres, and highlights optimizations which minimize the downsides of MVCC. Mvcc Unmasked 6 / 89
  • 7. Which Other Database Systems Support MVCC? ◮ Oracle ◮ DB2 (partial) ◮ MySQL with InnoDB ◮ Informix ◮ Firebird ◮ MSSQL (optional, disabled by default) Mvcc Unmasked 7 / 89
  • 8. MVCC Behavior Cre 40 Exp INSERT Cre 40 Exp 47 DELETE Cre 64 old (delete) Exp 78 UPDATE Cre 78 new (insert) Exp Mvcc Unmasked 8 / 89
  • 9. MVCC Snapshots MVCC snapshots control which tuples are visible for SQL statements. A snapshot is recorded at the start of each SQL statement in READ COMMITTED transaction isolation mode, and at transaction start in SERIALIZABLE transaction isolation mode. In fact, it is frequency of taking new snapshots that controls the transaction isolation behavior. When a new snapshot is taken, the following information is gathered: ◮ the highest-numbered committed transaction ◮ the transaction numbers currently executing Using this snapshot information, Postgres can determine if a transaction’s actions should be visible to an executing statement. Mvcc Unmasked 9 / 89
  • 10. MVCC Snapshots Determine Row Visibility Create−Only Cre 30 Sequential Scan Exp Visible Cre 50 Snapshot Exp Invisible Cre 110 The highest−numbered Exp Invisible committed transaction: 100 Open Transactions: 25, 50, 75 Create & Expire Cre 30 For simplicity, assume all other Exp 80 Invisible transactions are committed. Cre 30 Exp 75 Visible Cre 30 Exp 110 Visible Internally, the creation xid is stored in the system column ’xmin’, and expire in ’xmax’. Mvcc Unmasked 10 / 89
  • 11. Confused Yet? Source code comment in src/backend/utils/time/tqual.c: ((Xmin == my-transaction && inserted by the current transaction Cmin < my-command && before this command, and (Xmax is null || the row has not been deleted, or (Xmax == my-transaction && it was deleted by the current transaction Cmax >= my-command))) but not before this command, || or (Xmin is committed && the row was inserted by a committed transaction, and (Xmax is null || the row has not been deleted, or (Xmax == my-transaction && the row is being deleted by this transaction Cmax >= my-command) || but it’s not deleted "yet", or (Xmax != my-transaction && the row was deleted by another transaction Xmax is not committed)))) that has not been committed mao [Mike Olson] says 17 march 1993: the tests in this routine are correct; if you think they’re not, you’re wrong, and you should think about it again. i know, it happened to me. Mvcc Unmasked 11 / 89
  • 12. Implementation Details All queries were generated on an unmodified version of Postgres. The contrib module pageinspect was installed to show internal heap page information and pg_freespacemap was installed to show free space map information. Mvcc Unmasked 12 / 89
  • 13. Setup CREATE TABLE mvcc_demo (val INTEGER); CREATE TABLE DROP VIEW IF EXISTS mvcc_demo_page0; DROP VIEW CREATE VIEW mvcc_demo_page0 AS SELECT ’(0,’ || lp || ’)’ AS ctid, CASE lp_flags WHEN 0 THEN ’Unused’ WHEN 1 THEN ’Normal’ WHEN 2 THEN ’Redirect to ’ || lp_off WHEN 3 THEN ’Dead’ END, t_xmin::text::int8 AS xmin, t_xmax::text::int8 AS xmax, t_ctid FROM heap_page_items(get_raw_page(’mvcc_demo’, 0)) ORDER BY lp; CREATE VIEW Mvcc Unmasked 13 / 89
  • 14. INSERT Using Xmin DELETE FROM mvcc_demo; DELETE 0 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5409 | 0 | 1 (1 row) All the queries used in this presentation are available at http://momjian.us/main/writings/pgsql/mvcc.sql. Mvcc Unmasked 14 / 89
  • 15. Just Like INSERT Cre 40 Exp INSERT Cre 40 Exp 47 DELETE Cre 64 old (delete) Exp 78 UPDATE Cre 78 new (insert) Exp Mvcc Unmasked 15 / 89
  • 16. DELETE Using Xmax DELETE FROM mvcc_demo; DELETE 1 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5411 | 0 | 1 (1 row) BEGIN WORK; BEGIN DELETE FROM mvcc_demo; DELETE 1 Mvcc Unmasked 16 / 89
  • 17. DELETE Using Xmax SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- (0 rows) SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5411 | 5412 | 1 (1 row) SELECT txid_current(); txid_current -------------- 5412 (1 row) COMMIT WORK; COMMIT Mvcc Unmasked 17 / 89
  • 18. Just Like DELETE Cre 40 Exp INSERT Cre 40 Exp 47 DELETE Cre 64 old (delete) Exp 78 UPDATE Cre 78 new (insert) Exp Mvcc Unmasked 18 / 89
  • 19. UPDATE Using Xmin and Xmax DELETE FROM mvcc_demo; DELETE 0 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5413 | 0 | 1 (1 row) BEGIN WORK; BEGIN UPDATE mvcc_demo SET val = 2; UPDATE 1 Mvcc Unmasked 19 / 89
  • 20. UPDATE Using Xmin and Xmax SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5414 | 0 | 2 (1 row) SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5413 | 5414 | 1 (1 row) COMMIT WORK; COMMIT Mvcc Unmasked 20 / 89
  • 21. Just Like UPDATE Cre 40 Exp INSERT Cre 40 Exp 47 DELETE Cre 64 old (delete) Exp 78 UPDATE Cre 78 new (insert) Exp Mvcc Unmasked 21 / 89
  • 22. Aborted Transaction IDs Remain DELETE FROM mvcc_demo; DELETE 1 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 BEGIN WORK; BEGIN DELETE FROM mvcc_demo; DELETE 1 ROLLBACK WORK; ROLLBACK SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5415 | 5416 | 1 (1 row) Mvcc Unmasked 22 / 89
  • 23. Aborted IDs Can Remain Because Transaction Status Is Recorded Centrally pg_clog XID Status flags 028 0 0 0 1 0 0 1 0 024 1 0 1 0 0 0 0 0 020 1 0 1 0 0 1 0 0 016 0 0 0 0 0 0 1 0 012 0 0 0 1 0 1 1 0 008 1 0 1 0 0 0 1 0 004 1 0 1 0 0 0 0 0 000 1 0 0 1 0 0 1 0 Tuple Creation XID: 15 Expiration XID: 27 00 In Progress xmin xmax 01 Aborted 10 Committed Transaction Id (XID) Transaction roll back marks the transaction ID as aborted. All sessions will ignore such transactions; it is not ncessary to revisit each row to undo the transaction. Mvcc Unmasked 23 / 89
  • 24. Row Locks Using Xmax DELETE FROM mvcc_demo; DELETE 1 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 BEGIN WORK; BEGIN SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5416 | 0 | 1 (1 row) SELECT xmin, xmax, * FROM mvcc_demo FOR UPDATE; xmin | xmax | val ------+------+----- 5416 | 0 | 1 (1 row) Mvcc Unmasked 24 / 89
  • 25. Row Locks Using Xmax SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5416 | 5417 | 1 (1 row) COMMIT WORK; COMMIT The tuple bit HEAP_XMAX_EXCL_LOCK is used to indicate that xmax is a locking xid rather than an expiration xid. Mvcc Unmasked 25 / 89
  • 26. Multi-Statement Transactions Multi-statement transactions require extra tracking because each statement has its own visibility rules. For example, a cursor’s contents must remain unchanged even if later statements in the same transaction modify rows. Such tracking is implemented using system command id columns cmin/cmax, which is internally actually is a single column. Mvcc Unmasked 26 / 89
  • 27. INSERT Using Cmin DELETE FROM mvcc_demo; DELETE 1 BEGIN WORK; BEGIN INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 Mvcc Unmasked 27 / 89
  • 28. INSERT Using Cmin SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5419 | 0 | 0 | 1 5419 | 1 | 0 | 2 5419 | 2 | 0 | 3 (3 rows) COMMIT WORK; COMMIT Mvcc Unmasked 28 / 89
  • 29. DELETE Using Cmin DELETE FROM mvcc_demo; DELETE 3 BEGIN WORK; BEGIN INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 Mvcc Unmasked 29 / 89
  • 30. DELETE Using Cmin SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5421 | 0 | 0 | 1 5421 | 1 | 0 | 2 5421 | 2 | 0 | 3 (3 rows) DECLARE c_mvcc_demo CURSOR FOR SELECT xmin, xmax, cmax, * FROM mvcc_demo; DECLARE CURSOR Mvcc Unmasked 30 / 89
  • 31. DELETE Using Cmin DELETE FROM mvcc_demo; DELETE 3 SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- (0 rows) FETCH ALL FROM c_mvcc_demo; xmin | xmax | cmax | val ------+------+------+----- 5421 | 5421 | 0 | 1 5421 | 5421 | 1 | 2 5421 | 5421 | 2 | 3 (3 rows) COMMIT WORK; COMMIT A cursor had to be used because the rows were created and deleted in this transaction and therefore never visible outside this transaction. Mvcc Unmasked 31 / 89
  • 32. UPDATE Using Cmin DELETE FROM mvcc_demo; DELETE 0 BEGIN WORK; BEGIN INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5422 | 0 | 0 | 1 5422 | 1 | 0 | 2 5422 | 2 | 0 | 3 (3 rows) DECLARE c_mvcc_demo CURSOR FOR SELECT xmin, xmax, cmax, * FROM mvcc_demo; Mvcc Unmasked 32 / 89
  • 33. UPDATE Using Cmin UPDATE mvcc_demo SET val = val * 10; UPDATE 3 SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5422 | 3 | 0 | 10 5422 | 3 | 0 | 20 5422 | 3 | 0 | 30 (3 rows) FETCH ALL FROM c_mvcc_demo; xmin | xmax | cmax | val ------+------+------+----- 5422 | 5422 | 0 | 1 5422 | 5422 | 1 | 2 5422 | 5422 | 2 | 3 (3 rows) COMMIT WORK; COMMIT Mvcc Unmasked 33 / 89
  • 34. Modifying Rows From Different Transactions DELETE FROM mvcc_demo; DELETE 3 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT xmin, xmax, * FROM mvcc_demo; xmin | xmax | val ------+------+----- 5424 | 0 | 1 (1 row) BEGIN WORK; BEGIN INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 INSERT INTO mvcc_demo VALUES (4); INSERT 0 1 Mvcc Unmasked 34 / 89
  • 35. Modifying Rows From Different Transactions SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5424 | 0 | 0 | 1 5425 | 0 | 0 | 2 5425 | 1 | 0 | 3 5425 | 2 | 0 | 4 (4 rows) UPDATE mvcc_demo SET val = val * 10; UPDATE 4 Mvcc Unmasked 35 / 89
  • 36. Modifying Rows From Different Transactions SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5425 | 3 | 0 | 10 5425 | 3 | 0 | 20 5425 | 3 | 0 | 30 5425 | 3 | 0 | 40 (4 rows) SELECT xmin, xmax, cmax, * FROM mvcc_demo; xmin | xmax | cmax | val ------+------+------+----- 5424 | 5425 | 3 | 1 (1 row) COMMIT WORK; COMMIT Mvcc Unmasked 36 / 89
  • 37. Combo Command Id Because cmin and cmax are internally a single system column, it is impossible to simply record the status of a row that is created and expired in the same multi-statement transaction. For that reason, a special combo command id is created that references a local memory hash that contains the actual cmin and cmax values. Mvcc Unmasked 37 / 89
  • 38. UPDATE Using Combo Command Ids -- use TRUNCATE to remove even invisible rows TRUNCATE mvcc_demo; TRUNCATE TABLE BEGIN WORK; BEGIN DELETE FROM mvcc_demo; DELETE 0 DELETE FROM mvcc_demo; DELETE 0 DELETE FROM mvcc_demo; DELETE 0 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 Mvcc Unmasked 38 / 89
  • 39. UPDATE Using Combo Command Ids SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5427 | 3 | 0 | 1 5427 | 4 | 0 | 2 5427 | 5 | 0 | 3 (3 rows) DECLARE c_mvcc_demo CURSOR FOR SELECT xmin, xmax, cmax, * FROM mvcc_demo; DECLARE CURSOR UPDATE mvcc_demo SET val = val * 10; UPDATE 3 Mvcc Unmasked 39 / 89
  • 40. UPDATE Using Combo Command Ids SELECT xmin, cmin, xmax, * FROM mvcc_demo; xmin | cmin | xmax | val ------+------+------+----- 5427 | 6 | 0 | 10 5427 | 6 | 0 | 20 5427 | 6 | 0 | 30 (3 rows) FETCH ALL FROM c_mvcc_demo; xmin | xmax | cmax | val ------+------+------+----- 5427 | 5427 | 0 | 1 5427 | 5427 | 1 | 2 5427 | 5427 | 2 | 3 (3 rows) Mvcc Unmasked 40 / 89
  • 41. UPDATE Using Combo Command Ids SELECT t_xmin AS xmin, t_xmax::text::int8 AS xmax, t_field3::text::int8 AS cmin_cmax, (t_infomask::integer & X’0020’::integer)::bool AS is_combocid FROM heap_page_items(get_raw_page(’mvcc_demo’, 0)) ORDER BY 2 DESC, 3; xmin | xmax | cmin_cmax | is_combocid ------+------+-----------+------------- 5427 | 5427 | 0 | t 5427 | 5427 | 1 | t 5427 | 5427 | 2 | t 5427 | 0 | 6 | f 5427 | 0 | 6 | f 5427 | 0 | 6 | f (6 rows) COMMIT WORK; COMMIT The last query uses /contrib/pageinspect, which allows visibility of internal heap page structures and all stored rows, including those not visible in the current snapshot. (Bit 0x0020 is internally called HEAP_COMBOCID.) Mvcc Unmasked 41 / 89
  • 42. MVCC Implementation Summary xmin: creation transaction number, set by INSERT and UPDATE xmax: expire transaction number, set by UPDATE and DELETE; also used for explicit row locks cmin/cmax: used to identify the command number that created or expired the tuple; also used to store combo command ids when the tuple is created and expired in the same transaction, and for explicit row locks Mvcc Unmasked 42 / 89
  • 43. Traditional Cleanup Requirements Traditional single-row-version (non-MVCC) database systems require storage space cleanup: ◮ deleted rows ◮ rows created by aborted transactions Mvcc Unmasked 43 / 89
  • 44. MVCC Cleanup Requirements MVCC has additional cleanup requirements: ◮ The creation of a new row during UPDATE (rather than replacing the existing row); the storage space taken by the old row must eventually be recycled. ◮ The delayed cleanup of deleted rows (cleanup cannot occur until there are no transactions for which the row is visible) Postgres handles both traditional and MVCC-specific cleanup requirements. Mvcc Unmasked 44 / 89
  • 45. Cleanup Behavior Fortunately, Postgres cleanup happens automatically: ◮ On-demand cleanup of a single heap page during row access, specifically when a page is accessed by SELECT, UPDATE, and DELETE ◮ In bulk by an autovacuum processes that runs in the background Cleanup can also be initiated manually by VACUUM. Mvcc Unmasked 45 / 89
  • 46. Aspects of Cleanup Cleanup involves recycling space taken by several entities: ◮ heap tuples/rows (the largest) ◮ heap item pointers (the smallest) ◮ index entries Mvcc Unmasked 46 / 89
  • 47. Internal Heap Page Page Header Item Item Item 8K Tuple Tuple Tuple Special Mvcc Unmasked 47 / 89
  • 48. Indexes Point to Items, Not Tuples Indexes Page Header Item Item Item 8K Tuple3 Tuple2 Tuple1 Special Mvcc Unmasked 48 / 89
  • 49. Heap Tuple Space Recycling Indexes Page Header Item Item Item Dead Dead 8K Tuple3 Special Indexes prevent item pointers from being recycled. Mvcc Unmasked 49 / 89
  • 50. VACUUM Later Recycle Items Indexes Page Header Item Item Item Unused Unused 8K Tuple3 Special VACUUM performs index cleanup, then can mark “dead” items as “unused”. Mvcc Unmasked 50 / 89
  • 51. Cleanup of Deleted Rows TRUNCATE mvcc_demo; TRUNCATE TABLE -- force page to < 10% empty INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240); INSERT 0 240 -- compute free space percentage SELECT (100 * (upper - lower) / pagesize::float8)::integer AS free_ FROM page_header(get_raw_page(’mvcc_demo’, 0)); free_pct ---------- 6 (1 row) INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 Mvcc Unmasked 51 / 89
  • 52. Cleanup of Deleted Rows SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5430 | 0 | (0,241) (1 row) DELETE FROM mvcc_demo WHERE val > 0; DELETE 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5430 | 5431 | (0,241) (0,242) | Normal | 5432 | 0 | (0,242) (2 rows) Mvcc Unmasked 52 / 89
  • 53. Cleanup of Deleted Rows DELETE FROM mvcc_demo WHERE val > 0; DELETE 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Dead | | | (0,242) | Normal | 5432 | 5433 | (0,242) (0,243) | Normal | 5434 | 0 | (0,243) (3 rows) In normal, multi-user usage, cleanup might have been delayed because other open transactions in the same database might still need to view the expired rows. However, the behavior would be the same, just delayed. Mvcc Unmasked 53 / 89
  • 54. Cleanup of Deleted Rows -- force single-page cleanup via SELECT SELECT * FROM mvcc_demo OFFSET 1000; val ----- (0 rows) SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Dead | | | (0,242) | Dead | | | (0,243) | Normal | 5434 | 0 | (0,243) (3 rows) Mvcc Unmasked 54 / 89
  • 55. Same as this Slide Indexes Page Header Item Item Item Dead Dead 8K Tuple3 Special Mvcc Unmasked 55 / 89
  • 56. Cleanup of Deleted Rows SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0,0) (1 row) VACUUM mvcc_demo; VACUUM SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Unused | | | (0,242) | Unused | | | (0,243) | Normal | 5434 | 0 | (0,243) (3 rows) Mvcc Unmasked 56 / 89
  • 57. Same as this Slide Indexes Page Header Item Item Item Unused Unused 8K Tuple3 Special Mvcc Unmasked 57 / 89
  • 58. Free Space Map (FSM) SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0,416) (1 row) VACUUM also updates the free space map (FSM), which records pages containing significant free space. This information is used to provide target pages for INSERTs and some UPDATEs (those crossing page boundaries). Single-page vacuum does not update the free space map. Mvcc Unmasked 58 / 89
  • 59. Another Free Space Map Example TRUNCATE mvcc_demo; TRUNCATE TABLE VACUUM mvcc_demo; VACUUM SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0 rows) Mvcc Unmasked 59 / 89
  • 60. Another Free Space Map Example INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 VACUUM mvcc_demo; VACUUM SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0,8128) (1 row) INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 VACUUM mvcc_demo; VACUUM SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0,8096) (1 row) Mvcc Unmasked 60 / 89
  • 61. Another Free Space Map Example DELETE FROM mvcc_demo WHERE val = 2; DELETE 1 VACUUM mvcc_demo; VACUUM SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0,8128) (1 row) Mvcc Unmasked 61 / 89
  • 62. VACUUM Also Removes End-of-File Pages DELETE FROM mvcc_demo WHERE val = 1; DELETE 1 VACUUM mvcc_demo; VACUUM SELECT pg_freespace(’mvcc_demo’); pg_freespace -------------- (0 rows) SELECT pg_relation_size(’mvcc_demo’); pg_relation_size ------------------ 0 (1 row) VACUUM FULL shrinks the table file to its minimum size, but requires an exclusive table lock. Mvcc Unmasked 62 / 89
  • 63. Optimized Single-Page Cleanup of Old UPDATE Rows The storage space taken by old UPDATE tuples can be reclaimed just like deleted rows. However, certain UPDATE rows can even have their items reclaimed, i.e. it is possible to reuse certain old UPDATE items, rather than marking them as “dead” and requiring VACUUM to reclaim them after removing referencing index entries. Specifically, such item reuse is possible with special HOT update (heap-only tuple) chains, where the chain is on a single heap page and all indexed values in the chain are identical. Mvcc Unmasked 63 / 89
  • 64. Single-Page Cleanup of HOT UPDATE Rows HOT update items can be freed (marked “unused”) if they are in the middle of the chain, i.e. not at the beginning or end of the chain. At the head of the chain is a special “Redirect” item pointers that are referenced by indexes; this is possible because all indexed values are identical in a HOT/redirect chain. Index creation with HOT chains is complex because the chains might contain inconsistent values for the newly indexed columns. This is handled by indexing just the end of the HOT chain and allowing the index to be used only by transactions that start after the index has been created. (Specifically, post-index-creation transactions cannot see the inconsistent HOT chain values due to MVCC visibility rules; they only see the end of the chain.) Mvcc Unmasked 64 / 89
  • 65. Initial Single-Row State Indexes Page Header Item Item Item Unused Unused 8K Tuple, v1 Special Mvcc Unmasked 65 / 89
  • 66. UPDATE Adds a New Row Indexes Page Header Item Item Item Unused 8K Tuple, v2 Tuple, v1 Special No index entry added because indexes only point to the head of the HOT chain. Mvcc Unmasked 66 / 89
  • 67. Redirect Allows Indexes To Remain Valid Indexes Page Header Item Item Item Redirect 8K Tuple, v2 Tuple, v3 Special Mvcc Unmasked 67 / 89
  • 68. UPDATE Replaces Another Old Row Indexes Page Header Item Item Item Redirect 8K Tuple, v4 Tuple, v3 Special Mvcc Unmasked 68 / 89
  • 69. All Old UPDATE Row Versions Eventually Removed Indexes Page Header Item Item Item Redirect Unused 8K Tuple, v4 Special This cleanup was performed by another operation on the same page. Mvcc Unmasked 69 / 89
  • 70. Cleanup of Old Updated Rows TRUNCATE mvcc_demo; TRUNCATE TABLE INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240); INSERT 0 240 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5437 | 0 | (0,241) (1 row) Mvcc Unmasked 70 / 89
  • 71. Cleanup of Old Updated Rows UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5437 | 5438 | (0,242) (0,242) | Normal | 5438 | 0 | (0,242) (2 rows) Mvcc Unmasked 71 / 89
  • 72. Cleanup of Old Updated Rows UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+-----------------+------+------+--------- (0,241) | Redirect to 242 | | | (0,242) | Normal | 5438 | 5439 | (0,243) (0,243) | Normal | 5439 | 0 | (0,243) (3 rows) No index entry added because indexes only point to the head of the HOT chain. Mvcc Unmasked 72 / 89
  • 73. Same as this Slide Indexes Page Header Item Item Item Redirect 8K Tuple, v2 Tuple, v3 Special Mvcc Unmasked 73 / 89
  • 74. Cleanup of Old Updated Rows UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+-----------------+------+------+--------- (0,241) | Redirect to 243 | | | (0,242) | Normal | 5440 | 0 | (0,242) (0,243) | Normal | 5439 | 5440 | (0,242) (3 rows) Mvcc Unmasked 74 / 89
  • 75. Same as this Slide Indexes Page Header Item Item Item Redirect 8K Tuple, v4 Tuple, v3 Special Mvcc Unmasked 75 / 89
  • 76. Cleanup of Old Updated Rows -- transaction now committed, HOT chain allows tid -- to be marked as ‘‘Unused’’ SELECT * FROM mvcc_demo OFFSET 1000; val ----- (0 rows) SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+-----------------+------+------+--------- (0,241) | Redirect to 242 | | | (0,242) | Normal | 5440 | 0 | (0,242) (0,243) | Unused | | | (3 rows) Mvcc Unmasked 76 / 89
  • 77. Same as this Slide Indexes Page Header Item Item Item Redirect Unused 8K Tuple, v4 Special Mvcc Unmasked 77 / 89
  • 78. VACUUM Does Not Remove the Redirect VACUUM mvcc_demo; VACUUM SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+-----------------+------+------+--------- (0,241) | Redirect to 242 | | | (0,242) | Normal | 5440 | 0 | (0,242) (0,243) | Unused | | | (3 rows) Mvcc Unmasked 78 / 89
  • 79. Cleanup Using Manual VACUUM TRUNCATE mvcc_demo; TRUNCATE TABLE INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 INSERT INTO mvcc_demo VALUES (2); INSERT 0 1 INSERT INTO mvcc_demo VALUES (3); INSERT 0 1 SELECT ctid, xmin, xmax FROM mvcc_demo_page0; ctid | xmin | xmax -------+------+------ (0,1) | 5442 | 0 (0,2) | 5443 | 0 (0,3) | 5444 | 0 (3 rows) DELETE FROM mvcc_demo; DELETE 3 Mvcc Unmasked 79 / 89
  • 80. Cleanup Using Manual VACUUM SELECT ctid, xmin, xmax FROM mvcc_demo_page0; ctid | xmin | xmax -------+------+------ (0,1) | 5442 | 5445 (0,2) | 5443 | 5445 (0,3) | 5444 | 5445 (3 rows) -- too small to trigger autovacuum VACUUM mvcc_demo; VACUUM SELECT pg_relation_size(’mvcc_demo’); pg_relation_size ------------------ 0 (1 row) Mvcc Unmasked 80 / 89
  • 81. The Indexed UPDATE Problem The updating of any indexed columns prevents the use of “redirect” items because the chain must be usable by all indexes, i.e. a redirect/ HOT UPDATE cannot require additional index entries due to an indexed value change. In such cases, item pointers can only be marked as “dead”, like DELETE does. No previously shown UPDATE queries modified indexed columns. Mvcc Unmasked 81 / 89
  • 82. Index mvcc_demo Column CREATE INDEX i_mvcc_demo_val on mvcc_demo (val); CREATE INDEX Mvcc Unmasked 82 / 89
  • 83. UPDATE of an Indexed Column TRUNCATE mvcc_demo; TRUNCATE TABLE INSERT INTO mvcc_demo SELECT 0 FROM generate_series(1, 240); INSERT 0 240 INSERT INTO mvcc_demo VALUES (1); INSERT 0 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5449 | 0 | (0,241) (1 row) Mvcc Unmasked 83 / 89
  • 84. UPDATE of an Indexed Column UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Normal | 5449 | 5450 | (0,242) (0,242) | Normal | 5450 | 0 | (0,242) (2 rows) UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Dead | | | (0,242) | Normal | 5450 | 5451 | (0,243) (0,243) | Normal | 5451 | 0 | (0,243) (3 rows) Mvcc Unmasked 84 / 89
  • 85. UPDATE of an Indexed Column UPDATE mvcc_demo SET val = val + 1 WHERE val > 0; UPDATE 1 SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Dead | | | (0,242) | Dead | | | (0,243) | Normal | 5451 | 5452 | (0,244) (0,244) | Normal | 5452 | 0 | (0,244) (4 rows) Mvcc Unmasked 85 / 89
  • 86. UPDATE of an Indexed Column SELECT * FROM mvcc_demo OFFSET 1000; val ----- (0 rows) SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Dead | | | (0,242) | Dead | | | (0,243) | Dead | | | (0,244) | Normal | 5452 | 0 | (0,244) (4 rows) Mvcc Unmasked 86 / 89
  • 87. UPDATE of an Indexed Column VACUUM mvcc_demo; VACUUM SELECT * FROM mvcc_demo_page0 OFFSET 240; ctid | case | xmin | xmax | t_ctid ---------+--------+------+------+--------- (0,241) | Unused | | | (0,242) | Unused | | | (0,243) | Unused | | | (0,244) | Normal | 5452 | 0 | (0,244) (4 rows) Mvcc Unmasked 87 / 89
  • 88. Cleanup Summary Reuse Non-HOT HOT Cleanup Heap Item Item Clean Update Method Triggered By Scope Tuples? State State Indexes? FSM Single-Page SELECT, UPDATE, single heap yes dead unused no no DELETE page VACUUM autovacuum all potential yes unused unused yes yes or manually heap pages Cleanup is possible only when there are no active transactions for which the tuples are visible. HOT items are UPDATE chains that span a single page and contain identical indexed column values. In normal usage, single-page cleanup performs the majority of the cleanup work, while VACUUM reclaims “dead” item pointers, removes unnecessary index entries, and updates the free space map (FSM). Mvcc Unmasked 88 / 89
  • 89. Conclusion All the queries used in this presentation are available at http://momjian.us/main/writings/pgsql/mvcc.sql. http://momjian.us/presentations Escher, Relativity Mvcc Unmasked 89 / 89