2. Who am I ?
Pavan Deolasee
Http://www.linkedin.com/in/pavandeolasee
Contributor to PostgreSQL and Postgres-XC Global Development
Best known for development of Heap-Only-Tuple (HOT) feature of PostgreSQL
8.3 release
Contributed several other enhancements to PostgreSQL and derivatives
Associated with Postgres-XC since its inception
Contributed to Postgres-XC's architectural design and implementation of several
features
Previously worked for EnterpriseDB and Symantec/Veritas
11/21/13
Postgres-XC
2
3. Agenda
PostgreSQL
History of Postgres-XC
Architecture Overview
Data Distribution Strategies
Scalability and Performance Results
SPOF Analysis
Development Practices, Project Status and Roadmap
Support
11/21/13
Postgres-XC
3
5. PostgreSQL Recap
PostgreSQL
is world's most advanced open source database
is very stable
is fully compliant with ANSI SQL
supports foreign key, check constraints
supports various kinds of indexes
supports inheritance
is fully extensible (data types, procedural languages etc)
supports ACID transactions
will recover your database in case of server failure
11/21/13
Postgres-XC
5
6. PostgreSQL Recap
PostgreSQL
uses write-ahead-logs for durability and recover your database in case of server
failure
built-in log based streaming synchronous/asynchronous replication
file system level backups and archive recovery
point-in-time recovery
hot standby
binary upgrades
full-text search
and many more
11/21/13
Postgres-XC
6
7. Performance and Scalability
While PostgreSQL was becoming feature rich, scalability was still a
concern, especially on high end machines.
Significant work in last few releases to address performance and
scalability issues
Heap-only-tuples (HOT) in 8.3 for reuse of dead space
Virtual transaction identifiers for read-only transactions
Spread checkpoints to reduce spikes in disk IO
Hot standby for read scalability
Visibility maps, Index-only-scans
Improved spin-lock implementations
Linear read scalability for multi-core machines (64 cores)
Unlogged tables
11/21/13
Postgres-XC
7
8. Performance and Scalability
Scale up is a continuous process
Every major release comes up with new improvements
Reduced contention for critical sections is important for multiprocess scalability
64 and 128 core machines are becoming common
Several hundred concurrent connections is a norm
Efficient and full use of resources on high end machines
Large amounts of RAM
While scale up is nice and very important, it may not be able to keep
pace with the volume and growth of data
Streaming replication and Hot standby allows read scalability, but
each server should still be capable of handling full data
Partitioning using inheritance and constraint exclusion
11/21/13
Postgres-XC
8
9. Performance and Scalability
When data grows rapidly, it may not be possible or affordable to
upgrade hardware quickly
Scale out is a very attractive solution for BIG data use cases
PostgreSQL had some support as contrib module
DBlink
Foreign Data Wrappers (FDW)
Create foreign servers and foreign tables
Natively query those tables from PostgreSQL server
FDWs exist for Postgres, MySQL, File, MongoDB etc
Though attractive, still in very early stage and may not scale with complex
queries
11/21/13
Postgres-XC
9
11. Project Kick-off
2009 - NTT Data Intellilink Corporation, Japan and EnterpriseDB (my
previous employer) decide to collaborate for developing a clustering
solution for PostgreSQL
NTT had previously internally tried to build another Postgres cluster (RITA-DB),
but could not take it beyond prototype stage. Lessons learnt
EnterpriseDB had previously acquired GridSQL, a scale-out solution. But it had
its own issues including performance problems
No other existing clustering solution was good enough
Slony-II
Pgpool-II
GridSQL
Log Shipping Standby
11/21/13
Postgres-R
pl/proxy
Postgres-XC
11
12. Project Mandate
Provide a viable open source, free alternative to a popular shared
disk cluster from a very reputable commercial vendor. Yes, Oracle
RAC
Provide write-scalability
At least 3.5 times scalability with 5 servers in a write intensive workload
Provide read-scalability
Efficient utilization of cluster nodes
Parallel execution
11/21/13
Postgres-XC
12
13. Project Mandate (Cont)
Provide global consistency
Data written from one node must be available from other nodes immediately and
in a consistent manner
Provide ACID properties of transaction
Provide seamless API so that the client applications need no change
or very minor change
Same SQL
Same libpq, JDBC, ODBC APIs
Must run and scale on commodity hardware
Expensive shared storage
Expensive network interconnects
11/21/13
Postgres-XC
13
14. Initial Team
Koichi Suzuki
Chief architect and leader of the project, NTT Data Intellilink
Mason Sharp
StormDB (previously at EnterpriseDB)
Pavan Deolasee (me)
(previously at EnterpriseDB)
Many more from NTT, EnterpriseDB and open source community has
joined and contributed since then
(Part of the content is attributed to this team and the global development
team)
11/21/13
Postgres-XC
14
16. So What is Postgres-XC ?
Shared nothing architecture
Symmetric PostgreSQL cluster
No master/slave replication
No read-only clusters
Every node can issue both read/write
Every node provides single consistent database view
Transparent transaction management
Not just a replication
Each table can be replicated/distributed by sharding
Parallel transaction/query execution
So both read/write scalability
11/21/13
Postgres-XC
16
17. PostgreSQL Streaming Replication
Master-slave architecture
Write-Ahead-Log (WAL) based replication
Synchronous, asynchronous, cascading replication supported
Hot Standby
Can answer read-only queries
Standby may give a stale view of the database since the WAL application
happens asynchronously
Do not support table level replication
11/21/13
Postgres-XC
17
21. PostgreSQL MVCC
PostgreSQL uses Multi-version concurrency control (MVCC) for
guaranteeing Isolation and Consistency (“CI” of ACID)
MVCC increases concurrency
readers don't wait on other readers/writers
writers don't wait on readers
Every write transaction is assigned an XID (32-bit identifier)
Every row that gets inserted/deleted/updated gets a stamp of the
transaction that did the operation
A MVCC Snapshot is used to decide if a given row or a version of a
given row is visible to a transaction or not
11/21/13
Postgres-XC
21
22. PostgreSQL MVCC
Two important concepts
Transaction Identifiers (XIDs)
MVCC Snapshots
A consistent view of the database can be obtained by ensuring that
these two values are managed properly
11/21/13
Postgres-XC
22
23. Global Transaction Manager (GTM)
The only central component in the XC cluster
Provides globally consistent and visible transaction identifiers (GXID)
to transactions running across the cluster
Provides global snapshots to run queries across the cluster so that
they see a consistent view irrespective of how many different nodes
the query need to fetch data from
GTM is not a database itself, though it keeps some simple state
information
Its a separate binary supplied with Postgres-XC source
Runs as a independent server process
11/21/13
Postgres-XC
23
24. Global Transaction Manager (GTM)
There can only be ONE GTM per Postgres-XC cluster, though
additional GTM standby can be configured (more later)
Comes with a GTM-proxy (more later) for request pooling and
scalability
Multi-threaded process (one thread per connection)
Recommended to run GTM on a separate physical machine for
performance and scalability
Helps in a few other side issues such as to manage sequences
11/21/13
Postgres-XC
24
25. Coordinator
Coordinator is the brain of Postgres-XC cluster
Applications connect to the coordinators and issue SQL queries (they
never talk to datanodes directly)
Parse, plan and execute queries
Knows about table distribution properties and takes advantage of the same for
query planning
Pushes down as much computation as possible to the datanodes
Pushes down transaction identifiers and snapshots to other nodes
Manages implicit two-phase transactions
11/21/13
Postgres-XC
25
26. Coordinator
Non-critical component of the Postgres-XC cluster
There could be (and most likely there will be) more than one
coordinator in the cluster
Applications can connect to any coordinator and issue read/write
queries
No master coordinator
Modified version of PostgreSQL
Remember PostgreSQL comes with PostgreSQL license
11/21/13
Postgres-XC
26
27. Datanode
The real guy managing all the data
Accepts requests from coordinators and sends responses back
SQL/DML queries
DDL queries
Maintenance commands (VACUUM, CHECKPOINT, CLUSTER etc)
Modified PostgreSQL engine but with very minimal changes
Transaction management, MVCC
Sequences
Critical component of the cluster
Failure will lead to DML/DDL queries failing unless the coordinator can decide to
bypass the datanode because its not required for the given query
11/21/13
Postgres-XC
27
29. Distribution Strategies
Distributed tables
Each row exists only on a single datanode
Distribution strategies
MODULO
ROUND ROBIN
User defined functions (Future)
Range (Future)
HASH
Values (Future)
Replicated tables
Each row of the table is stored on all datanodes where the table is replicated
11/21/13
Postgres-XC
29
30. Distributed Tables
Write to a single row is applied only on the node where the row
resides
Multiple rows can be written in parallel
Scanning rows spanning across the nodes (e.g. table scans) can hamper
performance
Point reads and writes based on the distribution column value show
good performance
Datanode where the operation happens can be identified by the
distribution column value in the query
11/21/13
Postgres-XC
30
31. Replicated Tables
Statement level replication
Each write needs to be replicated
Writes are costly
Read can happen on any node (where table is replicated)
Reads from different coordinators can be routed to different nodes
Useful for relatively static tables, with high read load
11/21/13
Postgres-XC
31
32. Distribution Samples – Horizontal Scalability
CREATE TABLE emp (empno integer, ename text) DISTRIBUTE BY HASH(empno);
CREATE TABLE emp (empno integer, ename text) DISTRIBUTE BY MODULO(empno);
CREATE TABLE emp (empno integer, ename text) DISTRIBUTE BY ROUND ROBIN;
CREATE TABLE emp (empno integer, ename text);
Default is to distribute by HASH if the datatype is supported
first column of primary key or
first column of foreign key or
first column of the table
If everything else fails, distribute by ROUND ROBIN
11/21/13
Postgres-XC
32
33. Distribution Samples (Cont)
CREATE TABLE emp (empno integer, ename text) DISTRIBUTE BY HASH(empno)
TO NODE (N1, N2);
Distribute to all nodes or a set of nodes, either explicitly named or
grouped together as named group
Default is to distribute to all the nodes configured
11/21/13
Postgres-XC
33
34. Distributed Query Processing
Coordinator
Accepts queries and plans them
Finds the right data-nodes from where to fetch the data
Frames queries to be sent to these data-nodes
Gathers data from data-nodes
Processes it to get the desired result
Datanode
Executes queries from coordinator like PostgreSQL
Has same capabilities as PostgreSQL
11/21/13
Postgres-XC
34
35. Distributed Query Processing
Query shipping
Coordinator tries to delegate maximum query processing to datanodes
Indexes are located on datanodes
Materialization of huge results is avoided in case of sorting, aggregation,
grouping, JOINs etc.
Limit, Offset clauses are shipped to the datanodes
ORDER BY, GROUP BY clauses are shipped to the datanodes
Coordinator is freed to handle large number of connections
Joins are shipped too
Distributing data wisely helps coordinator to delegate maximum query
processing and improve performance
11/21/13
Postgres-XC
35
37. Write Scalability
Tables are distributed/sharded – usually HASH of the column value
Horizontal partitioning
Inserts are directed to the datanode based on the value of the
distributed column
Integrated in the query executor
Not trigger or rule based
Supports COPY (bulk load mechanism of PostgreSQL)
Updates and Deletes also directed to the datanodes
Coordinator is completely bypassed if the query does not have any local/cross
dependency
Highly scalable in many cases
Some queries must go through the coordinator because of
dependencies
11/21/13
Postgres-XC
37
38. Write Scalability (Cont)
CREATE TABLE test (a int, b text) DISTRIBUTE BY HASH(a);
INSERT INTO test SELECT generate_series(1,100), 'foo';
EXPLAIN VERBOSE UPDATE test SET b = 'bar' WHERE a > 1;
QUERY PLAN
---------------------------------------------------------------------------Data Node Scan on "__REMOTE_FQS_QUERY__"
(cost=0.00..0.00 rows=0 width=0)
Output: ('bar'::text), ('bar'::text), test.ctid, xc_node_id
Node/s: node_dn1, node_dn2
Remote query: UPDATE public.test SET b = 'bar'::text WHERE (a > 1)
(4 rows)
11/21/13
Postgres-XC
38
39. Write Scalability (Cont)
EXPLAIN VERBOSE DELETE FROM test WHERE a = 100;
QUERY PLAN
---------------------------------------------------------------------------Data Node Scan on "__REMOTE_FQS_QUERY__"
(cost=0.00..0.00 rows=0 width=0)
Output: test.a, test.ctid, xc_node_id
Node/s: node_dn2
Remote query: DELETE FROM public.test WHERE (a = 100)
(4 rows)
Queries are started in parallel on multiple nodes thus improving
scalability of write operations
11/21/13
Postgres-XC
39
40. Read Scalability
Significant planner enhancements to improve read scalability
Many cool features already in
Continuous development for improving it further
Query is analyzed and shipped to a datanode as it is if there are no
dependencies on the coordinator or other datanodes
Fast Query Shipping
If entire query can not be shipped, parts of the query are shipped
Offload as much processing as possible to the datanodes
Do the final processing of data on the coordinator
Reduce as much data movement from datanodes to the coordinator
11/21/13
Postgres-XC
40
41. Read Scalability (Cont)
Parallel query execution
Query is stared in parallel on all involved datanodes
Significantly improves performance by utilizing capacities across the cluster
Qualifications push-down
WHERE clauses in a SELECT query are pushed down to the datanodes to
reduce the selectivity of the result set
Significantly reduces the amount of data fetched at the coordinator for final
execution
Remote expressions push-down
Push function and expression evaluation to the datanodes
11/21/13
Postgres-XC
41
42. Read Scalability (Cont)
Aggregates
Enhanced PostgreSQL's two step aggregation to three step aggregation
Collect summary at each datanode and then do the final aggregation at the
coordinator
An example, avg(integer)
Datanode D1 - 1, 2, 5, 7
Datanode D2 - 3, 4, 6, 8, 9
Datanode D1 sends: sum(integer), count => 15, 4
Datanode D2 sends: sum(integer), count => 30, 5
Coordinator computes the final avg as (15 + 30) / (4 + 5) = 5
11/21/13
Postgres-XC
42
43. Read Scalability (Cont)
Join Push-down
Complete or part of complex joins can be pushed down to the datanodes
Complex analysis of the tables involved in the join and their distribution
strategy is required to arrive at a conclusion regarding what can be pushed
and what must not be pushed
Significant performance and scalability boost
Order By
Result sets are sorted on individual datanodes and fetched on the coordinator
Coordinator does a final merge step of the sorted sets
Limit/Offset
11/21/13
Postgres-XC
43
44. Thumb Rules - Read-Write Queries
High point reads (based on distribution column)
Distributed or replicated
High read activities but no frequent writes
Better be replicated
High point writes
Better be distributed
High insert-load, but no frequent update/delete/read
Better be round-robin
11/21/13
Postgres-XC
44
45. Query Analysis
Find the relations/columns participating in equi-join conditions,
WHERE clause etc.
Distribute on those columns
Find columns participating in GROUP BY, DISTINCT clauses
Distribute on those columns
Find columns/tables which are part of primary key and foreign key
constraints
Global constraints are not yet supported in XC
Distribute on those columns
11/21/13
Postgres-XC
45
47. Example DBT1
author, item
Less frequently written
Frequently read from
Author and item are frequently JOINed
Dimension tables
Hence replicated on all nodes
11/21/13
Postgres-XC
47
48. Example DBT1
customer, address, orders, order_line, cc_xacts
Frequently written
hence distributed
Participate in JOINs amongst each other with customer_id as JOIN key
point SELECTs based on customer_id
hence distributed by hash on customer_id so that JOINs are shippable
Participate in JOINs with item
11/21/13
Having item replicated helps pushing JOINs to datanodes
Postgres-XC
48
49. Example DBT1
Shopping_cart, shopping_cart_line
Frequently written
Hence distributed
Point selects based on column shopping_cart_id
Hence distributed by hash on shopping_cart_id
JOINs with item
11/21/13
Having item replicated helps
Postgres-XC
49
52. GTM Scalability
GTM can become a bottleneck
serve XID for every new transaction in the cluster
MVCC snapshot for transaction or every statement in a transaction (readcommitted transactions)
Snapshot size can increases proportional to the number of concurrent
transactions in the system
GTM has been heavily optimized and runs a multi-threaded process
for scalability
For even greater scalability, advised to run GTM Proxy on every
coordinator
11/21/13
Postgres-XC
52
53. GTM Scalability – GTM Proxy
Runs as a proxy to GTM on other servers – preferably on the same
server where coordinators run
One proxy per coordinator server
Same API as GTM
Components can connect to a proxy or GTM directly
Pools multiple GET TXID requests into a single request
e.g get 100 new identifiers in a single request/response
Pools multiple GET SNAPSHOT requests into a single request
Some manipulation of the received snapshot is required, but that's possible
Further pools requests of multiple types
Reduces load on the GTM significantly
One proxy per server scales near linearly
11/21/13
Postgres-XC
53
54. Coordinator Scalability
Coordinator does not store table data
IO scalability not an issue
Query parsing, planning and part or full execution
CPU utilization could be high
Add more coordinators if you need more computing power
Beware of connections explosion on the datanodes
Every additional coordinator will need more connections to datanode
Built-in connection pooler in the coordinator can help a lot
11/21/13
Postgres-XC
54
55. Datanode Scalability
Datanode stores and fetches table data from the disks
In case of IO saturation
Add more datanodes and reconfigure existing tables
Easier said than done
May need data redistribution (except for round-robin distribution)
Table remains locked out (until concurrent data redistribution is added)
Add more datanodes for new tables
11/21/13
Create new tables on the new datanode
Existing tables are not impacted
Postgres-XC
55
56. Transaction Management
Postgres-XC uses two-phase commit protocol for distributed
transaction management
Every commit will make two round-trips to the datanode (PREPARE and
COMMIT)
Optimized so that single phase commit is used if only one node is involved in a
transaction
Optimized so that only write-operating datanodes/coordinators are counted for
deciding if 2PC is required or not
If possible, design transactions so that they span a single or as few
nodes as possible
11/21/13
Postgres-XC
56
58. Single Point of Failure
GTM
Obviously SPOF
GTM-Proxy
No persistent data
Just restart when fails
Coordinator
Every coordinator is essentially a copy
When fails, other coordinators work
Datanode
SPOF for sharded table
11/21/13
Postgres-XC
58
59. SPOF - GTM
GTM keeps crucial information about next transaction identifier,
currently running transactions as well as sequence information
Actual fate of each transaction is not permanently saved on the GTM though (at
least not today)
GTM failure leads to complete failure of the cluster
any useful work would needs GTM interaction
GTM Standby for rescue
11/21/13
Postgres-XC
59
60. GTM Standby
GTM Standby is a working replica of the GTM
Every operation on the GTM is replicated on the GTM, either via proxy or GTM
Create and configure standby
Only one standby can be configured currently
Can be added to a running cluster without any disruption
Can be promoted to be the new master
Same executable as GTM, only configuration change
11/21/13
Postgres-XC
60
61. SPOF - Datanode
Almost all the techniques for PostgreSQL HA are available
Streaming replication
Shared disk remount
Coordinators may require reconfiguration
Coordinators should use the new datanode
Coordinators should clean connections to failed datanode before reconfiguration
GTM
Reconnect to (new) local GTM proxy
11/21/13
Postgres-XC
61
62. SPOF – Coordinator
Coordinator only maintains catalog information
Stable and mostly read-only
All coordinators are replica of each other
No master-slave concept. Every coordinator is equal
Datanode HA techniques can be applied
Streaming replication
Shared disk remount
Applications can continue to use other coordinators while the failed
coordinator is restored
11/21/13
Postgres-XC
62
64. Development Process
Project hosted on sourceforge
http://postgres-xc.sourceforge.net/
Join developer/user mailing lists
postgres-xc-developers@lists.sourceforge.net
postgres-xc-general@lists.sourceforge.net
Test and report bugs
postgres-xc-bugs@lists.sourceforge.net
Submit patches
11/21/13
Postgres-XC
64
65. Regression and Performance Tests
PostgreSQL's regression tests are enhanced
New tests added to test Postgres-XC specific features
Existing tests modified to adjust for difference because of cluster environment
Nightly regression tests are run on buildfarm (at least last I checked)
Regular performance tests are conducted by NTT Data to guard
against performance regression
Typically DBT1 tests
Last I heard, DBT2 tests were being developed by NTT
11/21/13
Postgres-XC
65
66. Status
Release 1.1 is in beta stage (recommended)
Based on latest PostgreSQL 9.2 stable release
Many new features, enhancements and bug fixes
Triggers
Planner enhancements
GTM standby
Where current of
Data redistribution
Returning
Release 1.0 is the latest stable release
Based on PostgreSQL 9.1
11/21/13
Postgres-XC
66
67. Release Management
Yearly major release
Merge with PostgreSQL's new release
Additional XC features
Quarterly minor release
Bug fixes
Security fixes
Catch up with PostgreSQL's minor release
11/21/13
Postgres-XC
67
68. Licensing
Started with GPL license, but changed to PostgreSQL license couple
of years back
Anyone can take the source, modify and use without requiring to open source
the modified code
Very flexible licensing
11/21/13
Postgres-XC
68
69. Support
Very good support available from the core team on the mailing lists
Helps if a self containing test case is included
Encouraged to fix bugs on your own and submit a patch
A committer will look at the patch and commit (with changes if
required)
Volume of bug reports is still low and hence good turnaround time
A few commercial companies have started offering professional
support
Good for project in the long run
Skills of support staff could be limited right now
11/21/13
Postgres-XC
69
71. Current Limitations
Multi-column distribution is not supported
Global constraints not supported
Unique index on non-distribution column is not supported
Updating distribution column not supported
Statement level triggers not supported on the datanode side
Statement level triggers will fire at the coordinator
11/21/13
Postgres-XC
71
72. Current Limitations
No cluster wide deadlock detection
Timeout based deadlock resolution
No cost estimation at coordinator (table statistics not available/used
at the coordinator)
ANALYZE command is propagated to the datanodes and the stats are updated
on the datanodes
Datanodes have up-to-date statistics. So they will plan the queries in the most
efficient way
Node management needs much improvement
No built-in HA for replicated tables SELECT queries
11/21/13
Postgres-XC
72
73. Current Limitations
Truly serializable transactions not supported
Read-committed and repeatable reads are supported
Transactions accessing TEMP objects can not be committed with
2PC
PostgreSQL does not support PREPARing a transaction that has accessed
temporary objects
Set enforce_two_phase_commit = off to handle such transactions
Savepoints not supported
Exception blocks in plpgsql not supported (they need savepoint
support)
11/21/13
Postgres-XC
73
74. Roadmap
Support for remaining SQL features (continuous process)
Planner enhancements so that more and complex queries can be
pushed down to the datanodes
Table statistics at the coordinator
Concurrent redistribution of data after node additional/removal
Without exclusive lock on the table
Separate group in NTT is developing resource agents for
corosync/pacemaker
Plans not yet in public domain
Installers and further testing with DBT-2/DBT-5
11/21/13
Postgres-XC
74
76. Project Resources
Postgres-XC project home page
http://postgres-xc.sourceforge.net/
Project wiki and links to other presentations/white papers
Downloads and installation
Product RPMS available for download
Source code
Source code from GIT repository
Zipped tar balls
Other tools
pgxc_ctl utility to configure/manage/monitor the cluster
A bunch of other contrib modules
11/21/13
Postgres-XC
76
77. Project Resources
Documentation
Extensive Postgres style documentation explaining XC-specific features
Available online as well as distributed with source/rpms.
Mailing lists
Postgres-xc-general
Postgres-xc-developers
Postgres-xc-bugs
Write to me
pavan.deolasee@gmail.com
11/21/13
Postgres-XC
77