Explaining an index structure of Google Cloud Datastore as well as underlying components such as Google File System (Colossus), Bigtable and Megastore.
Session video (Japanese)
https://youtu.be/H-tZUZGBo60?t=8524
2016/11/08 ver1.0 Published
2016/11/11 ver2.0 Add notes on Spanner
2017/02/09 ver2.1 Fix on Spanner's consistency description.
4. Dual nature of entities
● An entity represents a row of a specific "kind".
● You can think of "kind" as a table in the relational data model.
● An entity is identified by an ID (user-specified string or
auto-generated UUID) plus its (mysterious) parent key.
A row of a kind
4
Unique identifier
5. Dual nature of entities
● An entity represents a node of an "entity group" tree.
● An entity group can contain entities from multiple kinds.
● An entity is identified by a key (ancestor path + ID).
○ A key must contain all entities from the root.
○ Some entities in the ancestor path may not exist.
A node of an entity group
5
Organization: Flywheel (doesn't exist)
ancestor path ID
Key: (Organization, 'Flywheel', User, 'Alice', Mail, '15de6')
6. The bright/dark side of an entity
● It's safe to treat an entity as a member of an entity group.
○ Entities treated as part of an entity group are guaranteed to be strongly consistent.
● An ancestor query is a query that specifies an ancestor.
○ The search range is limited to the descendants of the specified ancestor.
○ Ancestor queries are strongly consistent.
○ In other words, it always retrieves the latest data.
○ You can use a single phase transaction inside an entity group
○ A cross group transaction can also be used, but slower than a single phase transaction.
● A global query is a query without specifying an ancestor.
○ Global queries are eventually consistent.
○ You may see old content and/or fail to find newly created entities.
6
7. Mystery of composite indexes
● Can you tell which query requires an additional (non-default) index?
○ Global query
○ Ancestor query
■
7
SELECT * FROM Mail WHERE size>256 ⇒ ◯(OK)
SELECT * FROM Mail WHERE size=256 and access_count>5 ⇒ △(Need an additional index)
SELECT * FROM Mail WHERE size>256 and access_count>5 ⇒ ✕(This is not allowed)
SELECT size FROM Mail WHERE size>256 ⇒ ◯(OK)
SELECT title FROM Mail WHERE size>256 ⇒ △(Need an additional index)
SELECT * FROM Mail
WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ ◯
SELECT * FROM Mail WHERE size=256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
8. What's happening under the covers?
● How is strong consistency guaranteed for ancestor queries?
● Why do I have to define additional indexes for some queries?
● When and why do I need to specify "ancestor = True" for an index?
9. Truth is here
● Cloud Datastore is implemented on top of Megastore which has the layered structure
over Bigtable and Google File System. The internal architecture of Megastore, Bigtable
and Google File System is explained in the published research papers.
● Megastore: Providing Scalable, Highly Available Storage for Interactive Services
○ http://research.google.com/pubs/pub36971.html
● Bigtable: A Distributed Storage System for Structured Data
○ http://research.google.com/archive/bigtable.html
● The Google File System
○ http://research.google.com/archive/gfs.html
9
Google File System
Bigtable
Megastore
10. Notes on Colossus
● Colossus is a successor of Google File System which overcomes shortcomings of
Google File System. It is used as an infrastructure of Google Cloud Platform as well as
Google's internal systems today.
● The following characteristics were mentioned at Google Faculty Summit 2010.
○ Next-generation cluster-level file system
○ Automatically sharded metadata layer
○ Data typically written using Reed-Solomon (1.5x)
○ Client-driven replication, encoding and replication
○ Metadata space has enabled availability analyses
● Since the architectural details of Colossus is not yet published, this presentation explains
the architecture of Google File System.
12. What is Google File System?
● Large scale distributed file system used in Google's internal systems to store large files.
● Optimized for file append and sequential file read for large files.
○ Other operations are supported but may be very slow.
● Transparent file replication for redundancy.
○ Each file is split into multiple 64MB chunks and each chunk is stored in (at least)
three chunk servers.
12
Handing over large data
between servers
Streaming data aggregation
Typical access patterns
13. Optimized dataflow
● Data is transferred serially from a client to chunk servers. The chunk server starts
sending the data right after it starts receiving it.
○ Faster than sending data from a client to all chunk servers in parallel.
● Control messages are handled by the primary chunk server to keep the consistency
among replicas.
13
Client
Chunk servers PrimarySecondary Secondary
Client
Dataflow to append data Control flow to commit the write
14. Data corruption detection
● Each chunk is associated with a checksum to
detect data corruption.
● The whole chunk is read and validated with the
checksum for the read operation.
○ This is optimized for the sequential read.
● A new checksum is calculated with appended
data and the existing checksum for the write
operation.
○ This is optimized for the file append.
14
16. What is Bigtable?
● Large scale distributed key-value style datastore used in Google's internal systems to
store structured data with varying data sizes (from web page URLs to satellite imagery.)
● Google Cloud Platform offers managed service for Bigtable with HBase compatible APIs.
16
Column family design to store HTML contents and inversed links
(excerpt from the research paper)
17. Row as a Database
● Data is identified with "Row Key + Column family: Column" (+ timestamp).
● You may think a single row as a small database.
○ A column family represents a table.
○ Columns can be dynamically added to a column family.
○ Atomic operations can be used within a single row.
17
Column family design for user profiles and query histories
18. Global view of the "big" table
● Rows are stored in lexicographic order by row key. The row range for a table is
dynamically partitioned into units called 'tablets'.
○ This strategy is optimized for fast row range scans.
● Tablet servers provide the access to tablets. The tablet assignment is managed by a
master node.
18
19. Tablet representation
● Tablet data is consisted of in-memory data (memtable) and immutable files (SSTables)
stored in Google File System.
○ SSTables store the freezed view of a tablet at some point of time. Updates are
appended to a tablet log and memtable.
○ A tablet server construct the united view of the tablet by combining memtable and
SSTables.
19
Tablet representation mechanism
(excerpt from the research paper)
● When memtable becomes too large, a new
memtable is created and the old one is freezed
to a new SSTable. (Minor compaction.)
● When SSTables becomes too many, they are
merged into a single SSTable by discarding
obsolete entries (Major compaction.)
21. Overview of Megastore
● Megastore provides the ACID semantics for
globally distributed datasets using fast
synchronous replication mechanism based
on (an enhanced version of) Paxos.
● This part explains the index structure of
Cloud Datastore implemented on top of
Megastore.
● Note that ancestor/global query is
additional features of Cloud Datastore.
They are not a part of Megastore.
21
Multi datacenter replication architecture of Megastore
(excerpt from the research paper)
23. How are entities stored in Bigtable?
● Row key: entity key (ancestor path + ID).
○ The whole entity group can be scanned by a row range scan (depth-first search).
● Column family: properties of an entity.
○ An independent column family is used for each property.
23
Row key status of the group email title size access_count
Organization, 'Flywheel'
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5
Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' xxxx 256 3
Organization, 'Flywheel', User, 'Bob' xxxx
・
・
・
Transaction log and replication status is recorded
for operations with strong consistency.
Rowrangescan
24. Ancestor query without inequality filters
● The following queries don't require an additional index since they can be done by a row
range scan.
● The scan starts from a row with the specified ancestor key.
Row key status of the group email title size
Organization, 'Flywheel'
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128
Starts from here
SELECT * FROM Mail
WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
SELECT * FROM Mail WHERE size=256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
24
25. Ancestor query with inequality filters
● The following query requires an additional index.
● Theoretically it's possible to do the same table scan, but may not be efficient enough.
Instead, the following index should be used.
○ The row key of this index table consists of:
■ "Ancestor of the entity" + "Property value" + "Entity key (ancestor path + ID)"
○ See next pages for details.
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
indexes:
- kind: Mail
ancestor: yes
properties:
- name: size
25
26. Single-property indexes for ancestor queries
● Each entity is mapped to multiple rows corresponding to all its ancestors.
○ The following example shows the rows for two entities.
○ This will be sorted in the order of row keys, then...
Organization, 'Flywheel', | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', User, 'Alice' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Organization, 'Flywheel', User, 'Alice', Mail, '65067' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Row key Column
Ancestors Property value Entity key (ancestor path + id)
26
27. Single-property indexes for ancestor queries
● Using the row keys which are sorted in lexicographic order:
○ First, the row range is limited by the specified ancestor.
○ The row range is narrowed further by the inequality filter.
Organization, 'Flywheel' | 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
Organization, 'Flywheel' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel', User, 'Alice' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')
27
28. Composite indexes for multiple conditions
● Indexes with multiple properties are used for queries with multiple conditions.
● The following query requires the composite index.
● The order of properties in the index definition has meaning.
○ The property for equality filter must come first.
indexes:
- kind: Mail
ancestor: yes
properties:
- name: size
- name: access_count
SELECT * FROM Mail WHERE size=256 and access_count<5
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')
28
Organization, 'Flywheel' | 64 | 1 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
Organization, 'Flywheel' | 128 | 5 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel' | 256 | 3 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
Organization, 'Flywheel' | 256 | 8 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
Organization, 'Flywheel' | 1024 | 9 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
Organization, 'Flywheel' | 1024 | 2 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
29. Multiple inequality filters are not allowed!
● The following query is not allowed.
○ The rows of index table cannot be a single range for this condition.
SELECT * FROM Mail WHERE size>128 AND access_count<5
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
29
30. Strong consistency of ancestor queries
● Indexes with "ancestor: yes" are used for ancestor queries where independent indexes
are created for each ancestor tree.
○ A single index table contains entries only for one entity group.
● Indexes are created in each datacenter and replicated.
○ Replication status is checked before starting a query to guarantee strong
consistency.
30
Row key status of the group email title size access_count
Organization, 'Flywheel' Replication Status
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5
Root entity
32. Indexes for global queries
● Indexes with "ancestor: no" are used for global queries where indexes are created for
each kind.
○ One index table contains all entities of a specific kind including entities from
multiple entity groups.
Operation across entity groups
(excerpt from the research paper)
● Megastore handles operations across
entity groups with weaker consistency
unless two-phase commitment is used.
● On the Cloud Datastore layer, it results in
the eventual consistency of global queries.
32
33. Default single-property indexes
● Single-property indexes for global queries are automatically created (in both asc and
desc orders).
○ Ancestors are not included in row keys of the index table.
● For example, the following queries use the default indexes.
SELECT * FROM Mail WHERE size>256
SELECT size FROM Mail WHERE size>256
33
64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
34. Composite indexes for global queries
● Indexes with multiple properties (composite indexes) need to be created manually.
○ Projection queries also need composite indexes so that values can be retrieved
directly from the index table.
SELECT * FROM Mail WHERE size=256 and access_count>5
SELECT title FROM Mail WHERE size>256
Projection query
indexes:
- kind: Mail
ancestor: no
properties:
- name: size
- name: access_count
- kind: Mail
ancestor: no
properties:
- name: size
- name: title
'title' can be retrieved
directly from the index table.
34
35. Index direction matters for sort orders
● "ORDER BY" requires the corresponding index.
● When used with an equality filter, the index direction needs to match the sort order.
● "ORDER BY" cannot mixed with an inequality filter for other properties.
○ The following query is not allowed.
SELECT * FROM Mail WHERE size=256 ORDER BY access_count DESC indexes:
- kind: Mail
ancestor: no
properties:
- name: size
- name: access_count
direction: desc
35
SELECT * FROM Mail WHERE size>256 ORDER BY access_count DESC
37. Design guide for entity groups
● Avoid global queries (queries without specifying an ancestor) unless you understand what
you are doing.
○ Global queries may not retrieve the latest data.
● Splitting data into entity groups so that updates in a single entity group are less frequent.
○ The update of entities in a single entity group should be less than 1 update/sec.
● Examples:
○ Web mail service
■ An entity group of mails for each user.
○ SNS user group service
■ An entity group of user profile for each user.
■ An entity group of posts for each user group.
■ An entity group of group names and pointers to group sites which provides a catalog of user
groups.
○ Online map service
■ An entity group of patches for an arbitrary region of the globe.
37
38. References
● Under the Covers of the Google App Engine Datastore
● How Entities and Indexes are Stored
● Balancing Strong and Eventual Consistency with Google Cloud Datastore
38
40. What is Spanner?
● Spanner: Google's Globally-Distributed Database
○ http://research.google.com/archive/spanner.html
● Spanner is a Google's scalable, multi-version, globally-distributed, and synchronously-replicated
database. It is used as a successor of Megastore in Google's internal systems.
● Designed to overcome the shortcomings of Megastore and support general-purpose
transactions with SQL-based query language.
● Example of shortcomings of Megastore:
○ It doesn't support the relational data model and SQL-based query language.
○ Transaction and strong consistency is limited within an entity group.
○ The number of updates is limited to 1 update/sec for each entity group.
40
41. Infrastructure design
● The overall server architecture of Spanner resembles Megastore over Bigtable.
○ A cluster in each zone contains multiple span servers. Zones are distributed across
data centers.
○ Each span server manages tablets which hold the key-value mappings:
(key: string, timestamp: int64) → value: string
○ Backend data files are stored in Colossus.
41
Spanner server organization
(excerpt from the research paper)
● Differently from Bigtable, rows in a tablet are
versioned with a system time instead of user
specified timestamps.
○ The versioning mechanism is used for
snapshot read and lock-free read-only
transactions.
42. Paxos-based tablet replication
● Tablets in different zones are replicated with Paxos-based algorithm.
○ A leader in each replication group takes care of row-range write locks during
read-write transactions. A leader is re-elected thorough Paxos if necessary.
○ In the case of transactions which involve multiple replication groups, transaction
managers from each group cooperate to perform two phase commitment.
42
Replication between tablets
(excerpt from the research paper)
43. So..., what's the problem?
● The problem with Paxos-based algorithm is that replications are done asynchronously.
○ When half of the replicas have agreed to write the data, it's considered to be
committed. The remaining replication will be done asynchronously.
○ If you enforce the genuine full-replication on each write, performance will be highly
degraded. (This is partly the reason for the limited strongly consistent updates on
Megastore.)
● Spanner associates timestamps with all writes, and every replica tracks a value called "safe
time: t-safe" which is the maximum timestamp at which a replica is up-to-date.
○ A replica can satisfy a read request for a timestamp t if t <= t-safe. If not, another
replica is used.
○ t-safe advances at each Paxos write. During a transaction, the advancement is
delayed until the transaction finishes.
43
44. So..., again, what's the problem?
● The timestamp-based tracking requires that the clocks on all replicas are synchronized.
○ At least, clocks should be calibrated within a limited amount of uncertainty, and the
range of uncertainty is known to the system.
44
● Spanner clusters are equipped with
TrueTime API system consisting of multiple
time servers using GPS and atomic clocks.
○ TrueTime API provides the time
interval in which the current time is
guaranteed to be.
Fluctuations of time drifts from time servers
(excerpt from the research paper)
Hardware maintenance
of two time servers
Network latency
improvement