SlideShare una empresa de Scribd logo
1 de 45
Descargar para leer sin conexión
Google Cloud Datastore
Inside-Out
Etsuji Nakai
Cloud Solutions Architect at Google
February 9, 2017 ver2.1
Etsuji Nakai
Cloud Solutions Architect at Google
Twitter @enakai00
Now On Sale!
2
Cloud Datastore 101
The mystery of entity groups
Dual nature of entities
● An entity represents a row of a specific "kind".
● You can think of "kind" as a table in the relational data model.
● An entity is identified by an ID (user-specified string or
auto-generated UUID) plus its (mysterious) parent key.
A row of a kind
4
Unique identifier
Dual nature of entities
● An entity represents a node of an "entity group" tree.
● An entity group can contain entities from multiple kinds.
● An entity is identified by a key (ancestor path + ID).
○ A key must contain all entities from the root.
○ Some entities in the ancestor path may not exist.
A node of an entity group
5
Organization: Flywheel (doesn't exist)
ancestor path ID
Key: (Organization, 'Flywheel', User, 'Alice', Mail, '15de6')
The bright/dark side of an entity
● It's safe to treat an entity as a member of an entity group.
○ Entities treated as part of an entity group are guaranteed to be strongly consistent.
● An ancestor query is a query that specifies an ancestor.
○ The search range is limited to the descendants of the specified ancestor.
○ Ancestor queries are strongly consistent.
○ In other words, it always retrieves the latest data.
○ You can use a single phase transaction inside an entity group
○ A cross group transaction can also be used, but slower than a single phase transaction.
● A global query is a query without specifying an ancestor.
○ Global queries are eventually consistent.
○ You may see old content and/or fail to find newly created entities.
6
Mystery of composite indexes
● Can you tell which query requires an additional (non-default) index?
○ Global query
○ Ancestor query
■
7
SELECT * FROM Mail WHERE size>256 ⇒ ◯(OK)
SELECT * FROM Mail WHERE size=256 and access_count>5 ⇒ △(Need an additional index)
SELECT * FROM Mail WHERE size>256 and access_count>5 ⇒ ✕(This is not allowed)
SELECT size FROM Mail WHERE size>256 ⇒ ◯(OK)
SELECT title FROM Mail WHERE size>256 ⇒ △(Need an additional index)
SELECT * FROM Mail
WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ ◯
SELECT * FROM Mail WHERE size=256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
What's happening under the covers?
● How is strong consistency guaranteed for ancestor queries?
● Why do I have to define additional indexes for some queries?
● When and why do I need to specify "ancestor = True" for an index?
Truth is here
● Cloud Datastore is implemented on top of Megastore which has the layered structure
over Bigtable and Google File System. The internal architecture of Megastore, Bigtable
and Google File System is explained in the published research papers.
● Megastore: Providing Scalable, Highly Available Storage for Interactive Services
○ http://research.google.com/pubs/pub36971.html
● Bigtable: A Distributed Storage System for Structured Data
○ http://research.google.com/archive/bigtable.html
● The Google File System
○ http://research.google.com/archive/gfs.html
9
Google File System
Bigtable
Megastore
Notes on Colossus
● Colossus is a successor of Google File System which overcomes shortcomings of
Google File System. It is used as an infrastructure of Google Cloud Platform as well as
Google's internal systems today.
● The following characteristics were mentioned at Google Faculty Summit 2010.
○ Next-generation cluster-level file system
○ Automatically sharded metadata layer
○ Data typically written using Reed-Solomon (1.5x)
○ Client-driven replication, encoding and replication
○ Metadata space has enabled availability analyses
● Since the architectural details of Colossus is not yet published, this presentation explains
the architecture of Google File System.
Google File System
What is Google File System?
● Large scale distributed file system used in Google's internal systems to store large files.
● Optimized for file append and sequential file read for large files.
○ Other operations are supported but may be very slow.
● Transparent file replication for redundancy.
○ Each file is split into multiple 64MB chunks and each chunk is stored in (at least)
three chunk servers.
12
Handing over large data
between servers
Streaming data aggregation
Typical access patterns
Optimized dataflow
● Data is transferred serially from a client to chunk servers. The chunk server starts
sending the data right after it starts receiving it.
○ Faster than sending data from a client to all chunk servers in parallel.
● Control messages are handled by the primary chunk server to keep the consistency
among replicas.
13
Client
Chunk servers PrimarySecondary Secondary
Client
Dataflow to append data Control flow to commit the write
Data corruption detection
● Each chunk is associated with a checksum to
detect data corruption.
● The whole chunk is read and validated with the
checksum for the read operation.
○ This is optimized for the sequential read.
● A new checksum is calculated with appended
data and the existing checksum for the write
operation.
○ This is optimized for the file append.
14
Bigtable
What is Bigtable?
● Large scale distributed key-value style datastore used in Google's internal systems to
store structured data with varying data sizes (from web page URLs to satellite imagery.)
● Google Cloud Platform offers managed service for Bigtable with HBase compatible APIs.
16
Column family design to store HTML contents and inversed links
(excerpt from the research paper)
Row as a Database
● Data is identified with "Row Key + Column family: Column" (+ timestamp).
● You may think a single row as a small database.
○ A column family represents a table.
○ Columns can be dynamically added to a column family.
○ Atomic operations can be used within a single row.
17
Column family design for user profiles and query histories
Global view of the "big" table
● Rows are stored in lexicographic order by row key. The row range for a table is
dynamically partitioned into units called 'tablets'.
○ This strategy is optimized for fast row range scans.
● Tablet servers provide the access to tablets. The tablet assignment is managed by a
master node.
18
Tablet representation
● Tablet data is consisted of in-memory data (memtable) and immutable files (SSTables)
stored in Google File System.
○ SSTables store the freezed view of a tablet at some point of time. Updates are
appended to a tablet log and memtable.
○ A tablet server construct the united view of the tablet by combining memtable and
SSTables.
19
Tablet representation mechanism
(excerpt from the research paper)
● When memtable becomes too large, a new
memtable is created and the old one is freezed
to a new SSTable. (Minor compaction.)
● When SSTables becomes too many, they are
merged into a single SSTable by discarding
obsolete entries (Major compaction.)
Cloud Datastore /
Megastore
Overview of Megastore
● Megastore provides the ACID semantics for
globally distributed datasets using fast
synchronous replication mechanism based
on (an enhanced version of) Paxos.
● This part explains the index structure of
Cloud Datastore implemented on top of
Megastore.
● Note that ancestor/global query is
additional features of Cloud Datastore.
They are not a part of Megastore.
21
Multi datacenter replication architecture of Megastore
(excerpt from the research paper)
Index structure
for ancestor queries
How are entities stored in Bigtable?
● Row key: entity key (ancestor path + ID).
○ The whole entity group can be scanned by a row range scan (depth-first search).
● Column family: properties of an entity.
○ An independent column family is used for each property.
23
Row key status of the group email title size access_count
Organization, 'Flywheel'
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5
Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' xxxx 256 3
Organization, 'Flywheel', User, 'Bob' xxxx
・
・
・
Transaction log and replication status is recorded
for operations with strong consistency.
Rowrangescan
Ancestor query without inequality filters
● The following queries don't require an additional index since they can be done by a row
range scan.
● The scan starts from a row with the specified ancestor key.
Row key status of the group email title size
Organization, 'Flywheel'
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128
Starts from here
SELECT * FROM Mail
WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
SELECT * FROM Mail WHERE size=256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
24
Ancestor query with inequality filters
● The following query requires an additional index.
● Theoretically it's possible to do the same table scan, but may not be efficient enough.
Instead, the following index should be used.
○ The row key of this index table consists of:
■ "Ancestor of the entity" + "Property value" + "Entity key (ancestor path + ID)"
○ See next pages for details.
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
indexes:
- kind: Mail
ancestor: yes
properties:
- name: size
25
Single-property indexes for ancestor queries
● Each entity is mapped to multiple rows corresponding to all its ancestors.
○ The following example shows the rows for two entities.
○ This will be sorted in the order of row keys, then...
Organization, 'Flywheel', | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', User, 'Alice' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity
Organization, 'Flywheel', | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Organization, 'Flywheel', User, 'Alice', Mail, '65067' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity
Row key Column
Ancestors Property value Entity key (ancestor path + id)
26
Single-property indexes for ancestor queries
● Using the row keys which are sorted in lexicographic order:
○ First, the row range is limited by the specified ancestor.
○ The row range is narrowed further by the inequality filter.
Organization, 'Flywheel' | 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
Organization, 'Flywheel' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel', User, 'Alice' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
SELECT * FROM Mail WHERE size>256
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')
27
Composite indexes for multiple conditions
● Indexes with multiple properties are used for queries with multiple conditions.
● The following query requires the composite index.
● The order of properties in the index definition has meaning.
○ The property for equality filter must come first.
indexes:
- kind: Mail
ancestor: yes
properties:
- name: size
- name: access_count
SELECT * FROM Mail WHERE size=256 and access_count<5
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel')
28
Organization, 'Flywheel' | 64 | 1 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
Organization, 'Flywheel' | 128 | 5 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
Organization, 'Flywheel' | 256 | 3 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
Organization, 'Flywheel' | 256 | 8 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
Organization, 'Flywheel' | 1024 | 9 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
Organization, 'Flywheel' | 1024 | 2 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
Multiple inequality filters are not allowed!
● The following query is not allowed.
○ The rows of index table cannot be a single range for this condition.
SELECT * FROM Mail WHERE size>128 AND access_count<5
AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')
29
Strong consistency of ancestor queries
● Indexes with "ancestor: yes" are used for ancestor queries where independent indexes
are created for each ancestor tree.
○ A single index table contains entries only for one entity group.
● Indexes are created in each datacenter and replicated.
○ Replication status is checked before starting a query to guarantee strong
consistency.
30
Row key status of the group email title size access_count
Organization, 'Flywheel' Replication Status
Organization, 'Flywheel', User, 'Alice' xxxx
Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9
Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5
Root entity
Index structure
for global queries
Indexes for global queries
● Indexes with "ancestor: no" are used for global queries where indexes are created for
each kind.
○ One index table contains all entities of a specific kind including entities from
multiple entity groups.
Operation across entity groups
(excerpt from the research paper)
● Megastore handles operations across
entity groups with weaker consistency
unless two-phase commitment is used.
● On the Cloud Datastore layer, it results in
the eventual consistency of global queries.
32
Default single-property indexes
● Single-property indexes for global queries are automatically created (in both asc and
desc orders).
○ Ancestors are not included in row keys of the index table.
● For example, the following queries use the default indexes.
SELECT * FROM Mail WHERE size>256
SELECT size FROM Mail WHERE size>256
33
64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6'
128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067'
256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d'
256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c''
1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6'
1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
Composite indexes for global queries
● Indexes with multiple properties (composite indexes) need to be created manually.
○ Projection queries also need composite indexes so that values can be retrieved
directly from the index table.
SELECT * FROM Mail WHERE size=256 and access_count>5
SELECT title FROM Mail WHERE size>256
Projection query
indexes:
- kind: Mail
ancestor: no
properties:
- name: size
- name: access_count
- kind: Mail
ancestor: no
properties:
- name: size
- name: title
'title' can be retrieved
directly from the index table.
34
Index direction matters for sort orders
● "ORDER BY" requires the corresponding index.
● When used with an equality filter, the index direction needs to match the sort order.
● "ORDER BY" cannot mixed with an inequality filter for other properties.
○ The following query is not allowed.
SELECT * FROM Mail WHERE size=256 ORDER BY access_count DESC indexes:
- kind: Mail
ancestor: no
properties:
- name: size
- name: access_count
direction: desc
35
SELECT * FROM Mail WHERE size>256 ORDER BY access_count DESC
Design guide
for entity groups
Design guide for entity groups
● Avoid global queries (queries without specifying an ancestor) unless you understand what
you are doing.
○ Global queries may not retrieve the latest data.
● Splitting data into entity groups so that updates in a single entity group are less frequent.
○ The update of entities in a single entity group should be less than 1 update/sec.
● Examples:
○ Web mail service
■ An entity group of mails for each user.
○ SNS user group service
■ An entity group of user profile for each user.
■ An entity group of posts for each user group.
■ An entity group of group names and pointers to group sites which provides a catalog of user
groups.
○ Online map service
■ An entity group of patches for an arbitrary region of the globe.
37
References
● Under the Covers of the Google App Engine Datastore
● How Entities and Indexes are Stored
● Balancing Strong and Eventual Consistency with Google Cloud Datastore
38
Notes on Spanner
What is Spanner?
● Spanner: Google's Globally-Distributed Database
○ http://research.google.com/archive/spanner.html
● Spanner is a Google's scalable, multi-version, globally-distributed, and synchronously-replicated
database. It is used as a successor of Megastore in Google's internal systems.
● Designed to overcome the shortcomings of Megastore and support general-purpose
transactions with SQL-based query language.
● Example of shortcomings of Megastore:
○ It doesn't support the relational data model and SQL-based query language.
○ Transaction and strong consistency is limited within an entity group.
○ The number of updates is limited to 1 update/sec for each entity group.
40
Infrastructure design
● The overall server architecture of Spanner resembles Megastore over Bigtable.
○ A cluster in each zone contains multiple span servers. Zones are distributed across
data centers.
○ Each span server manages tablets which hold the key-value mappings:
(key: string, timestamp: int64) → value: string
○ Backend data files are stored in Colossus.
41
Spanner server organization
(excerpt from the research paper)
● Differently from Bigtable, rows in a tablet are
versioned with a system time instead of user
specified timestamps.
○ The versioning mechanism is used for
snapshot read and lock-free read-only
transactions.
Paxos-based tablet replication
● Tablets in different zones are replicated with Paxos-based algorithm.
○ A leader in each replication group takes care of row-range write locks during
read-write transactions. A leader is re-elected thorough Paxos if necessary.
○ In the case of transactions which involve multiple replication groups, transaction
managers from each group cooperate to perform two phase commitment.
42
Replication between tablets
(excerpt from the research paper)
So..., what's the problem?
● The problem with Paxos-based algorithm is that replications are done asynchronously.
○ When half of the replicas have agreed to write the data, it's considered to be
committed. The remaining replication will be done asynchronously.
○ If you enforce the genuine full-replication on each write, performance will be highly
degraded. (This is partly the reason for the limited strongly consistent updates on
Megastore.)
● Spanner associates timestamps with all writes, and every replica tracks a value called "safe
time: t-safe" which is the maximum timestamp at which a replica is up-to-date.
○ A replica can satisfy a read request for a timestamp t if t <= t-safe. If not, another
replica is used.
○ t-safe advances at each Paxos write. During a transaction, the advancement is
delayed until the transaction finishes.
43
So..., again, what's the problem?
● The timestamp-based tracking requires that the clocks on all replicas are synchronized.
○ At least, clocks should be calibrated within a limited amount of uncertainty, and the
range of uncertainty is known to the system.
44
● Spanner clusters are equipped with
TrueTime API system consisting of multiple
time servers using GPS and atomic clocks.
○ TrueTime API provides the time
interval in which the current time is
guaranteed to be.
Fluctuations of time drifts from time servers
(excerpt from the research paper)
Hardware maintenance
of two time servers
Network latency
improvement
Thank you!

Más contenido relacionado

Destacado

DevOpsにおける組織に固有の事情を どのように整理するべきか
DevOpsにおける組織に固有の事情を どのように整理するべきかDevOpsにおける組織に固有の事情を どのように整理するべきか
DevOpsにおける組織に固有の事情を どのように整理するべきかEtsuji Nakai
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsEtsuji Nakai
 
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜Etsuji Nakai
 
TensorFlowで学ぶDQN
TensorFlowで学ぶDQNTensorFlowで学ぶDQN
TensorFlowで学ぶDQNEtsuji Nakai
 
Deep Q-Network for beginners
Deep Q-Network for beginnersDeep Q-Network for beginners
Deep Q-Network for beginnersEtsuji Nakai
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftEtsuji Nakai
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Etsuji Nakai
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)Etsuji Nakai
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスEtsuji Nakai
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Etsuji Nakai
 
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法Takuya Ueda
 
GAE/Go 選定から活用まで
GAE/Go 選定から活用までGAE/Go 選定から活用まで
GAE/Go 選定から活用までHiroyoshi Houchi
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像Etsuji Nakai
 
Google cloud datastore driver for Google Apps Script DB abstraction
Google cloud datastore driver for Google Apps Script DB abstractionGoogle cloud datastore driver for Google Apps Script DB abstraction
Google cloud datastore driver for Google Apps Script DB abstractionBruce McPherson
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキストEtsuji Nakai
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowEtsuji Nakai
 
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価Kumano Ryo
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Etsuji Nakai
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックEtsuji Nakai
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要Etsuji Nakai
 

Destacado (20)

DevOpsにおける組織に固有の事情を どのように整理するべきか
DevOpsにおける組織に固有の事情を どのように整理するべきかDevOpsにおける組織に固有の事情を どのように整理するべきか
DevOpsにおける組織に固有の事情を どのように整理するべきか
 
Googleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOpsGoogleのインフラ技術から考える理想のDevOps
Googleのインフラ技術から考える理想のDevOps
 
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜
インタークラウドを実現する技術 〜 デファクトスタンダードからの視点 〜
 
TensorFlowで学ぶDQN
TensorFlowで学ぶDQNTensorFlowで学ぶDQN
TensorFlowで学ぶDQN
 
Deep Q-Network for beginners
Deep Q-Network for beginnersDeep Q-Network for beginners
Deep Q-Network for beginners
 
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShiftExploring the Philosophy behind Docker/Kubernetes/OpenShift
Exploring the Philosophy behind Docker/Kubernetes/OpenShift
 
Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介Open Shift v3 主要機能と内部構造のご紹介
Open Shift v3 主要機能と内部構造のご紹介
 
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
「TensorFlow Tutorialの数学的背景」 クイックツアー(パート1)
 
Googleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービスGoogleにおける機械学習の活用とクラウドサービス
Googleにおける機械学習の活用とクラウドサービス
 
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!Docker活用パターンの整理 ― どう組み合わせるのが正解?!
Docker活用パターンの整理 ― どう組み合わせるのが正解?!
 
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
メルカリアッテの実務で使えた、GAE/Goの開発を効率的にする方法
 
GAE/Go 選定から活用まで
GAE/Go 選定から活用までGAE/Go 選定から活用まで
GAE/Go 選定から活用まで
 
OpenStackとDockerの未来像
OpenStackとDockerの未来像OpenStackとDockerの未来像
OpenStackとDockerの未来像
 
Google cloud datastore driver for Google Apps Script DB abstraction
Google cloud datastore driver for Google Apps Script DB abstractionGoogle cloud datastore driver for Google Apps Script DB abstraction
Google cloud datastore driver for Google Apps Script DB abstraction
 
機械学習概論 講義テキスト
機械学習概論 講義テキスト機械学習概論 講義テキスト
機械学習概論 講義テキスト
 
Introducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlowIntroducton to Convolutional Nerural Network with TensorFlow
Introducton to Convolutional Nerural Network with TensorFlow
 
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価
App engine admin apiを利用したgae%2 f go環境へのデプロイとgcp東京リージョンの性能評価
 
Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会Docker with RHEL7 技術勉強会
Docker with RHEL7 技術勉強会
 
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニックOpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
OpenStackをさらに”使う”技術 - OpenStack&Docker活用テクニック
 
分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要分散ストレージソフトウェアCeph・アーキテクチャー概要
分散ストレージソフトウェアCeph・アーキテクチャー概要
 

Más de Etsuji Nakai

「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考えるEtsuji Nakai
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Etsuji Nakai
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2Etsuji Nakai
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionEtsuji Nakai
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Etsuji Nakai
 

Más de Etsuji Nakai (7)

PRML11.2-11.3
PRML11.2-11.3PRML11.2-11.3
PRML11.2-11.3
 
「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える「ITエンジニアリングの本質」を考える
「ITエンジニアリングの本質」を考える
 
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実Googleのインフラ技術に見る基盤標準化とDevOpsの真実
Googleのインフラ技術に見る基盤標準化とDevOpsの真実
 
Lecture note on PRML 8.2
Lecture note on PRML 8.2Lecture note on PRML 8.2
Lecture note on PRML 8.2
 
PRML7.2
PRML7.2PRML7.2
PRML7.2
 
OpenShift v3 Technical Introduction
OpenShift v3 Technical IntroductionOpenShift v3 Technical Introduction
OpenShift v3 Technical Introduction
 
Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編Python 機械学習プログラミング データ分析演習編
Python 機械学習プログラミング データ分析演習編
 

Último

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slidevu2urc
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Histor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slideHistor y of HAM Radio presentation slide
Histor y of HAM Radio presentation slide
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Google Cloud Datastore Inside-Out

  • 1. Google Cloud Datastore Inside-Out Etsuji Nakai Cloud Solutions Architect at Google February 9, 2017 ver2.1
  • 2. Etsuji Nakai Cloud Solutions Architect at Google Twitter @enakai00 Now On Sale! 2
  • 3. Cloud Datastore 101 The mystery of entity groups
  • 4. Dual nature of entities ● An entity represents a row of a specific "kind". ● You can think of "kind" as a table in the relational data model. ● An entity is identified by an ID (user-specified string or auto-generated UUID) plus its (mysterious) parent key. A row of a kind 4 Unique identifier
  • 5. Dual nature of entities ● An entity represents a node of an "entity group" tree. ● An entity group can contain entities from multiple kinds. ● An entity is identified by a key (ancestor path + ID). ○ A key must contain all entities from the root. ○ Some entities in the ancestor path may not exist. A node of an entity group 5 Organization: Flywheel (doesn't exist) ancestor path ID Key: (Organization, 'Flywheel', User, 'Alice', Mail, '15de6')
  • 6. The bright/dark side of an entity ● It's safe to treat an entity as a member of an entity group. ○ Entities treated as part of an entity group are guaranteed to be strongly consistent. ● An ancestor query is a query that specifies an ancestor. ○ The search range is limited to the descendants of the specified ancestor. ○ Ancestor queries are strongly consistent. ○ In other words, it always retrieves the latest data. ○ You can use a single phase transaction inside an entity group ○ A cross group transaction can also be used, but slower than a single phase transaction. ● A global query is a query without specifying an ancestor. ○ Global queries are eventually consistent. ○ You may see old content and/or fail to find newly created entities. 6
  • 7. Mystery of composite indexes ● Can you tell which query requires an additional (non-default) index? ○ Global query ○ Ancestor query ■ 7 SELECT * FROM Mail WHERE size>256 ⇒ ◯(OK) SELECT * FROM Mail WHERE size=256 and access_count>5 ⇒ △(Need an additional index) SELECT * FROM Mail WHERE size>256 and access_count>5 ⇒ ✕(This is not allowed) SELECT size FROM Mail WHERE size>256 ⇒ ◯(OK) SELECT title FROM Mail WHERE size>256 ⇒ △(Need an additional index) SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ ◯ SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △ SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice')⇒ △
  • 8. What's happening under the covers? ● How is strong consistency guaranteed for ancestor queries? ● Why do I have to define additional indexes for some queries? ● When and why do I need to specify "ancestor = True" for an index?
  • 9. Truth is here ● Cloud Datastore is implemented on top of Megastore which has the layered structure over Bigtable and Google File System. The internal architecture of Megastore, Bigtable and Google File System is explained in the published research papers. ● Megastore: Providing Scalable, Highly Available Storage for Interactive Services ○ http://research.google.com/pubs/pub36971.html ● Bigtable: A Distributed Storage System for Structured Data ○ http://research.google.com/archive/bigtable.html ● The Google File System ○ http://research.google.com/archive/gfs.html 9 Google File System Bigtable Megastore
  • 10. Notes on Colossus ● Colossus is a successor of Google File System which overcomes shortcomings of Google File System. It is used as an infrastructure of Google Cloud Platform as well as Google's internal systems today. ● The following characteristics were mentioned at Google Faculty Summit 2010. ○ Next-generation cluster-level file system ○ Automatically sharded metadata layer ○ Data typically written using Reed-Solomon (1.5x) ○ Client-driven replication, encoding and replication ○ Metadata space has enabled availability analyses ● Since the architectural details of Colossus is not yet published, this presentation explains the architecture of Google File System.
  • 12. What is Google File System? ● Large scale distributed file system used in Google's internal systems to store large files. ● Optimized for file append and sequential file read for large files. ○ Other operations are supported but may be very slow. ● Transparent file replication for redundancy. ○ Each file is split into multiple 64MB chunks and each chunk is stored in (at least) three chunk servers. 12 Handing over large data between servers Streaming data aggregation Typical access patterns
  • 13. Optimized dataflow ● Data is transferred serially from a client to chunk servers. The chunk server starts sending the data right after it starts receiving it. ○ Faster than sending data from a client to all chunk servers in parallel. ● Control messages are handled by the primary chunk server to keep the consistency among replicas. 13 Client Chunk servers PrimarySecondary Secondary Client Dataflow to append data Control flow to commit the write
  • 14. Data corruption detection ● Each chunk is associated with a checksum to detect data corruption. ● The whole chunk is read and validated with the checksum for the read operation. ○ This is optimized for the sequential read. ● A new checksum is calculated with appended data and the existing checksum for the write operation. ○ This is optimized for the file append. 14
  • 16. What is Bigtable? ● Large scale distributed key-value style datastore used in Google's internal systems to store structured data with varying data sizes (from web page URLs to satellite imagery.) ● Google Cloud Platform offers managed service for Bigtable with HBase compatible APIs. 16 Column family design to store HTML contents and inversed links (excerpt from the research paper)
  • 17. Row as a Database ● Data is identified with "Row Key + Column family: Column" (+ timestamp). ● You may think a single row as a small database. ○ A column family represents a table. ○ Columns can be dynamically added to a column family. ○ Atomic operations can be used within a single row. 17 Column family design for user profiles and query histories
  • 18. Global view of the "big" table ● Rows are stored in lexicographic order by row key. The row range for a table is dynamically partitioned into units called 'tablets'. ○ This strategy is optimized for fast row range scans. ● Tablet servers provide the access to tablets. The tablet assignment is managed by a master node. 18
  • 19. Tablet representation ● Tablet data is consisted of in-memory data (memtable) and immutable files (SSTables) stored in Google File System. ○ SSTables store the freezed view of a tablet at some point of time. Updates are appended to a tablet log and memtable. ○ A tablet server construct the united view of the tablet by combining memtable and SSTables. 19 Tablet representation mechanism (excerpt from the research paper) ● When memtable becomes too large, a new memtable is created and the old one is freezed to a new SSTable. (Minor compaction.) ● When SSTables becomes too many, they are merged into a single SSTable by discarding obsolete entries (Major compaction.)
  • 21. Overview of Megastore ● Megastore provides the ACID semantics for globally distributed datasets using fast synchronous replication mechanism based on (an enhanced version of) Paxos. ● This part explains the index structure of Cloud Datastore implemented on top of Megastore. ● Note that ancestor/global query is additional features of Cloud Datastore. They are not a part of Megastore. 21 Multi datacenter replication architecture of Megastore (excerpt from the research paper)
  • 23. How are entities stored in Bigtable? ● Row key: entity key (ancestor path + ID). ○ The whole entity group can be scanned by a row range scan (depth-first search). ● Column family: properties of an entity. ○ An independent column family is used for each property. 23 Row key status of the group email title size access_count Organization, 'Flywheel' Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5 Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' xxxx 256 3 Organization, 'Flywheel', User, 'Bob' xxxx ・ ・ ・ Transaction log and replication status is recorded for operations with strong consistency. Rowrangescan
  • 24. Ancestor query without inequality filters ● The following queries don't require an additional index since they can be done by a row range scan. ● The scan starts from a row with the specified ancestor key. Row key status of the group email title size Organization, 'Flywheel' Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 Starts from here SELECT * FROM Mail WHERE __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') SELECT * FROM Mail WHERE size=256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') 24
  • 25. Ancestor query with inequality filters ● The following query requires an additional index. ● Theoretically it's possible to do the same table scan, but may not be efficient enough. Instead, the following index should be used. ○ The row key of this index table consists of: ■ "Ancestor of the entity" + "Property value" + "Entity key (ancestor path + ID)" ○ See next pages for details. SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') indexes: - kind: Mail ancestor: yes properties: - name: size 25
  • 26. Single-property indexes for ancestor queries ● Each entity is mapped to multiple rows corresponding to all its ancestors. ○ The following example shows the rows for two entities. ○ This will be sorted in the order of row keys, then... Organization, 'Flywheel', | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', User, 'Alice' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', User, 'Alice', Mail, '15de6' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Pointer to entity Organization, 'Flywheel', | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Organization, 'Flywheel', User, 'Alice', Mail, '65067' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Pointer to entity Row key Column Ancestors Property value Entity key (ancestor path + id) 26
  • 27. Single-property indexes for ancestor queries ● Using the row keys which are sorted in lexicographic order: ○ First, the row range is limited by the specified ancestor. ○ The row range is narrowed further by the inequality filter. Organization, 'Flywheel' | 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' Organization, 'Flywheel' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' Organization, 'Flywheel' | 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Organization, 'Flywheel' | 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de' Organization, 'Flywheel', User, 'Alice' | 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel', User, 'Alice' | 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' SELECT * FROM Mail WHERE size>256 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel') 27
  • 28. Composite indexes for multiple conditions ● Indexes with multiple properties are used for queries with multiple conditions. ● The following query requires the composite index. ● The order of properties in the index definition has meaning. ○ The property for equality filter must come first. indexes: - kind: Mail ancestor: yes properties: - name: size - name: access_count SELECT * FROM Mail WHERE size=256 and access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel') 28 Organization, 'Flywheel' | 64 | 1 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' Organization, 'Flywheel' | 128 | 5 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' Organization, 'Flywheel' | 256 | 3 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' Organization, 'Flywheel' | 256 | 8 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' Organization, 'Flywheel' | 1024 | 9 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' Organization, 'Flywheel' | 1024 | 2 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
  • 29. Multiple inequality filters are not allowed! ● The following query is not allowed. ○ The rows of index table cannot be a single range for this condition. SELECT * FROM Mail WHERE size>128 AND access_count<5 AND __key__ HAS ANCESTOR Key(Organization, 'Flywheel', User, 'Alice') 29
  • 30. Strong consistency of ancestor queries ● Indexes with "ancestor: yes" are used for ancestor queries where independent indexes are created for each ancestor tree. ○ A single index table contains entries only for one entity group. ● Indexes are created in each datacenter and replicated. ○ Replication status is checked before starting a query to guarantee strong consistency. 30 Row key status of the group email title size access_count Organization, 'Flywheel' Replication Status Organization, 'Flywheel', User, 'Alice' xxxx Organization, 'Flywheel', User, 'Alice', Mail, '15de6' xxxx 1024 9 Organization, 'Flywheel', User, 'Alice', Mail, '65067' xxxx 128 5 Root entity
  • 32. Indexes for global queries ● Indexes with "ancestor: no" are used for global queries where indexes are created for each kind. ○ One index table contains all entities of a specific kind including entities from multiple entity groups. Operation across entity groups (excerpt from the research paper) ● Megastore handles operations across entity groups with weaker consistency unless two-phase commitment is used. ● On the Cloud Datastore layer, it results in the eventual consistency of global queries. 32
  • 33. Default single-property indexes ● Single-property indexes for global queries are automatically created (in both asc and desc orders). ○ Ancestors are not included in row keys of the index table. ● For example, the following queries use the default indexes. SELECT * FROM Mail WHERE size>256 SELECT size FROM Mail WHERE size>256 33 64 | Organization, 'Flywheel', User, 'Bob', Mail, '15de6' 128 | Organization, 'Flywheel', User, 'Alice', Mail, '65067' 256 | Organization, 'Flywheel', User, 'Alice', Mail, 'c4c0d' 256 | Organization, 'Flywheel', User, 'Bob', Mail, 'c6f4c'' 1024 | Organization, 'Flywheel', User, 'Alice', Mail, '15de6' 1024 | Organization, 'Flywheel', User, Bob, Mail, 'f67de'
  • 34. Composite indexes for global queries ● Indexes with multiple properties (composite indexes) need to be created manually. ○ Projection queries also need composite indexes so that values can be retrieved directly from the index table. SELECT * FROM Mail WHERE size=256 and access_count>5 SELECT title FROM Mail WHERE size>256 Projection query indexes: - kind: Mail ancestor: no properties: - name: size - name: access_count - kind: Mail ancestor: no properties: - name: size - name: title 'title' can be retrieved directly from the index table. 34
  • 35. Index direction matters for sort orders ● "ORDER BY" requires the corresponding index. ● When used with an equality filter, the index direction needs to match the sort order. ● "ORDER BY" cannot mixed with an inequality filter for other properties. ○ The following query is not allowed. SELECT * FROM Mail WHERE size=256 ORDER BY access_count DESC indexes: - kind: Mail ancestor: no properties: - name: size - name: access_count direction: desc 35 SELECT * FROM Mail WHERE size>256 ORDER BY access_count DESC
  • 37. Design guide for entity groups ● Avoid global queries (queries without specifying an ancestor) unless you understand what you are doing. ○ Global queries may not retrieve the latest data. ● Splitting data into entity groups so that updates in a single entity group are less frequent. ○ The update of entities in a single entity group should be less than 1 update/sec. ● Examples: ○ Web mail service ■ An entity group of mails for each user. ○ SNS user group service ■ An entity group of user profile for each user. ■ An entity group of posts for each user group. ■ An entity group of group names and pointers to group sites which provides a catalog of user groups. ○ Online map service ■ An entity group of patches for an arbitrary region of the globe. 37
  • 38. References ● Under the Covers of the Google App Engine Datastore ● How Entities and Indexes are Stored ● Balancing Strong and Eventual Consistency with Google Cloud Datastore 38
  • 40. What is Spanner? ● Spanner: Google's Globally-Distributed Database ○ http://research.google.com/archive/spanner.html ● Spanner is a Google's scalable, multi-version, globally-distributed, and synchronously-replicated database. It is used as a successor of Megastore in Google's internal systems. ● Designed to overcome the shortcomings of Megastore and support general-purpose transactions with SQL-based query language. ● Example of shortcomings of Megastore: ○ It doesn't support the relational data model and SQL-based query language. ○ Transaction and strong consistency is limited within an entity group. ○ The number of updates is limited to 1 update/sec for each entity group. 40
  • 41. Infrastructure design ● The overall server architecture of Spanner resembles Megastore over Bigtable. ○ A cluster in each zone contains multiple span servers. Zones are distributed across data centers. ○ Each span server manages tablets which hold the key-value mappings: (key: string, timestamp: int64) → value: string ○ Backend data files are stored in Colossus. 41 Spanner server organization (excerpt from the research paper) ● Differently from Bigtable, rows in a tablet are versioned with a system time instead of user specified timestamps. ○ The versioning mechanism is used for snapshot read and lock-free read-only transactions.
  • 42. Paxos-based tablet replication ● Tablets in different zones are replicated with Paxos-based algorithm. ○ A leader in each replication group takes care of row-range write locks during read-write transactions. A leader is re-elected thorough Paxos if necessary. ○ In the case of transactions which involve multiple replication groups, transaction managers from each group cooperate to perform two phase commitment. 42 Replication between tablets (excerpt from the research paper)
  • 43. So..., what's the problem? ● The problem with Paxos-based algorithm is that replications are done asynchronously. ○ When half of the replicas have agreed to write the data, it's considered to be committed. The remaining replication will be done asynchronously. ○ If you enforce the genuine full-replication on each write, performance will be highly degraded. (This is partly the reason for the limited strongly consistent updates on Megastore.) ● Spanner associates timestamps with all writes, and every replica tracks a value called "safe time: t-safe" which is the maximum timestamp at which a replica is up-to-date. ○ A replica can satisfy a read request for a timestamp t if t <= t-safe. If not, another replica is used. ○ t-safe advances at each Paxos write. During a transaction, the advancement is delayed until the transaction finishes. 43
  • 44. So..., again, what's the problem? ● The timestamp-based tracking requires that the clocks on all replicas are synchronized. ○ At least, clocks should be calibrated within a limited amount of uncertainty, and the range of uncertainty is known to the system. 44 ● Spanner clusters are equipped with TrueTime API system consisting of multiple time servers using GPS and atomic clocks. ○ TrueTime API provides the time interval in which the current time is guaranteed to be. Fluctuations of time drifts from time servers (excerpt from the research paper) Hardware maintenance of two time servers Network latency improvement