This slides presents three key-value stores using log-structure, includes Riak, RethinkDB, LevelDB. BTW, i state that RethinkDB employs append-only B-tree and that is an estimate made by combining guessing wih reasoning!
4. Log Structure
• A log-structured file system is a file system design first
proposed in 1988 by John K. Ousterhout and Fred Douglis.
• Design for high write throughput, all updates to data and
metadata are written sequentially to a continuous stream,
called a log.
• Conventional file systems tend
to lay out files with great care for
spatial locality and make in-place
changes to their data structures.
5. Log Structure for flash memory
• Random write degrades the system performance and shrinks
the lifetime of flash memory.
• Log structure is flash-friendly natively!
Magnetic Disk Flash Memory
new data 1
data 1
free new data 1
erased
data 1
free RAM
block
data 2
free erased
data 2
free data 2
free
new data 3
data 3
free erased
data 3
free data 3
free
data 4
free free free
block
free free
free free
7. Riak ?
• Riak is an open source, highly scalable, fault-tolerant
distributed database.
• Supported core features:
- operate in highly distributed environments
- no single point of failure
- highly fault-tolerant
- scales simply and intelligently
- highly data available
- low cost of operations
8. Bitcask
• A Bitcask instance is a directory, and only one
operating system process will open that Bitcask for
writing at a given time.
• The active file is only written by appending, which
means that sequential writes do not require disk
seeking.
9. Hash Index: keydir
• A keydir is simply a hash table that maps every key in
a Bitcask to a fixed-size structure giving the file, offset
and size of the most recently written entry for that
key .
10. Merge
• The merge process iterates over all non-active file
and produces as output a set of data files containing
only the “live” or latest versions of each present key.
• During the merge process, for each merged data file,
a byproduct called hint file is generated, which can
be used to make startup and crash recovery easy.
12. RethinkDB ?
• RethinkDB is a persistent, industrial-strength key-value store
with full support for the Memcached protocol.
• Powerful technology:
- Linear scaling across cores
- Fine-grained durability control
- Instantaneous recovery on power failure
• Supported core features:
- Atomic increment/decrement
- Values up to 10MB in size
- Multi-GET support
- Up to one million transactions per second on commodity hardware
13. Installation & usage
• RethinkDB works on modern 64-bit distributions of
Linux.
Ubuntu 10.04.1 x86_64 Ubuntu 10.10 x86_64
Red Hat Enterprise Linux 5 x86_64 CentOS 5 x86_64
SUSE Linux 10
• Running the rethinkdb server:
Default installation path: /usr/bin/rethinkdb-1.0
./rethinkdb-1.0 -f /u01/rethinkdb_data
./rethinkdb-1.0 -f /u01/rethinkdb_data -c 4 -p 11500
./rethinkdb-1.0 -f /u01/rethinkdb_data
-f /u03/rethinkdb_data -c 4 -p 11500
14. The methodology
• Firstly, lack of mechanical parts makes random reads
on SSD are significantly efficient!
• Secondly, random writes trigger more erases, making
these operations expensive, and decreasing the drive
lifetime!
• RethinkDB takes an append-only approach to storing
data, pioneered by log-structured file system!
What are the
consequences of appen-
only ?
15. Append-only consequences
Data Consistency
1) eliminating data locality
Hot Backups requires a larger number of
disk access
Instantaneous Recovery
Easy Replication
2) large amount of data that
quickly becomes obsolete in
Lock-Free Concurrency
an environment with a
heavy insert or update
Live Schema Changes workload
Database Snapshots
18. LevelDB ?
• LevelDB is a fast key-value storage library written at
Google that provides an ordered mapping from string
keys to string values.
• Supported core features:
- Data is stored sorted by key
- Multiple changes can be made in one atomic batch
- Users can create a transient snapshot to get a consistent
view of data
- Data is automatically compressed using the Snappy
compression library
19. Installation & usage
• LevelDB works with snappy, which is a compression
/decompression library.
download snappy from http://code.google.com/p/snappy/
cd snappy-1.0.4
./configure && make && make install
• It is a library, no database server!
svn checkout http://leveldb.googlecode.com/svn/trunk/leveldb-read-only
cd leveldb-read-only
make && cp libleveldb.a /usr/local/lib && libleveldb.a
cp -r include/leveldb /usr/local/include
20. Log-structure merge tree
• Log file: a log file (*.log) stores a sequence of recent
updates and each update isRead appended to the current
Memtable
log file. Memory
• Memtable: a in-memory strcucture keeps a copy of Disk
SSTable SSTable
theLog file log file.
current
SSTable SSTable
• Sorted tables: a sorted table (*.sst) stores a…sequence
…
……
SSTable
of entries sorted by key and each entry is either a
Write SSTable SSTable
value for the key, or a deletion marker for the key.
Level-0 Level-1
22. Conclusion
• Log-structure enjoys high write throughput and
makes data consistency, hot backups, recovery and
snapshot easy.
• Log-structure eliminates the data locality, queries
require a larger number of random disk access
consequently.
• An excellent garbage collection method can be very
important to log-structure storage system.