1. NoSQL........................................................................................................................2
Why NoSQL?..........................................................................................................2
NoSQL Categories..................................................................................................2
Relational Vs NoSQL Databases............................................................................2
Why Key/value store?.............................................................................................3
Memcached (Key/value store on memory).............................................................4
Memcachedb (Key/Value store on disk).................................................................8
BerkeleyDB...........................................................................................................11
Document Stores.......................................................................................................14
Other Info..................................................................................................................15
2. NoSQL
NOT only SQL. It’s not about saying that SQL should never be used, or that SQL
is dead… it’s about recognizing that for some problems other storage solutions are
better suited.
Why NoSQL?
Trends that gave way for NoSQL paradigm
Exploding Data Size – Each year more and more digital data is created.
Over two years we create more digital data than all the data created in
history before that.
Increasing Connectedness – Over time data has evolved to be more and
more interlinked and connected. Hypertext has links, Blogs have pingback,
Tagging groups all related data.
Semi-structure – Individualization of content, Store more about each entity,
Acceleration of decentralized content generation (web 2.0)
Architecture – Moving towards decoupled services with their own backend
Sources: http://www.slideshare.net/novelys/nosql-3272395
http://www.slideshare.net/marin_dimitrov/nosql-databases-3584443
http://www.slideshare.net/thobe/nosql-for-dummies
NoSQL Categories
NoSQL Products
Relational Vs NoSQL Databases
3. Key/value store
Why Key/value store?
Even though RDBMS have provided database users with the best mix of simplicity,
robustness, flexibility, performance, scalability, and compatibility, their
performance in each of these areas is not necessarily better than that of an alternate
solution pursuing one of these benefits in isolation. This has not been much of a
problem so far because the universal dominance of RDBMS has outweighed the
need to push any of these boundaries. Nonetheless, if you really had a need that
couldn't be answered by a generic relational database, alternatives have always been
around to fill those niches.
4. Today, we are in a slightly different situation. For an increasing number of
applications, one of these benefits is becoming more and more critical; and while
still considered a niche, it is rapidly becoming mainstream, so much so that for an
increasing number of database users this requirement is beginning to eclipse others
in importance. That benefit is scalability. As more and more applications are
launched in environments that have massive workloads, such as web services, their
scalability requirements can, first of all, change very quickly and, secondly, grow
very large. The first scenario can be difficult to manage if you have a relational
database sitting on a single in-house server. For example, if your load triples
overnight, how quickly can you upgrade your hardware? The second scenario can
be too difficult to manage with a relational database in general.
Relational databases scale well, but usually only when that scaling happens on a
single server node. When the capacity of that single node is reached, you need to
scale out and distribute that load across multiple server nodes. This is when the
complexity of relational databases starts to rub against their potential to scale. Try
scaling to hundreds or thousands of nodes, rather than a few, and the complexities
become overwhelming, and the characteristics that make RDBMS so appealing
drastically reduce their viability as platforms for large distributed systems.
For cloud services to be viable, vendors have had to address this limitation, because
a cloud platform without a scalable data store is not much of a platform at all. So, to
provide customers with a scalable place to store application data, vendors had only
one real option. They had to implement a new type of database system that focuses
on scalability, at the expense of the other benefits that come with relational
databases.
These efforts, combined with those of existing niche vendors, have led to the rise of
a new breed of database management system.
Source:http://www.slideshare.net/marc.seeger/keyvalue-stores-a-practical-overview
Memcached (Key/value store on memory)
Definition
Free & open source, high-performance, distributed memory object caching
system, generic in nature, but intended for use in speeding up dynamic web
applications by alleviating database load.
Memcached is an in-memory key-value store for small chunks of arbitrary data
(strings, objects) from results of database calls, API calls, or page rendering.
5. Memcached is simple yet powerful. Its simple design promotes quick deployment,
ease of development, and solves many problems facing large data caches. Its API is
available for most popular languages.
What is it made up of?
Client software, which is given a list of available memcached servers.
A client-based hashing algorithm, which chooses a server based on the
"key" input.
Server software, which stores your values with their keys into an internal
hash table.
Server algorithms, which determine when to throw out old data (if out of
memory), or reuse memory.
What are the Design Philosophies?
Simple Key/Value Store
The server does not care what your data looks like. Items are made up of a key, an
expiration time, optional flags, and raw data. It does not understand data structures;
you must upload data that is pre-serialized. Some commands (incr/decr) may
operate on the underlying data, but the implementation is simplistic.
Smarts Half in Client, Half in Server
A "memcached implementation" is implemented partially in a client, and partially
in a server. Clients understand how to send items to particular servers, what to do
when it cannot contact a server, and how to fetch keys from the servers. The servers
understand how to receive items, and how to expire them.
Servers are Disconnected From Each Other
Memcached servers are generally unaware of each other. There is no crosstalk, no
synchronization, no broadcasting. The lack of interconnections means adding more
servers will usually add more capacity as you expect. There might be exceptions to
this rule, but they are exceptions and carefully regarded.
O(1) Everything
For everything it can, memcached commands are O(1). Each command takes
roughly the same amount of time to process every time, and should not get
noticably slower anywhere. This goes back to the "Simple K/V Store" principle, as
you don't want to be processing data in the cache service your tens or hundreds or
thousands of webservers may need to access at the same time.
Forgetting Data is a Feature
Memcached is, by default, a Least Recently Used cache. It is designed to have items
expire after a specified amount of time. Both of these are elegant solutions to many
problems; Expire items after a minute to limit stale data being returned, or flush
unused data in an effort to retain frequently requested information.
6. This further allows great simplification in how memcached works. No "pauses"
waiting for a garbage collector ensures low latency, and free space is lazily
reclaimed.
Cache Invalidation is a Hard Problem
Given memcached's centralized-as-a-cluster nature, the job of invalidating a cache
entry is trivial. Instead of broadcasting data to all available hosts, clients direct in on
the exact location of data to be invalidated. You may further complicate matters to
your needs, and there are caveats, but you sit on a strong baseline.
Architecture
The system uses client–server architecture. The servers maintain a key–value
associative array; the clients populate this array and query it. Keys are up to 250
bytes long and values can be at most 1 megabyte large.
Clients use client side libraries to contact the servers which, by default, expose their
service at port 11211. Each client knows all servers; the servers do not
communicate with each other. If a client wishes to set or read the value
corresponding to a certain key, the client's library first computes a hash of the key
to determine the server that will be used. Then it contacts that server. The server
will compute a second hash of the key to determine where to store or read the
corresponding value.
The servers keep the values in RAM; if a server runs out of RAM, it discards the
oldest values. Therefore, clients must treat Memcached as a transitory cache; they
cannot assume that data stored in Memcached is still there when they need it. A
Memcached-protocol compatible product known as MemcacheDB provides
persistent storage. There is also a solution called Membase from NorthScale that
provides persistence, replication and clustering.
If all client libraries use the same hashing algorithm to determine servers, then
clients can read each other's cached data; this is obviously desirable.
A typical deployment will have several servers and many clients. However, it is
possible to use Memcached on a single computer, acting simultaneously as client
and server.
http://memcached.org/
How this stuff works? a.k.a “The Memcache Pattern “
(http://code.google.com/appengine/docs/python/memcache/usingmemcache.html#Pattern)
Memcache is typically used with the following pattern:
• The application receives a query from the user or the application.
• The application checks whether the data needed to satisfy that query is in
memcache.
7. o If the data is in memcache, the application uses that data.
o If the data is not in memcache, the application queries the datastore
and stores the results in memcache for future requests.
The pseudocode below represents a typical memcache request:
def get_data():
data = memcache.get("key")
if data is not None:
return data
else:
data = self.query_for_data()
memcache.add("key", data, 60)
return data
Memcached allows you to take memory from parts of your system where you have
more than you need and make it accessible to areas where you have less than you
need.
Memcached also allows you to make better
use of your memory. If you consider the
diagram to the right, you can see two
deployment scenarios:
1. Each node is completely independent
(top).
2. Each node can make use of memory
from other nodes (bottom).
The first scenario illustrates the classic
deployment strategy, however you'll find
that it's both wasteful in the sense that the
total cache size is a fraction of the actual
capacity of your web farm, but also in the
amount of effort required to keep the cache
consistent across all of those nodes.
With memcached, you can see that all of the
servers are looking into the same virtual
pool of memory. This means that a given
item is always stored and always retrieved
from the same location in your entire web
cluster.
Also, as the demand for your application grows to the point where you need to have
more servers, it generally also grows in terms of the data that must be regularly
accessed. A deployment strategy where these two aspects of your system scale
together just makes sense.
8. The illustration to the right only shows two web servers for simplicity, but the
property remains the same as the number increases. If you had fifty web servers,
you'd still have a usable cache size of 64MB in the first example, but in the second,
you'd have 3.2GB of usable cache.
Of course, you aren't required to use your web server's memory for cache. Many
memcached users have dedicated machines that are built to only be memcached
servers.
Users of Memcached
LiveJournal, Wikipedia, Flickr, Bebo, Twitter, Typepad, Yellowbot, Youtube,
Digg, Wordpress, Craigslist, Mixi
Memcachedb (Key/Value store on disk)
Definition (Wiki:
http://en.wikipedia.org/wiki/Memcachedb)
is a persistence enabled variant of memcached, a general-purpose distributed
memory caching system often used to speed up dynamic database-driven
websites by caching data and objects in memory. The main difference between
MemcacheDB and memcached is that MemcacheDB has its own key-value
database system based on Berkeley DB, so it is meant for persistent storage
rather than as a cache solution. MemcacheDB is accessed through the same protocol
as memcached, so applications may use any memcached API as a means of
accessing a MemcacheDB database.
MemcacheQ is a MemcacheDB variant that provides a simple message queue
service.
MemcacheDB is a distributed key-value storage system designed for persistent. It
is NOT a cache solution, but a persistent storage engine for fast and reliable key-
value based object storage and retrieval. It conforms to memcache protocol, so
any memcached client can have connectivity with it. MemcacheDB uses Berkeley
DB as a storing backend, so lots of features including transaction and replication
are supported.
Memcached was first developed by Brad Fitzpatrick for his website LiveJournal, on
May 22, 2003.
Features
High performance read/write for a key-value based object. Rapid set/get
for a key-value based object, not relational. Benchmark will tell you the true
later.
9. High reliable persistent storage with transaction. Transaction is used to
make your data more reliable.
High availability data storage with replication. Replication rocks!
Achieve your HA, spread your read, make your transaction durable!
Memcache protocol compatibility. Lots of Memcached Client APIs can be
used for Memcachedb, almost in any language, Perl, C, Python, Java, ...
Why memcachedb?
We have MySQL, we have PostgreSQL, we have a lot of RDBMSs, but why we
need Memcachedb?
RDBMS is slow. All they have a complicated SQL engine on top of storage.
Our data requires to be stored and retrieved damnable fast.
Not concurrent well. When thousands of clients, millions of requests
happens...
But the data we wanna store is very small size! Cost is high if we use
RDBMS.
Many critical infrastructure services need fast, reliable data storage and
retrieval, but do not need the flexibility of dynamic SQL queries.
o Index, Counter, Flags
o Identity Management(Account, Profile, User config info, Score)
o Messaging
o Personal domain name
o meta data of distributed system
o Other non-relatonal data
Performance Benchmark:
MemcacheDB is very fast.
Environment
• Box: Dell 2950III
• OS: Linux CentOS 5
• Version: memcachedb-1.0.0-beta
• Client API: libmemcached
a. Non-thread Edition
Started: memcachedb -d -r -u root -H /data1/mdbtest/ -N -v
Write (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 set)
No. 1 2 3 4 5 6 7 8 avg.
Cost(s) 807 835 840 853 859 857 865 868 848
2000000 * 8 / 848 = 18868 w/s
Read (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 get)
No. 1 2 3 4 5 6 7 8 avg.
Cost(s) 354 354 359 358 357 364 363 365 360
2000000 * 8 / 360 = 44444 r/s
b. Thread Edition(4 Threads)
10. Started: memcachedb -d -r -u root -H /data1/mdbtest/ -N -t 4 –v
Write (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 set)
No. 1 2 3 4 5 6 7 8 avg.
Cost(s) 663 669 680 680 684 683 687 686 679
2000000 * 8 / 679 = 23564 w/s
Read (key: 16 value: 100B, 8 concurrents, every process does 2,000,000 get)
No. 1 2 3 4 5 6 7 8 avg.
Cost(s) 245 249 250 248 248 249 251 250 249
2000000 * 8 / 249 = 64257 r/s
How this stuff works??
Source: http://memcachedb.org/ and http://memcachedb.org/memcachedb-
guide-1.0.pdf
Non Thread version
Thread version
11. BerkeleyDB
(Persistent storage used by memcachedb)
Source: http://www.oracle.com/technology/products/berkeley-db/db/index.html
Oracle Berkeley DB is a high-performance embeddable database providing SQL,
Java Object and Key/Value storage. Berkeley DB offers advanced features
including transactional data storage, highly concurrent access, replication for high
availability, and fault tolerance in a self-contained, small footprint software library.
Berkeley DB enables the development of custom data management solutions,
without the overhead traditionally associated with such custom projects. Berkeley
DB provides a collection of well-proven building-block technologies that can be
configured to address any application need from the handheld device to the
datacenter, from a local storage solution to a world-wide distributed one, from
kilobytes to petabytes.
Berkeley DB can be downloaded and the source code can be reviewed, then choose
your build options and then compile the library in the configuration most suitable
for your needs. The Berkeley DB library is a building block that provides the
complex data management features found in enterprise class databases. These
12. facilities include high throughput, low-latency reads, non-blocking writes, high
concurrency, data scalability, in-memory caching, ACID transactions, automatic
and catastrophic recovery when the application, system or hardware fails, high
availability and replication in an application configurable package. Simply
configure the library and use the particular features available to satisfy your
particular application needs.
Oracle Berkeley DB fits where you need it regardless of programming language,
hardware platform, or storage media. Berkeley DB APIs are available in almost all
programming languages including ANSI-C, C++, Java, C#, Perl, Python, Ruby and
Erlang to name a few. There is a pure-Java version of the Berkeley DB library
designed for products that must run entirely within a Java Virtual Machine (JVM).
We support the Microsoft .NET environment and the Common Language Runtime
(CLR) with a C# API. Oracle Berkeley DB is tested and certified to compile and
run on all modern operating systems including Solaris, Windows, Linux, Android,
Mac OS/X, BSD, iPhone OS, VxWorks, and QNX to name a few.
Storage engine design
BerkeleyDB BerkeleyDB Java Ed. BerkeleyDB XML
Written in C Written in Java Written in C++
Software Library Java Software Archive Software Library
(JAR)
Key/value API Key/value API Layered on Berkeley DB
SQL API by incorporating Java Direct Persistence XQuery API by
SQLite Layer (DPL) API incorporating XQilla
BTREE, HASH, QUEUE, Java Collections API Indexed, optimized XML
RECNO storage storage
C++, Java/JNI, C#, Replication for High C++, Java/JNI, C#,
13. Python, Perl, ... Availability Python, Perl, ...
Java Direct Persistence Replication for High
Layer (DPL) API Availability
Java Collections API
Replication for High
Availability
Use cases of BerkeleyDB
Amazon’s Dynamo -
http://www.allthingsdistributed.com/2007/10/amazons_dynamo.html
BerkeleyDB Java Ed. On Android on
http://www.oracle.com/technetwork/database/berkeleydb/bdb-je-
android-160932.pdf
Infoflex Connect AB Embeds Critical Edge into High-Speed, High-
Performance SMS Messaging Gateway - http://www.oracle.com/customers/
snapshots/infoflex-connect-database-snapshot.pdf