NoSql And The Semantic Web

NoSQL and the Semantic/ Social Web
Irina Hutanu

Alexandru Ioan Cuza University, Computer Science,
Computional Linguistics (2nd year)
Faculty of Letters graduate

{Irina Hutanu, irina.hutanu@gmail.com}

Abstract. NoSQL is a new and promising method of storing and managing the world
wide information. “Not only SQL”[5], as many seem to define it, is spreading rapidly
because of its popular non-relational principle, which allows a better distribution on a
horizontal scale. Further on we will try to disambiguate this new born movement.

1 Introduction.

This type of database can handle a large amount of information because of some interesting
features that increase the storage power:

 The Consistency requirement is limited. It is said you cannot have Consistency,
Availability and Partitioning at the the same time. ( CAP Theorem)
 Key/ Value storage. A quite primitive manner to stockpile.
 It runs on a large number of machines, the information being replicated and
partitioned among them.

Some of the most important and highly rated database applications that function in the above
manner are GoogleBigtable, HBase, Hypertable, AmazonDynamo, Voldemort, Cassandra, Riak,
CouchDB, MongoDB, Redis.

The data-driven sites like Amazon.com, Google, Facebook work with terabytes of information that
needs to be immediately scaled and partitioned in a very efficient manner. On the other hand, these
Internet giants also use tens of thousands of servers and machines located all around the world.
Consequently, many drawbacks and failures happen every second, but the transactions must stay
“always-on”. Every minor problem occuring while a customer/ user queries the database, causing
him/her to lose contact with the informational target, may lead to serious financial loss. Such risks
must not be taken for granted, therefore apps like Dynamo or Bigtable emerged. Their non-
relational architecture, incremental scalability and decentralized character offer a quite robust data
storage system.

2 Architecture
2.1 Partitioning Process

One important feature of a NoSQL system is that it has to scale incrementally the information. In
order for this to happen rapidly and consistently, Dynamo, for example, uses the idea of virtual
nodes in the partitioning process. That means that a node is not mapped only to one position but to
various ones, this way non-uniform distribution is not a problem. Also, if a specific node has
limited access or disappears because of a system failure, the data load contained in that virtual
node is available in some other nodes properly working.

Bigtable, another non-relational storage system, uses another type of partitioning and gathering-
data tool. Being “a sparse, distributed, persistent multi-dimensional sorted map”[1] it uses rows,
columns and timestamps. The partitioning process takes place dynamically and it is applied to the
row’s range.

2.2 Replication

On the other hand, non-stop data availability is also assured by the replicational system. These
apps replicate, in general, all the information acquired on multiple hosts in order to avoid loss of
information and to offer durability.

Bigtable, for instance, uses a replication process that allows information to be duplicated in
different clusters, thus latency is avoided and data is assured against any loss: “The Personalized
Search data is replicated across several Bigtable clusters to increase availability and to reduce
latency due to distance from clients. The Personalized Search team originally built a client-side
replication mechanism on top of Bigtable that ensured eventual consistency of all replicas. The
current system now uses a replication subsystem that is built into the servers.”[1]

Fig. 1. Partitioning and Replication in Dynamo1

2.3 Consistency versus Availability

If a multiple versions of the same data exist, they must be reconciled to avoid any possible system
failures. Unfortunately, in a system that trades consistency for availability, reconciling divergent
versions is almost impossible to obtain. Dynamo, for example, works with some vector clocks to
filter the emergence of two or mode different versions of the same object. In some cases this
method cannot control the number of the divergent versions, thus semantic reconciliation is used.
However, this approach determines an overload of the entire system, so it’s used only if extreme
cases ask for it.

Anyway, with the exception of some minor issues that might cause problems like overloading, the
choice of availability against consistency gave rise to some interesting and unexpected results,
marking, to some extent, a real success: “The production use of Dynamo for the past year

1
Image from Dynamo: Amazon’s Highly Available Key-value Store

demonstrates that decentralized techniques can be combined to provide a single highly-available
system. Its success in one of the most challenging application environments shows that an
eventual-consistent storage system can be a building block for highly-available applications.” [2]

Fig.2. Version evolution of an object over time2.

2.4 Gossip Protocol

This protocol is used both in the updating process and in detecting failures. If a node becomes
unavailable it communicates its state to another node, allowing the reorganization of data between
the functioning nodes. Thus the virtual nodes are programmed to contact one another every second
in a random order to synchronize their history of membership changes.

The process of failure detection is undergone through the same gossip protocol. A node is
considered to be unavailable if it does not respond to the message of another node. The latter node
will get the information required from another virtual node and periodically retries the first one to
search for its recovery.

This is in fact a decentralized manner of detection because we don’ have an upper, superior entity
that points out the defective nodes. What we have is a gossip process that enables each node to
“hear” about the new arrival or departure of other nodes: “Dynamo adopts a full membership
model where each node is aware of the data hosted by its peers. To do this, each
node actively gossips the full routing table with other nodes in the system. This model works well
for a system that contains couple of hundreds of nodes.”[2]

2
Image from Dynamo: Amazon’s Highly Available Key-value Store

Fig.3. Gossip-style process3.

3. Final Remarks
A somehow new movement in the storage domain, NoSQL succeds in dethroning classical SQL
systems based on a relational and centralized information processing. The nowadays web realities
imply the coordination, manipulation and gathering of vast quantities of data and knowledge. Thus
the traditional database applications seem to have lost their applicability in favor of the non-
relational systems that avoid to use joint operations or fixed schemas and, to some extent, even
break the ACID guarantees by developing processes only “eventually consistent”[3].

4. References
[1] Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach
Mike Burrows, Tushar Chandra, Andrew Fikes, Robert E. Grube, Bigtable: A Distributed Storage
System for Structured Data, Appeared in: OSDI'06: Seventh Symposium on Operating System
Design and Implementation,
Seattle, WA, November, 2006.

[2] Giuseppe DeCandia, Deniz Hastorun, Madan Jampani, Gunavardhan Kakulapati,
Avinash Lakshman, Alex Pilchin, Swaminathan Sivasubramanian, Peter Vosshall
and Werner Vogels, Dynamo: Amazon’s Highly Available Key-value Store, 2007

[3] Werner Vogels, Eventually consistent- Revisited, 2008

[4] SQL Databases Don't Scale

[5] http://nosql-databases.org/

3
Image from Pragmatic Programming Techniques

NoSql And The Semantic Web

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (16)

Similar a NoSql And The Semantic Web

Similar a NoSql And The Semantic Web (20)

Último

Último (20)

NoSql And The Semantic Web