SlideShare una empresa de Scribd logo
1 de 12
Descargar para leer sin conexión
Introducing Riak, Part 1: The language-
independent HTTP API
Store and retrieve data using Riak's HTTP interface

Simon Buckle (simon@simonbuckle.com)                            Skill Level: Intermediate
Independent Consultant
Freelance                                                             Date: 13 Mar 2012


  This is Part 1 of a two-part series about Riak, a highly scalable, distributed data
  store written in Erlang and based on Dynamo, Amazon's high availability key-
  value store. Learn the basics about Riak and how to store and retrieve items
  using its HTTP API. Explore how to use its Map/Reduce framework for doing
  distributed queries, how links allow relationships to be defined between objects,
  and how to query those relationships using link walking.

Introduction
Typical modern relational databases perform poorly on certain types of applications
and struggle to cope with the performance and scalability demands of today's
Internet applications. A different approach is needed. In the last few years, a new
type of data store, commonly referred to as NoSQL, has become popular as it
directly addresses some of the deficiencies of relational databases. Riak is one such
example of this type of data store.

Riak is not the only NoSQL data store out there. Two other popular data stores
are MongoDB and Cassandra. Although similar in many ways, there are also
some significant differences. For example, Riak is a distributed system whereas
MongoDB is a single system database — Riak has no concept of a master node,
making it more resilient to failure. Though also based on Amazon's description of
Dynamo, Cassandra omits features such as vector clocks and consistent hashing for
organizing its data. Riak's data model is more flexible. In Riak, buckets are created
on the fly when they are first accessed; Cassandra's data model is defined in an XML
file so changing it requires having to reboot the entire cluster.

Another strength of Riak is it is written in Erlang. MongoDB and Cassandra
are written in what can be referred to as general-purpose languages (C++ and

© Copyright IBM Corporation 2012                                                Trademarks
Introducing Riak, Part 1: The language-independent                             Page 1 of 12
HTTP API
developerWorks®                                               ibm.com/developerWorks/


Java, respectively), whereas Erlang was designed from the ground up to support
distributed, fault-tolerant applications, and as such is more suited to developing
applications such as NoSQL data stores that share some characteristics with the
applications that Erlang was originally created for.

Map/Reduce jobs can only be written in either Erlang or JavaScript. For this article,
we have chosen to write the map and reduce functions in JavaScript, but it is also
possible to write them in Erlang. While Erlang code may be slightly quicker to
execute, we have chosen JavaScript code because of its accessibility to a larger
audience. See Resources for links to learn more about Erlang.

Getting started
If you want to try out some of the examples in this article, you need to install Riak
(see Resources) and Erlang on your system.

You also need to build a cluster containing three nodes running on your local
machine. All data stored in Riak are replicated to a number of nodes in the cluster. A
property (n_val) on the bucket the data is stored in determines the number of nodes
to replicate. The default value of this property is three, therefore, we need to create a
cluster with at least three nodes (after which you can create as many as you like) in
order for it to be effective.

After you download the source code, you need to build it. The basic steps are as
follows:

    1. Unpack the source: $ tar xzvf riak-1.0.1.tar.gz
    2. Change directory: $ cd riak-1.0.1
    3. Build: $ make all rel
This will build Riak (./rel/riak). To run multiple nodes locally you need to make copies
of ./rel/riak — one copy for each additional node. Copy ./rel/riak to ./rel/riak2, ./rel/
riak3 and so on, then make the following changes to each copy:

    • In riakN/etc/app.config change the following values: the port specified in the
      http{} section, handoff_port, and pb_port, to something unique
    • Open up riakN/etc/vm.args and change the name, again to something unique,
      for example, -name riak2@127.0.0.1
Now start each node in turn, as shown in Listing 1.

Listing 1. Listing 1. Starting each node
$   cd rel
$   ./riak/bin/riak start
$   ./riak2/bin/riak start
$   ./riak3/bin/riak start


Finally, join the nodes together to make a cluster, as shown in Listing 2.

Introducing Riak, Part 1: The language-independent                              Page 2 of 12
HTTP API
ibm.com/developerWorks/                                                     developerWorks®




Listing 2. Listing 2. Making a cluster
$ ./riak2/bin/riak-admin join riak@127.0.0.1
$ ./riak3/bin/riak-admin join riak@127.0.0.1


You should now have a 3-node cluster running locally. To test it, run the following
command: $ ./riak/bin/riak-admin status | grep ring_members.

You should see each node that is part of the cluster you just created, for example,
ring_members : ['riak2@127.0.0.1','riak3@127.0.0.1','riak@127.0.0.1'].

The Riak API
There are currently three ways of accessing Riak: an HTTP API (RESTful interface),
Protocol Buffers, and a native Erlang interface. Having more than one interface gives
you the benefit of being able to choose how to integrate your application. If you have
an application written in Erlang then it would make sense to use the native Erlang
interface so you have tight integration between the two. There are also other factors,
such as performance, that may play a part in deciding which interface to use. For
example, a client that uses the Protocol Buffers interface will perform better than
one that interacts with the HTTP API; less data is communicated and parsing all
those HTTP headers can be (relatively) costly in terms of performance. However, the
benefits of having an HTTP API are that most developers today — particularly Web
developers — are familiar with RESTful interfaces plus most programming languages
have built-in primitives for requesting resources over HTTP, for example, opening a
URL, so no additional software is needed. In this article, we will focus on the HTTP
API.

All the examples will use curl to interact with Riak through its HTTP interface. This is
just to get a better understanding of the underlying API. There are a number of client
libraries available in various different languages and you should consider using one
of those when developing an application that uses Riak as the data store. The client
libraries provide an API to Riak that makes it easy to integrate into your application;
you won't have to write code yourself to handle the kind of responses you will see
when using curl.

The API supports the usual HTTP methods: GET, PUT, POST, DELETE, which will be
used for retrieving, updating, creating and deleting objects respectively. Each one will
be covered in turn.

Storing objects
You can think of Riak as implementing a distributed map from keys (strings) to values
(objects). Riak stores values in buckets. There is no need to explicitly create a bucket
before storing an object in one; if an object is stored in a bucket that doesn't exist, it
will be created automatically for us.

Buckets are a virtual concept in Riak and exist primarily as a means of grouping
related objects. Buckets also have properties and the value of these properties define

Introducing Riak, Part 1: The language-independent                              Page 3 of 12
HTTP API
developerWorks®                                                  ibm.com/developerWorks/


what Riak does with the objects that are stored in it. Here are some examples of
bucket properties:

   • n_val — The number of times an object should be replicated across the cluster
   • allow_mult — Whether to allow concurrent updates
You can view a bucket's properties (and their current values) by making a GET request
on the bucket itself.

To store an object, we do an HTTP POST to one of the URLs shown in Listing 3.

Listing 3. Listing 3. Storing an object
POST -> /riak/<bucket> (1)
POST -> /riak/<bucket>/<key> (2)


Keys can either be allocated automatically by Riak (1) or defined by the user (2).

When storing an object with a user-defined key it's also possible to do an HTTP PUT
to (2) to create the object.

The latest version of Riak also supports the following URL format: /buckets/<bucket>/
keys/<key>, but we will use the older format in this article in order to maintain
backwards compatibility with earlier versions of Riak.

If no key is specified, Riak will automatically allocate a key for the object. For
example, let's store a plain text object in the bucket "foo" without explicitly specifying
a key (see Listing 4).

Listing 4. Listing 4. Storing a plain text object without specifying a key
$ curl -i -H "Content-Type: plain/text" -d "Some text" 
http://localhost:8098/riak/foo/

HTTP/1.1 201 Created
Vary: Accept-Encoding
Location: /riak/foo/3vbskqUuCdtLZjX5hx2JHKD2FTK
Content-Type: plain/text
Content-Length: ...


By examining the Location header, you can see the key that Riak allocated to the
object. It's not very memorable, so the alternative is to have the user provide a key.
Let's create an artists bucket and add an artist who goes by the name of Bruce (see
Listing 5).

Listing 5. Listing 5. Creating an artists bucket and adding an artist
$ curl -i -d '{"name":"Bruce"}' -H "Content-Type: application/json" 
http://localhost:8098/riak/artists/Bruce

HTTP/1.1 204 No Content
Vary: Accept-Encoding
Content-Type: application/json
Content-Length: ...


Introducing Riak, Part 1: The language-independent                               Page 4 of 12
HTTP API
ibm.com/developerWorks/                                                        developerWorks®




If the object was stored correctly using the key that we specified, we will get a 204 No
Content response from the server.

In this example, we are storing the value of the object as JSON but it could just as
easily have been plain text or some other format. It is important to note that when
storing an object that the Content-Type header is set correctly. For example, if you
want to store a JPEG image, then you should set the content type to image/jpeg.

Retrieving an object
To retrieve a stored object, do a GET on the bucket using the key of the object you
want to retrieve. If the object exists, it will be returned in the body of the response,
otherwise a 404 Object Not Found response will be returned by the server (see
Listing 6).

Listing 6. Listing 6. Performing a GET on the bucket
$ curl http://localhost:8098/riak/artists/Bruce

HTTP/1.1 200 OK
...
{ "name" : "Bruce" }


Updating an object
When updating an object, just like when storing one, the Content-Type header is
required. For example, let's add Bruce's nickname as shown in Listing 7.

Listing 7. Listing 7. Adding Bruce's nickname
$ curl -i -X PUT -d '{"name":"Bruce", "nickname":"The Boss"}' 
-H "Content-Type: application/json" http://localhost:8098/riak/artists/Bruce


As mentioned earlier, Riak creates buckets automatically. The buckets have
properties. One of those properties, allow_mult, determines whether concurrent
writes are allowed. By default, it is set to false; however, if concurrent updates are
allowed then for each update, the X-Riak-Vclock header should be sent as well. The
value of this header should be set to the value that was seen when the object was
last read by the client.

Riak uses vector clocks to determine the causality of modifications to objects. How
vector clocks work is beyond the scope of this article but suffice to say that when
concurrent writes are allowed there is a possibility that conflicts may occur so it will
be left up to the application to resolve these conflicts (see Resources).

Removing an object
Removing an object follows a similar pattern to the previous commands: we simply
do an HTTP DELETE to the URL that corresponds to the object we want to delete: $
curl -i -X DELETE http://localhost:8098/riak/artists/Bruce.

Introducing Riak, Part 1: The language-independent                                 Page 5 of 12
HTTP API
developerWorks®                                                 ibm.com/developerWorks/


If the object was removed successfully we will get a 204 No Content response from
the server; if the object we are trying to delete does not exist, the server responds
with a 404 Object Not Found.

Links
So far, we have seen how to store objects by associating an object with a particular
key so it can be retrieved later on. What would be useful is if we could extend this
simple model to be able to express how (and if) objects are related to each other.
Well we can and Riak achieves this via links.

So, what are links? Links allow the user to create relationships between objects. If
you are familiar with UML class diagrams, you can think of a link as an association
between objects with a label describing the relationship; in a relational database, the
relationship would be expressed using a foreign key.

Links are "attached" to objects via the "Link" header. Below is an example of what a
link header looks like. The target of the relationship, for example, the object we are
linking to, is the thing between the angled brackets. The relationship type — in this
case "performer" — is expressed by the riaktag property: Link: </riak/artists/
Bruce>; riaktag="performer".

Let's add some albums and associate them with the artist Bruce who performed on
the albums (see Listing 8).

Listing 8. Listing 8. Adding some albums
$ curl -H "Content-Type: text/plain" 
-H 'Link: </riak/artists/Bruce> riaktag="performer"' 
-d "The River" http://localhost:8098/riak/albums/TheRiver

$ curl -H "Content-Type: text/plain" 
-H 'Link: </riak/artists/Bruce> riaktag="performer"' 
-d "Born To Run" http://localhost:8098/riak/albums/BornToRun


Now that we have set-up some relationships, it's time to query them via link walking
— link walking is the name given to the process of querying the relationships
between objects. For example, to find the artist who performed the album The River,
you would do this: $ curl -i http://localhost:8098/riak/albums/TheRiver/
artists,performer,1.

The bit at the end is the link specification. This is what a link query looks like. The first
part (artists) specifies the bucket that we should restrict the query to. The second
part (performer) specifies the tag we want to use to limit the results, and finally, the
1 indicates that we do want to include the results from this particular phase of the
query.

It's also possible to issue transitive queries. Let's assume we have set-up the
relationships between albums and artists as in Figure 1.

Introducing Riak, Part 1: The language-independent                                Page 6 of 12
HTTP API
ibm.com/developerWorks/                                                    developerWorks®




Figure 1. Figure 1. Example relationship between albums and artists




It's now possible to issue queries such as, "Which artists collaborated with the
artist who performed The River," by executing the following: $ curl -i http://
localhost:8098/riak/albums/TheRiver/artists,_,0/artists,collaborator,1. The
underscore in the link specification acts like a wildcard character and indicates that
we don't care what the relationship is.

Running Map/Reduce queries
Map/Reduce is a framework popularized by Google for running distributed
computations in parallel over huge datasets. Riak also supports Map/Reduce by
allowing queries that are more powerful to be performed on the data stored in the
cluster.

A Map/Reduce function consists of both a map phase and a reduce phase. The map
phase is applied to some data and produces zero or more results; this is equivalent in
functional programming terms to mapping a function over each item in a list. The map
phases occur in parallel. The reduce phase then takes all of the results from the map
phases and combines them together.

For example, consider counting the number of each instance of a word across a large
set of documents. Each map phase would calculate the number of times each word
appears in a particular document. These intermediate totals, once calculated, would
then be sent to the reduce function that would tally the totals and emit the result
for the whole set of documents. See Resources for a link to Google's Map/Reduce
paper.

Example: Distributed grep
For this article, we are going to develop a Map/Reduce function that will do a
distributed grep over a set of documents stored in Riak. Just like grep, the final output
will be a set of lines that match the supplied pattern. In addition, each result will also
indicate the line number in the document where the match occurred.

To execute a Map/Reduce query we do a POST to the /mapred resource. The body of
the request is a JSON representation of the query; as in previous cases, the Content-
Type header must be present and always be set to application/json. Listing 9 shows

Introducing Riak, Part 1: The language-independent                             Page 7 of 12
HTTP API
developerWorks®                                                    ibm.com/developerWorks/


the query that we will execute to do the distributed grep. Each part of the query will
be discussed in turn.

Listing 9. Listing 9. Example Map/Reduce query
{
    "inputs": [["documents","s1"],["documents","s2"]],
    "query": [
      { "map": {
          "language": "javascript",
          "name": "GrepUtils.map",
          "keep": true,
          "arg": "[s|S]herlock" }
      },
      { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } }
    ]
}


Each query consists of a number of inputs, for example, the set of documents we
want to do some computation on, and the name of a function to run during both the
map and reduce phases. It is also possible to include the source of both the map and
reduce functions directly inline in the query by using the source property instead of
name but I have not done that here; however, in order to use named functions you
will need to make some changes to Riak's default configuration. Save the code in
Listing 9 in a directory somewhere. For each node in the cluster, locate the file etc/
app.config, open it up and set the property js_source_dir to the directory where you
saved the code. You will need to restart all the nodes in the cluster in order for the
changes to take effect.

The code in Listing 10 contains the functions that will be executed during the map
and reduce phases. The map function looks at each line in the document and checks
to see if matches the supplied pattern (the arg parameter). The reduce function in
this particular example doesn't do much; it behaves like an identity function and just
returns its input.

Listing 10. Listing 10. GrepUtils.js
var GrepUtils = {
    map: function (v, k, arg) {
        var i, len, lines, r = [], re = new RegExp(arg);
        lines = v.values[0].data.split(/r?n/);
        for (i = 0, len = lines.length; i < len; i += 1) {
            var match = re.exec(lines[i]);
            if (match) {
                r.push((i+1) + “. “ + lines[i]);
            }
        }
        return r;
    },
    reduce: function (v) {
        return [v];
    }
};


Before we can run the query, we need some data. I downloaded a couple of Sherlock
Holmes e-books from the Project Gutenberg Web site (see Resources). The first text

Introducing Riak, Part 1: The language-independent                               Page 8 of 12
HTTP API
ibm.com/developerWorks/                                                         developerWorks®




is stored in the "documents" bucket under the key "s1"; the second text in the same
bucket with the key "s2".

Listing 11 is an example of how you would load such a document into Riak.

Listing 11. Listing 11. Loading a document into Riak
$ curl -i -X POST http://localhost:8098/riak/documents/s1 
-H “Content-Type: text/plain” --data-binary @s1.txt


Once the documents have been loaded, we can then search them. In this case, we
want to output any lines that match the regular expression "[s|S]herlock" (see
Listing 12).

Listing 12. Listing 12. Searching the documents
$ curl -X POST -H "Content-Type: application/json" 
http://localhost:8098/mapred --data @-<<EOF
{
  "inputs": [["documents","s1"],["documents","s2"]],
  "query": [
    { "map": {
        "language":"javascript",
        "name":"GrepUtils.map",
        "keep":true,
        "arg": "[s|S]herlock" }
    },
    { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } }
  ]
}
EOF


The arg property in the query contains the pattern that we want to grep for in the
documents; this value is passed in to the map function as the arg parameter.

The output from running the Map/Reduce job over the sample data is in Listing 13.

Listing 13. Listing 13. Sample output from running the Map/Reduce job
[["1. Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan
Doyle","9. Title: The Adventures of Sherlock Holmes","62. To Sherlock Holmes
she is always THE woman. I have seldom heard","819. as I had pictured it from
Sherlock Holmes' succinct description,","1017. "Good-night, Mister Sherlock
Holmes."","1034. "You have really got it!" he cried, grasping Sherlock
Holmes by" …]]



Streaming Map/Reduce
To finish off this section on Map/Reduce, we'll take a brief look at Riak's streaming
Map/Reduce feature. It's useful for jobs that have map phases that take a while
to complete, since streaming the results allows you to access the results of each
map phase as soon as they become available, and before the reduce phase has
executed.

We can apply this to good effect to the distributed grep query. The reduce step in
the example doesn't actually do much. In fact, we can get rid of the reduce phase

Introducing Riak, Part 1: The language-independent                                  Page 9 of 12
HTTP API
developerWorks®                                               ibm.com/developerWorks/


altogether and just emit the results from each map phase directly to the client. To
achieve this, we need to modify the query by removing the reduce step and adding
?chunked=true to end of the URL to indicate that we want to stream the results (see
Listing 14).

Listing 14. Listing 14. Modifying the query to stream the results
$ curl -X POST -H "Content-Type: application/json" 
http://localhost:8098/mapred?chunked=true --data @-<<EOF
{
  "inputs": [["documents","s1"],["documents","s2"]],
  "query": [
        { "map": {
            "language": "javascript",
            "name": "GrepUtils.map",
            "keep": true, "arg": "[s|S]herlock" } }
  ]
}
EOF


The results of each map phase — in this example, lines that match the query string
— will now be returned to the client as each map phase completes. This approach
would be useful for applications that need to process the intermediary results of a
query when they become available.

Conclusion
Riak is an open source, highly scalable key-value store based on principles from
Amazon's Dynamo paper. It's easy to deploy and to scale. Additional nodes can
added to the cluster seamlessly. Features such as link walking and support for Map/
Reduce allow for queries that are more complex. In addition to the HTTP API there is
also a native Erlang API and support for Protocol Buffers. In Part 2 of this series, we'll
explore a number of client libraries available in various different languages and show
how Riak can be used as a highly scalable cache.




Introducing Riak, Part 1: The language-independent                             Page 10 of 12
HTTP API
ibm.com/developerWorks/                                               developerWorks®




Resources
Learn

   • See Basic Cluster Setup and Building a Development Environment for more
     detailed information on setting-up a 3-node cluster.
   • Read Google's MapReduce: Simplified Data Processing on Large Clusters.
   • Introduction to programming in Erlang (Martin Brown, developerWorks, May
     2011) explains how Erlang's functional programming style compares with other
     programming paradigms such as imperative, procedural and object-oriented
     programming.
   • Read Amazon's Dynamo paper on which Riak is based. Highly recommended!
   • See the article How To Analyze Apache Logs to learn how you can use Riak to
     process your server logs.
   • Get an explanation of vector clocks and why they are easier to understand than
     you may think.
   • Find a good explanation of vector clocks and more detailed information on link
     walking on the Riak wiki.
   • The Project Gutenberg site is a great resource if you need some text resources
     for experimenting.
   • Find extensive how-to information, tools, and project updates to help you
     develop with open source technologies and use them with IBM products under
     developerWorks Open source
   • developerWorks Web development specializes in articles covering various web-
     based solutions.
   • To listen to interesting interviews and discussions for software developers,
     check out developerWorks podcasts.
   • Follow developerWorks on Twitter.
   • Watch developerWorks demos that range from product installation and setup for
     beginners to advanced functionality for experienced developers.

Get products and technologies

   • Download Riak from basho.com.
   • Download the Erlang programming language.
   • Innovate your next open source development project using software especially
     for developers; access IBM trial software, available for download or on DVD.

Discuss

   • Connect with other developerWorks users while exploring the developer-driven
     blogs, forums, groups, and wikis. Help build the Real world open source group
     in the developerWorks community.




Introducing Riak, Part 1: The language-independent                       Page 11 of 12
HTTP API
developerWorks®                                              ibm.com/developerWorks/


About the author
Simon Buckle

                Simon Buckle is an independent consultant. His interests include
                distributed systems, algorithms, and concurrency. He has a Masters
                Degree in Computing from Imperial College, London. Check out his
                website at simonbuckle.com.



© Copyright IBM Corporation 2012
(www.ibm.com/legal/copytrade.shtml)
Trademarks
(www.ibm.com/developerworks/ibm/trademarks/)




Introducing Riak, Part 1: The language-independent                          Page 12 of 12
HTTP API

Más contenido relacionado

La actualidad más candente

Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive Rupak Roy
 
Always on. 2018-10 Reactive Summit
Always on. 2018-10 Reactive SummitAlways on. 2018-10 Reactive Summit
Always on. 2018-10 Reactive SummitEnno Runne
 
Android Resource Manager
Android Resource ManagerAndroid Resource Manager
Android Resource ManagerSandeep Marathe
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksGoDataDriven
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R PackagesCraig Warman
 
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIs
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIsPoster Declaratively Describing Responses of Hypermedia-Driven Web APIs
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIsRuben Taelman
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Simplilearn
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureVARUN SAXENA
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFramePrashant Gupta
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBaseJosh Elser
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerArtem Ervits
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...StampedeCon
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtMichael Stack
 

La actualidad más candente (20)

Apache Spark Tutorial
Apache Spark TutorialApache Spark Tutorial
Apache Spark Tutorial
 
Introductive to Hive
Introductive to Hive Introductive to Hive
Introductive to Hive
 
Unit 5-apache hive
Unit 5-apache hiveUnit 5-apache hive
Unit 5-apache hive
 
Always on. 2018-10 Reactive Summit
Always on. 2018-10 Reactive SummitAlways on. 2018-10 Reactive Summit
Always on. 2018-10 Reactive Summit
 
Android Resource Manager
Android Resource ManagerAndroid Resource Manager
Android Resource Manager
 
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, DatabricksSpark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
Spark Meetup Amsterdam - Dealing with Bad Actors in ETL, Databricks
 
Parallelizing Existing R Packages
Parallelizing Existing R PackagesParallelizing Existing R Packages
Parallelizing Existing R Packages
 
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIs
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIsPoster Declaratively Describing Responses of Hypermedia-Driven Web APIs
Poster Declaratively Describing Responses of Hypermedia-Driven Web APIs
 
Scala and spark
Scala and sparkScala and spark
Scala and spark
 
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
Spark SQL Tutorial | Spark SQL Using Scala | Apache Spark Tutorial For Beginn...
 
Application Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and FutureApplication Timeline Server - Past, Present and Future
Application Timeline Server - Past, Present and Future
 
Spark Sql and DataFrame
Spark Sql and DataFrameSpark Sql and DataFrame
Spark Sql and DataFrame
 
Unit 4 lecture-3
Unit 4 lecture-3Unit 4 lecture-3
Unit 4 lecture-3
 
Spark core
Spark coreSpark core
Spark core
 
Practical Kerberos with Apache HBase
Practical Kerberos with Apache HBasePractical Kerberos with Apache HBase
Practical Kerberos with Apache HBase
 
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow ManagerBreathing new life into Apache Oozie with Apache Ambari Workflow Manager
Breathing new life into Apache Oozie with Apache Ambari Workflow Manager
 
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
Choosing an HDFS data storage format- Avro vs. Parquet and more - StampedeCon...
 
Hive: Loading Data
Hive: Loading DataHive: Loading Data
Hive: Loading Data
 
HBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the ArtHBaseConEast2016: HBase and Spark, State of the Art
HBaseConEast2016: HBase and Spark, State of the Art
 
Unit 4 lecture2
Unit 4 lecture2Unit 4 lecture2
Unit 4 lecture2
 

Similar a Os riak1-pdf

Similar a Os riak1-pdf (20)

Rails
RailsRails
Rails
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Relational Databases to Riak
Relational Databases to RiakRelational Databases to Riak
Relational Databases to Riak
 
Intro to Rack
Intro to RackIntro to Rack
Intro to Rack
 
Ruby on Rails introduction
Ruby on Rails introduction Ruby on Rails introduction
Ruby on Rails introduction
 
Ruby On Rails Siddhesh
Ruby On Rails SiddheshRuby On Rails Siddhesh
Ruby On Rails Siddhesh
 
Riak Intro at Munich Node.js
Riak Intro at Munich Node.jsRiak Intro at Munich Node.js
Riak Intro at Munich Node.js
 
Rail3 intro 29th_sep_surendran
Rail3 intro 29th_sep_surendranRail3 intro 29th_sep_surendran
Rail3 intro 29th_sep_surendran
 
Rails interview questions
Rails interview questionsRails interview questions
Rails interview questions
 
Introduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman OrtegaIntroduction to Rails - presented by Arman Ortega
Introduction to Rails - presented by Arman Ortega
 
Getting Started with Rails
Getting Started with RailsGetting Started with Rails
Getting Started with Rails
 
RoR guide_p1
RoR guide_p1RoR guide_p1
RoR guide_p1
 
Ruby On Rails
Ruby On RailsRuby On Rails
Ruby On Rails
 
Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02Red5workshop 090619073420-phpapp02
Red5workshop 090619073420-phpapp02
 
Ruby on rails for beginers
Ruby on rails for beginersRuby on rails for beginers
Ruby on rails for beginers
 
Dev streams2
Dev streams2Dev streams2
Dev streams2
 
Ruby on Rails
Ruby on RailsRuby on Rails
Ruby on Rails
 
rails.html
rails.htmlrails.html
rails.html
 
rails.html
rails.htmlrails.html
rails.html
 
Viridians on Rails
Viridians on RailsViridians on Rails
Viridians on Rails
 

Último

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfAddepto
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLScyllaDB
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationRidwan Fadjar
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):comworks
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostZilliz
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesZilliz
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited
 

Último (20)

DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Gen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdfGen AI in Business - Global Trends Report 2024.pdf
Gen AI in Business - Global Trends Report 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Developer Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQLDeveloper Data Modeling Mistakes: From Postgres to NoSQL
Developer Data Modeling Mistakes: From Postgres to NoSQL
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
My Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 PresentationMy Hashitalk Indonesia April 2024 Presentation
My Hashitalk Indonesia April 2024 Presentation
 
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):CloudStudio User manual (basic edition):
CloudStudio User manual (basic edition):
 
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage CostLeverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
Leverage Zilliz Serverless - Up to 50X Saving for Your Vector Storage Cost
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Vector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector DatabasesVector Databases 101 - An introduction to the world of Vector Databases
Vector Databases 101 - An introduction to the world of Vector Databases
 
Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365Ensuring Technical Readiness For Copilot in Microsoft 365
Ensuring Technical Readiness For Copilot in Microsoft 365
 

Os riak1-pdf

  • 1. Introducing Riak, Part 1: The language- independent HTTP API Store and retrieve data using Riak's HTTP interface Simon Buckle (simon@simonbuckle.com) Skill Level: Intermediate Independent Consultant Freelance Date: 13 Mar 2012 This is Part 1 of a two-part series about Riak, a highly scalable, distributed data store written in Erlang and based on Dynamo, Amazon's high availability key- value store. Learn the basics about Riak and how to store and retrieve items using its HTTP API. Explore how to use its Map/Reduce framework for doing distributed queries, how links allow relationships to be defined between objects, and how to query those relationships using link walking. Introduction Typical modern relational databases perform poorly on certain types of applications and struggle to cope with the performance and scalability demands of today's Internet applications. A different approach is needed. In the last few years, a new type of data store, commonly referred to as NoSQL, has become popular as it directly addresses some of the deficiencies of relational databases. Riak is one such example of this type of data store. Riak is not the only NoSQL data store out there. Two other popular data stores are MongoDB and Cassandra. Although similar in many ways, there are also some significant differences. For example, Riak is a distributed system whereas MongoDB is a single system database — Riak has no concept of a master node, making it more resilient to failure. Though also based on Amazon's description of Dynamo, Cassandra omits features such as vector clocks and consistent hashing for organizing its data. Riak's data model is more flexible. In Riak, buckets are created on the fly when they are first accessed; Cassandra's data model is defined in an XML file so changing it requires having to reboot the entire cluster. Another strength of Riak is it is written in Erlang. MongoDB and Cassandra are written in what can be referred to as general-purpose languages (C++ and © Copyright IBM Corporation 2012 Trademarks Introducing Riak, Part 1: The language-independent Page 1 of 12 HTTP API
  • 2. developerWorks® ibm.com/developerWorks/ Java, respectively), whereas Erlang was designed from the ground up to support distributed, fault-tolerant applications, and as such is more suited to developing applications such as NoSQL data stores that share some characteristics with the applications that Erlang was originally created for. Map/Reduce jobs can only be written in either Erlang or JavaScript. For this article, we have chosen to write the map and reduce functions in JavaScript, but it is also possible to write them in Erlang. While Erlang code may be slightly quicker to execute, we have chosen JavaScript code because of its accessibility to a larger audience. See Resources for links to learn more about Erlang. Getting started If you want to try out some of the examples in this article, you need to install Riak (see Resources) and Erlang on your system. You also need to build a cluster containing three nodes running on your local machine. All data stored in Riak are replicated to a number of nodes in the cluster. A property (n_val) on the bucket the data is stored in determines the number of nodes to replicate. The default value of this property is three, therefore, we need to create a cluster with at least three nodes (after which you can create as many as you like) in order for it to be effective. After you download the source code, you need to build it. The basic steps are as follows: 1. Unpack the source: $ tar xzvf riak-1.0.1.tar.gz 2. Change directory: $ cd riak-1.0.1 3. Build: $ make all rel This will build Riak (./rel/riak). To run multiple nodes locally you need to make copies of ./rel/riak — one copy for each additional node. Copy ./rel/riak to ./rel/riak2, ./rel/ riak3 and so on, then make the following changes to each copy: • In riakN/etc/app.config change the following values: the port specified in the http{} section, handoff_port, and pb_port, to something unique • Open up riakN/etc/vm.args and change the name, again to something unique, for example, -name riak2@127.0.0.1 Now start each node in turn, as shown in Listing 1. Listing 1. Listing 1. Starting each node $ cd rel $ ./riak/bin/riak start $ ./riak2/bin/riak start $ ./riak3/bin/riak start Finally, join the nodes together to make a cluster, as shown in Listing 2. Introducing Riak, Part 1: The language-independent Page 2 of 12 HTTP API
  • 3. ibm.com/developerWorks/ developerWorks® Listing 2. Listing 2. Making a cluster $ ./riak2/bin/riak-admin join riak@127.0.0.1 $ ./riak3/bin/riak-admin join riak@127.0.0.1 You should now have a 3-node cluster running locally. To test it, run the following command: $ ./riak/bin/riak-admin status | grep ring_members. You should see each node that is part of the cluster you just created, for example, ring_members : ['riak2@127.0.0.1','riak3@127.0.0.1','riak@127.0.0.1']. The Riak API There are currently three ways of accessing Riak: an HTTP API (RESTful interface), Protocol Buffers, and a native Erlang interface. Having more than one interface gives you the benefit of being able to choose how to integrate your application. If you have an application written in Erlang then it would make sense to use the native Erlang interface so you have tight integration between the two. There are also other factors, such as performance, that may play a part in deciding which interface to use. For example, a client that uses the Protocol Buffers interface will perform better than one that interacts with the HTTP API; less data is communicated and parsing all those HTTP headers can be (relatively) costly in terms of performance. However, the benefits of having an HTTP API are that most developers today — particularly Web developers — are familiar with RESTful interfaces plus most programming languages have built-in primitives for requesting resources over HTTP, for example, opening a URL, so no additional software is needed. In this article, we will focus on the HTTP API. All the examples will use curl to interact with Riak through its HTTP interface. This is just to get a better understanding of the underlying API. There are a number of client libraries available in various different languages and you should consider using one of those when developing an application that uses Riak as the data store. The client libraries provide an API to Riak that makes it easy to integrate into your application; you won't have to write code yourself to handle the kind of responses you will see when using curl. The API supports the usual HTTP methods: GET, PUT, POST, DELETE, which will be used for retrieving, updating, creating and deleting objects respectively. Each one will be covered in turn. Storing objects You can think of Riak as implementing a distributed map from keys (strings) to values (objects). Riak stores values in buckets. There is no need to explicitly create a bucket before storing an object in one; if an object is stored in a bucket that doesn't exist, it will be created automatically for us. Buckets are a virtual concept in Riak and exist primarily as a means of grouping related objects. Buckets also have properties and the value of these properties define Introducing Riak, Part 1: The language-independent Page 3 of 12 HTTP API
  • 4. developerWorks® ibm.com/developerWorks/ what Riak does with the objects that are stored in it. Here are some examples of bucket properties: • n_val — The number of times an object should be replicated across the cluster • allow_mult — Whether to allow concurrent updates You can view a bucket's properties (and their current values) by making a GET request on the bucket itself. To store an object, we do an HTTP POST to one of the URLs shown in Listing 3. Listing 3. Listing 3. Storing an object POST -> /riak/<bucket> (1) POST -> /riak/<bucket>/<key> (2) Keys can either be allocated automatically by Riak (1) or defined by the user (2). When storing an object with a user-defined key it's also possible to do an HTTP PUT to (2) to create the object. The latest version of Riak also supports the following URL format: /buckets/<bucket>/ keys/<key>, but we will use the older format in this article in order to maintain backwards compatibility with earlier versions of Riak. If no key is specified, Riak will automatically allocate a key for the object. For example, let's store a plain text object in the bucket "foo" without explicitly specifying a key (see Listing 4). Listing 4. Listing 4. Storing a plain text object without specifying a key $ curl -i -H "Content-Type: plain/text" -d "Some text" http://localhost:8098/riak/foo/ HTTP/1.1 201 Created Vary: Accept-Encoding Location: /riak/foo/3vbskqUuCdtLZjX5hx2JHKD2FTK Content-Type: plain/text Content-Length: ... By examining the Location header, you can see the key that Riak allocated to the object. It's not very memorable, so the alternative is to have the user provide a key. Let's create an artists bucket and add an artist who goes by the name of Bruce (see Listing 5). Listing 5. Listing 5. Creating an artists bucket and adding an artist $ curl -i -d '{"name":"Bruce"}' -H "Content-Type: application/json" http://localhost:8098/riak/artists/Bruce HTTP/1.1 204 No Content Vary: Accept-Encoding Content-Type: application/json Content-Length: ... Introducing Riak, Part 1: The language-independent Page 4 of 12 HTTP API
  • 5. ibm.com/developerWorks/ developerWorks® If the object was stored correctly using the key that we specified, we will get a 204 No Content response from the server. In this example, we are storing the value of the object as JSON but it could just as easily have been plain text or some other format. It is important to note that when storing an object that the Content-Type header is set correctly. For example, if you want to store a JPEG image, then you should set the content type to image/jpeg. Retrieving an object To retrieve a stored object, do a GET on the bucket using the key of the object you want to retrieve. If the object exists, it will be returned in the body of the response, otherwise a 404 Object Not Found response will be returned by the server (see Listing 6). Listing 6. Listing 6. Performing a GET on the bucket $ curl http://localhost:8098/riak/artists/Bruce HTTP/1.1 200 OK ... { "name" : "Bruce" } Updating an object When updating an object, just like when storing one, the Content-Type header is required. For example, let's add Bruce's nickname as shown in Listing 7. Listing 7. Listing 7. Adding Bruce's nickname $ curl -i -X PUT -d '{"name":"Bruce", "nickname":"The Boss"}' -H "Content-Type: application/json" http://localhost:8098/riak/artists/Bruce As mentioned earlier, Riak creates buckets automatically. The buckets have properties. One of those properties, allow_mult, determines whether concurrent writes are allowed. By default, it is set to false; however, if concurrent updates are allowed then for each update, the X-Riak-Vclock header should be sent as well. The value of this header should be set to the value that was seen when the object was last read by the client. Riak uses vector clocks to determine the causality of modifications to objects. How vector clocks work is beyond the scope of this article but suffice to say that when concurrent writes are allowed there is a possibility that conflicts may occur so it will be left up to the application to resolve these conflicts (see Resources). Removing an object Removing an object follows a similar pattern to the previous commands: we simply do an HTTP DELETE to the URL that corresponds to the object we want to delete: $ curl -i -X DELETE http://localhost:8098/riak/artists/Bruce. Introducing Riak, Part 1: The language-independent Page 5 of 12 HTTP API
  • 6. developerWorks® ibm.com/developerWorks/ If the object was removed successfully we will get a 204 No Content response from the server; if the object we are trying to delete does not exist, the server responds with a 404 Object Not Found. Links So far, we have seen how to store objects by associating an object with a particular key so it can be retrieved later on. What would be useful is if we could extend this simple model to be able to express how (and if) objects are related to each other. Well we can and Riak achieves this via links. So, what are links? Links allow the user to create relationships between objects. If you are familiar with UML class diagrams, you can think of a link as an association between objects with a label describing the relationship; in a relational database, the relationship would be expressed using a foreign key. Links are "attached" to objects via the "Link" header. Below is an example of what a link header looks like. The target of the relationship, for example, the object we are linking to, is the thing between the angled brackets. The relationship type — in this case "performer" — is expressed by the riaktag property: Link: </riak/artists/ Bruce>; riaktag="performer". Let's add some albums and associate them with the artist Bruce who performed on the albums (see Listing 8). Listing 8. Listing 8. Adding some albums $ curl -H "Content-Type: text/plain" -H 'Link: </riak/artists/Bruce> riaktag="performer"' -d "The River" http://localhost:8098/riak/albums/TheRiver $ curl -H "Content-Type: text/plain" -H 'Link: </riak/artists/Bruce> riaktag="performer"' -d "Born To Run" http://localhost:8098/riak/albums/BornToRun Now that we have set-up some relationships, it's time to query them via link walking — link walking is the name given to the process of querying the relationships between objects. For example, to find the artist who performed the album The River, you would do this: $ curl -i http://localhost:8098/riak/albums/TheRiver/ artists,performer,1. The bit at the end is the link specification. This is what a link query looks like. The first part (artists) specifies the bucket that we should restrict the query to. The second part (performer) specifies the tag we want to use to limit the results, and finally, the 1 indicates that we do want to include the results from this particular phase of the query. It's also possible to issue transitive queries. Let's assume we have set-up the relationships between albums and artists as in Figure 1. Introducing Riak, Part 1: The language-independent Page 6 of 12 HTTP API
  • 7. ibm.com/developerWorks/ developerWorks® Figure 1. Figure 1. Example relationship between albums and artists It's now possible to issue queries such as, "Which artists collaborated with the artist who performed The River," by executing the following: $ curl -i http:// localhost:8098/riak/albums/TheRiver/artists,_,0/artists,collaborator,1. The underscore in the link specification acts like a wildcard character and indicates that we don't care what the relationship is. Running Map/Reduce queries Map/Reduce is a framework popularized by Google for running distributed computations in parallel over huge datasets. Riak also supports Map/Reduce by allowing queries that are more powerful to be performed on the data stored in the cluster. A Map/Reduce function consists of both a map phase and a reduce phase. The map phase is applied to some data and produces zero or more results; this is equivalent in functional programming terms to mapping a function over each item in a list. The map phases occur in parallel. The reduce phase then takes all of the results from the map phases and combines them together. For example, consider counting the number of each instance of a word across a large set of documents. Each map phase would calculate the number of times each word appears in a particular document. These intermediate totals, once calculated, would then be sent to the reduce function that would tally the totals and emit the result for the whole set of documents. See Resources for a link to Google's Map/Reduce paper. Example: Distributed grep For this article, we are going to develop a Map/Reduce function that will do a distributed grep over a set of documents stored in Riak. Just like grep, the final output will be a set of lines that match the supplied pattern. In addition, each result will also indicate the line number in the document where the match occurred. To execute a Map/Reduce query we do a POST to the /mapred resource. The body of the request is a JSON representation of the query; as in previous cases, the Content- Type header must be present and always be set to application/json. Listing 9 shows Introducing Riak, Part 1: The language-independent Page 7 of 12 HTTP API
  • 8. developerWorks® ibm.com/developerWorks/ the query that we will execute to do the distributed grep. Each part of the query will be discussed in turn. Listing 9. Listing 9. Example Map/Reduce query { "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ] } Each query consists of a number of inputs, for example, the set of documents we want to do some computation on, and the name of a function to run during both the map and reduce phases. It is also possible to include the source of both the map and reduce functions directly inline in the query by using the source property instead of name but I have not done that here; however, in order to use named functions you will need to make some changes to Riak's default configuration. Save the code in Listing 9 in a directory somewhere. For each node in the cluster, locate the file etc/ app.config, open it up and set the property js_source_dir to the directory where you saved the code. You will need to restart all the nodes in the cluster in order for the changes to take effect. The code in Listing 10 contains the functions that will be executed during the map and reduce phases. The map function looks at each line in the document and checks to see if matches the supplied pattern (the arg parameter). The reduce function in this particular example doesn't do much; it behaves like an identity function and just returns its input. Listing 10. Listing 10. GrepUtils.js var GrepUtils = { map: function (v, k, arg) { var i, len, lines, r = [], re = new RegExp(arg); lines = v.values[0].data.split(/r?n/); for (i = 0, len = lines.length; i < len; i += 1) { var match = re.exec(lines[i]); if (match) { r.push((i+1) + “. “ + lines[i]); } } return r; }, reduce: function (v) { return [v]; } }; Before we can run the query, we need some data. I downloaded a couple of Sherlock Holmes e-books from the Project Gutenberg Web site (see Resources). The first text Introducing Riak, Part 1: The language-independent Page 8 of 12 HTTP API
  • 9. ibm.com/developerWorks/ developerWorks® is stored in the "documents" bucket under the key "s1"; the second text in the same bucket with the key "s2". Listing 11 is an example of how you would load such a document into Riak. Listing 11. Listing 11. Loading a document into Riak $ curl -i -X POST http://localhost:8098/riak/documents/s1 -H “Content-Type: text/plain” --data-binary @s1.txt Once the documents have been loaded, we can then search them. In this case, we want to output any lines that match the regular expression "[s|S]herlock" (see Listing 12). Listing 12. Listing 12. Searching the documents $ curl -X POST -H "Content-Type: application/json" http://localhost:8098/mapred --data @-<<EOF { "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language":"javascript", "name":"GrepUtils.map", "keep":true, "arg": "[s|S]herlock" } }, { "reduce": { "language": "javascript", "name": "GrepUtils.reduce" } } ] } EOF The arg property in the query contains the pattern that we want to grep for in the documents; this value is passed in to the map function as the arg parameter. The output from running the Map/Reduce job over the sample data is in Listing 13. Listing 13. Listing 13. Sample output from running the Map/Reduce job [["1. Project Gutenberg's The Adventures of Sherlock Holmes, by Arthur Conan Doyle","9. Title: The Adventures of Sherlock Holmes","62. To Sherlock Holmes she is always THE woman. I have seldom heard","819. as I had pictured it from Sherlock Holmes' succinct description,","1017. "Good-night, Mister Sherlock Holmes."","1034. "You have really got it!" he cried, grasping Sherlock Holmes by" …]] Streaming Map/Reduce To finish off this section on Map/Reduce, we'll take a brief look at Riak's streaming Map/Reduce feature. It's useful for jobs that have map phases that take a while to complete, since streaming the results allows you to access the results of each map phase as soon as they become available, and before the reduce phase has executed. We can apply this to good effect to the distributed grep query. The reduce step in the example doesn't actually do much. In fact, we can get rid of the reduce phase Introducing Riak, Part 1: The language-independent Page 9 of 12 HTTP API
  • 10. developerWorks® ibm.com/developerWorks/ altogether and just emit the results from each map phase directly to the client. To achieve this, we need to modify the query by removing the reduce step and adding ?chunked=true to end of the URL to indicate that we want to stream the results (see Listing 14). Listing 14. Listing 14. Modifying the query to stream the results $ curl -X POST -H "Content-Type: application/json" http://localhost:8098/mapred?chunked=true --data @-<<EOF { "inputs": [["documents","s1"],["documents","s2"]], "query": [ { "map": { "language": "javascript", "name": "GrepUtils.map", "keep": true, "arg": "[s|S]herlock" } } ] } EOF The results of each map phase — in this example, lines that match the query string — will now be returned to the client as each map phase completes. This approach would be useful for applications that need to process the intermediary results of a query when they become available. Conclusion Riak is an open source, highly scalable key-value store based on principles from Amazon's Dynamo paper. It's easy to deploy and to scale. Additional nodes can added to the cluster seamlessly. Features such as link walking and support for Map/ Reduce allow for queries that are more complex. In addition to the HTTP API there is also a native Erlang API and support for Protocol Buffers. In Part 2 of this series, we'll explore a number of client libraries available in various different languages and show how Riak can be used as a highly scalable cache. Introducing Riak, Part 1: The language-independent Page 10 of 12 HTTP API
  • 11. ibm.com/developerWorks/ developerWorks® Resources Learn • See Basic Cluster Setup and Building a Development Environment for more detailed information on setting-up a 3-node cluster. • Read Google's MapReduce: Simplified Data Processing on Large Clusters. • Introduction to programming in Erlang (Martin Brown, developerWorks, May 2011) explains how Erlang's functional programming style compares with other programming paradigms such as imperative, procedural and object-oriented programming. • Read Amazon's Dynamo paper on which Riak is based. Highly recommended! • See the article How To Analyze Apache Logs to learn how you can use Riak to process your server logs. • Get an explanation of vector clocks and why they are easier to understand than you may think. • Find a good explanation of vector clocks and more detailed information on link walking on the Riak wiki. • The Project Gutenberg site is a great resource if you need some text resources for experimenting. • Find extensive how-to information, tools, and project updates to help you develop with open source technologies and use them with IBM products under developerWorks Open source • developerWorks Web development specializes in articles covering various web- based solutions. • To listen to interesting interviews and discussions for software developers, check out developerWorks podcasts. • Follow developerWorks on Twitter. • Watch developerWorks demos that range from product installation and setup for beginners to advanced functionality for experienced developers. Get products and technologies • Download Riak from basho.com. • Download the Erlang programming language. • Innovate your next open source development project using software especially for developers; access IBM trial software, available for download or on DVD. Discuss • Connect with other developerWorks users while exploring the developer-driven blogs, forums, groups, and wikis. Help build the Real world open source group in the developerWorks community. Introducing Riak, Part 1: The language-independent Page 11 of 12 HTTP API
  • 12. developerWorks® ibm.com/developerWorks/ About the author Simon Buckle Simon Buckle is an independent consultant. His interests include distributed systems, algorithms, and concurrency. He has a Masters Degree in Computing from Imperial College, London. Check out his website at simonbuckle.com. © Copyright IBM Corporation 2012 (www.ibm.com/legal/copytrade.shtml) Trademarks (www.ibm.com/developerworks/ibm/trademarks/) Introducing Riak, Part 1: The language-independent Page 12 of 12 HTTP API