6. Improve Retrieval in
Riak is awesome at storing data
(not so easy to query it).
MapReduce can be resource intensive
(also, you have to write code).
2i is limited, resource issues.
Efficiently query nontrivial amounts of data.
7. Riak Search
Lessons Learned
Lessons Learned
Pretends to be lucene/solr (but it ain’t).
Lack of analyzer/language/feature support.
Poor performance on certain queries.
Poor anti-entropy (indexes can corrupt).
Basho isn’t a “search company”.
8. Solr is Better
Excellent analyzer/language support
Features: ranking, faceting, highlighting, geo, etc.
Built upon Lucene
Actively developed, built search innovators
9. By Our Powers Combined
Riak: HA, distributed,
scale out/in
Solr: efficient index,
features people want,
known entity
Solr scales with Riak
Make Riak searchable with Solr
12. Integration
Solr bundled with Riak, turn-key, zero config to start
Supervise Solr process, start/stop/restart
Present Real Solr query interface
Use Solr clients to query Riak
13. Intermediary
Erlang processes.
Translates KV data into Solr docs.
Makes Solr queries distributed Solr queries.
Communicates with KV to verify object/index
convergence.
16. EC2 AMI
ami-8d9c20e4
• based on ami-6df93504, x86_64, Amazon
Linux, instance storage
• Yokozuna ready to go, ~ec2-user/riak/rel
•
do this
modify node name in vm.args _before_
you
• set ulimit -n start
the node
• open up port 8098
18. Riak as Normal
Key Value
http://wiki.basho.com/attachments/riak-ring.png http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png
19. One Solr Instance per Node
Riak
Solr Start/Monitor
Proc Jetty/Solr
cores
Index people
KV
Hook msgs
Query
goal of M:1, bucket:inde
Yokozu
na 1 1 1
bucket M index core
20. All Partitions, One Solr
Riak special Solr
fields
Doc
KV 1
id ryan_7
_yz_pn 7
KV 4 _yz_fpn 7
_yz_node dev1
_yz_rk ryan
ryan Value
KV 7 value_t “...”
21. Extraction on Media Type
riak object yz_xml_extractor(Value)
metadata
content-type “text/xml”
Key
<doc>
<person_name_s>Ryan Zezeski</person_name_s>
Value <person_bio_t>...</person_bio_t>
...
</doc>
22. Anti-Entropy
read handof
put
repair f
obj modified!
* Thank Ryan Zezeski, who’s an absolute rockstar, and whose works generally raise the level of Basho as an organization. * And before anyone asks, Yokozuna means “Horizontal rope” in Japanese. It’s the top rank in sumo, usually translated as “Grand Champion”. Not named after the WWF wrester.
* Although I’m probably better known as this guy, and as of yesterday, this guy
* if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
* if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
* riak is good at storing data, not so great at retrieving it beyond primary key * map/reduce can work but is often too general and can be resource hungry * 2i is great for simple tagging, but can’t do much beyond that, has issues around large result sets * want to query data in sophisticated ways but in an efficient manner
* looks like solr but not exact semantics, different performance, lacking features * conjunction queries containing sub-queries w/ large results can hurt * basho is not in business of innovating search, stand on shoulders of giants
* solr has good language and analyzer support * has features people want that would be expensive to develop in riak search * rest upon lucene, a well known and tried solution * sees active development, 4.0 including distributed search, better compaction algorithms, and “near real time” indexing, to name a few
* riak is highly available (via replication), distributed by default, and scales out/in well (or at least better than most) * Solr has efficient indexing, has the features people ask for, and is known with large community around it * the fundamental idea behind yokozuna is to use the strengths of both to compliment each other * use riak to make solr HA and scale at * use solr to make data discovery in riak better
* In a nutshell, Yokozuna has this behavior
* you write KV data to riak like normal, yokozuna writes a doc to solr * query it like a single solr instance, get solr response verbatim
* yokozuna is tight integration between solr and riak, focus on getting started easily * yokozuna controls external solr instance, will restart on crash * provide the canonical solr query interface * use existing solr clients to query riak
* yokozuna acts as intermediary between KV and Solr * converts KV data to solr docs * translates single instance solr queries to dist solr queries * constantly communicates with KV to check convergence of data
* In the interest of dogfooding, Our new docs search is powered by yokozuna, using an AMI
* even with a guide building riak from source can be a pain, I created an AMI to get started quickly using ec2 * based on amazon linux, x86_64, instance storage * you’ll find a source build under ~ec2-user/riak/rel * after starting an instance need to modify node name in vm.args _before_ starting (cause it gets written to ring) * probably want to set ulimit, riak loves file descriptors * change ec2 security profiles to let traffic thru 8098
* start riak as normal, you’ll see a new beam and java process start
* attach to riak so you can create an index and add the hook
* if you want you can check the bucket props to make sure the hook is there
* write some data to riak like you normally would * the content-type will be used to extract field/value pairs * in this case the text extractor will create a field ‘text’ with the string passed as the value
* query the data, notice ‘wt’ is a canonical solr param, no special yokozuna fields
* the key ryan was matched, as expected * that crazy string at the top is the filter query, this is the magic sauce
* okay, that’s great and all, but how about something more sophisticated * in solr you can do something called “highlighting” which shows your matches in the context of their surrounding content * this is the same thing you get when querying google, it shows how your query matched * notice the use of many canonical solr params here, it’s all passed through verbatim, yokozuna simply sends it to the right shards
* okay, the correct key matched again, but there is no highlighting * what gives?
* by default the text extractor stores the value under the field ‘text’ * this field isn’t stored, and highlighting only works with stored fields * there is a dynamic field in the default schema *_t that does store the data, so let’s modify the extractor to use that field * keep in mind storing fields isn’t free and shouldn’t be used willy nilly, but I want to show a more complex example than a simple query
* here we re-register the extractor for the ‘text/plain’ content type so that the data is stored under the field ‘value_t’ * this will match the *_t dynamic field because the suffix matches
* here I delete the key and then rewrite it * the new extractor def will be used this time
* here I run the query again * notice I had to change the field name in the ‘hl.fl’ and ‘q’ params
* and voila, now highlighting works * but seriously, I heard there is Natty Boh here
* you start with Riak as normal * key-value based storage * distributed via consistent hashing * sibling detection with vector clocks * etc
* one solr instance per node * yokozuna uses erlang supervision to monitor/restart the external solr process * a KV hook causes documents to be created/written to solr * queries are converted by yokozuna to distributed solr queries * yokozuna has notion of index, it’s a synonym for solr core, every index is realized on disk via a solr core * currently it is a one-to-one mapping between bucket-index, plan is to allow many buckets to write to one index, e.g. storing mailing list/wiki/commit logs under different buckets but want to search under one index
* since it’s one solr instance it means all partitions live on same index * you can’t just query on value alone b/c multiple replicas could live on same node * there could be fallback data, onwership xfer causes multiple copies of same replica temporarily * id field is riak key + logical partition, otherwise multiple replicas on same node would squash each other * _yz_pn is the owning partition for this replica, needed for coverage query since only a covering subset of partitions is searched * _yz_node is needed during ownership since two nodes will have same partition copy temporarily, if both shards selected you get overlap * _yz_fpn is the first partition in the preflist, since coverage queries can overlap on partitions (i.e. 2 neighbors are selected) need to pick particular pl * there is also a field for entropy data used to calculate hashtrees * the rest of the fields are extracted from the KV value * the special fields are used as a solr filter query which is an efficient way to limit the results w/o affecting score * this is a little confusing, the key take away is this is done to prevent overlap in solr which can mess up the results
* data is opaque to Riak * Solr wants fields/values to index* Extractors map media type to set of field/value pairs* can register new extractors
* any time the object is modified the index is updated* handoff comes for free, no messy iteration of Solr data, trade CPU for network/complexity* queries don’t have to serialize on VNodes
* there is a hashtree for every partition in KV and Solr, each storing hash of KV object * trees are updated as data comes in, periodically rebuilt from scratch in case data/tree get out of sync * entropy mgr continuously iterates partitions and performs exchanges * if divergence is detected the object is read-repaired and re-indexed * if one replica is found to be divergent all of them get fixed
* query comes into riak as canonical solr query * yokozuna builds a coverage plan: covering set of nodes/shards, filter query on PN/Node to eliminate overlap * Solr does the actual coordination/merging of the results, yokozuna just tells it what to do