Yokozuna

•

2 recomendaciones•1,383 vistas

PDX Web & Design

Tecnología

Yokozuna
Eric Redmond
@coderoshi
Shamelessly pilfered from Ryan Zezeski @rzezeski

Nov 19th, 2012

Gifts from Basho
https://github.com/rzezeski/yokozuna
https://github.com/coderoshi/little_riak_book

Improve Retrieval in

Riak is awesome at storing data
(not so easy to query it).
MapReduce can be resource intensive
(also, you have to write code).
2i is limited, resource issues.
Efficiently query nontrivial amounts of data.

Riak Search
Lessons Learned
Lessons Learned

Pretends to be lucene/solr (but it ain’t).
Lack of analyzer/language/feature support.
Poor performance on certain queries.
Poor anti-entropy (indexes can corrupt).
Basho isn’t a “search company”.

Solr is Better

Excellent analyzer/language support
Features: ranking, faceting, highlighting, geo, etc.
Built upon Lucene
Actively developed, built search innovators

By Our Powers Combined
Riak: HA, distributed,
scale out/in
Solr: efficient index,
features people want,
known entity
Solr scales with Riak

Make Riak searchable with Solr

IF DISTRIBUTED SOLR HAS IT
YOKOZUNA HAS IT

Integration

Solr bundled with Riak, turn-key, zero config to start
Supervise Solr process, start/stop/restart
Present Real Solr query interface
Use Solr clients to query Riak

Intermediary

Erlang processes.
Translates KV data into Solr docs.
Makes Solr queries distributed Solr queries.
Communicates with KV to verify object/index
convergence.

EC2 AMI
ami-8d9c20e4
• based on ami-6df93504, x86_64, Amazon
Linux, instance storage

• Yokozuna ready to go, ~ec2-user/riak/rel

•
do this
modify node name in vm.args _before_
you
• set ulimit -n start
the node

• open up port 8098

Riak as Normal
Key Value

http://wiki.basho.com/attachments/riak-ring.png http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png

One Solr Instance per Node
Riak
Solr Start/Monitor
Proc Jetty/Solr
cores

Index people
KV
Hook msgs
Query

goal of M:1, bucket:inde
Yokozu
na 1 1 1
bucket M index core

All Partitions, One Solr
Riak special Solr
fields
Doc
KV 1
id ryan_7
_yz_pn 7
KV 4 _yz_fpn 7
_yz_node dev1
_yz_rk ryan
ryan Value

KV 7 value_t “...”

Extraction on Media Type
riak object yz_xml_extractor(Value)
metadata

content-type “text/xml”

Key
<doc>
<person_name_s>Ryan Zezeski</person_name_s>
Value <person_bio_t>...</person_bio_t>
...
</doc>

Anti-Entropy

read handof
put
repair f

obj modified!

Active Anti-Entropy
Entropy
Mgr

KV Tree YZ Tree

Exchang
e

Query -> Dist. Query
search/people?q=zezeski Node A
Riak
yz_cover:plan(“people”)

solr/people/select?shards=...&fq=...&q=zezeski

distributed search
Solr

NodeB NodeB NodeC

Solr Solr Solr

Más contenido relacionado

Similar a Yokozuna

Why JRuby? - RubyConf 2012Charles Nutter

Riak at Engine Yard CloudInes Sombra

seevl: Data-driven music discoveryAlexandre Passant

Scaling search to a million pages with Solr, Python, and Djangotow21

Seeley yonik solr performance key innovationsLucidworks (Archived)

Session 49 - Semantic metadata management practical ISSGC Summer School

Ruby on Rails All Hands MeetingDan Davis

SSONDE: Semantic Similarity On liNked Data EntitiesRiccardo Albertoni

LatJUG. Java Bytecode Fundamentalsdenis Udod

Session 49 Practical Semantic Sticky NoteISSGC Summer School

Node.js ExplainedJeff Kunkle

Rails and the Apache SOLR Search EngineDavid Keener

Cacheconcurrencyconsistency cassandra svccsrisatish ambati

A Hands On Overview Of The Semantic WebShamod Lacoul

Tldr solr-courseloadmattdeboard

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen

iOS Dev Happy Hour Realm - Feb 2021Jason Flax

Java and MongoMarcio Mangar

Практики применения JRuby.toster

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks

Similar a Yokozuna (20)

Why JRuby? - RubyConf 2012

Riak at Engine Yard Cloud

seevl: Data-driven music discovery

Scaling search to a million pages with Solr, Python, and Django

Seeley yonik solr performance key innovations

Session 49 - Semantic metadata management practical

Ruby on Rails All Hands Meeting

SSONDE: Semantic Similarity On liNked Data Entities

LatJUG. Java Bytecode Fundamentals

Session 49 Practical Semantic Sticky Note

Node.js Explained

Rails and the Apache SOLR Search Engine

Cacheconcurrencyconsistency cassandra svcc

A Hands On Overview Of The Semantic Web

Tldr solr-courseload

[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex

iOS Dev Happy Hour Realm - Feb 2021

Java and Mongo

Практики применения JRuby

Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...

Más de PDX Web & Design

HTTP is HardPDX Web & Design

Riak Search 2: YokozunaPDX Web & Design

Distributed Data StructuresPDX Web & Design

DDS-20mPDX Web & Design

Hardcore CSSPDX Web & Design

Hardcore HTMLPDX Web & Design

Más de PDX Web & Design (6)

HTTP is Hard

Riak Search 2: Yokozuna

Distributed Data Structures

DDS-20m

Hardcore CSS

Hardcore HTML

Último

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

From Family Reminiscence to Scholarly Archive .Alan Dix

Story boards and shot lists for my a level piececharlottematthew16

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

Commit 2024 - Secret Management made easyAlfredo García Lavilla

DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

Search Engine Optimization SEO PDF for 2024.pdfRankYa

DMCC Future of Trade Web3 - Special EditionDubai Multi Commodity Centre

Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi

New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro

Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely

Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc

Yokozuna

1. Yokozuna Eric Redmond @coderoshi Shamelessly pilfered from Ryan Zezeski @rzezeski Nov 19th, 2012

2. Aittle L Riak Book

3. Gifts from Basho https://github.com/rzezeski/yokozuna https://github.com/coderoshi/little_riak_book

4. Gifts from Basho https://github.com/rzezeski/yokozuna https://github.com/coderoshi/little_riak_book

5. WHY DO THIS?

6. Improve Retrieval in Riak is awesome at storing data (not so easy to query it). MapReduce can be resource intensive (also, you have to write code). 2i is limited, resource issues. Efficiently query nontrivial amounts of data.

7. Riak Search Lessons Learned Lessons Learned Pretends to be lucene/solr (but it ain’t). Lack of analyzer/language/feature support. Poor performance on certain queries. Poor anti-entropy (indexes can corrupt). Basho isn’t a “search company”.

8. Solr is Better Excellent analyzer/language support Features: ranking, faceting, highlighting, geo, etc. Built upon Lucene Actively developed, built search innovators

9. By Our Powers Combined Riak: HA, distributed, scale out/in Solr: efficient index, features people want, known entity Solr scales with Riak Make Riak searchable with Solr

10. WHA T IS YOKOZUNA?

11. IF DISTRIBUTED SOLR HAS IT YOKOZUNA HAS IT

12. Integration Solr bundled with Riak, turn-key, zero config to start Supervise Solr process, start/stop/restart Present Real Solr query interface Use Solr clients to query Riak

13. Intermediary Erlang processes. Translates KV data into Solr docs. Makes Solr queries distributed Solr queries. Communicates with KV to verify object/index convergence.

14. DEMO LETS SEE IT!

15. POWERED BY YOKOZUNA

16. EC2 AMI ami-8d9c20e4 • based on ami-6df93504, x86_64, Amazon Linux, instance storage • Yokozuna ready to go, ~ec2-user/riak/rel • do this modify node name in vm.args _before_ you • set ulimit -n start the node • open up port 8098

17. ARCHI TECTURE

18. Riak as Normal Key Value http://wiki.basho.com/attachments/riak-ring.png http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png

19. One Solr Instance per Node Riak Solr Start/Monitor Proc Jetty/Solr cores Index people KV Hook msgs Query goal of M:1, bucket:inde Yokozu na 1 1 1 bucket M index core

20. All Partitions, One Solr Riak special Solr fields Doc KV 1 id ryan_7 _yz_pn 7 KV 4 _yz_fpn 7 _yz_node dev1 _yz_rk ryan ryan Value KV 7 value_t “...”

21. Extraction on Media Type riak object yz_xml_extractor(Value) metadata content-type “text/xml” Key <doc> <person_name_s>Ryan Zezeski</person_name_s> Value <person_bio_t>...</person_bio_t> ... </doc>

22. Anti-Entropy read handof put repair f obj modified!

23. Active Anti-Entropy Entropy Mgr KV Tree YZ Tree Exchang e

24. Query -> Dist. Query search/people?q=zezeski Node A Riak yz_cover:plan(“people”) solr/people/select?shards=...&fq=...&q=zezeski distributed search Solr NodeB NodeB NodeC Solr Solr Solr

Notas del editor

* Thank Ryan Zezeski, who’s an absolute rockstar, and whose works generally raise the level of Basho as an organization. * And before anyone asks, Yokozuna means “Horizontal rope” in Japanese. It’s the top rank in sumo, usually translated as “Grand Champion”. Not named after the WWF wrester.
* Although I’m probably better known as this guy, and as of yesterday, this guy
* if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
* if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
* riak is good at storing data, not so great at retrieving it beyond primary key * map/reduce can work but is often too general and can be resource hungry * 2i is great for simple tagging, but can’t do much beyond that, has issues around large result sets * want to query data in sophisticated ways but in an efficient manner
* looks like solr but not exact semantics, different performance, lacking features * conjunction queries containing sub-queries w/ large results can hurt * basho is not in business of innovating search, stand on shoulders of giants
* solr has good language and analyzer support * has features people want that would be expensive to develop in riak search * rest upon lucene, a well known and tried solution * sees active development, 4.0 including distributed search, better compaction algorithms, and “near real time” indexing, to name a few
* riak is highly available (via replication), distributed by default, and scales out/in well (or at least better than most) * Solr has efficient indexing, has the features people ask for, and is known with large community around it * the fundamental idea behind yokozuna is to use the strengths of both to compliment each other * use riak to make solr HA and scale at * use solr to make data discovery in riak better
* In a nutshell, Yokozuna has this behavior
* you write KV data to riak like normal, yokozuna writes a doc to solr * query it like a single solr instance, get solr response verbatim
* yokozuna is tight integration between solr and riak, focus on getting started easily * yokozuna controls external solr instance, will restart on crash * provide the canonical solr query interface * use existing solr clients to query riak
* yokozuna acts as intermediary between KV and Solr * converts KV data to solr docs * translates single instance solr queries to dist solr queries * constantly communicates with KV to check convergence of data
* In the interest of dogfooding, Our new docs search is powered by yokozuna, using an AMI
* even with a guide building riak from source can be a pain, I created an AMI to get started quickly using ec2 * based on amazon linux, x86_64, instance storage * you’ll find a source build under ~ec2-user/riak/rel * after starting an instance need to modify node name in vm.args _before_ starting (cause it gets written to ring) * probably want to set ulimit, riak loves file descriptors * change ec2 security profiles to let traffic thru 8098
* start riak as normal, you’ll see a new beam and java process start
* attach to riak so you can create an index and add the hook
* if you want you can check the bucket props to make sure the hook is there
* write some data to riak like you normally would * the content-type will be used to extract field/value pairs * in this case the text extractor will create a field ‘text’ with the string passed as the value
* query the data, notice ‘wt’ is a canonical solr param, no special yokozuna fields
* the key ryan was matched, as expected * that crazy string at the top is the filter query, this is the magic sauce
* okay, that’s great and all, but how about something more sophisticated * in solr you can do something called “highlighting” which shows your matches in the context of their surrounding content * this is the same thing you get when querying google, it shows how your query matched * notice the use of many canonical solr params here, it’s all passed through verbatim, yokozuna simply sends it to the right shards
* okay, the correct key matched again, but there is no highlighting * what gives?
* by default the text extractor stores the value under the field ‘text’ * this field isn’t stored, and highlighting only works with stored fields * there is a dynamic field in the default schema *_t that does store the data, so let’s modify the extractor to use that field * keep in mind storing fields isn’t free and shouldn’t be used willy nilly, but I want to show a more complex example than a simple query
* here we re-register the extractor for the ‘text/plain’ content type so that the data is stored under the field ‘value_t’ * this will match the *_t dynamic field because the suffix matches
* here I delete the key and then rewrite it * the new extractor def will be used this time
* here I run the query again * notice I had to change the field name in the ‘hl.fl’ and ‘q’ params
* and voila, now highlighting works * but seriously, I heard there is Natty Boh here
* you start with Riak as normal * key-value based storage * distributed via consistent hashing * sibling detection with vector clocks * etc
* one solr instance per node * yokozuna uses erlang supervision to monitor/restart the external solr process * a KV hook causes documents to be created/written to solr * queries are converted by yokozuna to distributed solr queries * yokozuna has notion of index, it’s a synonym for solr core, every index is realized on disk via a solr core * currently it is a one-to-one mapping between bucket-index, plan is to allow many buckets to write to one index, e.g. storing mailing list/wiki/commit logs under different buckets but want to search under one index
* since it’s one solr instance it means all partitions live on same index * you can’t just query on value alone b/c multiple replicas could live on same node * there could be fallback data, onwership xfer causes multiple copies of same replica temporarily * id field is riak key + logical partition, otherwise multiple replicas on same node would squash each other * _yz_pn is the owning partition for this replica, needed for coverage query since only a covering subset of partitions is searched * _yz_node is needed during ownership since two nodes will have same partition copy temporarily, if both shards selected you get overlap * _yz_fpn is the first partition in the preflist, since coverage queries can overlap on partitions (i.e. 2 neighbors are selected) need to pick particular pl * there is also a field for entropy data used to calculate hashtrees * the rest of the fields are extracted from the KV value * the special fields are used as a solr filter query which is an efficient way to limit the results w/o affecting score * this is a little confusing, the key take away is this is done to prevent overlap in solr which can mess up the results
* data is opaque to Riak * Solr wants fields/values to index* Extractors map media type to set of field/value pairs* can register new extractors
* any time the object is modified the index is updated* handoff comes for free, no messy iteration of Solr data, trade CPU for network/complexity* queries don’t have to serialize on VNodes
* there is a hashtree for every partition in KV and Solr, each storing hash of KV object * trees are updated as data comes in, periodically rebuilt from scratch in case data/tree get out of sync * entropy mgr continuously iterates partitions and performs exchanges * if divergence is detected the object is read-repaired and re-indexed * if one replica is found to be divergent all of them get fixed
* query comes into riak as canonical solr query * yokozuna builds a coverage plan: covering set of nodes/shards, filter query on PN/Node to eliminate overlap * Solr does the actual coordination/merging of the results, yokozuna just tells it what to do

Yokozuna

Recomendados

Recomendados

Más contenido relacionado

Similar a Yokozuna

Similar a Yokozuna (20)

Más de PDX Web & Design

Más de PDX Web & Design (6)

Último

Último (20)

Yokozuna

Notas del editor