SlideShare una empresa de Scribd logo
1 de 24
Yokozuna
                Eric Redmond
                 @coderoshi
Shamelessly pilfered from Ryan Zezeski @rzezeski




                                                   Nov 19th, 2012
Aittle
L
Riak
Book
Gifts from Basho
https://github.com/rzezeski/yokozuna
https://github.com/coderoshi/little_riak_book
Gifts from Basho
https://github.com/rzezeski/yokozuna
https://github.com/coderoshi/little_riak_book
WHY
DO THIS?
Improve Retrieval in


      Riak is awesome at storing data
         (not so easy to query it).
   MapReduce can be resource intensive
      (also, you have to write code).
        2i is limited, resource issues.
Efficiently query nontrivial amounts of data.
Riak Search
  Lessons Learned
  Lessons Learned

Pretends to be lucene/solr (but it ain’t).
Lack of analyzer/language/feature support.
 Poor performance on certain queries.
Poor anti-entropy (indexes can corrupt).
    Basho isn’t a “search company”.
Solr is Better


     Excellent analyzer/language support
Features: ranking, faceting, highlighting, geo, etc.
                Built upon Lucene
  Actively developed, built search innovators
By Our Powers Combined
Riak: HA, distributed,
scale out/in
Solr: efficient index,
features people want,
known entity
Solr scales with Riak

Make Riak searchable with Solr
WHA
   T
   IS
YOKOZUNA?
IF DISTRIBUTED SOLR HAS IT
YOKOZUNA HAS IT
Integration

Solr bundled with Riak, turn-key, zero config to start
     Supervise Solr process, start/stop/restart
         Present Real Solr query interface
           Use Solr clients to query Riak
Intermediary

             Erlang processes.
     Translates KV data into Solr docs.
Makes Solr queries distributed Solr queries.
Communicates with KV to verify object/index
             convergence.
DEMO
LETS SEE IT!
POWERED BY YOKOZUNA
EC2 AMI
            ami-8d9c20e4
•   based on ami-6df93504, x86_64, Amazon
    Linux, instance storage

•   Yokozuna ready to go, ~ec2-user/riak/rel

•
                                          do this
    modify node name in vm.args          _before_
                                            you
•   set ulimit -n                          start
                                         the node

•   open up port 8098
ARCHI
TECTURE
Riak as Normal
                                                  Key   Value




http://wiki.basho.com/attachments/riak-ring.png          http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png
One Solr Instance per Node
                  Riak
   Solr    Start/Monitor
   Proc                                  Jetty/Solr
                                                     cores

              Index                       people
    KV
   Hook                                       msgs
              Query

                           goal of M:1, bucket:inde
  Yokozu
    na                            1           1                 1
                           bucket M   index              core
All Partitions, One Solr
               Riak special              Solr
                       fields
                                             Doc
 KV 1
                              id   ryan_7
                           _yz_pn     7
 KV 4                      _yz_fpn    7
                          _yz_node dev1
                           _yz_rk   ryan
        ryan   Value

 KV 7                      value_t   “...”
Extraction on Media Type
    riak object                    yz_xml_extractor(Value)
                      metadata




content-type   “text/xml”



         Key
                                 <doc>
                                   <person_name_s>Ryan Zezeski</person_name_s>
        Value                      <person_bio_t>...</person_bio_t>
                                   ...
                                 </doc>
Anti-Entropy

             read       handof
put
            repair         f




        obj modified!
Active Anti-Entropy
          Entropy
            Mgr




KV Tree             YZ Tree

          Exchang
             e
Query -> Dist. Query
            search/people?q=zezeski                Node A
                                         Riak
               yz_cover:plan(“people”)


solr/people/select?shards=...&fq=...&q=zezeski

                    distributed search
                                         Solr


            NodeB            NodeB              NodeC

     Solr             Solr               Solr

Más contenido relacionado

Similar a Yokozuna

Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Charles Nutter
 
Riak at Engine Yard Cloud
Riak at Engine Yard CloudRiak at Engine Yard Cloud
Riak at Engine Yard CloudInes Sombra
 
seevl: Data-driven music discovery
seevl: Data-driven music discoveryseevl: Data-driven music discovery
seevl: Data-driven music discoveryAlexandre Passant
 
Scaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and DjangoScaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and Djangotow21
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovationsLucidworks (Archived)
 
Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical ISSGC Summer School
 
Ruby on Rails All Hands Meeting
Ruby on Rails All Hands MeetingRuby on Rails All Hands Meeting
Ruby on Rails All Hands MeetingDan Davis
 
SSONDE: Semantic Similarity On liNked Data Entities
SSONDE: Semantic Similarity On liNked Data EntitiesSSONDE: Semantic Similarity On liNked Data Entities
SSONDE: Semantic Similarity On liNked Data EntitiesRiccardo Albertoni
 
LatJUG. Java Bytecode Fundamentals
LatJUG. Java Bytecode FundamentalsLatJUG. Java Bytecode Fundamentals
LatJUG. Java Bytecode Fundamentalsdenis Udod
 
Session 49 Practical Semantic Sticky Note
Session 49 Practical Semantic Sticky NoteSession 49 Practical Semantic Sticky Note
Session 49 Practical Semantic Sticky NoteISSGC Summer School
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js ExplainedJeff Kunkle
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineDavid Keener
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccsrisatish ambati
 
A Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebA Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebShamod Lacoul
 
Tldr solr-courseload
Tldr solr-courseloadTldr solr-courseload
Tldr solr-courseloadmattdeboard
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 
iOS Dev Happy Hour Realm - Feb 2021
iOS Dev Happy Hour Realm - Feb 2021iOS Dev Happy Hour Realm - Feb 2021
iOS Dev Happy Hour Realm - Feb 2021Jason Flax
 
Практики применения JRuby
Практики применения JRubyПрактики применения JRuby
Практики применения JRuby.toster
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Lucidworks
 

Similar a Yokozuna (20)

Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012Why JRuby? - RubyConf 2012
Why JRuby? - RubyConf 2012
 
Riak at Engine Yard Cloud
Riak at Engine Yard CloudRiak at Engine Yard Cloud
Riak at Engine Yard Cloud
 
seevl: Data-driven music discovery
seevl: Data-driven music discoveryseevl: Data-driven music discovery
seevl: Data-driven music discovery
 
Scaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and DjangoScaling search to a million pages with Solr, Python, and Django
Scaling search to a million pages with Solr, Python, and Django
 
Seeley yonik solr performance key innovations
Seeley yonik   solr performance key innovationsSeeley yonik   solr performance key innovations
Seeley yonik solr performance key innovations
 
Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical Session 49 - Semantic metadata management practical
Session 49 - Semantic metadata management practical
 
Ruby on Rails All Hands Meeting
Ruby on Rails All Hands MeetingRuby on Rails All Hands Meeting
Ruby on Rails All Hands Meeting
 
SSONDE: Semantic Similarity On liNked Data Entities
SSONDE: Semantic Similarity On liNked Data EntitiesSSONDE: Semantic Similarity On liNked Data Entities
SSONDE: Semantic Similarity On liNked Data Entities
 
LatJUG. Java Bytecode Fundamentals
LatJUG. Java Bytecode FundamentalsLatJUG. Java Bytecode Fundamentals
LatJUG. Java Bytecode Fundamentals
 
Session 49 Practical Semantic Sticky Note
Session 49 Practical Semantic Sticky NoteSession 49 Practical Semantic Sticky Note
Session 49 Practical Semantic Sticky Note
 
Node.js Explained
Node.js ExplainedNode.js Explained
Node.js Explained
 
Rails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search EngineRails and the Apache SOLR Search Engine
Rails and the Apache SOLR Search Engine
 
Cacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svccCacheconcurrencyconsistency cassandra svcc
Cacheconcurrencyconsistency cassandra svcc
 
A Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic WebA Hands On Overview Of The Semantic Web
A Hands On Overview Of The Semantic Web
 
Tldr solr-courseload
Tldr solr-courseloadTldr solr-courseload
Tldr solr-courseload
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 
iOS Dev Happy Hour Realm - Feb 2021
iOS Dev Happy Hour Realm - Feb 2021iOS Dev Happy Hour Realm - Feb 2021
iOS Dev Happy Hour Realm - Feb 2021
 
Java and Mongo
Java and MongoJava and Mongo
Java and Mongo
 
Практики применения JRuby
Практики применения JRubyПрактики применения JRuby
Практики применения JRuby
 
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
Solr and Spark for Real-Time Big Data Analytics: Presented by Tim Potter, Luc...
 

Más de PDX Web & Design

Más de PDX Web & Design (6)

HTTP is Hard
HTTP is HardHTTP is Hard
HTTP is Hard
 
Riak Search 2: Yokozuna
Riak Search 2: YokozunaRiak Search 2: Yokozuna
Riak Search 2: Yokozuna
 
Distributed Data Structures
Distributed Data StructuresDistributed Data Structures
Distributed Data Structures
 
DDS-20m
DDS-20mDDS-20m
DDS-20m
 
Hardcore CSS
Hardcore CSSHardcore CSS
Hardcore CSS
 
Hardcore HTML
Hardcore HTMLHardcore HTML
Hardcore HTML
 

Último

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piececharlottematthew16
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyAlfredo García Lavilla
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningLars Bell
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsSergiu Bodiu
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubKalema Edgar
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 

Último (20)

Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?Are Multi-Cloud and Serverless Good or Bad?
Are Multi-Cloud and Serverless Good or Bad?
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 
Story boards and shot lists for my a level piece
Story boards and shot lists for my a level pieceStory boards and shot lists for my a level piece
Story boards and shot lists for my a level piece
 
Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Commit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easyCommit 2024 - Secret Management made easy
Commit 2024 - Secret Management made easy
 
DSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine TuningDSPy a system for AI to Write Prompts and Do Fine Tuning
DSPy a system for AI to Write Prompts and Do Fine Tuning
 
DevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platformsDevEX - reference for building teams, processes, and platforms
DevEX - reference for building teams, processes, and platforms
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks..."LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Unleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding ClubUnleash Your Potential - Namagunga Girls Coding Club
Unleash Your Potential - Namagunga Girls Coding Club
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 

Yokozuna

  • 1. Yokozuna Eric Redmond @coderoshi Shamelessly pilfered from Ryan Zezeski @rzezeski Nov 19th, 2012
  • 6. Improve Retrieval in Riak is awesome at storing data (not so easy to query it). MapReduce can be resource intensive (also, you have to write code). 2i is limited, resource issues. Efficiently query nontrivial amounts of data.
  • 7. Riak Search Lessons Learned Lessons Learned Pretends to be lucene/solr (but it ain’t). Lack of analyzer/language/feature support. Poor performance on certain queries. Poor anti-entropy (indexes can corrupt). Basho isn’t a “search company”.
  • 8. Solr is Better Excellent analyzer/language support Features: ranking, faceting, highlighting, geo, etc. Built upon Lucene Actively developed, built search innovators
  • 9. By Our Powers Combined Riak: HA, distributed, scale out/in Solr: efficient index, features people want, known entity Solr scales with Riak Make Riak searchable with Solr
  • 10. WHA T IS YOKOZUNA?
  • 11. IF DISTRIBUTED SOLR HAS IT YOKOZUNA HAS IT
  • 12. Integration Solr bundled with Riak, turn-key, zero config to start Supervise Solr process, start/stop/restart Present Real Solr query interface Use Solr clients to query Riak
  • 13. Intermediary Erlang processes. Translates KV data into Solr docs. Makes Solr queries distributed Solr queries. Communicates with KV to verify object/index convergence.
  • 16. EC2 AMI ami-8d9c20e4 • based on ami-6df93504, x86_64, Amazon Linux, instance storage • Yokozuna ready to go, ~ec2-user/riak/rel • do this modify node name in vm.args _before_ you • set ulimit -n start the node • open up port 8098
  • 18. Riak as Normal Key Value http://wiki.basho.com/attachments/riak-ring.png http://s3.amazonaws.com/wernervogels/public/sosp/sosp-figure3-small.png
  • 19. One Solr Instance per Node Riak Solr Start/Monitor Proc Jetty/Solr cores Index people KV Hook msgs Query goal of M:1, bucket:inde Yokozu na 1 1 1 bucket M index core
  • 20. All Partitions, One Solr Riak special Solr fields Doc KV 1 id ryan_7 _yz_pn 7 KV 4 _yz_fpn 7 _yz_node dev1 _yz_rk ryan ryan Value KV 7 value_t “...”
  • 21. Extraction on Media Type riak object yz_xml_extractor(Value) metadata content-type “text/xml” Key <doc> <person_name_s>Ryan Zezeski</person_name_s> Value <person_bio_t>...</person_bio_t> ... </doc>
  • 22. Anti-Entropy read handof put repair f obj modified!
  • 23. Active Anti-Entropy Entropy Mgr KV Tree YZ Tree Exchang e
  • 24. Query -> Dist. Query search/people?q=zezeski Node A Riak yz_cover:plan(“people”) solr/people/select?shards=...&fq=...&q=zezeski distributed search Solr NodeB NodeB NodeC Solr Solr Solr

Notas del editor

  1. * Thank Ryan Zezeski, who’s an absolute rockstar, and whose works generally raise the level of Basho as an organization. * And before anyone asks, Yokozuna means “Horizontal rope” in Japanese. It’s the top rank in sumo, usually translated as “Grand Champion”. Not named after the WWF wrester.
  2. * Although I’m probably better known as this guy, and as of yesterday, this guy
  3. * if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
  4. * if you want to follow latest then you can follow getting started guide for building from source * requires special branches of riak, riak core, and riak kv * requires erlang, I’ve been building against 15B01 * will pull latest of solr 4.0 alpha and build so requires javac, ant, and ivy
  5. * riak is good at storing data, not so great at retrieving it beyond primary key * map/reduce can work but is often too general and can be resource hungry * 2i is great for simple tagging, but can’t do much beyond that, has issues around large result sets * want to query data in sophisticated ways but in an efficient manner
  6. * looks like solr but not exact semantics, different performance, lacking features * conjunction queries containing sub-queries w/ large results can hurt * basho is not in business of innovating search, stand on shoulders of giants
  7. * solr has good language and analyzer support * has features people want that would be expensive to develop in riak search * rest upon lucene, a well known and tried solution * sees active development, 4.0 including distributed search, better compaction algorithms, and “near real time” indexing, to name a few
  8. * riak is highly available (via replication), distributed by default, and scales out/in well (or at least better than most) * Solr has efficient indexing, has the features people ask for, and is known with large community around it * the fundamental idea behind yokozuna is to use the strengths of both to compliment each other * use riak to make solr HA and scale at * use solr to make data discovery in riak better
  9. * In a nutshell, Yokozuna has this behavior
  10. * you write KV data to riak like normal, yokozuna writes a doc to solr * query it like a single solr instance, get solr response verbatim
  11. * yokozuna is tight integration between solr and riak, focus on getting started easily * yokozuna controls external solr instance, will restart on crash * provide the canonical solr query interface * use existing solr clients to query riak
  12. * yokozuna acts as intermediary between KV and Solr * converts KV data to solr docs * translates single instance solr queries to dist solr queries * constantly communicates with KV to check convergence of data
  13. * In the interest of dogfooding, Our new docs search is powered by yokozuna, using an AMI
  14. * even with a guide building riak from source can be a pain, I created an AMI to get started quickly using ec2 * based on amazon linux, x86_64, instance storage * you’ll find a source build under ~ec2-user/riak/rel * after starting an instance need to modify node name in vm.args _before_ starting (cause it gets written to ring) * probably want to set ulimit, riak loves file descriptors * change ec2 security profiles to let traffic thru 8098
  15. * start riak as normal, you’ll see a new beam and java process start
  16. * attach to riak so you can create an index and add the hook
  17. * if you want you can check the bucket props to make sure the hook is there
  18. * write some data to riak like you normally would * the content-type will be used to extract field/value pairs * in this case the text extractor will create a field ‘text’ with the string passed as the value
  19. * query the data, notice ‘wt’ is a canonical solr param, no special yokozuna fields
  20. * the key ryan was matched, as expected * that crazy string at the top is the filter query, this is the magic sauce
  21. * okay, that’s great and all, but how about something more sophisticated * in solr you can do something called “highlighting” which shows your matches in the context of their surrounding content * this is the same thing you get when querying google, it shows how your query matched * notice the use of many canonical solr params here, it’s all passed through verbatim, yokozuna simply sends it to the right shards
  22. * okay, the correct key matched again, but there is no highlighting * what gives?
  23. * by default the text extractor stores the value under the field ‘text’ * this field isn’t stored, and highlighting only works with stored fields * there is a dynamic field in the default schema *_t that does store the data, so let’s modify the extractor to use that field * keep in mind storing fields isn’t free and shouldn’t be used willy nilly, but I want to show a more complex example than a simple query
  24. * here we re-register the extractor for the ‘text/plain’ content type so that the data is stored under the field ‘value_t’ * this will match the *_t dynamic field because the suffix matches
  25. * here I delete the key and then rewrite it * the new extractor def will be used this time
  26. * here I run the query again * notice I had to change the field name in the ‘hl.fl’ and ‘q’ params
  27. * and voila, now highlighting works * but seriously, I heard there is Natty Boh here
  28. * you start with Riak as normal * key-value based storage * distributed via consistent hashing * sibling detection with vector clocks * etc
  29. * one solr instance per node * yokozuna uses erlang supervision to monitor/restart the external solr process * a KV hook causes documents to be created/written to solr * queries are converted by yokozuna to distributed solr queries * yokozuna has notion of index, it’s a synonym for solr core, every index is realized on disk via a solr core * currently it is a one-to-one mapping between bucket-index, plan is to allow many buckets to write to one index, e.g. storing mailing list/wiki/commit logs under different buckets but want to search under one index
  30. * since it’s one solr instance it means all partitions live on same index * you can’t just query on value alone b/c multiple replicas could live on same node * there could be fallback data, onwership xfer causes multiple copies of same replica temporarily * id field is riak key + logical partition, otherwise multiple replicas on same node would squash each other * _yz_pn is the owning partition for this replica, needed for coverage query since only a covering subset of partitions is searched * _yz_node is needed during ownership since two nodes will have same partition copy temporarily, if both shards selected you get overlap * _yz_fpn is the first partition in the preflist, since coverage queries can overlap on partitions (i.e. 2 neighbors are selected) need to pick particular pl * there is also a field for entropy data used to calculate hashtrees * the rest of the fields are extracted from the KV value * the special fields are used as a solr filter query which is an efficient way to limit the results w/o affecting score * this is a little confusing, the key take away is this is done to prevent overlap in solr which can mess up the results
  31. * data is opaque to Riak * Solr wants fields/values to index* Extractors map media type to set of field/value pairs* can register new extractors
  32. * any time the object is modified the index is updated* handoff comes for free, no messy iteration of Solr data, trade CPU for network/complexity* queries don’t have to serialize on VNodes
  33. * there is a hashtree for every partition in KV and Solr, each storing hash of KV object * trees are updated as data comes in, periodically rebuilt from scratch in case data/tree get out of sync * entropy mgr continuously iterates partitions and performs exchanges * if divergence is detected the object is read-repaired and re-indexed * if one replica is found to be divergent all of them get fixed
  34. * query comes into riak as canonical solr query * yokozuna builds a coverage plan: covering set of nodes/shards, filter query on PN/Node to eliminate overlap * Solr does the actual coordination/merging of the results, yokozuna just tells it what to do