SlideShare una empresa de Scribd logo
1 de 27
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud and NoSQL
Anshum Gupta
The Fifth Elephant 2013, Bangalore
12th July 20132
Who am I?
• Anshum Gupta
• Search and related stuff for around 8 years now
• Apache Lucene since 2006, Solr since 2010
• Currently:
• Helped launch the first AWS search service, CloudSearch.
• Places I‟ve worked at:
The Fifth Elephant 2013, Bangalore
12th July 2013
Big Data
• Real Value = Process +
Store + Search
• Search
- No longer expensive
- Affordable
- Necessity
- Can get as complicated as
you‟d want it to get.
3
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Data
Search
The Fifth Elephant 2013, Bangalore
12th July 2013
NoSQL Databases
•Wikipedia says:
A NoSQL database provides a mechanism for storage and retrieval of data that
use looser consistency models than traditional relational databases in order to
achieve horizontal scaling and higher availability. Some authors refer to them as
"Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query
language to be used.
•Non-traditional data stores
•Doesn‟t use / isn‟t designed around SQL
•May not give full ACID guarantees
- Offers other advantages such as greater scalability as a
tradeoff
•Distributed, fault-tolerant architecture
The Fifth Elephant 2013, Bangalore
12th July 2013
DB Rankings: Overall
Source: http://db-engines.com/en/ranking
The Fifth Elephant 2013, Bangalore
12th July 2013
Search Engine Rankings
Source: http://db-engines.com/en/ranking/search+engine
The Fifth Elephant 2013, Bangalore
12th July 2013
MongoDB
• Data Model: BSON
• Distributed Model: Sharded master-slave async
replication.
• Consistency: Per table write lock.
• Search:
- Built in full text search, large gaps with „search‟ players.
- Alternate and popular solution: Use another search solution
along with MongoDB, Solr?. Consistency issues and more.
The Fifth Elephant 2013, Bangalore
12th July 2013
Cassandra
• Data Model: Column based data store.
• Distributed Model: Uses consistent hashing for
distributed updates.
• Consistency: Timestamps for consistency.
• Search
- Lucandra : Lucene based search.
- Solandra : Solr based search.
The Fifth Elephant 2013, Bangalore
12th July 20139
• Implements principles from the Amazon Dynamo paper.
• Riak Search - Distributed index and full-text search
engine.
- Merge Index – Storage backed used by Riak Search. It‟s a pure
Erlang storage format and among other things uses the Apache
Lucene file format.
- Riak Solr – Adds a subset of Apache Solr HTTP capabilities to
Riak Search.
• Yokozuna
- “next generation of Riak Search that marries Riak with Apache
Solr”.
- Sits alongside of Riak.
The Fifth Elephant 2013, Bangalore
12th July 201310
The story so far…
• Different approaches for:
- Data Model
- Distributed Update handling
- Consistency management
• Work reasonably well on different fronts as far as
storage is concerned.
• Search:
- There‟s barely anything native and in the core.
- (Almost) Everyone is trying to fuse together with Lucene/Solr.
The Fifth Elephant 2013, Bangalore
12th July 201311
Adding Search to NoSQL
• To begin with, wasn‟t built for that
• Compromises
• Integration is the buzzword.
• Lucandra, Solandra…No strong contender yet.
The Fifth Elephant 2013, Bangalore
12th July 201312
Adding NoSQL to Search
• Already store documents
• With growing data, more intuitive for this to happen
• More intuitive = makes more sense = easier (perhaps)
• No key player as yet.
The Fifth Elephant 2013, Bangalore
12th July 2013
The Fifth Elephant 2013, Bangalore
12th July 2013
Apache Solr 4 at a glance
• Document Oriented NoSQL Search Server
- Data-format agnostic (JSON, XML, CSV, binary)
- Schema-less options (more coming soon)
• Distributed
- Multi-tenanted
• Fault Tolerant
- HA + No single points of failure
• Atomic Updates
• Optimistic Concurrency
• Near Real-time Search
• Full-Text search + Hit Highlighting
• Tons of specialized queries: Faceted
search, grouping, pseudo-join, spatial search, functions
The desire for these
features drove some
of the “SolrCloud”
architecture
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud Design Goals
• Automatic Distributed Indexing
• HA for Writes
• Durable Writes
• Near Real-time Search
• Real-time get
• Optimistic Concurrency
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
• Distributed Indexing designed from the ground up to
accommodate desired features
• CAP Theorem
- Consistency, Availability, Partition Tolerance (saying goes “choose 2”)
- Reality: Must handle P – the real choice is tradeoffs between C and A
• Ended up with a CP system (roughly)
- Value Consistency over Availability
- Eventual consistency is incompatible with optimistic concurrency
- Closest to MongoDB in architecture
• We still do well with Availability
- All N replicas of a shard must go down before we lose writability for that
shard
- For a network partition, the “big” partition remains active (i.e. Availability
isn‟t “on” or “off”)
The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
shard1
replica2
replica3
replica2
replica3
ZooKeeper
quorum
ZK
nod
e
ZK
node
ZK
nod
e
ZK
node
ZK
node
/configs
/myconf
solrconfig.xml
schema.xml
/clusterstate.json
/aliases.json
/livenodes
server1:8983/solr
server2:8983/solr/collections
/collection1
configName=myconf
/shards
/shard1
server1:8983/solr
server2:8983/solr
/shard2
server3:8983/solr
server4:8983/solr
http://.../solr/collection1/query?q=awesome
Load-balanced
sub-request
replica1
shard2
replica1
ZooKeeper holds cluster state
• Nodes in the cluster
• Collections in the cluster
• Schema & config for each
collection
• Shards in each collection
• Replicas in each shard
• Collection aliases
The Fifth Elephant 2013, Bangalore
12th July 2013
Shard1 Shard2
Replica1 Replica3
Replica2 Replica4
Distributed Indexing
http://.../solr/collection1/update
• Update sent to any node
• Solr determines what shard the document is on, and forwards to shard leader
• Shard Leader versions document and forwards to all other shard replicas
• HA for updates (if one leader fails, another takes it‟s place)
Document Update
Leader
Non leading replica
The Fifth Elephant 2013, Bangalore
12th July 2013
Optimistic Concurrency
• Conditional update based on document version
Solr
2. Modify
document,
retaining
_version_
4. Go back to
step #1 if fail
code=409
client
The Fifth Elephant 2013, Bangalore
12th July 2013
Distributed Query Requests
 Distributed query across all shards in the collection
http://localhost:8983/solr/collection1/query?q=foo
 Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
 A list of equivalent nodes are separated by “|”
 Different phases of the same distributed request use the same node
 Specify logical shards to search across
shards=NY,NJ,CT
 Specify multiple collections to search across
collection=collection1,collection2
 public CloudSolrServer(String zkHost)
 ZK aware SolrJ Java client that load-balances across all nodes in cluster
 Calculate where document belongs and directly send to shard leader (new)
The Fifth Elephant 2013, Bangalore
12th July 2013
Document Routing
80000000-bfffffff
00000000-3fffffff
40000000-7fffffff
c0000000-ffffffff
shard1shard4
shard3 shard2
id = BigCo!doc5
9f2
7
3c71
(MurmurHash3)
q=my_query
shard.keys=BigCo!
9f27 0000 9f27 ffffto
(hash)
shard1
numShards=4
router=compositeId
Hash
Ring
The Fifth Elephant 2013, Bangalore
12th July 2013
Durable Writes
• Lucene flushes writes to disk on a “commit”
- Uncommitted docs are lost on a crash (at lucene level)
• Solr 4 maintains it‟s own transaction log
- Contains uncommitted documents
- Services real-time get requests
- Recovery (log replay on restart)
- Supports distributed “peer sync”
• Writes forwarded to multiple shard replicas
- A replica can go away forever w/o collection data loss
- A replica can do a fast “peer sync” if it‟s only slightly out of
date
- A replica can do a full index replication (copy) from a leader.
The Fifth Elephant 2013, Bangalore
12th July 2013
Collections API
 Create a new document collection
http://localhost:8983/solr/admin/collections?
action=CREATE
&name=mycollection
&numShards=4
&replicationFactor=3
CREATE DELETE ALIAS
SPLITSHARD DELETESHARD RELOAD
The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.3: Seamless Online Shard Splitting
Shard2_0
Shard1
replica
leader
Shard2
replica
leader
Shard3
replica
leader
Shard2_1
1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col
lection=mycollection&shard=Shard2
2. New sub-shards created in “construction” state
3. Leader starts forwarding applicable updates, which are buffered by the sub-shards
4. Leader index is split and installed on the sub-shards
5. Sub-shards apply buffered updates then become “active” leaders and old shard
becomes “inactive”
update
The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.4: Schemaless
• “Schemaless” really normally means that the client(s) have an implicit
schema.
• “No Schema” impossible for anything based on Lucene
- A field must be indexed the same way across documents
• Dynamic fields: convention over configuration
- Only pre-define types of fields, not fields themselves
- No guessing. Any field name ending in _i is an integer
• “Guessed Schema” or “Type Guessing”
- For previously unknown fields, guess using JSON type as a hint
- Coming soon (4.4?) based on the Dynamic Schema work
• Many disadvantages to guessing
- Lose ability to catch field naming errors
- Can‟t optimize based on types
- Guessing incorrectly means having to start over
The Fifth Elephant 2013, Bangalore
12th July 2013
Bangalore Apache Lucene/Solr Meetup
 1 meetup already
 Almost 150 members
 Another one coming up soon…
 Join us at: http://www.meetup.com/Bangalore-Apache-
Solr-Lucene-Group/
The Fifth Elephant 2013, Bangalore
12th July 2013
Twitter: @anshumgupta
LinkedIn: http://www.linkedin.com/in/anshumgupta
Blog: http://www.anshumgupta.net
Thanks!

Más contenido relacionado

Destacado

Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Lucidworks
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and RecommendersLucidworks
 
Webinar: Fusion for Business Intelligence
Webinar: Fusion for Business IntelligenceWebinar: Fusion for Business Intelligence
Webinar: Fusion for Business IntelligenceLucidworks
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyondAnshum Gupta
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Anshum Gupta
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10Anshum Gupta
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkLucidworks
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0Anshum Gupta
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Lucidworks
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingLucidworks
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache SolrAnshum Gupta
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworksAnshum Gupta
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsAnshum Gupta
 
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...Lucidworks
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Lucidworks
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrAnshum Gupta
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsAnshum Gupta
 

Destacado (20)

Top Node.js Metrics to Watch
Top Node.js Metrics to WatchTop Node.js Metrics to Watch
Top Node.js Metrics to Watch
 
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
Downtown SF Lucene/Solr Meetup: Developing Scalable User Search for PlayStati...
 
Webinar: Search and Recommenders
Webinar: Search and RecommendersWebinar: Search and Recommenders
Webinar: Search and Recommenders
 
Webinar: Fusion for Business Intelligence
Webinar: Fusion for Business IntelligenceWebinar: Fusion for Business Intelligence
Webinar: Fusion for Business Intelligence
 
Apache Solr 5.0 and beyond
Apache Solr 5.0 and beyondApache Solr 5.0 and beyond
Apache Solr 5.0 and beyond
 
Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015Understanding the Solr security framework - Lucene Solr Revolution 2015
Understanding the Solr security framework - Lucene Solr Revolution 2015
 
What's New in Apache Solr 4.10
What's New in Apache Solr 4.10What's New in Apache Solr 4.10
What's New in Apache Solr 4.10
 
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & SparkWebinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
Webinar: Fusion 2.3 Preview - Enhanced Features with Solr & Spark
 
What's new in Solr 5.0
What's new in Solr 5.0What's new in Solr 5.0
What's new in Solr 5.0
 
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
Building a Solr Continuous Delivery Pipeline with Jenkins: Presented by James...
 
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon ConsultingSolr JDBC: Presented by Kevin Risden, Avalon Consulting
Solr JDBC: Presented by Kevin Risden, Avalon Consulting
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
it's just search
it's just searchit's just search
it's just search
 
Ease of use in Apache Solr
Ease of use in Apache SolrEase of use in Apache Solr
Ease of use in Apache Solr
 
Solr security frameworks
Solr security frameworksSolr security frameworks
Solr security frameworks
 
SolrCloud Cluster management via APIs
SolrCloud Cluster management via APIsSolrCloud Cluster management via APIs
SolrCloud Cluster management via APIs
 
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch,  Wipro...
Using Apache Solr for Images as Big Data: Presented by Kerry Koitzsch, Wipro...
 
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
Downtown SF Lucene/Solr Meetup: Developing Scalable Search for User Generated...
 
Working with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache SolrWorking with deeply nested documents in Apache Solr
Working with deeply nested documents in Apache Solr
 
Managing a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIsManaging a SolrCloud cluster using APIs
Managing a SolrCloud cluster using APIs
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 

SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore

  • 1. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud and NoSQL Anshum Gupta
  • 2. The Fifth Elephant 2013, Bangalore 12th July 20132 Who am I? • Anshum Gupta • Search and related stuff for around 8 years now • Apache Lucene since 2006, Solr since 2010 • Currently: • Helped launch the first AWS search service, CloudSearch. • Places I‟ve worked at:
  • 3. The Fifth Elephant 2013, Bangalore 12th July 2013 Big Data • Real Value = Process + Store + Search • Search - No longer expensive - Affordable - Necessity - Can get as complicated as you‟d want it to get. 3 Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Load s of Data Data Search
  • 4. The Fifth Elephant 2013, Bangalore 12th July 2013 NoSQL Databases •Wikipedia says: A NoSQL database provides a mechanism for storage and retrieval of data that use looser consistency models than traditional relational databases in order to achieve horizontal scaling and higher availability. Some authors refer to them as "Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query language to be used. •Non-traditional data stores •Doesn‟t use / isn‟t designed around SQL •May not give full ACID guarantees - Offers other advantages such as greater scalability as a tradeoff •Distributed, fault-tolerant architecture
  • 5. The Fifth Elephant 2013, Bangalore 12th July 2013 DB Rankings: Overall Source: http://db-engines.com/en/ranking
  • 6. The Fifth Elephant 2013, Bangalore 12th July 2013 Search Engine Rankings Source: http://db-engines.com/en/ranking/search+engine
  • 7. The Fifth Elephant 2013, Bangalore 12th July 2013 MongoDB • Data Model: BSON • Distributed Model: Sharded master-slave async replication. • Consistency: Per table write lock. • Search: - Built in full text search, large gaps with „search‟ players. - Alternate and popular solution: Use another search solution along with MongoDB, Solr?. Consistency issues and more.
  • 8. The Fifth Elephant 2013, Bangalore 12th July 2013 Cassandra • Data Model: Column based data store. • Distributed Model: Uses consistent hashing for distributed updates. • Consistency: Timestamps for consistency. • Search - Lucandra : Lucene based search. - Solandra : Solr based search.
  • 9. The Fifth Elephant 2013, Bangalore 12th July 20139 • Implements principles from the Amazon Dynamo paper. • Riak Search - Distributed index and full-text search engine. - Merge Index – Storage backed used by Riak Search. It‟s a pure Erlang storage format and among other things uses the Apache Lucene file format. - Riak Solr – Adds a subset of Apache Solr HTTP capabilities to Riak Search. • Yokozuna - “next generation of Riak Search that marries Riak with Apache Solr”. - Sits alongside of Riak.
  • 10. The Fifth Elephant 2013, Bangalore 12th July 201310 The story so far… • Different approaches for: - Data Model - Distributed Update handling - Consistency management • Work reasonably well on different fronts as far as storage is concerned. • Search: - There‟s barely anything native and in the core. - (Almost) Everyone is trying to fuse together with Lucene/Solr.
  • 11. The Fifth Elephant 2013, Bangalore 12th July 201311 Adding Search to NoSQL • To begin with, wasn‟t built for that • Compromises • Integration is the buzzword. • Lucandra, Solandra…No strong contender yet.
  • 12. The Fifth Elephant 2013, Bangalore 12th July 201312 Adding NoSQL to Search • Already store documents • With growing data, more intuitive for this to happen • More intuitive = makes more sense = easier (perhaps) • No key player as yet.
  • 13. The Fifth Elephant 2013, Bangalore 12th July 2013
  • 14. The Fifth Elephant 2013, Bangalore 12th July 2013 Apache Solr 4 at a glance • Document Oriented NoSQL Search Server - Data-format agnostic (JSON, XML, CSV, binary) - Schema-less options (more coming soon) • Distributed - Multi-tenanted • Fault Tolerant - HA + No single points of failure • Atomic Updates • Optimistic Concurrency • Near Real-time Search • Full-Text search + Hit Highlighting • Tons of specialized queries: Faceted search, grouping, pseudo-join, spatial search, functions The desire for these features drove some of the “SolrCloud” architecture
  • 15. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud Design Goals • Automatic Distributed Indexing • HA for Writes • Durable Writes • Near Real-time Search • Real-time get • Optimistic Concurrency
  • 16. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud • Distributed Indexing designed from the ground up to accommodate desired features • CAP Theorem - Consistency, Availability, Partition Tolerance (saying goes “choose 2”) - Reality: Must handle P – the real choice is tradeoffs between C and A • Ended up with a CP system (roughly) - Value Consistency over Availability - Eventual consistency is incompatible with optimistic concurrency - Closest to MongoDB in architecture • We still do well with Availability - All N replicas of a shard must go down before we lose writability for that shard - For a network partition, the “big” partition remains active (i.e. Availability isn‟t “on” or “off”)
  • 17. The Fifth Elephant 2013, Bangalore 12th July 2013 SolrCloud shard1 replica2 replica3 replica2 replica3 ZooKeeper quorum ZK nod e ZK node ZK nod e ZK node ZK node /configs /myconf solrconfig.xml schema.xml /clusterstate.json /aliases.json /livenodes server1:8983/solr server2:8983/solr/collections /collection1 configName=myconf /shards /shard1 server1:8983/solr server2:8983/solr /shard2 server3:8983/solr server4:8983/solr http://.../solr/collection1/query?q=awesome Load-balanced sub-request replica1 shard2 replica1 ZooKeeper holds cluster state • Nodes in the cluster • Collections in the cluster • Schema & config for each collection • Shards in each collection • Replicas in each shard • Collection aliases
  • 18. The Fifth Elephant 2013, Bangalore 12th July 2013 Shard1 Shard2 Replica1 Replica3 Replica2 Replica4 Distributed Indexing http://.../solr/collection1/update • Update sent to any node • Solr determines what shard the document is on, and forwards to shard leader • Shard Leader versions document and forwards to all other shard replicas • HA for updates (if one leader fails, another takes it‟s place) Document Update Leader Non leading replica
  • 19. The Fifth Elephant 2013, Bangalore 12th July 2013 Optimistic Concurrency • Conditional update based on document version Solr 2. Modify document, retaining _version_ 4. Go back to step #1 if fail code=409 client
  • 20. The Fifth Elephant 2013, Bangalore 12th July 2013 Distributed Query Requests  Distributed query across all shards in the collection http://localhost:8983/solr/collection1/query?q=foo  Explicitly specify node addresses to load-balance across shards=localhost:8983/solr|localhost:8900/solr, localhost:7574/solr|localhost:7500/solr  A list of equivalent nodes are separated by “|”  Different phases of the same distributed request use the same node  Specify logical shards to search across shards=NY,NJ,CT  Specify multiple collections to search across collection=collection1,collection2  public CloudSolrServer(String zkHost)  ZK aware SolrJ Java client that load-balances across all nodes in cluster  Calculate where document belongs and directly send to shard leader (new)
  • 21. The Fifth Elephant 2013, Bangalore 12th July 2013 Document Routing 80000000-bfffffff 00000000-3fffffff 40000000-7fffffff c0000000-ffffffff shard1shard4 shard3 shard2 id = BigCo!doc5 9f2 7 3c71 (MurmurHash3) q=my_query shard.keys=BigCo! 9f27 0000 9f27 ffffto (hash) shard1 numShards=4 router=compositeId Hash Ring
  • 22. The Fifth Elephant 2013, Bangalore 12th July 2013 Durable Writes • Lucene flushes writes to disk on a “commit” - Uncommitted docs are lost on a crash (at lucene level) • Solr 4 maintains it‟s own transaction log - Contains uncommitted documents - Services real-time get requests - Recovery (log replay on restart) - Supports distributed “peer sync” • Writes forwarded to multiple shard replicas - A replica can go away forever w/o collection data loss - A replica can do a fast “peer sync” if it‟s only slightly out of date - A replica can do a full index replication (copy) from a leader.
  • 23. The Fifth Elephant 2013, Bangalore 12th July 2013 Collections API  Create a new document collection http://localhost:8983/solr/admin/collections? action=CREATE &name=mycollection &numShards=4 &replicationFactor=3 CREATE DELETE ALIAS SPLITSHARD DELETESHARD RELOAD
  • 24. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.3: Seamless Online Shard Splitting Shard2_0 Shard1 replica leader Shard2 replica leader Shard3 replica leader Shard2_1 1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col lection=mycollection&shard=Shard2 2. New sub-shards created in “construction” state 3. Leader starts forwarding applicable updates, which are buffered by the sub-shards 4. Leader index is split and installed on the sub-shards 5. Sub-shards apply buffered updates then become “active” leaders and old shard becomes “inactive” update
  • 25. The Fifth Elephant 2013, Bangalore 12th July 2013 Solr 4.4: Schemaless • “Schemaless” really normally means that the client(s) have an implicit schema. • “No Schema” impossible for anything based on Lucene - A field must be indexed the same way across documents • Dynamic fields: convention over configuration - Only pre-define types of fields, not fields themselves - No guessing. Any field name ending in _i is an integer • “Guessed Schema” or “Type Guessing” - For previously unknown fields, guess using JSON type as a hint - Coming soon (4.4?) based on the Dynamic Schema work • Many disadvantages to guessing - Lose ability to catch field naming errors - Can‟t optimize based on types - Guessing incorrectly means having to start over
  • 26. The Fifth Elephant 2013, Bangalore 12th July 2013 Bangalore Apache Lucene/Solr Meetup  1 meetup already  Almost 150 members  Another one coming up soon…  Join us at: http://www.meetup.com/Bangalore-Apache- Solr-Lucene-Group/
  • 27. The Fifth Elephant 2013, Bangalore 12th July 2013 Twitter: @anshumgupta LinkedIn: http://www.linkedin.com/in/anshumgupta Blog: http://www.anshumgupta.net Thanks!

Notas del editor

  1. - You can see the range of any shard in clusterstate.jsonHashing based on the “id” only has some advantages vs hashing based on a different field. Clients can be more generic and not know/care what addressing scheme is being used when dealing with individual documents. The “id” always fully defines where a document lives.Enabled highly scalable multi-tenanted applications