Presentation from my talk at the big data conference, Fifth Elephant 2013, Bangalore.
It talks about how Solr 4 can be used as a data store, specially in cases where there's a need to perform text searches on the data.
08448380779 Call Girls In Civil Lines Women Seeking Men
SolrCloud and NoSQL at the Fifth Elephant 2013, Bangalore
1. The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud and NoSQL
Anshum Gupta
2. The Fifth Elephant 2013, Bangalore
12th July 20132
Who am I?
• Anshum Gupta
• Search and related stuff for around 8 years now
• Apache Lucene since 2006, Solr since 2010
• Currently:
• Helped launch the first AWS search service, CloudSearch.
• Places I‟ve worked at:
3. The Fifth Elephant 2013, Bangalore
12th July 2013
Big Data
• Real Value = Process +
Store + Search
• Search
- No longer expensive
- Affordable
- Necessity
- Can get as complicated as
you‟d want it to get.
3
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Load
s of
Data
Data
Search
4. The Fifth Elephant 2013, Bangalore
12th July 2013
NoSQL Databases
•Wikipedia says:
A NoSQL database provides a mechanism for storage and retrieval of data that
use looser consistency models than traditional relational databases in order to
achieve horizontal scaling and higher availability. Some authors refer to them as
"Not only SQL" to emphasize that some NoSQL systems do allow SQL-like query
language to be used.
•Non-traditional data stores
•Doesn‟t use / isn‟t designed around SQL
•May not give full ACID guarantees
- Offers other advantages such as greater scalability as a
tradeoff
•Distributed, fault-tolerant architecture
5. The Fifth Elephant 2013, Bangalore
12th July 2013
DB Rankings: Overall
Source: http://db-engines.com/en/ranking
6. The Fifth Elephant 2013, Bangalore
12th July 2013
Search Engine Rankings
Source: http://db-engines.com/en/ranking/search+engine
7. The Fifth Elephant 2013, Bangalore
12th July 2013
MongoDB
• Data Model: BSON
• Distributed Model: Sharded master-slave async
replication.
• Consistency: Per table write lock.
• Search:
- Built in full text search, large gaps with „search‟ players.
- Alternate and popular solution: Use another search solution
along with MongoDB, Solr?. Consistency issues and more.
8. The Fifth Elephant 2013, Bangalore
12th July 2013
Cassandra
• Data Model: Column based data store.
• Distributed Model: Uses consistent hashing for
distributed updates.
• Consistency: Timestamps for consistency.
• Search
- Lucandra : Lucene based search.
- Solandra : Solr based search.
9. The Fifth Elephant 2013, Bangalore
12th July 20139
• Implements principles from the Amazon Dynamo paper.
• Riak Search - Distributed index and full-text search
engine.
- Merge Index – Storage backed used by Riak Search. It‟s a pure
Erlang storage format and among other things uses the Apache
Lucene file format.
- Riak Solr – Adds a subset of Apache Solr HTTP capabilities to
Riak Search.
• Yokozuna
- “next generation of Riak Search that marries Riak with Apache
Solr”.
- Sits alongside of Riak.
10. The Fifth Elephant 2013, Bangalore
12th July 201310
The story so far…
• Different approaches for:
- Data Model
- Distributed Update handling
- Consistency management
• Work reasonably well on different fronts as far as
storage is concerned.
• Search:
- There‟s barely anything native and in the core.
- (Almost) Everyone is trying to fuse together with Lucene/Solr.
11. The Fifth Elephant 2013, Bangalore
12th July 201311
Adding Search to NoSQL
• To begin with, wasn‟t built for that
• Compromises
• Integration is the buzzword.
• Lucandra, Solandra…No strong contender yet.
12. The Fifth Elephant 2013, Bangalore
12th July 201312
Adding NoSQL to Search
• Already store documents
• With growing data, more intuitive for this to happen
• More intuitive = makes more sense = easier (perhaps)
• No key player as yet.
14. The Fifth Elephant 2013, Bangalore
12th July 2013
Apache Solr 4 at a glance
• Document Oriented NoSQL Search Server
- Data-format agnostic (JSON, XML, CSV, binary)
- Schema-less options (more coming soon)
• Distributed
- Multi-tenanted
• Fault Tolerant
- HA + No single points of failure
• Atomic Updates
• Optimistic Concurrency
• Near Real-time Search
• Full-Text search + Hit Highlighting
• Tons of specialized queries: Faceted
search, grouping, pseudo-join, spatial search, functions
The desire for these
features drove some
of the “SolrCloud”
architecture
15. The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud Design Goals
• Automatic Distributed Indexing
• HA for Writes
• Durable Writes
• Near Real-time Search
• Real-time get
• Optimistic Concurrency
16. The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
• Distributed Indexing designed from the ground up to
accommodate desired features
• CAP Theorem
- Consistency, Availability, Partition Tolerance (saying goes “choose 2”)
- Reality: Must handle P – the real choice is tradeoffs between C and A
• Ended up with a CP system (roughly)
- Value Consistency over Availability
- Eventual consistency is incompatible with optimistic concurrency
- Closest to MongoDB in architecture
• We still do well with Availability
- All N replicas of a shard must go down before we lose writability for that
shard
- For a network partition, the “big” partition remains active (i.e. Availability
isn‟t “on” or “off”)
17. The Fifth Elephant 2013, Bangalore
12th July 2013
SolrCloud
shard1
replica2
replica3
replica2
replica3
ZooKeeper
quorum
ZK
nod
e
ZK
node
ZK
nod
e
ZK
node
ZK
node
/configs
/myconf
solrconfig.xml
schema.xml
/clusterstate.json
/aliases.json
/livenodes
server1:8983/solr
server2:8983/solr/collections
/collection1
configName=myconf
/shards
/shard1
server1:8983/solr
server2:8983/solr
/shard2
server3:8983/solr
server4:8983/solr
http://.../solr/collection1/query?q=awesome
Load-balanced
sub-request
replica1
shard2
replica1
ZooKeeper holds cluster state
• Nodes in the cluster
• Collections in the cluster
• Schema & config for each
collection
• Shards in each collection
• Replicas in each shard
• Collection aliases
18. The Fifth Elephant 2013, Bangalore
12th July 2013
Shard1 Shard2
Replica1 Replica3
Replica2 Replica4
Distributed Indexing
http://.../solr/collection1/update
• Update sent to any node
• Solr determines what shard the document is on, and forwards to shard leader
• Shard Leader versions document and forwards to all other shard replicas
• HA for updates (if one leader fails, another takes it‟s place)
Document Update
Leader
Non leading replica
19. The Fifth Elephant 2013, Bangalore
12th July 2013
Optimistic Concurrency
• Conditional update based on document version
Solr
2. Modify
document,
retaining
_version_
4. Go back to
step #1 if fail
code=409
client
20. The Fifth Elephant 2013, Bangalore
12th July 2013
Distributed Query Requests
Distributed query across all shards in the collection
http://localhost:8983/solr/collection1/query?q=foo
Explicitly specify node addresses to load-balance across
shards=localhost:8983/solr|localhost:8900/solr,
localhost:7574/solr|localhost:7500/solr
A list of equivalent nodes are separated by “|”
Different phases of the same distributed request use the same node
Specify logical shards to search across
shards=NY,NJ,CT
Specify multiple collections to search across
collection=collection1,collection2
public CloudSolrServer(String zkHost)
ZK aware SolrJ Java client that load-balances across all nodes in cluster
Calculate where document belongs and directly send to shard leader (new)
21. The Fifth Elephant 2013, Bangalore
12th July 2013
Document Routing
80000000-bfffffff
00000000-3fffffff
40000000-7fffffff
c0000000-ffffffff
shard1shard4
shard3 shard2
id = BigCo!doc5
9f2
7
3c71
(MurmurHash3)
q=my_query
shard.keys=BigCo!
9f27 0000 9f27 ffffto
(hash)
shard1
numShards=4
router=compositeId
Hash
Ring
22. The Fifth Elephant 2013, Bangalore
12th July 2013
Durable Writes
• Lucene flushes writes to disk on a “commit”
- Uncommitted docs are lost on a crash (at lucene level)
• Solr 4 maintains it‟s own transaction log
- Contains uncommitted documents
- Services real-time get requests
- Recovery (log replay on restart)
- Supports distributed “peer sync”
• Writes forwarded to multiple shard replicas
- A replica can go away forever w/o collection data loss
- A replica can do a fast “peer sync” if it‟s only slightly out of
date
- A replica can do a full index replication (copy) from a leader.
23. The Fifth Elephant 2013, Bangalore
12th July 2013
Collections API
Create a new document collection
http://localhost:8983/solr/admin/collections?
action=CREATE
&name=mycollection
&numShards=4
&replicationFactor=3
CREATE DELETE ALIAS
SPLITSHARD DELETESHARD RELOAD
24. The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.3: Seamless Online Shard Splitting
Shard2_0
Shard1
replica
leader
Shard2
replica
leader
Shard3
replica
leader
Shard2_1
1. http://localhost:8983/solr/admin/collections?action=SPLITSHARD&col
lection=mycollection&shard=Shard2
2. New sub-shards created in “construction” state
3. Leader starts forwarding applicable updates, which are buffered by the sub-shards
4. Leader index is split and installed on the sub-shards
5. Sub-shards apply buffered updates then become “active” leaders and old shard
becomes “inactive”
update
25. The Fifth Elephant 2013, Bangalore
12th July 2013
Solr 4.4: Schemaless
• “Schemaless” really normally means that the client(s) have an implicit
schema.
• “No Schema” impossible for anything based on Lucene
- A field must be indexed the same way across documents
• Dynamic fields: convention over configuration
- Only pre-define types of fields, not fields themselves
- No guessing. Any field name ending in _i is an integer
• “Guessed Schema” or “Type Guessing”
- For previously unknown fields, guess using JSON type as a hint
- Coming soon (4.4?) based on the Dynamic Schema work
• Many disadvantages to guessing
- Lose ability to catch field naming errors
- Can‟t optimize based on types
- Guessing incorrectly means having to start over
26. The Fifth Elephant 2013, Bangalore
12th July 2013
Bangalore Apache Lucene/Solr Meetup
1 meetup already
Almost 150 members
Another one coming up soon…
Join us at: http://www.meetup.com/Bangalore-Apache-
Solr-Lucene-Group/
27. The Fifth Elephant 2013, Bangalore
12th July 2013
Twitter: @anshumgupta
LinkedIn: http://www.linkedin.com/in/anshumgupta
Blog: http://www.anshumgupta.net
Thanks!
Notas del editor
- You can see the range of any shard in clusterstate.jsonHashing based on the “id” only has some advantages vs hashing based on a different field. Clients can be more generic and not know/care what addressing scheme is being used when dealing with individual documents. The “id” always fully defines where a document lives.Enabled highly scalable multi-tenanted applications