Presented by Mark Miller, Software Engineer, Cloudera
As the NoSQL ecosystem looks to integrate great search, great search is naturally beginning to expose many NoSQL features. Will these Goliath's collide? Or will they remain specialized while intermingling – two sides of the same coin.
Come learn about where SolrCloud fits into the NoSQL landscape. What can it do? What will it do? And how will the big data, NoSQL, Search ecosystem evolve. If you are interested in Big Data, NoSQL, distributed systems, CAP theorem and other hype filled terms, than this talk may be for you.
2. • Mark Miller: Cloudera
employee, Lucene PMC
member, Apache member
• Started playing with
Lucene in 2006
• Lucene committer since
2008
• Solr committer since 2009
Who Am I?
4. Big Data is getting Bigger
• The total Big Data market reached $11.4 billion in 2012
• The Big Data market is projected to reach $18.1 billion in
2013, an annual growth of 61%
• On pace to exceed $47 billion by 2017.
7. Ultimately, the NoSQL market is largely up for
grabs. Each NoSQL database has its related
strengths and weaknesses, and no one NoSQL
database currently “does it all.” Big Data
practitioners must take a number of factors into
consideration when selecting a NoSQL database
to facilitate large-scale transactional workloads,
including scalability, performance, security, and
ease-of-development.
Big Data Vendor Revenue and Market Forecast
(Wikibon)
8. RMDBS
• The classic way to store your data.
• ACID is great, transactions are cool, SQL is well
known and understood.
• Scaling is *hard*, but possible (see Facebook’s
MySQL cluster)
• ‘impedance mismatch’ sucks
9. Search
• Search has been moving from an expensive,
complicated option to an affordable and more easy
necessity.
• Lot’s of data begs for the ability to process it, store it,
and search it.
10. Enterprise Search
Engines
• Verity - acquired by Autonomy in 2005
• FAST - acquired by Microsoft in 2008
• Endeca - acquired by Oracle in 2011
• Autonomy - acquired by HP in 2011
• Vivisimo - acquired by IBM in 2012
11. NoSQL
• Not Only SQL rather than ‘No SQL’
• Except that makes little sense...
• “when ‘NoSQL’ is applied to a database, it refers to
an ill- defined set of mostly open-source databases,
mostly developed in the early 21st century, and
mostly not using SQL.” - NoSQL Distilled
17. When it comes to NoSQL,
Open Source rules the
roost.
• I won’t be talking about any solution that is not
based on Open Source - only because those
solutions are not popular.
• "there’s a notion that NoSQL is an open-source
phenomenon.” - NoSQL Distilled
18. The 2013 Future of Open
Source Survey Results
Black Duck and North Bridge
19. What’s Popular?
• NoSQL database proliferation - NoSQL databases are
a dime a dozen. Why?
• Which solutions should we look at?
20. indeed.com
• Indeed.com is an employment-related metasearch
engine for job listings
• Indeed is the #1 job site worldwide, with over 100
million unique visitors per month. Indeed is available
in more than 50 countries and 26 languages,
covering 94% of global GDP.
21. http://db-engines.com
• DB-Engines is an initiative to collect and present
information on database management systems
(DBMS). In addition to established relational DBMS,
systems and concepts of the growing NoSQL area
are emphasized.
• The DB-Engines Ranking is a list of DBMS ranked by
their current popularity. The list is updated monthly.
35. In case you forgot,
Oracle is in the
NoSQL game...
• Oracle NoSQL
36. CAP Theorem
The CAP theorem, also known as Brewer's theorem,
states that it is impossible for a distributed computer
system to simultaneously provide all three of the
following guarantees:
• Consistency (all nodes see the same data at the
same time)
• Availability (a guarantee that every request
receives a response about whether it was
successful or failed)
• Partition tolerance (the system continues to
operate despite arbitrary message loss or failure of
part of the system)
38. Architectures
• For NoSQL, generally boils down to AP or CP. CA
does not support partition tolerance.
• You have to trade off consistency versus availability.
• AP favors availability over consistency - the is the
eventually consistent architecture.
• CP favors consistency over availability.
• Of course, there is a continuum between AP and CP.
39. Key Design
Decisions
• Data Model - how is the data stored/accessed
• Distribution Model - how is the data distributed
• Conflict Resolution - how is it ensured that the same
update ‘wins’ on each node.
40. Data Model
• key -> value (opaque)
• key -> document
• column oriented
42. Data Versioning and
Consistency
• Essentially, how is data kept consistent across nodes?
• Sequential consistency—ensuring that all nodes
apply operations in the same order.
• Update consistency and read consistency.
43. • Data Model - bson - binary json format
• Distributed Model - sharded asynchronous master/
slave replication.
• Data Versioning and Consistency - Master / Slave, per
table write lock
44. MongoDB Search
• Built in text search. I think of it like RBDMS built in
full text search - major feature gaps with dedicated
full text search engines, and likely major
performance gaps.
• Common to sit a search engine next to MongoDB
45. • Data Model - column based, like BigTable
• Distributed Updates - similar to Dynamo, consistent
hashing, master-master
• Data Versioning and Consistency - timestamps
49. • Riak is a NoSQL database implementing the
principles from Amazon's Dynamo paper
• Data Model - stores key/value pairs in a high level
namespace called a bucket.
• Data Versioning and Consistency - Riak uses a data
structure called a vector clock to reason about
causality and staleness of stored values. (Can also
use timestamps). Last write wins, or client resolves
conflict.
51. Yokozuna Author Enumerates
Common Reasons Custom Search
has Failed
• Pretends to be lucene/solr
• Lack of analyzer/language/features
• Bad performance/resource usage for certain queries
• Basho is not in the business of search
52. • CouchDB’s data format is JSON stored as documents
(self-contained records with no intrinsic
relationships), grouped into “database” namespaces.
• Conflicts are left to the application to resolve at write
time. CouchDB arbitrarily, but deterministically,
determines a winner and tracks a conflict. The client
must then resolve the conflict.
54. • Redis is an open-source, networked, in-memory, key-
value data store with optional durability.
• Memcached is a general-purpose distributed memory
caching system
• Redis-Search
55. Adding Search to
NoSQL
• Hard to do without a lot of compromise
• Build your own, or use Lucene or Lucene based
solution
• Nothing has yet set the world on fire...
56. Adding NoSQL to
Search
• Search solutions are generally already a Document
based NoSQL solution.
• Seems a lot easier to do then the reverse
• Nothing has yet set the world on fire...
58. Schemaless?
• NoSQL databases are generally ‘schemaless’
• In some ways, convenient, in others ways not.
• Implicit schema moves to application code.
• Can’t optimize based on types.
• Note: some are calling ‘guessed’ schemas
schemaless.
59. • Most similar to the MongoDB architecture
• A CP system, though currently, eventually consistent.
• The architecture supports adding strong consistency
options.
60. SolrCloud
• The length of time an inconsistency is present is
called the inconsistency window.
• SolrCloud has a very small inconsistency window.