Search is eveywhere. Tell story about kids with search and recommendations.
Properly written queries, on the other hand, are completely precise. A query will give you exactly the result you ask for. Exactly what you asked for, which may or not be exactly what you want. The hardest part of a query is asking the right question.
Search, on the other hand, is not precise. A dumb search just gives you a randomly ordered list of items in which your search term occurs. A smart search will provide you a ranked list of items that are strongly related to the search terms you entered, even if they do not match the exactly. This means that even though some of the results will doubtless be unrelated, and sometimes absurd, there is a very good chance that there is data you can use pretty high up in those results, even if you didn’t ask quite the right question. And there may well be relevant data you didn’t even think to ask for.
Being smart in this way is a key benefit of search. Queries cannot be smart. Queries must always give you exactly what you asked for. There can be no tolerance for serendipity in query results. Search can be smart, but query must be dumb and strictly obedient.
“Find me all webpages that contain “Cassandra” and “Optimization””
“Recommend artists who are “like” “Taylor Swift””
“Highlight the keywords “Awesome,” “Good,” and “Amazing.””
Find me all shoes with:
Size: 12
Price: $15 - $60
Brand: Nike
Lucene is the base to nearly every popular search engine out there, including elastic search (which is more a of fried taco shell)
Fast, high performance, scalable search/IR library
Open source
Initially developed by Doug Cutting (Also author of Hadoop)
Indexing and Searching
Inverted Index of documents
Provides advanced Search options like synonyms, stopwords, based on similarity, proximity.
http://lucene.apache.org/
Created by Yonik Seeley for CNET
Enterprise Search platform for Apache Lucene
Open source
Highly reliable, scalable, fault tolerant
Support distributed Indexing (SolrCloud), Replication, and load balanced querying
http://lucene.apache.org/solr
To search! (duh)
Note: Not the same as querying.
Querying implies you are searching with a specific value in mind. Search is fuzzier, about finding likeness.
Full text search and highlighting
Faceted search (search by price, size, manufacturer, etc.)
Geospatial search (combining location information, filtering by distance, and more)
Solr is an open source enterprise search system
Solr “wraps” the Lucene open source information retrieval engine
You customize Solr via configuration & plug-ins
Free-form text search includes wildcards and phrases
Query support includes filtering (ranges, geo-spatial) and sorting
You can run Solr as a webapp, or in stand-alone (embedded) mode
You make search requests using HTTP GET requests
Documents are added and deleted via HTTP POST requests
Document updates work the same way (internally as delete+add)
----- Meeting Notes (6/5/15 14:33) -----
Lucene is assembly to Solr's Java
These are extremely efficient and fast structures that live on disk. Simple queries are typically fulfilled in < 1ms. More complex ones, < 10ms
no built in datastore, it is a just a jar file that runs in a servlet container like Jetty and Tomcat
Give me some documents and let me index them for you!
But how does that solve your scaling problem? Your uptime problem?
It doesn’t.
data is stored completely on disk
No built in way to shard or distribute
Even with SolrCloud or ElasticSearch, you will need to move your data around
SolrCloud = zookeeper makes Solr clusterable. Just like mysql sharding… not transparent
SolrCloud = zookeeper makes Solr clusterable. Just like mysql sharding… not transparent
Cassandra is storing your data, solr is storing the indexes for fast search. Again, no ETLs and your data model doesn’t change.
Data is indexed locally. You could completely forgot that solr exists. You could just write your queries through Cassandra.
You get a lot of the benefits from using Solr on top of C* - No ETL! Just load the data in cassandra and it’s automatically replicated to Solr!
Data is automatically replicated from the Cassandra DC to the Solr DC
No single point of failure
No ETL
You get a lot of the benefits from using Solr on top of C* - No ETL! Just load the data in cassandra and it’s automatically replicated to Solr!
Data is automatically replicated from the Cassandra DC to the Solr DC
No single point of failure
No ETL
Real time indexing!!
Push down predicates
Locality awareness with the sharding
Netty vs HTTP communication
Using docvalues all the time
lockless agortithms on accessing ram buffer to provide linear scalability, each thread can search a different portion of a segment, so you can have larger ram buffers
All this plus some other enhancements means indexing twice as fast as OSS lucene
Solr is an enterprise search solution meaning you have access to many API’s, there’s an administrative UI, support for importing data from different sources, and customizations (Solr is a bunch of JARS!)
It was built on top of the Lucene information retrieval engine and was developed by Cnet as a product search engine.
Solr is an enterprise search solution meaning you have access to many API’s, there’s an administrative UI, support for importing data from different sources, and customizations (Solr is a bunch of JARS!)
It was built on top of the Lucene information retrieval engine and was developed by Cnet as a product search engine.
This ONLY allows search on tags, and the “index” table needs to be maintained manually. This is fine if this is all you want to do, and is very common.
Location_type, currently ith only 2 values, will work with a2i, name on the other hand, would not. So how do you search by name?
Secondary indexes will allow you to search on the whole field, but those aren’t efficient for higher cardinality values, like email address
Secondary index creates additional data structures on each node that hold table partitions
Each ‘local index’ indexes values in rows stored locally
Query on an indexed column requires accessing ‘local indexes’ on all nodes
Expensive
Query on a partition key and an indexed column requires accessing a ‘local index’ on one (or a few) nodes
Efficient
But say you want to search the text in description or title? What now?