2. What is ShopWiki?
• ShopWiki is the retail division of Oversee.net.
• We run a collection of retail websites,
• Including the Comparison Shopping Engines (CSE)
– ShopWiki.com
– Compare.com
3.
4.
5. How do we use Elasticsearch?
• You know, for search (not logging).
• We index millions of products, offered from
hundreds of thousands of stores, and allow
users to search them.
6. Why Elasticsearch?
• ShopWiki was built using a proprietary search
server written in C++.
• Served us well for many years, but it needed
improvements, especially for non-English
language search.
• What about Lucene-based solutions?
7. Solr3
• We tried out Solr3 when building
CouponFinder.com.
• Solr worked well (for English & French), but
the coupon dataset is small in comparison to
our product dataset.
• The setup was simple master-slave replication.
8. How do we scale?
• To use Solr for our product data we needed to
shard the data across multiple machines.
• But, Solr3’s sharding capabilities were clunky
and difficult to use.
• Enter Elasticsearch!
• Designed to scale out-of-the-box.
9. Compare.com
• Compare.com was built using Elasticsearch
from the start.
• Allowed us to get up & running very quickly.
• Allowed us to scale up very quickly.
– 60 million products and growing.
• Allows us iterate on new features quickly.
10. Other Languages
• ShopWiki search is being gradually ported to
Elasticsearch.
• Allows us to have better non-English search
right out-of-the-box.
– French
– German
– Dutch
– Spanish
11. Our Elasticsearch Cluster
• 12 indices, one for each website.
• 3 replicas per shard.
• 3 master nodes (quorum of 2).
• 6 data nodes.
• Plan to add more data nodes as we proceed with
our migration of ShopWiki (500m products).
• Expect to need less hardware than the C++.
cluster (uses 50+ machines).
13. Realtime Updates
• C++ search servers need to have the entire
dataset re-indexed and swapped out all at
once.
• Could only do this oncea day, at night (affects
performance).
• With Elasticsearch, we can update our data all
the time (it’s not even a limiting factor).
14. Challenges
• Use TermsFacet to suggest filters to the user.
• E.g. filter by stores or brands.
• Using the 10 most frequent brands from a
search can produce bad results.
– A single brand may have lots of products that are
all weakly relevant.
15. Top-N Faceting
• The solution in Solr is to limit facets to the
top-N results.
• Elasticsearch doesn’t have this feature (as
mentioned at last Meetup).
• Solution: TermsStatsFacet(AKA aggregations in 1.0)
• Allows us to get the brands/stores with the
most relevant results.
• E.g. Σ(scoren) n allows us to tune facet results to our liking
16. N = 0 (same as count)
TermsStatsFacet for Brands
Query: “mixing bowl”
Σ(scoren)
N = 4
17. De-duping Products
• Use “more_like_this” query to find similar
products.
• If result’s score is “high enough”, it’s likely the
same product from a different store.
• “High enough” is defined as a fraction of the
identity match’s score.
18. • Questions?
• Rob Stewart
• Lead Software Engineer
• rstewart@shopwiki.com
Editor's Notes
Similar functionality.Different business models (SEO vs SEM).ShopWiki.com was first.