ElasticSearch on AWS - Real Estate portal case study (Spitogatos.gr)
1. ElasticSearch on AWS
Real Estate portal Case Study (Spitogatos.gr)
AWSUG GR meetup #7
27 September 2012
Andreas Chatzakis
co-founder / IT Director – Spitogatos.gr
Event sponsored by: @achatzakis on twitter
4. Helping you find a property
Finding a property in Greece is complex, lacks transparency.
We make life easier for househunters via:
Powerful search functionality
Web & Mobile
Location & Criteria
Quality content
Listings (we love photos)
Articles
mySpitogatos
Email alerts
Save your search
Favorite listings & notes
Contact the realtors
4
5. Realtors love us too!
Professionals need help in those turbulent times.
We add value in multiple ways:
Cost effective promotion & high quality leads
Targeted channel (very)
Leads already filtered (we ve seen the fotos!)
Technology services for realtors
Turnkey web site solution
Listing synchronization web service
B2B via Spitogatos Network (SpiN) business
network / collaboration tool for realtors
Channel for foreign buyers via the English version
5
7. To Search is to Find
Search is central to what we do
Users searching for property come with structured criteria of huge variety
Athens Center, residential - flat or studio, for sale, 100-150k €, 85-120 sq meter,
with a garage
Athens Center & N.Kosmos, residential - flat, for sale, 75-100k €, 70-100 sq meter,
2+ bedrooms, only show listings with photos
Piraeus centre or Mikrolimano, commercial – store, for rent, 500-750 € per
month, only listings with recently reduced price
Monetize: # of Listings grouped by paying member + above criteria
IPhone app → Listings within geo-rectangle + above criteria
As a result, caching is rarely our friend!
We used to think Lucene/Solr, ElasticSearch, CloudSearch etc were only useful
for text search, not adding value for structured search
G
Have been insisting on trying to optimize MySQL (multi column indices etc)
N
while throwing replicas to the problem.
O
R
7
8. Why ElasticSearch
Selected elasticSearch after a (very) brief research* on alternatives:
AWS's own Cloudsearch:
Zero management service: nice!
Not available on eu-west-1
Currently lacks ES functionality (e.g. geospatial, non english analyzers)
Sphinx
Easy MySQL integration
How do you scale it?*
Solr
Industry standard
Seems like it is conceived as somehow harder to scale/operate*?
ElasticSearch:
Piece of cake to setup on AWS (stay tuned!)
Super distributed, scales & is easy on IT ops (more on that later!)
* Disclaimer: We did not go through a
8
detailed product selection process!
10. ElasticSearch basics
A distributed, RESTful Search engine built on top of Lucene
Free Schema
JSON documents
Analyzers
Boost levels
Easy & flexible Search
Lucene query string or JSON based search query DSL
Facets & Highlighting
Spatial search
Custom scripts
Multi Tenancy
Store & search across multiple indices
Each with its own settings
Use-case: Logs – recent in memory, old on disk
10
11. Scaling ElasticSearch
Designed from the ground up to be Scalable & Highly Available
Distributed
Indices automatically broken into shards
Replicas for read performance & availability
Multiple cluster nodes, each hosting 1+ shards/replicas
peer2peer, each node can delegate operations to other nodes
Add,remove nodes at will
Rebalancing & routing automagically behind the scenes
Discovery
Multicast or unicast (declarative)
Gateway
Allows recovery in case all nodes go down
Local or shared storage
Async replication in case of shared storage
11
12. A scale-up example
Assume a cluster with 4 shards and 1 replica configuration
1 node example – Status Yellow
2 nodes example – Status Green
3 nodes example
: Primary shard : Replica shard : Master node : Regular node
Master node maintains cluster state, acts if nodes join or leave the cluster by reassigning shards. 12
13. ElasticSearch on AWS
2 modules make deployment on AWS a breeze
EC2 discovery
Filter by security group, AZ, tags
Requires IAM user with certain EC2 privileges:
DescribeAvailabilityZones, DescribeInstances, DescribeRegions,
DescribeSecurityGroups, DescribeTags
Very useful in autoscaling setups with ephemeral servers
S3 gateway
Long term reliable async persistency of cluster state and indices
Allows deployment without EBS volumes
Still, local gateway with EBS volumes performs better (less network used,
faster recovery)
Won't protect from accidental deletion of index (deletion will propagate to
shared storage)
13
15. Indexation
Indexation of Spitogatos.gr ads
DB is still the “source of truth”
We propagate DELETEs synchronously, INSERTs & UPDATEs asynchronously
KISS: Cron job (re) indexes never or least-recently indexed listings
ORM marks new/modified listings as never-indexed (so they go first)
Location: Multivalue field instead of nested set model in the DB
e.g. this property is in Greece, Attica, Piraeus, Port of Piraeus
Property will be included in results when I search for any of the above.
Flat schema
Searchable listing owner fields are included in the document (vs a JOIN in our DB)
Changes to other tables might lead to large # of listings requiring reindexation
(e.g. real estate agent becomes a paying member)
15
16. Index Integrity
Making sure our index is consistent with the DB
Scrutineer ( https://github.com/Aconex/scrutineer )
Compares DB and ElasticSearch index for mismatches
exists in ES but not on DB (or vice versa)
ES version not up to date
Relies on “_version” field - is incremented via our ORM onChange
When indexing we explicitly set versioning to “external”
Had to “hack” it as it doesn't work with EC2 discovery module
http://labs.spitogatos.gr/?p=45
16
17. Search – Shards & Routing
How does ElasticSearch decide in which shard to store a doc?
By default this is done based on hash of document id
Can be ovverriden while indexing and while searching (routing parameter)
We shard based on hash of the id of area id
- Most users search for listings within a specific area
- We hit only a single shard for a large percentage of the searches.
No routing Routing by
specificed specific areaId
17
18. Search – Flat Schema, Facets & Scoring
We rely a lot on ElasticSearch's Flat Schema, Facets & Scoring
No joins due to flat schema => fast!
Multivalue fields => fast filtering for listings in areas of various hierarchy levels
Facets functionality returns list of paying agents with # listings matching criteria
Old slow ranking algorithm replaced by elasticSearch scoring functionality
used to go through our DB and refresh score
ad age is part of the equation
Now ES computes this dynamically on every search
We use custom scoring
We can modify scoring algorithm and see changes instantly
no need to recalculate scores for all listings
18
19. Monitoring
Sematext SPM offers a (currently free) ES monitoring solution
Cluster Health Search rate & latency Disk
Index Stats Cache Network
Shard Stats CPU & RAM JVM & GC
19
21. Backups
We take periodic copies from the Gateway
Cause the Gateway is no cure for accidental deletions or bugs
S3cmd syncs S3 gateway contents to local folder
Expect some errors here as files get deleted/modified
Disables snapshots to gateway
Syncs again (no errors this time and much faster)
Reenables snapshots to gateway
Zips local folder contents, splits into smaller files & uploads to secondary S3 bucket
Get the script here: http://labs.spitogatos.gr/?p=17
21
22. Learnings
Issues & leasons learned:
Faceted search can return wrong (smaller) results (on multiple shards)
Due to the way sorting/merging is done
Increase facet size field depending on cardinallity of faceted field
We use Elastica – a PHP client for ElasticSearch - https://github.com/ruflin/Elastica
Lacking Document Routing and Version Type support
Our own Jerry Manolarakis on a pull request to add setRouting, setVersionType
Filters vs queries (Query DSL)
Filters perform an order of magnitude better than plain queries since no scoring is
performed and they are automatically cached.
Do it! Your DB will thank you
CPU Utilization Response time pattern
22
23. Read more
Useful resources:
https://speakerdeck.com/u/jmikola/p/symfony-live-london-elasticsearch
http://blog.sematext.com/2010/05/03/elastic-search-distributed-lucene/
http://www.slideshare.net/elasticsearch/elasticsearch-at-berlinbuzzwords-2010
http://www.slideshare.net/kucrafal/scaling-massive-elastic-search-clusters-rafa-ku-sematext
Need help integrating ElasticSearch to your app?
http://bacterials.net/
Follow us on twitter: @spitogatosLabs
Check out our blog: http://labs.spitogatos.gr
23