16. the tech: deployment
• Git based deployment
• Auto-pull from master every minute
(danger Will Robinson!)
• Work off develop branches and merge
17. the tech: configuration
• Currently local file read
• Unexpected annoyance
• Looking at Doozer / Zookeeper
18. the tech: scaling
• Every machine can disappear
• Ignore FS
• Uploads to S3
• Next: sessions off the box - then ready !
• Not quite auto-scaling but almost there
• Plan to fail
20. the tech: index
• Apache Solr
• Learning curve not steep - just hard to
find!
• ~25m documents
• Three servers
21. the tech: index
• Index documents by sentence
• Prevents cross sentence mismatches
• NLTK
• Not 100%
22. the tech: index
• Performance factors
• Distribute workload
• Commit frequency
• Data size
• Caching
• Memory
23. the tech: index
• 2 - 3 days to index full text
• 1 week if any issues arise
• Not a runner
• Reduced to 9 hours with optimisations
• ~450k / hr | ~125 / s
• Distributed index = distributed search
24. the tech: analyse
• Gearman-like approach
• One job queue server
• Many analysis servers
• Many workers per analysis server
25. the tech: analyse
• How
• Solr proximity search
• Magic filters o /
• Store in Mongo