3. Project background
•SpareBank1 Gruppen
• 19
individual
bank
portals
and
1
forside
•Boost 25 umbrella project
• ”Seman7c”
URLs:
h>ps://www2.sparebank1.no/9898/3_privat?
_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149
354625&_
• New
search
GUI
•CMS with no easy way of telling which bank has published what
• Mass
duplica7ons
• Access
to
other
portal
specific
ar7cles
• Webcrawlers
4. What is better search?
At the very least :
• Relevant hits
• Facetting
• Query completion
• Spelling check and suggestions
• Basic search analytics
5. Relevant hits
• Relevancy = ”.. The quality of results returned from
a query...”
• Based
on
hits
in
fields
generated
from
document
processing
• Clean and meta-data rich index
• Pushed
from
CMS
or
extracted
by
crawlers
8. Crawling and Indexing
• Clean and meta-data rich index
• OpenPipeline
• Ignore
irrelevant
ar7cles
• Extract
ar7cle
text
contents
• Detect
duplicates
• Facet
data
• Populate
index
fields
including
*_qc
and
*_sp
fields
9. Crawling and Indexing
• Crawlers will be as smart as you make them
• Very
rigid
logic
• Heavily
reliant
on
ar7cle
quality
• Don’t
blame
the
crawler
https://www2.sparebank1.no/portal/4702/3_privat?
_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false
https://www2.sparebank1.no/portal/9898/3_privat?
_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false
19. Lessons Learnt
• Scope creep
• Garbage in, garbage out
• Documentation is only useful if it gets read
Notas del editor
No query completion, spellcheck, duplicate detection, contextual search…. Basically useless
This is what we learnt at the beginning: 20 “distinct” portals Bank selector, semantic URLs….. One bank portal for example of 1.5k docs about 50% were duplicates Group publications are made available via the CMS but individual banks are under no obligation to publish the article on their portal and there’s no indication as to whether or not the have Had to use webcrawlers rather than pushing new content from the CMS directly to the indexing service – will come back to that
For us…. Search at the very least means: High speed queries (yay solr) High speed indexing (yay solr, boo crawlers up to 2 hours for 20k docs on test server) Basic search analytics = query list with hit count and “no hit” count, average queries per time period etc. allows sparebank1 to see what people are searching for most and not finding the most More advanced = click through information, used to tune the relevancy model Pagination, look and feel etc
Full definition on the solr wiki -> at a very basic level, you get what you search for. SB1s existing search did not return relevant results Online portal used by general public != application search -> queries will not be very “focused” ie. Looking for general key words rather than a specific user/file/ID etc. These were the only “reliable” (NOT ALWAYS GUARANTEED) bits of “meta data” we could get from the articles -> subtitles used <b> instead of <hx> How do you determine facets based on that? How do you determine for which bank an article is targeted?
Precision = “Percent of documents returned that are relevant” -> 0% Recall = “Percent of relevant documents returned” -> 0%
Recall is low Precision is low 0 indication (on other hits) as to why they were returned Further reveals poor or non-existent relevancy/scoring model
Couldn’t rely on the CMS, too many articles/documents without relevant/necessary meta data webcrawlers! 1 crawler per bank -> 20 crawlers Using regex to: Drop all articles without the bank id in the url Drop all .css etc From the html: <title> <abstract> <body> <b> -> used to later build relevancy model Duplicate detection based on hash of text content Facets taken from URL with regex looking for first tag after the bank id each bank had subtly different facets Qc based on titles sp based on title, body and description *_ for dynamic fields (solr schema) context specific qc, results and spelling suggestions
If no <abstract> then that wouldn’t be added to the index and wouldn’t show up in the results page If an article included a link that met the rules, regardless of the validity/relevance of the content to the particular bank, it would be crawled and indexed Rogue articles! SpareBank were convinced there was something wrong with our implementation of the search because they were getting results from other banks Based on the crawler’s log, found that the bank in question had a page that linked (using https://www2.sparebank1.no/irrelevant_bank_id/article) to the “rogue” article https://www2.sparebank1.no/portal/4702/3_privat?_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149354625&_nfls=false https://www2.sparebank1.no/portal/9898/3_privat?_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149354625&_nfls=false - Can also see the difference in facets
Title uber alles Keyword = collection/facet CONTENT1 = TITLE CONTENT2 = DESCRIPTION CONTENT3 = BODY Stem1 = title Stem2 = sub_title Stem3 = body
Additional request handler for each bank’s spell checker to do a filtered search for matches against the misspelt query
Demo search GUI (changed a lot since then) facets are now only by section/collection and each bank has it’s own individual GUI which they now use internally to find what bank has published what Find “rogue” articles that shouldn’t appear on other bank portals eg. Russland for sb1 nord-norge showing up elsewhere even after crawler URL rules filtering
Went for regex query completion allowing for inword rather than only beginning of word Forsikring was a good example as it’s very rarely at the beginning of the word/phrase in norwegian
Started very basic and got increasingly more complex Index size/document count was never an issue so didn’t need sharding
From this…. With all 3 services on 1 tomcat/machine Very naïve, little idea of the scope of the proejct ie. 20 banks etc
To this with each service on a different tomcat/machine Upon realising that we’d need a crawler per bank, each potentially indexing and writing to solr simultaneously, split the solr instances to optimize indexing and search time
The end product! Users do a search on the portal servers, which are in the dmz Searches are logged to the search statistic servers, which store everything in a database behind an internal firewall The portal solr search slave servers check for replication updates at regular intervals The indexing master servers crawl daily and push updates to the portal servers Each master can have several slaves If one master dies, switch to another
Ie. Upon making a change in the cms and publishing it, how well did the crawlers respond and what did it take to make the new content searchable? New documents weren’t showing up despite being published or shared For NEW articles to be found, they needed to be linked to (with a link that would adhere to the crawler URL rules) from an existing crawlable page
Exactly what I learnt while studying, ie. What not to do…. Lovely little plan that they agreed on and then ignored from the first week onwards Our fault for not getting them to be more specific about “better search” Their fault for not telling us about their environment requirements eg. Linux installation location, RPMs etc Their fault for not telling us about their documentation requirements (took about 2 months) Ours for not asking? My fault for suggesting more and more shiney features Search will only ever be as good as the underlying information management system and content owners/authors practices, should we address that? Start from the bottom up with a search centric information management system and best practices/SOPs Frequently getting emails and phone calls from different people asking the same questions, the answers to which are all in the documentation… but no one wants to read that (or write it)