Enterprise Search Case Study: SpareBank1 Gruppen

•

0 recomendaciones•734 vistas

Findwise

Enterprise Search Case Study: SpareBank1 Gruppen.

Tecnología

CASE STUDY: SPAREBANK1 GRUPPEN
Sébastien Muller

Customer Requirement

”Better portal search”

Project background

•SpareBank1 Gruppen
• 19
individual
bank
portals
and
1
forside
•Boost 25 umbrella project
• ”Seman7c”
URLs:
h>ps://www2.sparebank1.no/9898/3_privat?
_nfpb=true&_nﬂs=false&_pageLabel=page_privat_innhold&pId=1233149
354625&_
• New
search
GUI
•CMS with no easy way of telling which bank has published what
• Mass
duplica7ons
• Access
to
other
portal
speciﬁc
ar7cles
• Webcrawlers

What is better search?
At the very least :
• Relevant hits
• Facetting
• Query completion
• Spelling check and suggestions
• Basic search analytics

Relevant hits

• Relevancy = ”.. The quality of results returned from
a query...”
• Based
on
hits
in
ﬁelds
generated
from
document

processing
• Clean and meta-data rich index
• Pushed
from
CMS
or
extracted
by
crawlers

Crawling and Indexing

• Clean and meta-data rich index
• OpenPipeline
• Ignore
irrelevant
ar7cles
• Extract
ar7cle
text
contents
• Detect
duplicates
• Facet
data
• Populate
index
ﬁelds
including
*_qc
and
*_sp
ﬁelds

Crawling and Indexing

• Crawlers will be as smart as you make them
• Very
rigid
logic
• Heavily
reliant
on
ar7cle
quality
• Don’t
blame
the
crawler

https://www2.sparebank1.no/portal/4702/3_privat?
_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false

https://www2.sparebank1.no/portal/9898/3_privat?
_nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false

Relevant hits

Scoring model
<bean id="qf"
class="com."ndwise.jelly"sh.solr.querymodi"er.dismax.StaticQueryFieldSetter">
<property name="queryFields">
<list value-type="java.lang.String">
<value>keyword^4</value>
<value>content1^8</value>
<value>content2^3</value>
<value>content3^2</value>
<value>stem1^1.5</value>
<value>stem2^1.2</value>
<value>stem3</value>
</list>
</property>
</bean>

Relevant hits

• Spell checker
• Request
handler
for
each
bank
• Index
based

•Stop
words

System Architecture

• Solr is incredibly !exible
• Master/slave
• Security constraints
• Search
services
available
publicly
• Search
analy7cs
available
internally
but
limited
• Indexing

Quality Assurance

• Crawler friendly content modi"cations
• Edit
• Delete
• Add
• Share
• Risk
analyse
etc

Lessons Learnt

• Scope creep
• Garbage in, garbage out
• Documentation is only useful if it gets read

Más contenido relacionado

Destacado

Findability Day 2015 Anders Fors - Volvo Bus - A cost efficient R&D with EX...Findwise

What Practitioners Say (Think) About Enterprise SearchFindwise

Enterprise Search Case Study: Jönköping CountyFindwise

Karnov Super power your search with Text Analytics - Findability Day 2014Findwise

Accessing Enterprise Content with Mobile SearchFindwise

Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...Findwise

Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...Findwise

Introduction to Enterprise SearchFindwise

Findability Day 2016 - Get started with GDPRFindwise

Findability Day 2016 - What is GDPR?Findwise

The Why and How of FindabilityFindwise

Logganalys med Elastic & FindwiseFindwise

Findability Day 2016 - Structuring content for user experienceFindwise

Anaphora ResolutionFindwise

What is Hydra?Findwise

Findability Day 2016 - SKF case studyFindwise

Findability Day 2016 - Enterprise social collaborationFindwise

Findability Day 2015 - Abby Covert - Keynote - How to make sense of any messFindwise

Findability Day 2016 - Big data analytics and machine learningFindwise

Destacado (19)

Findability Day 2015 Anders Fors - Volvo Bus - A cost efficient R&D with EX...

What Practitioners Say (Think) About Enterprise Search

Enterprise Search Case Study: Jönköping County

Karnov Super power your search with Text Analytics - Findability Day 2014

Accessing Enterprise Content with Mobile Search

Findability Day 2015 Liam Holley - Dassault systems - Insight and discovery...

Findability Day 2015 Mattias Ellison - Findwise - Enterprise Search and fin...

Introduction to Enterprise Search

Findability Day 2016 - Get started with GDPR

Findability Day 2016 - What is GDPR?

The Why and How of Findability

Logganalys med Elastic & Findwise

Findability Day 2016 - Structuring content for user experience

Anaphora Resolution

What is Hydra?

Findability Day 2016 - SKF case study

Findability Day 2016 - Enterprise social collaboration

Findability Day 2015 - Abby Covert - Keynote - How to make sense of any mess

Findability Day 2016 - Big data analytics and machine learning

Similar a Enterprise Search Case Study: SpareBank1 Gruppen

Elasticsearch in production New York Meetup at Twitter October 2014beiske

Elasticsearch in production Boston Meetup October 2014beiske

5 Common Mistakes You are Making on your WebsiteAcquia

Test driving Azure Search and DocumentDBAndrew Siemer

DrupalSouth 2015 - Performance: Not an AfterthoughtNick Santamaria

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malwareDaveEdwards12

Phishing Website Detection by Machine Learning Techniques Presentation.pdfVaralakshmiKC

Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...Jon Peck

scrazzl - A technical overviewscrazzl

Drupal Site Audit - SFDUGJon Peck

DOXLON November 2016 - Data Democratization Using SplunkOutlyer

State of Florida Neo4j Graph Briefing - Cyber IAMNeo4j

Web mining slidesmahavir_a

Node.js Dublin Meetup April 2014Damian Beresford

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...Flink Forward

Developing and Implementing a QA Plan During Your Legacy Data to S1000Ddclsocialmedia

Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...MongoDB

SQL To NoSQL - Top 6 Questions Before Making The MoveIBM Cloud Data Services

SPLive Orlando - 10 Things I Like in SharePoint 2013 SearchAgnes Molnar

MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkinskiwilkins

Similar a Enterprise Search Case Study: SpareBank1 Gruppen (20)

Elasticsearch in production New York Meetup at Twitter October 2014

Elasticsearch in production Boston Meetup October 2014

5 Common Mistakes You are Making on your Website

Test driving Azure Search and DocumentDB

DrupalSouth 2015 - Performance: Not an Afterthought

Defcon 22-wesley-mc grew-instrumenting-point-of-sale-malware

Phishing Website Detection by Machine Learning Techniques Presentation.pdf

Auditing Drupal Sites for Performance, Content and Optimal Configuration - SA...

scrazzl - A technical overview

Drupal Site Audit - SFDUG

DOXLON November 2016 - Data Democratization Using Splunk

State of Florida Neo4j Graph Briefing - Cyber IAM

Web mining slides

Node.js Dublin Meetup April 2014

Marc Schwering – Using Flink with MongoDB to enhance relevancy in personaliza...

Developing and Implementing a QA Plan During Your Legacy Data to S1000D

Big Data Analytics 2: Leveraging Customer Behavior to Enhance Relevancy in Pe...

SQL To NoSQL - Top 6 Questions Before Making The Move

SPLive Orlando - 10 Things I Like in SharePoint 2013 Search

MongoDB Revised Sharding Guidelines MongoDB 3.x_Kimberly_Wilkins

Más de Findwise

White Arkitekter - Findability Day Roadshow 2017Findwise

AI och maskininlärning - Findability Day Roadshow 2017Findwise

De kognitiva eran med IBM Watson - Findability Day Roadshow 2017Findwise

Findwise and IBM WatsonFindwise

Findability Day 2016 - Enterprise Search and Findability Survey 2016Findwise

Findability Day 2016 - Augmented intelligenceFindwise

Digital workplace och informationshantering i office 365Findwise

Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...Findwise

Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...Findwise

Findability Day 2015 - Martin White - The future is search!Findwise

Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...Findwise

BigData med logganalysFindwise

Intranet focus search strategy a z - from Findability Day 2014Findwise

Findability Day 2014 Neo4j how graph data boost your insightsFindwise

Martin White it's not the technology it's the contentFindwise

Models and beer Findability Day 2014Findwise

Designing the search experience the language of discovery - Findability Day 2014Findwise

IBM Big Data Analytics - Cognitive Computing and Watson - Findability Day 2014Findwise

Más de Findwise (19)

White Arkitekter - Findability Day Roadshow 2017

AI och maskininlärning - Findability Day Roadshow 2017

De kognitiva eran med IBM Watson - Findability Day Roadshow 2017

Findwise and IBM Watson

Findability Day 2016 - Enterprise Search and Findability Survey 2016

Findability Day 2016 - Augmented intelligence

Digital workplace och informationshantering i office 365

Findability Day 2015 - Mickel Grönroos - Findwise - How to increase safety on...

Findability Day 2015 - Noel Garry - IBM - Information governance and a 360 de...

Findability Day 2015 - Martin White - The future is search!

Findability Day 2015 Joachim Dahl - Virtual Works - 360 degree view of the ...

BigData med logganalys

Intranet focus search strategy a z - from Findability Day 2014

Findability Day 2014 Neo4j how graph data boost your insights

Martin White it's not the technology it's the content

Models and beer Findability Day 2014

Designing the search experience the language of discovery - Findability Day 2014

IBM Big Data Analytics - Cognitive Computing and Watson - Findability Day 2014

Último

Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer

[BuildWithAI] Introduction to Gemini.pdfSandro Moreira

Corporate and higher education May webinar.pptxRustici Software

Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...apidays

Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2

MINDCTI Revenue Release Quarter One 2024MIND CTI

Cloud Frontiers: A Deep Dive into Serverless Spatial Data and FMESafe Software

Manulife - Insurer Transformation Award 2024The Digital Insurer

Why Teams call analytics are critical to your entire businesspanagenda

CNIC Information System with Pakdata Cf In Pakistandanishmna97

Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Angeliki Cooney

AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin

EMPOWERMENT TECHNOLOGY GRADE 11 QUARTER 2 REVIEWERMadyBayot

Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays

Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Jeffrey Haguewood

Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez

FWD Group - Insurer Innovation Award 2024The Digital Insurer

Rising Above_ Dubai Floods and the Fortitude of Dubai International Airport.pdfOrbitshub

Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya

"I see eyes in my soup": How Delivery Hero implemented the safety system for ...Zilliz

Enterprise Search Case Study: SpareBank1 Gruppen

1. CASE STUDY: SPAREBANK1 GRUPPEN Sébastien Muller

2. Customer Requirement ”Better portal search”

3. Project background •SpareBank1 Gruppen • 19 individual bank portals and 1 forside •Boost 25 umbrella project • ”Seman7c” URLs: h>ps://www2.sparebank1.no/9898/3_privat? _nfpb=true&_nﬂs=false&_pageLabel=page_privat_innhold&pId=1233149 354625&_ • New search GUI •CMS with no easy way of telling which bank has published what • Mass duplica7ons • Access to other portal speciﬁc ar7cles • Webcrawlers

4. What is better search? At the very least : • Relevant hits • Facetting • Query completion • Spelling check and suggestions • Basic search analytics

5. Relevant hits • Relevancy = ”.. The quality of results returned from a query...” • Based on hits in ﬁelds generated from document processing • Clean and meta-data rich index • Pushed from CMS or extracted by crawlers

6. Relevant hits

7. Relevant hits

8. Crawling and Indexing • Clean and meta-data rich index • OpenPipeline • Ignore irrelevant ar7cles • Extract ar7cle text contents • Detect duplicates • Facet data • Populate index ﬁelds including *_qc and *_sp ﬁelds

9. Crawling and Indexing • Crawlers will be as smart as you make them • Very rigid logic • Heavily reliant on ar7cle quality • Don’t blame the crawler https://www2.sparebank1.no/portal/4702/3_privat? _nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false https://www2.sparebank1.no/portal/9898/3_privat? _nfpb=true&_n!s=false&_pageLabel=page_privat_innhold&pId=1233149354625&_n!s=false

10. Relevant hits Scoring model <bean id="qf" class="com."ndwise.jelly"sh.solr.querymodi"er.dismax.StaticQueryFieldSetter"> <property name="queryFields"> <list value-type="java.lang.String"> <value>keyword^4</value> <value>content1^8</value> <value>content2^3</value> <value>content3^2</value> <value>stem1^1.5</value> <value>stem2^1.2</value> <value>stem3</value> </list> </property> </bean>

11. Relevant hits • Spell checker • Request handler for each bank • Index based •Stop words

12. Result

13. Result

14. System Architecture • Solr is incredibly !exible • Master/slave • Security constraints • Search services available publicly • Search analy7cs available internally but limited • Indexing

15. System Architecture

16. System Architecture

17. System Architecture

18. Quality Assurance • Crawler friendly content modi"cations • Edit • Delete • Add • Share • Risk analyse etc

19. Lessons Learnt • Scope creep • Garbage in, garbage out • Documentation is only useful if it gets read

Notas del editor

No query completion, spellcheck, duplicate detection, contextual search…. Basically useless
This is what we learnt at the beginning: 20 “distinct” portals Bank selector, semantic URLs….. One bank portal for example of 1.5k docs about 50% were duplicates Group publications are made available via the CMS but individual banks are under no obligation to publish the article on their portal and there’s no indication as to whether or not the have Had to use webcrawlers rather than pushing new content from the CMS directly to the indexing service – will come back to that
For us…. Search at the very least means: High speed queries (yay solr) High speed indexing (yay solr, boo crawlers  up to 2 hours for 20k docs on test server) Basic search analytics = query list with hit count and “no hit” count, average queries per time period etc. allows sparebank1 to see what people are searching for most and not finding the most More advanced = click through information, used to tune the relevancy model Pagination, look and feel etc
Full definition on the solr wiki -> at a very basic level, you get what you search for. SB1s existing search did not return relevant results Online portal used by general public != application search -> queries will not be very “focused” ie. Looking for general key words rather than a specific user/file/ID etc. These were the only “reliable” (NOT ALWAYS GUARANTEED) bits of “meta data” we could get from the articles -> subtitles used <b> instead of <hx> How do you determine facets based on that? How do you determine for which bank an article is targeted?
Precision = “Percent of documents returned that are relevant” -> 0% Recall = “Percent of relevant documents returned” -> 0%
Recall is low Precision is low 0 indication (on other hits) as to why they were returned Further reveals poor or non-existent relevancy/scoring model
Couldn’t rely on the CMS, too many articles/documents without relevant/necessary meta data  webcrawlers! 1 crawler per bank -> 20 crawlers Using regex to: Drop all articles without the bank id in the url Drop all .css etc From the html: <title> <abstract> <body> <b> -> used to later build relevancy model Duplicate detection based on hash of text content Facets taken from URL with regex looking for first tag after the bank id  each bank had subtly different facets Qc based on titles sp based on title, body and description *_ for dynamic fields (solr schema)  context specific qc, results and spelling suggestions
If no <abstract> then that wouldn’t be added to the index and wouldn’t show up in the results page If an article included a link that met the rules, regardless of the validity/relevance of the content to the particular bank, it would be crawled and indexed Rogue articles! SpareBank were convinced there was something wrong with our implementation of the search because they were getting results from other banks Based on the crawler’s log, found that the bank in question had a page that linked (using https://www2.sparebank1.no/irrelevant_bank_id/article) to the “rogue” article https://www2.sparebank1.no/portal/4702/3_privat?_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149354625&_nfls=false https://www2.sparebank1.no/portal/9898/3_privat?_nfpb=true&_nfls=false&_pageLabel=page_privat_innhold&pId=1233149354625&_nfls=false -  Can also see the difference in facets
Title uber alles Keyword = collection/facet CONTENT1 = TITLE CONTENT2 = DESCRIPTION CONTENT3 = BODY Stem1 = title Stem2 = sub_title Stem3 = body
Additional request handler for each bank’s spell checker to do a filtered search for matches against the misspelt query
Demo search GUI (changed a lot since then) facets are now only by section/collection and each bank has it’s own individual GUI  which they now use internally to find what bank has published what Find “rogue” articles that shouldn’t appear on other bank portals eg. Russland for sb1 nord-norge showing up elsewhere even after crawler URL rules filtering
Went for regex query completion allowing for inword rather than only beginning of word Forsikring was a good example as it’s very rarely at the beginning of the word/phrase in norwegian
Started very basic and got increasingly more complex Index size/document count was never an issue so didn’t need sharding
From this…. With all 3 services on 1 tomcat/machine Very naïve, little idea of the scope of the proejct ie. 20 banks etc
To this with each service on a different tomcat/machine Upon realising that we’d need a crawler per bank, each potentially indexing and writing to solr simultaneously, split the solr instances to optimize indexing and search time
The end product! Users do a search on the portal servers, which are in the dmz Searches are logged to the search statistic servers, which store everything in a database behind an internal firewall The portal solr search slave servers check for replication updates at regular intervals The indexing master servers crawl daily and push updates to the portal servers Each master can have several slaves If one master dies, switch to another
Ie. Upon making a change in the cms and publishing it, how well did the crawlers respond and what did it take to make the new content searchable? New documents weren’t showing up despite being published or shared For NEW articles to be found, they needed to be linked to (with a link that would adhere to the crawler URL rules) from an existing crawlable page
Exactly what I learnt while studying, ie. What not to do…. Lovely little plan that they agreed on and then ignored from the first week onwards Our fault for not getting them to be more specific about “better search” Their fault for not telling us about their environment requirements eg. Linux installation location, RPMs etc Their fault for not telling us about their documentation requirements (took about 2 months) Ours for not asking? My fault for suggesting more and more shiney features Search will only ever be as good as the underlying information management system and content owners/authors practices, should we address that? Start from the bottom up with a search centric information management system and best practices/SOPs Frequently getting emails and phone calls from different people asking the same questions, the answers to which are all in the documentation… but no one wants to read that (or write it)

Enterprise Search Case Study: SpareBank1 Gruppen

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (19)

Similar a Enterprise Search Case Study: SpareBank1 Gruppen

Similar a Enterprise Search Case Study: SpareBank1 Gruppen (20)

Más de Findwise

Más de Findwise (19)

Último

Último (20)

Enterprise Search Case Study: SpareBank1 Gruppen

Notas del editor