SlideShare una empresa de Scribd logo
1 de 21
Descargar para leer sin conexión
Deep Search or Harvest: how we decided 
Simon Moffatt Ilaria Corda 
Discovery & Access – The British Library
www.bl.uk 
2 
Overview 
•Focussing on operational and functional considerations 
– Harvesting data and using Deep Search. 
•Along the way… 
–Challenges of operating with a large data set 
–Demonstrate our new Web Archive search of 1.3bn web pages 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
3 
Introductions 
•We work in a team of six supporting and developing Discovery and Access at the British Library 
•The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) 
•We are a UK Legal Deposit library 
–We collect everything 
–It can only be accessed in our reading rooms 
•Our Ex Libris products 
–Aleph v20 Primo v4.x SFX v4.x 
–All locally hosted 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
4 
Deep Search or Harvest: how we decided 
•Like most institutions, the remit of our search is expanding 
–New digital content 
–Migration from older technologies 
•If you have a large number of new records to add, often you have two options: 
–Harvest the records into your main index 
–Deep Search to an external index 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
5 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
6 
Deep Search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
7 
Two case studies 
Recently we faced this Harvest vs Deep Search dilemma for two new sources of data 
•A replacement to our Content Management System 
–we decided to harvest 
•A search of our new Web Archive 
– we implemented a deep search 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
8 
Content Management System - Harvested 
The technology behind the BL website is being updated. 
•There are around 50,000 pages 
•It has its own index, updated daily 
Work involved 
•Creating Normalisation rules and Pipe 
•Setting up daily processes 
Not appropriate for deep search – more later…. 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
9 
Large datasets in Primo: Our experience 
Background: Our data 
•13m Aleph records (Books, Journals, Maps etc) 
•5m SirsiDynix Symphony records (Sound Archive) 
•48m Journal articles (growing by 2m per year) 
•1m other records from five other pipes 
Total: 67 Million records 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 10 
Our topology 
FE1 
Front End Server 
FE2 
Front End Server 
FE3 
Front End Server 
SE1 
Search 
Slice 
SE8 
Search 
Slice 
SE7 
Search 
Slice 
SE6 
Search 
Slice 
SE5 
Search 
Slice 
SE4 
Search 
Slice 
SE3 
Search 
Slice 
SE2 
Search 
Slice 
BO 
Back Office Server 
DB 
Database Server 
Search Results 
Indexes 
Config, PNXs etc 
Config, 
Web 2.0 
etc 
Public 
Private 
Harvests 
Load-balanced requests 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
11 
Service challenges with 67m records 
Our index is ~100GB 
•Indexing takes at least 3 hours; Hotswap takes 4 hours 
–Even if there is only one new record 
–Overnight schedules are tight 
–Fast data updates are impossible 
•System restart takes 5 hours 
–Re-sync Search Schema is a whole day 
–Failover system must be available at all times 
•Primo Service Packs and Hotfixes need caution 
•Standard documented advice must be read carefully 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
12 
Development challenges with 67m records 
•Full re-harvest takes 7 weeks 
–Major normalisation changes only 3 times per year 
–Smaller UI changes can be made more often 
•Primo Version upgrades affect 13 servers (or 56 in total) 
•Implementing Primo enhancements 
–We must consider index size 
–Sort, browse etc all have an impact 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
13 
Our development cycle (not to scale) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
14 
But there are compensations… 
•Speed of the index 
•Control over the data 
•Consistency of rules 
•Common development skills 
•A single point of failure 
These are all important 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
15 
Web Archive - Deep Search 
Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain 
Some Figures: 
 ~1.3 billion documents 
 Regular crawling processes - E.g. BBC News 
And the infrastructure? 
 index size is ~3.5TB spread across 24 Solr shards 
 80-node Hadoop cluster (where crawls are processed for submission) 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
16 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
17 
Service challenges with Deep Search 
 Additional point of failure within Primo 
–Newly introduced Troubleshooting procedure 
–Service Notices for planned Downtime/Maintenance 
–Changes to the Solr Schema that could break our search 
 Primo Upgrades & Hotfixes can affect the functioning of the Deep Search 
–Re-builds of the Client in case the Deep Search libraries (jars) change 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
18 
Development challenges with Deep Search 
Significant development work to implement Primo/Solr integration 
Accommodate new Primo features (versioning control) 
E.g. Multi-select faceting (Exclusion & Inclusion) 
Needs to ensure consistency across local and non-local collections 
Seemingly NO difference from a UI/UX point of view 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
19 
But there are compensations… 
Ideal for large indexes and frequent updates 
Independent indexing processes 
Maintenance of existing scheduled processes 
Leverage existing Primo-built in features 
Existing and Extendable Solr APIs / active community 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
21 
Harvest vs Deep Search: how we decided 
ADDITIONAL INFRASTRUCTURE COSTS 
DEV TIME / SPECIALIST SKILLS 
INDEX SIZE/NO. RECORDS 
FREQUENCY & VOLUME OF UPDATES 
IMPACT ON CURRENT PROCESSES 
Deep Search 
Harvest 
IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
www.bl.uk 
22 
Thank you

Más contenido relacionado

Similar a Deep Search or Harvest, how we decided

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research dataARDC
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of SouthamptonRepository Fringe
 
Conf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimizationConf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimizationSplunk
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxKhongHieu2
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxCazlp1
 
The Good, The Bad and the Ugly
The Good, The Bad and the UglyThe Good, The Bad and the Ugly
The Good, The Bad and the UglyRoy Salazar
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Librarypaidi_ed
 
Role of-analytics-in-db as-life
Role of-analytics-in-db as-lifeRole of-analytics-in-db as-life
Role of-analytics-in-db as-lifeNavneet Upneja
 
Customer Presentation
Customer PresentationCustomer Presentation
Customer PresentationSplunk
 
GWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easyGWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easyGWAVA
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Lucidworks
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015Dan Boutin
 
Brief on Migration to V7 Project
Brief on Migration to V7 ProjectBrief on Migration to V7 Project
Brief on Migration to V7 ProjectAxiell ALM
 
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet ResultsUS Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet ResultsMarie Journey
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analyticsjhkrug
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureSplunk
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionSplunk
 

Similar a Deep Search or Harvest, how we decided (20)

Using Archivemedia to preserve research data
Using Archivemedia to preserve research dataUsing Archivemedia to preserve research data
Using Archivemedia to preserve research data
 
EPrints Update, Les Carr, University of Southampton
EPrints  Update, Les Carr, University of SouthamptonEPrints  Update, Les Carr, University of Southampton
EPrints Update, Les Carr, University of Southampton
 
Conf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimizationConf2014_SplunkSearchOptimization
Conf2014_SplunkSearchOptimization
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
SplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptxSplunkGettingStartedWorkshop.pptx
SplunkGettingStartedWorkshop.pptx
 
The Good, The Bad and the Ugly
The Good, The Bad and the UglyThe Good, The Bad and the Ugly
The Good, The Bad and the Ugly
 
AD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension LibraryAD1545 - Extending the XPages Extension Library
AD1545 - Extending the XPages Extension Library
 
Inn reach for IIUG (2005)
Inn reach for IIUG (2005)Inn reach for IIUG (2005)
Inn reach for IIUG (2005)
 
Role of-analytics-in-db as-life
Role of-analytics-in-db as-lifeRole of-analytics-in-db as-life
Role of-analytics-in-db as-life
 
Customer Presentation
Customer PresentationCustomer Presentation
Customer Presentation
 
GWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easyGWAVACon - Vibe: Collaboration made easy
GWAVACon - Vibe: Collaboration made easy
 
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
Building a Vibrant Search Ecosystem @ Bloomberg: Presented by Steven Bower & ...
 
Kscope11 recap
Kscope11 recapKscope11 recap
Kscope11 recap
 
ASTQB washington-sept-2015
ASTQB washington-sept-2015ASTQB washington-sept-2015
ASTQB washington-sept-2015
 
PEER End of Project Report
PEER End of Project ReportPEER End of Project Report
PEER End of Project Report
 
Brief on Migration to V7 Project
Brief on Migration to V7 ProjectBrief on Migration to V7 Project
Brief on Migration to V7 Project
 
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet ResultsUS Sugar: How Mobilizing SAP ERP Delivered Sweet Results
US Sugar: How Mobilizing SAP ERP Delivered Sweet Results
 
EPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster AnalyticsEPUG UKI - Lancaster Analytics
EPUG UKI - Lancaster Analytics
 
Taking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - ArchitectureTaking Splunk to the Next Level - Architecture
Taking Splunk to the Next Level - Architecture
 
Taking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout SessionTaking Splunk to the Next Level - Architecture Breakout Session
Taking Splunk to the Next Level - Architecture Breakout Session
 

Último

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024The Digital Insurer
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slidespraypatel2
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 

Último (20)

What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024Finology Group – Insurtech Innovation Award 2024
Finology Group – Insurtech Innovation Award 2024
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law DevelopmentsTrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
TrustArc Webinar - Stay Ahead of US State Data Privacy Law Developments
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
Slack Application Development 101 Slides
Slack Application Development 101 SlidesSlack Application Development 101 Slides
Slack Application Development 101 Slides
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 

Deep Search or Harvest, how we decided

  • 1. Deep Search or Harvest: how we decided Simon Moffatt Ilaria Corda Discovery & Access – The British Library
  • 2. www.bl.uk 2 Overview •Focussing on operational and functional considerations – Harvesting data and using Deep Search. •Along the way… –Challenges of operating with a large data set –Demonstrate our new Web Archive search of 1.3bn web pages IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 3. www.bl.uk 3 Introductions •We work in a team of six supporting and developing Discovery and Access at the British Library •The British Library has reading rooms and storage in London and in Yorkshire. (We are based in Yorkshire) •We are a UK Legal Deposit library –We collect everything –It can only be accessed in our reading rooms •Our Ex Libris products –Aleph v20 Primo v4.x SFX v4.x –All locally hosted IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 4. www.bl.uk 4 Deep Search or Harvest: how we decided •Like most institutions, the remit of our search is expanding –New digital content –Migration from older technologies •If you have a large number of new records to add, often you have two options: –Harvest the records into your main index –Deep Search to an external index IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 5. www.bl.uk 5 Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 6. www.bl.uk 6 Deep Search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 7. www.bl.uk 7 Two case studies Recently we faced this Harvest vs Deep Search dilemma for two new sources of data •A replacement to our Content Management System –we decided to harvest •A search of our new Web Archive – we implemented a deep search IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 8. www.bl.uk 8 Content Management System - Harvested The technology behind the BL website is being updated. •There are around 50,000 pages •It has its own index, updated daily Work involved •Creating Normalisation rules and Pipe •Setting up daily processes Not appropriate for deep search – more later…. IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 9. www.bl.uk 9 Large datasets in Primo: Our experience Background: Our data •13m Aleph records (Books, Journals, Maps etc) •5m SirsiDynix Symphony records (Sound Archive) •48m Journal articles (growing by 2m per year) •1m other records from five other pipes Total: 67 Million records IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 10. www.bl.uk 10 Our topology FE1 Front End Server FE2 Front End Server FE3 Front End Server SE1 Search Slice SE8 Search Slice SE7 Search Slice SE6 Search Slice SE5 Search Slice SE4 Search Slice SE3 Search Slice SE2 Search Slice BO Back Office Server DB Database Server Search Results Indexes Config, PNXs etc Config, Web 2.0 etc Public Private Harvests Load-balanced requests IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 11. www.bl.uk 11 Service challenges with 67m records Our index is ~100GB •Indexing takes at least 3 hours; Hotswap takes 4 hours –Even if there is only one new record –Overnight schedules are tight –Fast data updates are impossible •System restart takes 5 hours –Re-sync Search Schema is a whole day –Failover system must be available at all times •Primo Service Packs and Hotfixes need caution •Standard documented advice must be read carefully IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 12. www.bl.uk 12 Development challenges with 67m records •Full re-harvest takes 7 weeks –Major normalisation changes only 3 times per year –Smaller UI changes can be made more often •Primo Version upgrades affect 13 servers (or 56 in total) •Implementing Primo enhancements –We must consider index size –Sort, browse etc all have an impact IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 13. www.bl.uk 13 Our development cycle (not to scale) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 14. www.bl.uk 14 But there are compensations… •Speed of the index •Control over the data •Consistency of rules •Common development skills •A single point of failure These are all important IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 15. www.bl.uk 15 Web Archive - Deep Search Web Archiving collects, makes accessible and preserves web resources of scholarly and cultural importance from the UK domain Some Figures:  ~1.3 billion documents  Regular crawling processes - E.g. BBC News And the infrastructure?  index size is ~3.5TB spread across 24 Solr shards  80-node Hadoop cluster (where crawls are processed for submission) IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 16. www.bl.uk 16 IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 17. www.bl.uk 17 Service challenges with Deep Search  Additional point of failure within Primo –Newly introduced Troubleshooting procedure –Service Notices for planned Downtime/Maintenance –Changes to the Solr Schema that could break our search  Primo Upgrades & Hotfixes can affect the functioning of the Deep Search –Re-builds of the Client in case the Deep Search libraries (jars) change IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 18. www.bl.uk 18 Development challenges with Deep Search Significant development work to implement Primo/Solr integration Accommodate new Primo features (versioning control) E.g. Multi-select faceting (Exclusion & Inclusion) Needs to ensure consistency across local and non-local collections Seemingly NO difference from a UI/UX point of view IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 19. www.bl.uk 19 But there are compensations… Ideal for large indexes and frequent updates Independent indexing processes Maintenance of existing scheduled processes Leverage existing Primo-built in features Existing and Extendable Solr APIs / active community IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]
  • 20. www.bl.uk 21 Harvest vs Deep Search: how we decided ADDITIONAL INFRASTRUCTURE COSTS DEV TIME / SPECIALIST SKILLS INDEX SIZE/NO. RECORDS FREQUENCY & VOLUME OF UPDATES IMPACT ON CURRENT PROCESSES Deep Search Harvest IGeLU 2014 – Deep Search or Harvest: How we decided [14.5]