Unstructured Doc Solr Overview

•

0 recomendaciones•789 vistas

This document discusses Solr, an open source search platform from the Apache Lucene project. It provides full-text search, faceted search, auto-suggest capabilities, and supports multiple file formats for document indexing. The document outlines Solr's architecture and components, provides usage examples from large government sites, and recommends related open source tools.

Tecnología

Un-Structured

!
Or: How I Learned to Stop
Worrying and Love the XML
Mike Nibeck, Asim Shaikh

1st NF, 2nd NF, 3rd NF

!
It’s The Way It’s Done

Solr
Extension
of

Apache
Lucene
Full
Text
Search Open
Interfaces

(XML,
JSON,
HTTP)
Faceted
Search Database
Ingest Document
Indexing

(PDF,
Word,
etc)
Spelling

Suggestions
Auto
Suggest “Cloudy”
Advanced
Input

Parsing
Relevance
Ranking v4.4

You got your chocolate in
my peanut butter!

It’s a Hammer.

A really nice, efﬁcient
and free hammer.

Chronicling America
• 6.8 million documents

• 10 Billion vectors

• 50,000 queries/day

• Index 250GB

• +100K documents per month
Congress.gov
• 4 million documents

• 3.3+ million queries/day
(user and system)

• 36 GB indexes

•Adding many thousands/
month
Library Web Search
• 18+ million documents

• 9,000 queries/day

• 28GB index size

• + many thousands/month
World Digital Library
• 120k documents

• 7 different languages

• 10-50k queries/day

• Index < 1GB

• +100 documents/month

Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
Users
App Servers
Web Cache
Legacy Systems
Data Partners
Solr Architecture - congress.gov
ETL Processing
Extract Translate
Load
Master Data Sources

Analyzers,Tokenizers and
Filters. Oh My!

Open Source Tools
• PHP / Zend

• Python / Django

• MySQL

• RabbitMQ

•Varnish

• Jenkins

• Graphite, Statsd

Mike Nibeck - mnib@loc.gov

!
Asim Shaikh - ashaikh@loc.gov

Más contenido relacionado

La actualidad más candente

Managing changes to content: CrossmarkCrossref

Content Registration at Crossref - LIVE Kuala LumpurCrossref

Collecting and using funding data in your publicationsCrossref

Managing plagiarism: Similarity CheckCrossref

Crossref Metadata and Metadata ServicesCrossref

Understanding Crossref MetadataCrossref

Ed Pentz: Crossref Best Practice #crossref15Crossref

Introduction to Crossref - Crossref LIVE Kuala LumpurCrossref

Crossref Community Call May 2016Crossref

Multiple Resolution and handling content available in multiple placesCrossref

Crossref Content Registration - LIVE MumbaiCrossref

Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible LibraryKsenija Mincic Obradovic

New member webinar 052418Crossref

MENGGUNAKAN METADATA PADA CROSSREFRelawan Jurnal Indonesia

BISG DOI OverviewCrossref

Cited-by Linking Crossref

Winning the Big Data SPAM Challenge__HadoopSummit2010Yahoo Developer Network

Large Scale Data Clean-ups & Challenges for the Library Ksenija Mincic Obradovic

Full text searchdeleteman

02222016Nicholas Schiller

La actualidad más candente (20)

Managing changes to content: Crossmark

Content Registration at Crossref - LIVE Kuala Lumpur

Collecting and using funding data in your publications

Managing plagiarism: Similarity Check

Crossref Metadata and Metadata Services

Understanding Crossref Metadata

Ed Pentz: Crossref Best Practice #crossref15

Introduction to Crossref - Crossref LIVE Kuala Lumpur

Crossref Community Call May 2016

Multiple Resolution and handling content available in multiple places

Crossref Content Registration - LIVE Mumbai

Beyond the catalogue : BibFrame, Linked Data and Ending the Invisible Library

New member webinar 052418

MENGGUNAKAN METADATA PADA CROSSREF

BISG DOI Overview

Cited-by Linking

Winning the Big Data SPAM Challenge__HadoopSummit2010

Large Scale Data Clean-ups & Challenges for the Library

Full text search

02222016

Destacado

Van goghguest986e5ae

第4回「ブラウザー勉強会」オープニングトーク彰村地

Tennisaritz

Metacognicionmonica vives

Is this lovetanica

20101023 ie9 cache彰村地

Overview of Searching in Solr 1.4Lucidworks (Archived)

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DCLucidworks (Archived)

A haititanica

The mobile as a health hub, and how bluetooth low energy enables the marketPaul Williamson

Moving to Solr/Lucene Open Source SearchLucidworks (Archived)

Learn How to Master Solr1 4Lucidworks (Archived)

Amazing grace[1]tanica

What’s New in Apache Lucene 3.0Lucidworks (Archived)

"Search, APIs,Capability Management and the Sensis Journey"Lucidworks (Archived)

ブラウザー勉強会始めました彰村地

Adobe PhotoshopLaRue

Azure と世間様彰村地

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DCLucidworks (Archived)

Chicago Solr Meetup - June 10th: Exploring Hadoop with SearchLucidworks (Archived)

Destacado (20)

Van gogh

第4回「ブラウザー勉強会」オープニングトーク

Tennis

Metacognicion

Is this love

20101023 ie9 cache

Overview of Searching in Solr 1.4

Test Driven Relevancy, Presented by Doug Turnbull at SolrExchage DC

A haiti

The mobile as a health hub, and how bluetooth low energy enables the market

Moving to Solr/Lucene Open Source Search

Learn How to Master Solr1 4

Amazing grace[1]

What’s New in Apache Lucene 3.0

"Search, APIs,Capability Management and the Sensis Journey"

ブラウザー勉強会始めました

Adobe Photoshop

Azure と世間様

Big Data Challenges, Presented by Wes Caldwell at SolrExchage DC

Chicago Solr Meetup - June 10th: Exploring Hadoop with Search

Similar a Unstructured Doc Solr Overview

Building Corpora from Social MediaRichard Littauer

Digital Library Infrastructure for a Million BooksSteve Toub

Wetzel, Baish, Johnson, Reich, and Grant "Digital Preservation: Current Efforts"National Information Standards Organization (NISO)

WorldCat Local Illinoisltls

Data Designed for DiscoveryOCLC

The Future of Metadata Management & Making Library Collections Discoverable o...tfons

Develop open source search engineNAILBITER

Institutional Repositories (NLA 2011)Paul Royster

Keynote: The Current State of Library DiscoveryWiLS

Tidewater Consortium, 22 July 09George Needham

Cataloging PresentationAngela Dresselhaus

Presentation on KohaNur Ahammad

Library Mashups & APIslibrarywebchic

Webscale Discovery with the Enduser in Mind Debra Kolah

Getting in the Flow! : How libraries can adapt to changing users and environm...Guus van den Brekel

Discoverer_RevisedPaul Stella

WorldCat PresentationVal MacMillan

Frances McNamara - Kuali OLE Implementation at University of ChicagoKuali Days UK

Open Library at Make Books ApparentGeorge Oates

INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARYChris Okiki

Similar a Unstructured Doc Solr Overview (20)

Building Corpora from Social Media

Digital Library Infrastructure for a Million Books

Wetzel, Baish, Johnson, Reich, and Grant "Digital Preservation: Current Efforts"

WorldCat Local Illinois

Data Designed for Discovery

The Future of Metadata Management & Making Library Collections Discoverable o...

Develop open source search engine

Institutional Repositories (NLA 2011)

Keynote: The Current State of Library Discovery

Tidewater Consortium, 22 July 09

Cataloging Presentation

Presentation on Koha

Library Mashups & APIs

Webscale Discovery with the Enduser in Mind

Getting in the Flow! : How libraries can adapt to changing users and environm...

Discoverer_Revised

WorldCat Presentation

Frances McNamara - Kuali OLE Implementation at University of Chicago

Open Library at Make Books Apparent

INFORMATION SKILLS: NAVIGATING RESEARCH IN LIBRARY

Más de Lucidworks (Archived)

Integrating Hadoop & SolrLucidworks (Archived)

The Data-Driven ParadigmLucidworks (Archived)

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...Lucidworks (Archived)

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and SolrLucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for BusinessLucidworks (Archived)

SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search EngineLucidworks (Archived)

What's new in solr june 2014Lucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache SolrLucidworks (Archived)

Minneapolis Solr Meetup - May 28, 2014: Target.com SearchLucidworks (Archived)

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...Lucidworks (Archived)

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...Lucidworks (Archived)

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DCLucidworks (Archived)

Solr At AOL, Presented by Sean Timm at SolrExchage DCLucidworks (Archived)

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DCLucidworks (Archived)

Building a data driven search application with LucidWorks SiLKLucidworks (Archived)

Introducing LucidWorks App for Splunk Enterprise webinarLucidworks (Archived)

Solr4 nosql search_server_2013Lucidworks (Archived)

Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks Lucidworks (Archived)

Seeley yonik solr performance key innovationsLucidworks (Archived)

Más de Lucidworks (Archived) (20)

Integrating Hadoop & Solr

The Data-Driven Paradigm

Downtown SF Lucene/Solr Meetup - September 17: Thoth: Real-time Solr Monitori...

SFBay Area Solr Meetup - July 15th: Integrating Hadoop and Solr

SFBay Area Solr Meetup - June 18th: Box + Solr = Content Search for Business

SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance

Chicago Solr Meetup - June 10th: This Ain't Your Parents' Search Engine

What's new in solr june 2014

Minneapolis Solr Meetup - May 28, 2014: eCommerce Search with Apache Solr

Minneapolis Solr Meetup - May 28, 2014: Target.com Search

Exploration of multidimensional biomedical data in pub chem, Presented by Lia...

Building a Lightweight Discovery Interface for Chinese Patents, Presented by ...

What's New in Lucene/Solr Presented by Grant Ingersoll at SolrExchage DC

Solr At AOL, Presented by Sean Timm at SolrExchage DC

Intro to Solr Cloud, Presented by Tim Potter at SolrExchage DC

Building a data driven search application with LucidWorks SiLK

Introducing LucidWorks App for Splunk Enterprise webinar

Solr4 nosql search_server_2013

Lucene/Solr Revolution 2013: Paul Doscher Opening Remarks

Seeley yonik solr performance key innovations

Último

Tampa BSides - Chef's Tour of Microsoft Security Adoption Framework (SAF)Mark Simos

Artificial intelligence in cctv survelliance.pptxhariprasad279825

Take control of your SAP testing with UiPath Test SuiteDianaGray10

Merck Moving Beyond Passwords: FIDO Paris Seminar.pptxLoriGlavin3

"Debugging python applications inside k8s environment", Andrii SoldatenkoFwdays

Gen AI in Business - Global Trends Report 2024.pdfAddepto

Anypoint Exchange: It’s Not Just a Repo!Manik S Magar

Ensuring Technical Readiness For Copilot in Microsoft 3652toLead Limited

The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech

Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada

"LLMs for Python Engineers: Advanced Data Analysis and Semantic Kernel",Oleks...Fwdays

Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst

Are Multi-Cloud and Serverless Good or Bad?Mattias Andersson

DevoxxFR 2024 Reproducible Builds with Apache MavenHervé Boutemy

CloudStudio User manual (basic edition):comworks

Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed

DevEX - reference for building teams, processes, and platformsSergiu Bodiu

"ML in Production",Oleksandr BaganFwdays

WordPress Websites for Engineers: Elevate Your Brandgvaughan

Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge

Unstructured Doc Solr Overview

1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh

2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done

3. Maintainability vs. Performance

4. I’m Feeling Lucky

5. Solr Extension of Apache Lucene Full Text Search Open Interfaces (XML, JSON, HTTP) Faceted Search Database Ingest Document Indexing (PDF, Word, etc) Spelling Suggestions Auto Suggest “Cloudy” Advanced Input Parsing Relevance Ranking v4.4

6. You got your chocolate in my peanut butter!

7. It’s a Hammer. A really nice, efﬁcient and free hammer.

8. A Mental Shift Pancakes & Relevancy

9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month

10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources

11. Analyzers,Tokenizers and Filters. Oh My!

12. Cores? We Don’t Need No Stinkin' Cores

13. Data Import Handler

14. Next Steps

15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd

16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov

Unstructured Doc Solr Overview

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (20)

Similar a Unstructured Doc Solr Overview

Similar a Unstructured Doc Solr Overview (20)

Más de Lucidworks (Archived)

Más de Lucidworks (Archived) (20)

Último

Último (20)

Unstructured Doc Solr Overview