Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.
Un-Structured 	

!
Or: How I Learned to Stop
Worrying and Love the XML
Mike Nibeck, Asim Shaikh
1st NF, 2nd NF, 3rd NF	

!
It’s The Way It’s Done
Maintainability vs.
Performance
I’m Feeling Lucky
Solr
Extension	
  of	
  
Apache	
  Lucene
Full	
  Text	
  Search Open	
  Interfaces	
  
(XML,	
  JSON,	
  HTTP)
Faceted	
 ...
You got your chocolate in
my peanut butter!
It’s a Hammer. 	

A really nice, efficient
and free hammer.
A Mental Shift	

Pancakes & Relevancy
Chronicling America
• 6.8 million documents	

• 10 Billion vectors	

• 50,000 queries/day	

• Index 250GB 	

• +100K docum...
Load Balancer
Database Filesystem
Indexing
SOLR Cores SOLR Cores
Users
App Servers
Web Cache
Legacy Systems
Data Partners
...
Analyzers,Tokenizers and
Filters. Oh My!
Cores? We Don’t Need
No Stinkin' Cores
Data Import Handler
Next Steps
Open Source Tools
• PHP / Zend	

• Python / Django	

• MySQL	

• RabbitMQ	

•Varnish	

• Jenkins	

• Graphite, Statsd
Mike Nibeck - mnib@loc.gov	

!
Asim Shaikh - ashaikh@loc.gov
Próxima SlideShare
Cargando en…5
×

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

718 visualizaciones

Publicado el

Publicado en: Tecnología
  • Sé el primero en comentar

  • Sé el primero en recomendar esto

Unstructured Or: How I Learned to Stop Worrying and Love the xml, Presented by Mike Nibeck and Asim Shaikh

  1. 1. Un-Structured ! Or: How I Learned to Stop Worrying and Love the XML Mike Nibeck, Asim Shaikh
  2. 2. 1st NF, 2nd NF, 3rd NF ! It’s The Way It’s Done
  3. 3. Maintainability vs. Performance
  4. 4. I’m Feeling Lucky
  5. 5. Solr Extension  of   Apache  Lucene Full  Text  Search Open  Interfaces   (XML,  JSON,  HTTP) Faceted  Search Database  Ingest Document  Indexing   (PDF,  Word,  etc) Spelling   Suggestions Auto  Suggest “Cloudy” Advanced  Input   Parsing Relevance  Ranking v4.4
  6. 6. You got your chocolate in my peanut butter!
  7. 7. It’s a Hammer. A really nice, efficient and free hammer.
  8. 8. A Mental Shift Pancakes & Relevancy
  9. 9. Chronicling America • 6.8 million documents • 10 Billion vectors • 50,000 queries/day • Index 250GB • +100K documents per month Congress.gov • 4 million documents • 3.3+ million queries/day (user and system) • 36 GB indexes •Adding many thousands/ month Library Web Search • 18+ million documents • 9,000 queries/day • 28GB index size • + many thousands/month World Digital Library • 120k documents • 7 different languages • 10-50k queries/day • Index < 1GB • +100 documents/month
  10. 10. Load Balancer Database Filesystem Indexing SOLR Cores SOLR Cores Users App Servers Web Cache Legacy Systems Data Partners Solr Architecture - congress.gov ETL Processing Extract Translate Load Master Data Sources
  11. 11. Analyzers,Tokenizers and Filters. Oh My!
  12. 12. Cores? We Don’t Need No Stinkin' Cores
  13. 13. Data Import Handler
  14. 14. Next Steps
  15. 15. Open Source Tools • PHP / Zend • Python / Django • MySQL • RabbitMQ •Varnish • Jenkins • Graphite, Statsd
  16. 16. Mike Nibeck - mnib@loc.gov ! Asim Shaikh - ashaikh@loc.gov

×