SlideShare una empresa de Scribd logo
1 de 29
Descargar para leer sin conexión
LuSql: (Quickly and easily)Getting your
  data from your DBMS into Lucene
                Glen Newton
         CISTI Research, CISTI NRC
                code4lib 2009
        Providence Rhode Island Feb 24 2009
Outline

•   What is LuSql?
•   Context
•   Examples
•   Performance & comparisons
•   Next version
Context

• CISTI == Canada National Science Library
• Digital Library Research Group
• Heavy text mining, knowledge discovery tools, information
  visualization, citation analysis, recommender systems
• Large local text collection:
   – 8.4M PDFs, full text & metadata (~700GB)
   – Full text on file system
   – Metadata in MySql
• Team of 4: Lucene expert; 3 needing to use Lucene
• Daily creation of some experiment/domain/foo – specific large
  scale Lucene index
LuSql Rationale

• Need for low barrier, high performance, flexible tool for Lucene
  index creation
• Choice
   – SOLR
   – DBSight
   – Lucen4DB.net
   – Hibernate Search
   – Compass
• All one or more of:
   – Overly complicated for non-Lucene / non-Java / non-XML / non-
      framework users
   – Performance/scalability issues
   – Not Open Source Software (OSS)
LuSql

• User knowledge:
   – Knowledge of SQL
   – Knowledge of their database and tables
   – Ability to set the Java CLASSPATH in a command-line shell
   – Ability to run a command line application
LuSql Command Line
                                  Arguments
–   Create or append
–   SQL
–   JDBC URL
–   # records to index
–   Lucene Analyzer class
–   JDBC driver class
–   Indexing properties, global or by field
–   Global term value (i.e. all documents have quot;source=catquot;)
–   Lucene RAM buffer size
–   Stop word file
–   Lucene index directory
–   Multithreading toggle, #threads
–   Pluggable Document filter
–   Subqueries
Examples

java jar lusql.jar -q quot;select * from Article where
  volumeYear > 2007quot; -c
  quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n
  5 -l tutorial -I 211 -t
Index Term Properties

• Index: Default:TOKENIZED
   – 0:NO
   – 1:NO_NORMS
   – 2:TOKENIZED
   – 3:UN_TOKENIZED
• Store: Default:YES
   – 0:NO
   – 1:YES
   – 2:COMPRESS
• Term vector: Default:YES
   – 0:NO
   – 1:YES
Example 2: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 2: Large Scale
                                    Index Propertries
•   6.4M articles (only metadata)
•   20 fields, including abstract
•   Indexing time: 1h 34m
•   Index size: 21GB
Example 3: Complex
                         Join
java -jar lusql.jar -q quot;select Publisher.name as
  pub, Journal.title as jo,Article.rawUrl as text ,
  Journal.issn, Volume.number as
  vol,Volume.coverYear as year, Issue.number as
  iss, Article.id as id, Article.title as ti,
  Article.abstract as ab, Article.startPage as
  startPage, Article.endPage as endPage from
  Publisher, Journal, Volume, Issue, Article where
  Publisher.id = Journal.publisherId and Journal.id
  = Volume.journalId and Volume.id=Issue.volumeId
  and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql
  ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l
  tutorial_2
Example 4: Out-of-band
                                 document manipulation
• Plugin architecture allowing arbitrary manipulations of Documents
  before they go into the index
• Implement DocFilter interface
• Add filter class at command line:
   – -f ca.nrc.cisti.lusql.example.FileFullTextFilter
   – Looks in metadata field for PDF location in file system; finds
     corresponding .txt file; reads file & adds to Document
Example 4: Large Scale
                                   Index Properties
•   6.4M articles (metadata & full-text), ~600GB PDFs
•   21 fields, including abstract & full-text
•   Indexing time: 13h 46m
•   Index size: 86GB
Comparison to SOLR

• SOLR 1.4 November build:
   – Using DataImportHandler, with all defaults
• Lucene 2.4
• LuSql 0.90
Hardware
• Indexing and database machines:
   – Dell PowerEdge 1955 Blade server56
   – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0
     Ghz 64bit
   – Memory: 8 GB 667MHz
   – Disk: 2 x 73GB internal 10K RPM SAS drives
• Both machines attached to:
   – Dell EMC AX150 storage arrays
   – 12 x 500 GB SATA II 7.2K RPM disks
   – via:
   – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
Software
• MySql: v5.0.45 compiled from source.
• gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux)
• Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0
  07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23,
  mixed mode)
• Operating System
   – Linux openSUSE 10.2 (64-bit X86-64)
   – Linux kernel: 2.6.18.8-0.10-default #1 SMP
Comparison to SOLR
Comparison to SOLR
Acknowledgments

• Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
LuSql 0.95 in
                                   development
• Re-architected to have pluggable drivers for both input & output
• Read drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier
• Write drivers:
   – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI,
     Terrier
• For large volumes, concurrent multiple indexes merged at end
Questions

• Glen Newton glen.newton@nrc-cnrc.gc.ca
LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

Más contenido relacionado

La actualidad más candente

Using memcache to improve php performance
Using memcache to improve php performanceUsing memcache to improve php performance
Using memcache to improve php performanceSudar Muthu
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsJignesh Shah
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory Fabio Fumarola
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"IT Event
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedGear6
 
캐시 분산처리 인프라
캐시 분산처리 인프라캐시 분산처리 인프라
캐시 분산처리 인프라Park Chunduck
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLRené Cannaò
 
Using advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JUsing advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JMariaDB plc
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql shardingMarco Tusa
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemakerhastexo
 
Proxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteProxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteMarco Tusa
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by StepKim Stefan Lindholm
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web appGeorgio_1999
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Mydbops
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQLMydbops
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerelliando dias
 

La actualidad más candente (20)

Cassandra as Memcache
Cassandra as MemcacheCassandra as Memcache
Cassandra as Memcache
 
Using memcache to improve php performance
Using memcache to improve php performanceUsing memcache to improve php performance
Using memcache to improve php performance
 
Introduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System AdministratorsIntroduction to PostgreSQL for System Administrators
Introduction to PostgreSQL for System Administrators
 
8. key value databases laboratory
8. key value databases laboratory 8. key value databases laboratory
8. key value databases laboratory
 
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
Leonid Vasilyev  "Building, deploying and running production code at Dropbox"Leonid Vasilyev  "Building, deploying and running production code at Dropbox"
Leonid Vasilyev "Building, deploying and running production code at Dropbox"
 
Implementing High Availability Caching with Memcached
Implementing High Availability Caching with MemcachedImplementing High Availability Caching with Memcached
Implementing High Availability Caching with Memcached
 
캐시 분산처리 인프라
캐시 분산처리 인프라캐시 분산처리 인프라
캐시 분산처리 인프라
 
ProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQLProxySQL - High Performance and HA Proxy for MySQL
ProxySQL - High Performance and HA Proxy for MySQL
 
Using advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/JUsing advanced options in MariaDB Connector/J
Using advanced options in MariaDB Connector/J
 
Proxysql sharding
Proxysql shardingProxysql sharding
Proxysql sharding
 
MySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the PacemakerMySQL High Availability Sprint: Launch the Pacemaker
MySQL High Availability Sprint: Launch the Pacemaker
 
Level DB - Quick Cheat Sheet
Level DB - Quick Cheat SheetLevel DB - Quick Cheat Sheet
Level DB - Quick Cheat Sheet
 
Proxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynoteProxysql ha plam_2016_2_keynote
Proxysql ha plam_2016_2_keynote
 
CUBRID Cluster Introduction
CUBRID Cluster IntroductionCUBRID Cluster Introduction
CUBRID Cluster Introduction
 
Varnish Configuration Step by Step
Varnish Configuration Step by StepVarnish Configuration Step by Step
Varnish Configuration Step by Step
 
How to scale your web app
How to scale your web appHow to scale your web app
How to scale your web app
 
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
Analyze corefile and backtraces with GDB for Mysql/MariaDB on Linux - Nilanda...
 
ProxySQL for MySQL
ProxySQL for MySQLProxySQL for MySQL
ProxySQL for MySQL
 
plProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancerplProxy, pgBouncer, pgBalancer
plProxy, pgBouncer, pgBalancer
 
Introduction to HDFS
Introduction to HDFSIntroduction to HDFS
Introduction to HDFS
 

Destacado

Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Dataeby
 
DLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationDLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationeby
 
Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochureguest1b72b3
 
RESTafarian-ism at the NLA
RESTafarian-ism at the NLARESTafarian-ism at the NLA
RESTafarian-ism at the NLAeby
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangleeby
 
CouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeCouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeeby
 
The Future of Vertical Search Engines
The Future of Vertical Search EnginesThe Future of Vertical Search Engines
The Future of Vertical Search EnginesTed Drake
 
BABOK Study Group - meeting 4
BABOK Study Group - meeting 4BABOK Study Group - meeting 4
BABOK Study Group - meeting 4Paweł Zubkiewicz
 

Destacado (10)

Why libraries should embrace Linked Data
Why libraries should embrace Linked DataWhy libraries should embrace Linked Data
Why libraries should embrace Linked Data
 
DLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendationDLF ILS Discovery Interface Task Force API recommendation
DLF ILS Discovery Interface Task Force API recommendation
 
Discovery Interfaces
Discovery InterfacesDiscovery Interfaces
Discovery Interfaces
 
Nichols & Co. Photography Brochure
Nichols & Co. Photography BrochureNichols & Co. Photography Brochure
Nichols & Co. Photography Brochure
 
RESTafarian-ism at the NLA
RESTafarian-ism at the NLARESTafarian-ism at the NLA
RESTafarian-ism at the NLA
 
Like a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and JangleLike a can opener for your data silo: simple access through AtomPub and Jangle
Like a can opener for your data silo: simple access through AtomPub and Jangle
 
Como evitar enfermarse
Como evitar enfermarseComo evitar enfermarse
Como evitar enfermarse
 
CouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilegeCouchDB is sacrilege... mmm, delicious sacrilege
CouchDB is sacrilege... mmm, delicious sacrilege
 
The Future of Vertical Search Engines
The Future of Vertical Search EnginesThe Future of Vertical Search Engines
The Future of Vertical Search Engines
 
BABOK Study Group - meeting 4
BABOK Study Group - meeting 4BABOK Study Group - meeting 4
BABOK Study Group - meeting 4
 

Similar a LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At CraigslistMySQLConference
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and dockerBob Ward
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBob Ward
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Data Con LA
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Odinot Stanislas
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositoriesIryna Kuchma
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkAlex Zeltov
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen ILS
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09Chris Purrington
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresmkorremans
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Databricks
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...Glenn K. Lockwood
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsDataWorks Summit
 

Similar a LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene (20)

My Sql And Search At Craigslist
My Sql And Search At CraigslistMy Sql And Search At Craigslist
My Sql And Search At Craigslist
 
Experience sql server on l inux and docker
Experience sql server on l inux and dockerExperience sql server on l inux and docker
Experience sql server on l inux and docker
 
Brk2051 sql server on linux and docker
Brk2051 sql server on linux and dockerBrk2051 sql server on linux and docker
Brk2051 sql server on linux and docker
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
Big Data Day LA 2016/ Big Data Track - How To Use Impala and Kudu To Optimize...
 
Ceph
CephCeph
Ceph
 
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
Ceph: Open Source Storage Software Optimizations on Intel® Architecture for C...
 
Setting up repositories
Setting up repositoriesSetting up repositories
Setting up repositories
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Intro to Apache Spark
Intro to Apache SparkIntro to Apache Spark
Intro to Apache Spark
 
Typesafe spark- Zalando meetup
Typesafe spark- Zalando meetupTypesafe spark- Zalando meetup
Typesafe spark- Zalando meetup
 
Intro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with sparkIntro to big data analytics using microsoft machine learning server with spark
Intro to big data analytics using microsoft machine learning server with spark
 
Ml2
Ml2Ml2
Ml2
 
Evergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival SkillsEvergreen Sysadmin Survival Skills
Evergreen Sysadmin Survival Skills
 
AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09AWS (Hadoop) Meetup 30.04.09
AWS (Hadoop) Meetup 30.04.09
 
Vijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-featuresVijfhart thema-avond-oracle-12c-new-features
Vijfhart thema-avond-oracle-12c-new-features
 
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
Spark Saturday: Spark SQL & DataFrame Workshop with Apache Spark 2.3
 
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
The Proto-Burst Buffer: Experience with the flash-based file system on SDSC's...
 
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test ResultsUncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
Uncovering an Apache Spark 2 Benchmark - Configuration, Tuning and Test Results
 
What's New in Apache Hive
What's New in Apache HiveWhat's New in Apache Hive
What's New in Apache Hive
 

Más de eby

Using a CSS Framework
Using a CSS FrameworkUsing a CSS Framework
Using a CSS Frameworkeby
 
XForms for Metadata creation
XForms for Metadata creationXForms for Metadata creation
XForms for Metadata creationeby
 
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?eby
 
ÖpënÜRL
ÖpënÜRLÖpënÜRL
ÖpënÜRLeby
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehillseby
 
Zotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic WebZotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic Webeby
 
Creating an Academic Image Collection with Flickr
Creating an Academic Image Collection with FlickrCreating an Academic Image Collection with Flickr
Creating an Academic Image Collection with Flickreby
 
From Idea to Open Source
From Idea to Open SourceFrom Idea to Open Source
From Idea to Open Sourceeby
 
Obstacles to Agility
Obstacles to AgilityObstacles to Agility
Obstacles to Agilityeby
 
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...eby
 
Code4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher KeynoteCode4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher Keynoteeby
 
Library Data APIs Abound!
Library Data APIs Abound!Library Data APIs Abound!
Library Data APIs Abound!eby
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendationseby
 
On The Herding of Cats
On The Herding of CatsOn The Herding of Cats
On The Herding of Catseby
 
Code4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch PortalCode4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch Portaleby
 
Code4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeCode4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeeby
 
Evergreen - Future of the ILS
Evergreen - Future of the ILSEvergreen - Future of the ILS
Evergreen - Future of the ILSeby
 
Lucene Summit - Beth's Presentation
Lucene Summit - Beth's PresentationLucene Summit - Beth's Presentation
Lucene Summit - Beth's Presentationeby
 

Más de eby (18)

Using a CSS Framework
Using a CSS FrameworkUsing a CSS Framework
Using a CSS Framework
 
XForms for Metadata creation
XForms for Metadata creationXForms for Metadata creation
XForms for Metadata creation
 
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
Karen Coyle Keynote - R&D: Can Resource Description become Rigorous Data?
 
ÖpënÜRL
ÖpënÜRLÖpënÜRL
ÖpënÜRL
 
Building Mountains Out of Molehills
Building Mountains Out of MolehillsBuilding Mountains Out of Molehills
Building Mountains Out of Molehills
 
Zotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic WebZotero and You, or Bibliography on the Semantic Web
Zotero and You, or Bibliography on the Semantic Web
 
Creating an Academic Image Collection with Flickr
Creating an Academic Image Collection with FlickrCreating an Academic Image Collection with Flickr
Creating an Academic Image Collection with Flickr
 
From Idea to Open Source
From Idea to Open SourceFrom Idea to Open Source
From Idea to Open Source
 
Obstacles to Agility
Obstacles to AgilityObstacles to Agility
Obstacles to Agility
 
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...The Intellectual Property Disclosure Process: Releasing Open Source Software ...
The Intellectual Property Disclosure Process: Releasing Open Source Software ...
 
Code4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher KeynoteCode4Lib 2007: Erik Hatcher Keynote
Code4Lib 2007: Erik Hatcher Keynote
 
Library Data APIs Abound!
Library Data APIs Abound!Library Data APIs Abound!
Library Data APIs Abound!
 
Smart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject RecommendationsSmart Subjects - Application Independent Subject Recommendations
Smart Subjects - Application Independent Subject Recommendations
 
On The Herding of Cats
On The Herding of CatsOn The Herding of Cats
On The Herding of Cats
 
Code4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch PortalCode4Lib 2007: MyResearch Portal
Code4Lib 2007: MyResearch Portal
 
Code4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's timeCode4Lib 2007: Hurry up please, it's time
Code4Lib 2007: Hurry up please, it's time
 
Evergreen - Future of the ILS
Evergreen - Future of the ILSEvergreen - Future of the ILS
Evergreen - Future of the ILS
 
Lucene Summit - Beth's Presentation
Lucene Summit - Beth's PresentationLucene Summit - Beth's Presentation
Lucene Summit - Beth's Presentation
 

Último

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure servicePooja Nehwal
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesSinan KOZAK
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking MenDelhi Call girls
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processorsdebabhi2
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 

Último (20)

A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure serviceWhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
WhatsApp 9892124323 ✓Call Girls In Kalyan ( Mumbai ) secure service
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Unblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen FramesUnblocking The Main Thread Solving ANRs and Frozen Frames
Unblocking The Main Thread Solving ANRs and Frozen Frames
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men08448380779 Call Girls In Greater Kailash - I Women Seeking Men
08448380779 Call Girls In Greater Kailash - I Women Seeking Men
 
Exploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone ProcessorsExploring the Future Potential of AI-Enabled Smartphone Processors
Exploring the Future Potential of AI-Enabled Smartphone Processors
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 

LuSql: (Quickly and easily) Getting your data from your DBMS into Lucene

  • 1. LuSql: (Quickly and easily)Getting your data from your DBMS into Lucene Glen Newton CISTI Research, CISTI NRC code4lib 2009 Providence Rhode Island Feb 24 2009
  • 2. Outline • What is LuSql? • Context • Examples • Performance & comparisons • Next version
  • 3. Context • CISTI == Canada National Science Library • Digital Library Research Group • Heavy text mining, knowledge discovery tools, information visualization, citation analysis, recommender systems • Large local text collection: – 8.4M PDFs, full text & metadata (~700GB) – Full text on file system – Metadata in MySql • Team of 4: Lucene expert; 3 needing to use Lucene • Daily creation of some experiment/domain/foo – specific large scale Lucene index
  • 4. LuSql Rationale • Need for low barrier, high performance, flexible tool for Lucene index creation • Choice – SOLR – DBSight – Lucen4DB.net – Hibernate Search – Compass • All one or more of: – Overly complicated for non-Lucene / non-Java / non-XML / non- framework users – Performance/scalability issues – Not Open Source Software (OSS)
  • 5. LuSql • User knowledge: – Knowledge of SQL – Knowledge of their database and tables – Ability to set the Java CLASSPATH in a command-line shell – Ability to run a command line application
  • 6. LuSql Command Line Arguments – Create or append – SQL – JDBC URL – # records to index – Lucene Analyzer class – JDBC driver class – Indexing properties, global or by field – Global term value (i.e. all documents have quot;source=catquot;) – Lucene RAM buffer size – Stop word file – Lucene index directory – Multithreading toggle, #threads – Pluggable Document filter – Subqueries
  • 7.
  • 8. Examples java jar lusql.jar -q quot;select * from Article where volumeYear > 2007quot; -c quot;jdbc:mysql://dbhost/db?user=ID&password=PASSquot; -n 5 -l tutorial -I 211 -t
  • 9. Index Term Properties • Index: Default:TOKENIZED – 0:NO – 1:NO_NORMS – 2:TOKENIZED – 3:UN_TOKENIZED • Store: Default:YES – 0:NO – 1:YES – 2:COMPRESS • Term vector: Default:YES – 0:NO – 1:YES
  • 10.
  • 11. Example 2: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 12.
  • 13. Example 2: Large Scale Index Propertries • 6.4M articles (only metadata) • 20 fields, including abstract • Indexing time: 1h 34m • Index size: 21GB
  • 14.
  • 15.
  • 16. Example 3: Complex Join java -jar lusql.jar -q quot;select Publisher.name as pub, Journal.title as jo,Article.rawUrl as text , Journal.issn, Volume.number as vol,Volume.coverYear as year, Issue.number as iss, Article.id as id, Article.title as ti, Article.abstract as ab, Article.startPage as startPage, Article.endPage as endPage from Publisher, Journal, Volume, Issue, Article where Publisher.id = Journal.publisherId and Journal.id = Volume.journalId and Volume.id=Issue.volumeId and Issue.id = Article.issueIdquot; -c quot;jdbc:mysql ://dbhost/db?user=ID&password=PASSquot; -n 50000 -l tutorial_2
  • 17. Example 4: Out-of-band document manipulation • Plugin architecture allowing arbitrary manipulations of Documents before they go into the index • Implement DocFilter interface • Add filter class at command line: – -f ca.nrc.cisti.lusql.example.FileFullTextFilter – Looks in metadata field for PDF location in file system; finds corresponding .txt file; reads file & adds to Document
  • 18. Example 4: Large Scale Index Properties • 6.4M articles (metadata & full-text), ~600GB PDFs • 21 fields, including abstract & full-text • Indexing time: 13h 46m • Index size: 86GB
  • 19. Comparison to SOLR • SOLR 1.4 November build: – Using DataImportHandler, with all defaults • Lucene 2.4 • LuSql 0.90
  • 20. Hardware • Indexing and database machines: – Dell PowerEdge 1955 Blade server56 – CPU: 2 x dual-core Xeon 5050 processors with 2x2MB cache, 3.0 Ghz 64bit – Memory: 8 GB 667MHz – Disk: 2 x 73GB internal 10K RPM SAS drives • Both machines attached to: – Dell EMC AX150 storage arrays – 12 x 500 GB SATA II 7.2K RPM disks – via: – SilkWorm 200E57 Series 16-Port Capable 4Gb Fabric Switch
  • 21. Software • MySql: v5.0.45 compiled from source. • gcc: gcc version 4.1.2 20061115 (prerelease) (SUSE Linux) • Java: java version 1.6.0 07 SE Runtime Environment (build 1.6.0 07-b06) Java HotSpot(TM) 64-Bit Server VM (build 10.0-b23, mixed mode) • Operating System – Linux openSUSE 10.2 (64-bit X86-64) – Linux kernel: 2.6.18.8-0.10-default #1 SMP
  • 24.
  • 25.
  • 26. Acknowledgments • Greg Kresko, Andre Vellino, Jeff Demaine, various LuSql users
  • 27. LuSql 0.95 in development • Re-architected to have pluggable drivers for both input & output • Read drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, RMI, Terrier • Write drivers: – JDBC, Lucene, Minion, BDB. ehcache, SolrJ, text, XML, RMI, Terrier • For large volumes, concurrent multiple indexes merged at end
  • 28. Questions • Glen Newton glen.newton@nrc-cnrc.gc.ca