SlideShare una empresa de Scribd logo
1 de 77
Descargar para leer sin conexión
Creating an Open Source
Genealogical Search Engine
    With Apache Solr


              Brooke Schreier Ganz
                 info@leafseek.com
                Twitter: @LeafSeek
                www.LeafSeek.com
Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
  profit
• Web Developer at IBM.com and Disney
  Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
  network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
  (existing website is fugly)
• Needs a search engine…for data
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
The Old Problem
The Old Problem
The New Problem
The New Problem
• Diverse Data Languages
  (German, Polish, Ukrainian, Russian, Yiddi
  sh, Hebrew, English…)
• Diverse Data Types
  (births, marriages, deaths, divorces, tax
  lists, landsmanschaften lists, industrial
  permit lists, school
  yearbooks, governmental yearbooks…)
Diverse Data Shapes
Diverse Data Shapes
Diverse Data Shapes
Existing solutions
• They‟re okay...for small numbers of
  databases, with small amounts of data

  – Steve Morse's One-Step Tool Creator
  – Roll-your-own solution with PHP and MySQL


• Both get more difficult to manage as data
  sets increase in number and complexity
In space, no one can hear your data scream
To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
  data
• Surely there must be a way to deal with
  this?
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
So I Made A Thing
But “That Thing I Made With The Database And Stuff”
     was kind of an awkward name, so I called it



             LeafSeek
This is the part where I show you all
the shiny new All Galicia Database

  http://search.geshergalicia.org/
Meet Apache Solr
• Highly functional open source search
  platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Saves Time, and Heartache
Creating an Open Source Genealogical Search Engine with Apache Solr
Saves Time, and Stomachache
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
File Structure: Back-End
Welcome to /conf
The Important Stuff
solrconfig.xml
solrconfig.xml

Make sure this part is configured, so you can
import data:
How to get your data into Solr
• Step 1: Make a properly-formatted
  spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
  Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
  db-data-config.xml
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
db-data-config.xml
• Basic XML file that tells Solr how to grab
  data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
  databases
• You need to make sure your MySQL
  connector .jar is installed for this to work
Creating an Open Source Genealogical Search Engine with Apache Solr
Import!
Creating an Open Source Genealogical Search Engine with Apache Solr
schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
  instructions to “buckets”
• Fields say what‟s what and whether to
  make something facetable or not
• CopyFields collect Fields together into
  extra FieldTypes
schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
  – givenname
  – surname
  – surname_bmpm (phonetic)
  – place (note: not merely town)
  – year (which we‟re treating as text right now)
schema.xml - FieldTypes
schema.xml - FieldTypes
schema.xml - Fields
schema.xml - Fields
• Uppercase fields come from the name of
  the MySQL column name
• Examples:
  – Year
  – SchoolYear
  – Surname
  – FathersTown
  – MothersFathersGivenName
  – MaternalGrandfathersGivenName
schema.xml - Fields
• Lowercase fields were added once the
  data is getting inputted to Solr, and start
  with the prefix record_
• Examples:
  – record_type (birth, death, tax, whatever)
  – record_source (name of repository)
  – record_latlong (latitude,longitude)
  – record_id (required!)
schema.xml - Fields
• You do not have to explicitly define every
  Field.
• If something is imported that is not named
  and defined in schema.xml it will just be
  indexed as a straight-up text string, with
  nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
  anyway so you can remember what‟s what
  and what you are doing to it.
schema.xml - CopyFields
Creating an Open Source Genealogical Search Engine with Apache Solr
Add-ons and nice-to-have‟s
         (for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
  stopwords.txt
• Nicknames and name synonyms handled
  through synonyms.txt
• Two files included:
  – synonyms_-_american-anglo-saxon.txt
  – synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
  historical/ethnic community standards
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
More add-ons and nice-to-have‟s
        (for the back-end)
• Translate your site into different languages – multi-
  lingual content deserves a real multi-lingual
  website
   – Pass user preferences through GET value or through
     accept-language header or read from a cookie or
     whatever you want
• Built-in performance monitoring hooks for New
  Relic
• Soundalike searches for surname variants
   – Levenstein distance
   – “Regular” Soundex, Metaphone, Caverphone, etc.
This is the part where I tell
          the story about


     THE SAGA
of Beider-Morse Phonetic Matching
             (BMPM)
Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
  wildcards, alternate names /
  synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
  people‟s names in the record
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Lots of Front-End Options
• Ruby:
  Sunspot, RSolr, Tanning Bed, acts-as-solr
• Django/Python:
  Haystack, Sunburnt, solrpy, pysolr
• Older PHP options:
  PECL, solr-php-client
• Plugins for blog/CMS systems:
  Drupal, WordPress
Meet Solarium
•   http://www.solarium-project.org/
•   New, open source PHP wrapper for Solr
•   Very active development
•   Version 2.4 coming soon
File Structure: Front-End
Meet Solarium: The Config
Meet Solarium: The Guts
Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
  POST, not by GET, so the URL never
  changes
• You can (and should) paginate results
  listings
• You can't actually see the Solr server's
  URL from the front-end, not even in view-
  source
Add-ons and nice-to-have‟s
        (for the front-end)
• A welcome screen with information about
  the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
Add-ons and nice-to-have‟s
           (for the front-end)
•   Make good UI choices
•   Pop-Up Google Maps
•   Tooltips to reduce UI clutter
•   Cross-browser compatibility
•   Still stuck with IE 7 and 8
•   CSS and code that degrades gracefully
•   No small text
Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
  my data?
• What are the TOP (surnames, towns, etc.)
  in my data?
• Finding incorrect data
  – Outlying years and dates
  – Figure out that hard-to-read surname
• Make charts and graphs from your data
Creating an Open Source Genealogical Search Engine with Apache Solr
The (Back-End) Future!        (Maybe.)

• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
  (based on data frequency)
• “More Like This”
  (would have to do scoring)
• Record bookmarking system (hashes?)
The (Front-End) Future!         (Maybe.)

• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
  http://historical-data.org/
• Membership login system
Creating an Open Source Genealogical Search Engine with Apache Solr
Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
  bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
Presenting LeafSeek!
•   Free and Open Source
•   Code is all on GitHub
•   Please add, edit, fix, change, tinker
•   …and use it!
Creating an Open Source Genealogical Search Engine with Apache Solr
Creating an Open Source Genealogical Search Engine with Apache Solr
Why is this FREE?

And why is this important?
Thank you! :-)

Más contenido relacionado

La actualidad más candente

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012Teresa Pask
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteNeo4j
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Storesandyseaborne
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthDatabricks
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import pptDavid Horvath
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with ElasticsearchAleksander Stensby
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.orgJoshua Shinavier
 

La actualidad más candente (8)

The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012The Master Genealogist for Beginners 2012
The Master Genealogist for Beginners 2012
 
Intro to Neo4j - Nicole White
Intro to Neo4j - Nicole WhiteIntro to Neo4j - Nicole White
Intro to Neo4j - Nicole White
 
Basics of Web Research for ELA 10
Basics of Web Research for ELA 10Basics of Web Research for ELA 10
Basics of Web Research for ELA 10
 
NoSQL and Triple Stores
NoSQL and Triple StoresNoSQL and Triple Stores
NoSQL and Triple Stores
 
Apache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in DepthApache Spark's Built-in File Sources in Depth
Apache Spark's Built-in File Sources in Depth
 
20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt20171106 sesug bb 180 proc import ppt
20171106 sesug bb 180 proc import ppt
 
Data Exploration with Elasticsearch
Data Exploration with ElasticsearchData Exploration with Elasticsearch
Data Exploration with Elasticsearch
 
semantic markup using schema.org
semantic markup using schema.orgsemantic markup using schema.org
semantic markup using schema.org
 

Destacado

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language CentreLucy Bullett
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for BeginnersIrina Bubnova
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian languageSecondary School from Helsinki
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationLiang Tang
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityDaniel Hieber
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageLegesse Allyn
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...eveline wandl-vogt
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageDmitry Kan
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian languageKaterina Vylomova
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the worldManu Alias
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)101_languages
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course101_languages
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branchesPamela Sanhueza
 
Russian Language
Russian LanguageRussian Language
Russian LanguageIzzah Ros
 

Destacado (17)

Russian Language Centre
Russian Language CentreRussian Language Centre
Russian Language Centre
 
Russian for Beginners
Russian for BeginnersRussian for Beginners
Russian for Beginners
 
How many people speak and will speak the russian language
How many people  speak and will speak the russian languageHow many people  speak and will speak the russian language
How many people speak and will speak the russian language
 
Ensemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized RecommendationEnsemble Contextual Bandits for Personalized Recommendation
Ensemble Contextual Bandits for Personalized Recommendation
 
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & TransitivityHieber - An Introduction to Typology, Part II: Voice & Transitivity
Hieber - An Introduction to Typology, Part II: Voice & Transitivity
 
Amarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian LanguageAmarigna & Tigrigna Qal Roots of Russian Language
Amarigna & Tigrigna Qal Roots of Russian Language
 
Pre-incident plan
Pre-incident planPre-incident plan
Pre-incident plan
 
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...towards mulitlingual cultural lexicography. the russian dialect dictionary as...
towards mulitlingual cultural lexicography. the russian dialect dictionary as...
 
Linguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian languageLinguistic component Tokenizer for the Russian language
Linguistic component Tokenizer for the Russian language
 
Russia, Russians and Russian language
Russia, Russians and Russian languageRussia, Russians and Russian language
Russia, Russians and Russian language
 
Languages of the world
Languages of the worldLanguages of the world
Languages of the world
 
Russia
RussiaRussia
Russia
 
Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)Learn Russian - FSI FAST Course (Part 3)
Learn Russian - FSI FAST Course (Part 3)
 
Basic Russian Language Course
Basic Russian Language CourseBasic Russian Language Course
Basic Russian Language Course
 
Language families and branches
Language families and branchesLanguage families and branches
Language families and branches
 
Russian Language
Russian LanguageRussian Language
Russian Language
 
AINL 2016: Malykh
AINL 2016: MalykhAINL 2016: Malykh
AINL 2016: Malykh
 

Similar a Creating an Open Source Genealogical Search Engine with Apache Solr

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptxIke Ellis
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminarGlen McGregor
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring databodaceacat
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring dataSara-Jayne Terp
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.orgrvguha
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersRob Fuller
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talkrtelmore
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.Jurriaan Persyn
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databaseBarry Jones
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Yahoo Developer Network
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleDataStax Academy
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformSteve Hoffman
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerIke Ellis
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxPythian
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsEDB
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptxGambari Amosa Isiaka
 

Similar a Creating an Open Source Genealogical Search Engine with Apache Solr (20)

Build a modern data platform.pptx
Build a modern data platform.pptxBuild a modern data platform.pptx
Build a modern data platform.pptx
 
Computer-assisted reporting seminar
Computer-assisted reporting seminarComputer-assisted reporting seminar
Computer-assisted reporting seminar
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Session 03 acquiring data
Session 03 acquiring dataSession 03 acquiring data
Session 03 acquiring data
 
Semantic Web and Schema.org
Semantic Web and Schema.orgSemantic Web and Schema.org
Semantic Web and Schema.org
 
PHP - Introduction to PHP MySQL Joins and SQL Functions
PHP -  Introduction to PHP MySQL Joins and SQL FunctionsPHP -  Introduction to PHP MySQL Joins and SQL Functions
PHP - Introduction to PHP MySQL Joins and SQL Functions
 
NotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for PentestersNotaCon 2011 - Networking for Pentesters
NotaCon 2011 - Networking for Pentesters
 
Make Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 MinutesMake Your Data Searchable With Solr in 25 Minutes
Make Your Data Searchable With Solr in 25 Minutes
 
useR! 2012 Talk
useR! 2012 TalkuseR! 2012 Talk
useR! 2012 Talk
 
An Introduction to Elastic Search.
An Introduction to Elastic Search.An Introduction to Elastic Search.
An Introduction to Elastic Search.
 
PostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty databasePostgreSQL - It's kind've a nifty database
PostgreSQL - It's kind've a nifty database
 
Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010Winning the Big Data SPAM Challenge__HadoopSummit2010
Winning the Big Data SPAM Challenge__HadoopSummit2010
 
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at ScaleCassandra Summit 2014: Fuzzy Entity Matching at Scale
Cassandra Summit 2014: Fuzzy Entity Matching at Scale
 
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the PlatformNerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
Nerd Out with Hadoop: A Not-So-Basic Introduction to the Platform
 
Hadoop for the Absolute Beginner
Hadoop for the Absolute BeginnerHadoop for the Absolute Beginner
Hadoop for the Absolute Beginner
 
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to SphinxMYSQL Query Anti-Patterns That Can Be Moved to Sphinx
MYSQL Query Anti-Patterns That Can Be Moved to Sphinx
 
Postgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data ModelsPostgres Vision 2018: Five Sharding Data Models
Postgres Vision 2018: Five Sharding Data Models
 
05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx05. EDT 513 Week 5 2023 Searching the Internet.pptx
05. EDT 513 Week 5 2023 Searching the Internet.pptx
 
Hands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop EcosystemHands On: Introduction to the Hadoop Ecosystem
Hands On: Introduction to the Hadoop Ecosystem
 
Splunk bsides
Splunk bsidesSplunk bsides
Splunk bsides
 

Último

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Websitedgelyza
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxMatsuo Lab
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostMatt Ray
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAshyamraj55
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UbiTrack UK
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioChristian Posta
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintMahmoud Rabie
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesThousandEyes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureEric D. Schabell
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesDavid Newbury
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxYounusS2
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Commit University
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncObject Automation
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6DianaGray10
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Adtran
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1DianaGray10
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URLRuncy Oommen
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataSafe Software
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfAijun Zhang
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServiceRenan Moreira de Oliveira
 

Último (20)

COMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a WebsiteCOMPUTER 10 Lesson 8 - Building a Website
COMPUTER 10 Lesson 8 - Building a Website
 
Introduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptxIntroduction to Matsuo Laboratory (ENG).pptx
Introduction to Matsuo Laboratory (ENG).pptx
 
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCostKubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
KubeConEU24-Monitoring Kubernetes and Cloud Spend with OpenCost
 
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPAAnypoint Code Builder , Google Pub sub connector and MuleSoft RPA
Anypoint Code Builder , Google Pub sub connector and MuleSoft RPA
 
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
UWB Technology for Enhanced Indoor and Outdoor Positioning in Physiological M...
 
Comparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and IstioComparing Sidecar-less Service Mesh from Cilium and Istio
Comparing Sidecar-less Service Mesh from Cilium and Istio
 
Empowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership BlueprintEmpowering Africa's Next Generation: The AI Leadership Blueprint
Empowering Africa's Next Generation: The AI Leadership Blueprint
 
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyesHow to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
How to Effectively Monitor SD-WAN and SASE Environments with ThousandEyes
 
OpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability AdventureOpenShift Commons Paris - Choose Your Own Observability Adventure
OpenShift Commons Paris - Choose Your Own Observability Adventure
 
Linked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond OntologiesLinked Data in Production: Moving Beyond Ontologies
Linked Data in Production: Moving Beyond Ontologies
 
Babel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptxBabel Compiler - Transforming JavaScript for All Browsers.pptx
Babel Compiler - Transforming JavaScript for All Browsers.pptx
 
Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)Crea il tuo assistente AI con lo Stregatto (open source python framework)
Crea il tuo assistente AI con lo Stregatto (open source python framework)
 
GenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation IncGenAI and AI GCC State of AI_Object Automation Inc
GenAI and AI GCC State of AI_Object Automation Inc
 
UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6UiPath Studio Web workshop series - Day 6
UiPath Studio Web workshop series - Day 6
 
Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™Meet the new FSP 3000 M-Flex800™
Meet the new FSP 3000 M-Flex800™
 
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1UiPath Platform: The Backend Engine Powering Your Automation - Session 1
UiPath Platform: The Backend Engine Powering Your Automation - Session 1
 
Designing A Time bound resource download URL
Designing A Time bound resource download URLDesigning A Time bound resource download URL
Designing A Time bound resource download URL
 
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial DataCloud Revolution: Exploring the New Wave of Serverless Spatial Data
Cloud Revolution: Exploring the New Wave of Serverless Spatial Data
 
Machine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdfMachine Learning Model Validation (Aijun Zhang 2024).pdf
Machine Learning Model Validation (Aijun Zhang 2024).pdf
 
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer ServicePicPay - GenAI Finance Assistant - ChatGPT for Customer Service
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
 

Creating an Open Source Genealogical Search Engine with Apache Solr

  • 1. Creating an Open Source Genealogical Search Engine With Apache Solr Brooke Schreier Ganz info@leafseek.com Twitter: @LeafSeek www.LeafSeek.com
  • 2. Hi, I‟m Brooke • I make web stuff for fun, and (sometimes) for profit • Web Developer at IBM.com and Disney Consumer Products • Lead Programmer at TMZ.com (yikes, sorry about that) • Senior Web Producer at Bravo cable TV network and its spin-off websites • Big dork • Big genealogy dork • #BigData dork
  • 3. Meet Gesher Galicia • Non-profit 501(c)3 genealogy society • Founded in 1993 • Hundreds of members, worldwide • E-mail discussion group • New website development in progress (existing website is fugly) • Needs a search engine…for data
  • 10. The New Problem • Diverse Data Languages (German, Polish, Ukrainian, Russian, Yiddi sh, Hebrew, English…) • Diverse Data Types (births, marriages, deaths, divorces, tax lists, landsmanschaften lists, industrial permit lists, school yearbooks, governmental yearbooks…)
  • 14. Existing solutions • They‟re okay...for small numbers of databases, with small amounts of data – Steve Morse's One-Step Tool Creator – Roll-your-own solution with PHP and MySQL • Both get more difficult to manage as data sets increase in number and complexity
  • 15. In space, no one can hear your data scream
  • 16. To Sum Up • There are lots of ways to publish your tree • …but not so many ways to publish your data • Surely there must be a way to deal with this?
  • 19. So I Made A Thing But “That Thing I Made With The Database And Stuff” was kind of an awkward name, so I called it LeafSeek
  • 20. This is the part where I show you all the shiny new All Galicia Database http://search.geshergalicia.org/
  • 21. Meet Apache Solr • Highly functional open source search platform • Based on Apache Lucene (Java)… • …plus a web wrapper/API • Not the prettiest or simplest tool • FREE and open source
  • 24. Saves Time, and Heartache
  • 26. Saves Time, and Stomachache
  • 33. solrconfig.xml Make sure this part is configured, so you can import data:
  • 34. How to get your data into Solr • Step 1: Make a properly-formatted spreadsheet • Step 2: Save spreadsheet as a .CSV file • Step 3: Create a MySQL database + table • Step 4: Import CSV into that new table • Step 5: Add a Unique Auto-Incrementing Primary Key called “id” (INT) • Step 6: Add this table‟s information to db-data-config.xml
  • 37. db-data-config.xml • Basic XML file that tells Solr how to grab data from your MySQL database(s) • Add new <dataSource> for new databases • Add new <entity> for new tables within the databases • You need to make sure your MySQL connector .jar is installed for this to work
  • 41. schema.xml • FieldTypes, Fields, and CopyFields • FieldTypes give indexing and querying instructions to “buckets” • Fields say what‟s what and whether to make something facetable or not • CopyFields collect Fields together into extra FieldTypes
  • 42. schema.xml - FieldTypes • 5 Custom FieldTypes (so far): – givenname – surname – surname_bmpm (phonetic) – place (note: not merely town) – year (which we‟re treating as text right now)
  • 46. schema.xml - Fields • Uppercase fields come from the name of the MySQL column name • Examples: – Year – SchoolYear – Surname – FathersTown – MothersFathersGivenName – MaternalGrandfathersGivenName
  • 47. schema.xml - Fields • Lowercase fields were added once the data is getting inputted to Solr, and start with the prefix record_ • Examples: – record_type (birth, death, tax, whatever) – record_source (name of repository) – record_latlong (latitude,longitude) – record_id (required!)
  • 48. schema.xml - Fields • You do not have to explicitly define every Field. • If something is imported that is not named and defined in schema.xml it will just be indexed as a straight-up text string, with nothing done to it. • Which is fine. • But IMHO it‟s better to define everything anyway so you can remember what‟s what and what you are doing to it.
  • 51. Add-ons and nice-to-have‟s (for the back-end) • Wildcards, and lots of „em • Non-name words handled through stopwords.txt • Nicknames and name synonyms handled through synonyms.txt • Two files included: – synonyms_-_american-anglo-saxon.txt – synonyms_-_polish-ukrainian-jewish.txt • Should be based on your data and your historical/ethnic community standards
  • 54. More add-ons and nice-to-have‟s (for the back-end) • Translate your site into different languages – multi- lingual content deserves a real multi-lingual website – Pass user preferences through GET value or through accept-language header or read from a cookie or whatever you want • Built-in performance monitoring hooks for New Relic • Soundalike searches for surname variants – Levenstein distance – “Regular” Soundex, Metaphone, Caverphone, etc.
  • 55. This is the part where I tell the story about THE SAGA of Beider-Morse Phonetic Matching (BMPM)
  • 56. Relevancy • Right now, we‟re using exact matches • (Of course, “exact” includes wildcards, alternate names / synonyms, etc.) • Like “Old Search” on Ancestry.com • DisMax! Boosting fields! Scoring! • (…but not yet) • Problems with records with multiple people‟s names in the record
  • 59. Lots of Front-End Options • Ruby: Sunspot, RSolr, Tanning Bed, acts-as-solr • Django/Python: Haystack, Sunburnt, solrpy, pysolr • Older PHP options: PECL, solr-php-client • Plugins for blog/CMS systems: Drupal, WordPress
  • 60. Meet Solarium • http://www.solarium-project.org/ • New, open source PHP wrapper for Solr • Very active development • Version 2.4 coming soon
  • 64. Meet Solarium: The Guts • You choose the parts of your data to facet • Data is submitted to the front-end by POST, not by GET, so the URL never changes • You can (and should) paginate results listings • You can't actually see the Solr server's URL from the front-end, not even in view- source
  • 65. Add-ons and nice-to-have‟s (for the front-end) • A welcome screen with information about the database's contents • Instructions (maybe twice) • How many records in the database? • How many datasets? • What features are coming next? • What datasets are coming next?
  • 66. Add-ons and nice-to-have‟s (for the front-end) • Make good UI choices • Pop-Up Google Maps • Tooltips to reduce UI clutter • Cross-browser compatibility • Still stuck with IE 7 and 8 • CSS and code that degrades gracefully • No small text
  • 67. Bird‟s Eye View of Your Data • What (surnames, towns, etc.) do I have in my data? • What are the TOP (surnames, towns, etc.) in my data? • Finding incorrect data – Outlying years and dates – Figure out that hard-to-read surname • Make charts and graphs from your data
  • 69. The (Back-End) Future! (Maybe.) • Date ranges, instead of just years • Auto-complete as you type • “Did you mean...?” (based on data frequency) • “More Like This” (would have to do scoring) • Record bookmarking system (hashes?)
  • 70. The (Front-End) Future! (Maybe.) • Hierarchical facets for locations • Disambiguating locations • Social sharing of individual records • New genealogy data schema http://historical-data.org/ • Membership login system
  • 72. Please Do Not Build That Wall • Password protect some of the databases • Password protect some of the data • Open data, but pay for record or surname bookmarking system • Open data, but pay for API access • Open data, but sell online ads • Open data, but give people guilt trips
  • 73. Presenting LeafSeek! • Free and Open Source • Code is all on GitHub • Please add, edit, fix, change, tinker • …and use it!
  • 76. Why is this FREE? And why is this important?