Set Your Records Free!
LeafSeek is a new tool that helps you turn your genealogical or historical record collections into searchable online databases. Combine multiple datasets of different types — such as birth, marriage, and military records — into one unified searchable website. Find inter-connections in your data that you never noticed before.
With great features like built-in geo-spatial searches, pop-up Google Maps, Beider-Morse Phonetic Matching, name synonyms, and language localization, LeafSeek can help you turn your spreadsheets of names and dates into a full-featured genealogy search engine. It’s designed for researchers and genealogy societies alike.
Oh, and one more thing: LeafSeek is free and open source. No strings attached.
PicPay - GenAI Finance Assistant - ChatGPT for Customer Service
Creating an Open Source Genealogical Search Engine with Apache Solr
1. Creating an Open Source
Genealogical Search Engine
With Apache Solr
Brooke Schreier Ganz
info@leafseek.com
Twitter: @LeafSeek
www.LeafSeek.com
2. Hi, I‟m Brooke
• I make web stuff for fun, and (sometimes) for
profit
• Web Developer at IBM.com and Disney
Consumer Products
• Lead Programmer at TMZ.com (yikes, sorry about that)
• Senior Web Producer at Bravo cable TV
network and its spin-off websites
• Big dork
• Big genealogy dork
• #BigData dork
3. Meet Gesher Galicia
• Non-profit 501(c)3 genealogy society
• Founded in 1993
• Hundreds of members, worldwide
• E-mail discussion group
• New website development in progress
(existing website is fugly)
• Needs a search engine…for data
10. The New Problem
• Diverse Data Languages
(German, Polish, Ukrainian, Russian, Yiddi
sh, Hebrew, English…)
• Diverse Data Types
(births, marriages, deaths, divorces, tax
lists, landsmanschaften lists, industrial
permit lists, school
yearbooks, governmental yearbooks…)
14. Existing solutions
• They‟re okay...for small numbers of
databases, with small amounts of data
– Steve Morse's One-Step Tool Creator
– Roll-your-own solution with PHP and MySQL
• Both get more difficult to manage as data
sets increase in number and complexity
16. To Sum Up
• There are lots of ways to publish your tree
• …but not so many ways to publish your
data
• Surely there must be a way to deal with
this?
19. So I Made A Thing
But “That Thing I Made With The Database And Stuff”
was kind of an awkward name, so I called it
LeafSeek
20. This is the part where I show you all
the shiny new All Galicia Database
http://search.geshergalicia.org/
21. Meet Apache Solr
• Highly functional open source search
platform
• Based on Apache Lucene (Java)…
• …plus a web wrapper/API
• Not the prettiest or simplest tool
• FREE and open source
34. How to get your data into Solr
• Step 1: Make a properly-formatted
spreadsheet
• Step 2: Save spreadsheet as a .CSV file
• Step 3: Create a MySQL database + table
• Step 4: Import CSV into that new table
• Step 5: Add a Unique Auto-Incrementing
Primary Key called “id” (INT)
• Step 6: Add this table‟s information to
db-data-config.xml
37. db-data-config.xml
• Basic XML file that tells Solr how to grab
data from your MySQL database(s)
• Add new <dataSource> for new databases
• Add new <entity> for new tables within the
databases
• You need to make sure your MySQL
connector .jar is installed for this to work
41. schema.xml
• FieldTypes, Fields, and CopyFields
• FieldTypes give indexing and querying
instructions to “buckets”
• Fields say what‟s what and whether to
make something facetable or not
• CopyFields collect Fields together into
extra FieldTypes
42. schema.xml - FieldTypes
• 5 Custom FieldTypes (so far):
– givenname
– surname
– surname_bmpm (phonetic)
– place (note: not merely town)
– year (which we‟re treating as text right now)
46. schema.xml - Fields
• Uppercase fields come from the name of
the MySQL column name
• Examples:
– Year
– SchoolYear
– Surname
– FathersTown
– MothersFathersGivenName
– MaternalGrandfathersGivenName
47. schema.xml - Fields
• Lowercase fields were added once the
data is getting inputted to Solr, and start
with the prefix record_
• Examples:
– record_type (birth, death, tax, whatever)
– record_source (name of repository)
– record_latlong (latitude,longitude)
– record_id (required!)
48. schema.xml - Fields
• You do not have to explicitly define every
Field.
• If something is imported that is not named
and defined in schema.xml it will just be
indexed as a straight-up text string, with
nothing done to it.
• Which is fine.
• But IMHO it‟s better to define everything
anyway so you can remember what‟s what
and what you are doing to it.
51. Add-ons and nice-to-have‟s
(for the back-end)
• Wildcards, and lots of „em
• Non-name words handled through
stopwords.txt
• Nicknames and name synonyms handled
through synonyms.txt
• Two files included:
– synonyms_-_american-anglo-saxon.txt
– synonyms_-_polish-ukrainian-jewish.txt
• Should be based on your data and your
historical/ethnic community standards
54. More add-ons and nice-to-have‟s
(for the back-end)
• Translate your site into different languages – multi-
lingual content deserves a real multi-lingual
website
– Pass user preferences through GET value or through
accept-language header or read from a cookie or
whatever you want
• Built-in performance monitoring hooks for New
Relic
• Soundalike searches for surname variants
– Levenstein distance
– “Regular” Soundex, Metaphone, Caverphone, etc.
55. This is the part where I tell
the story about
THE SAGA
of Beider-Morse Phonetic Matching
(BMPM)
56. Relevancy
• Right now, we‟re using exact matches
• (Of course, “exact” includes
wildcards, alternate names /
synonyms, etc.)
• Like “Old Search” on Ancestry.com
• DisMax! Boosting fields! Scoring!
• (…but not yet)
• Problems with records with multiple
people‟s names in the record
64. Meet Solarium: The Guts
• You choose the parts of your data to facet
• Data is submitted to the front-end by
POST, not by GET, so the URL never
changes
• You can (and should) paginate results
listings
• You can't actually see the Solr server's
URL from the front-end, not even in view-
source
65. Add-ons and nice-to-have‟s
(for the front-end)
• A welcome screen with information about
the database's contents
• Instructions (maybe twice)
• How many records in the database?
• How many datasets?
• What features are coming next?
• What datasets are coming next?
66. Add-ons and nice-to-have‟s
(for the front-end)
• Make good UI choices
• Pop-Up Google Maps
• Tooltips to reduce UI clutter
• Cross-browser compatibility
• Still stuck with IE 7 and 8
• CSS and code that degrades gracefully
• No small text
67. Bird‟s Eye View of Your Data
• What (surnames, towns, etc.) do I have in
my data?
• What are the TOP (surnames, towns, etc.)
in my data?
• Finding incorrect data
– Outlying years and dates
– Figure out that hard-to-read surname
• Make charts and graphs from your data
69. The (Back-End) Future! (Maybe.)
• Date ranges, instead of just years
• Auto-complete as you type
• “Did you mean...?”
(based on data frequency)
• “More Like This”
(would have to do scoring)
• Record bookmarking system (hashes?)
70. The (Front-End) Future! (Maybe.)
• Hierarchical facets for locations
• Disambiguating locations
• Social sharing of individual records
• New genealogy data schema
http://historical-data.org/
• Membership login system
72. Please Do Not Build That Wall
• Password protect some of the databases
• Password protect some of the data
• Open data, but pay for record or surname
bookmarking system
• Open data, but pay for API access
• Open data, but sell online ads
• Open data, but give people guilt trips
73. Presenting LeafSeek!
• Free and Open Source
• Code is all on GitHub
• Please add, edit, fix, change, tinker
• …and use it!