Enterprise Search is a challenging problem for most organizations. Public search technologies such as Google can index content and use link popularity to rank content in addition to the basic keyword matches. Enterprise Search is different. Sometimes it requires specially designed indexes as well as several processing steps.
At the U.S. Patent & Trademark Office, part of the Department of Commerce, a team of professionals is building the next generation of search tools using open source technologies. Like any large undertaking, it’s not a simple plug and play project.
Main topics to be covered in this talk:
+ Architectures for Large Scale Enterprise Search
+ Leveraging Apache Cassandra & Spark
+ Customizing / Configuring Apache SolR and Indexing
+ Writing a custom Parser for SolR in Scala
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Building Enterprise Search Engines using Open Source Technologies
1. www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Large Scale Search with Open Source Technologies
Building Search Engines
2. What do we do?
Streamline, Organize & Unify
Business Information
3. Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/SolR
• Custom Parser - Written in Scala
4. Challenge – Why does this matter?
Knowledge
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Resources
Appleseed Framework (Portal, Base, Search)
G Drive
Delta
DropBox
G Drive
Delta
Nutshell
Dropbox
Freshbooks
G Drive
G Sites (KB)
G Drive
Workflowy
Evernote
G Drive
DropBox
OwnCloud
Pocket
Leaves
AIC (WP)
Anant (WP)
5. Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to
spend a lot of time telling people how to search.
6. Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.
7. Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.
8. Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.
9. Configuring - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Configuration - Schema
–Data Types
–Pre-Processing
–Collection Definitions
–Managed vs. Unmanaged
• Configuration - ZooKeeper
–Synchronize Configurations
–Distribute Shards
–Manage Replicas
–Elect Leaders
• Configuration - SolrConfig
–Handlers
–Components
–Indexing Configurations
–Memory / Cache
–File System
• Lessons Learned
–Try to use out of the box
–Try to configure your way
–Make sure to upgrade
–Not everything can be configured
14. Customizing - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Customization - Parsing
–Need Specialized Syntax?
–Java or Scala Based
–Open Plugin Structure
–Several Examples
• Customization - Highlighting
–Need Special Coloring?
–Specialized Syntax Aware
–Open Plugin Structure
–Several Examples
• Customization - Term Counts
–Need Specific Information?
–Create the Logic
–Register the Component
–Complicated Examples
• Lessons Learned
–Major version upgrades = pain
–Newer classes can be extended
better
–Long term investment
15. Customizing - SolR - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using
-lucenes-new-queryparser-
framework.html
17. Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it
in Scala made me see how much better the language is.
• Why a Specialized Syntax?
–Legacy Syntax
–Boolean AND Proximity Queries
–Specialized Fielded Expressions
–Ranges / Classifications
• Why not ANTLR or JavaCC?
–Old Parser was in Parboiled(1)
–Parboiled2 was in Scala
–No need to learn a separate
Syntax for Creating Syntax
• Lessons Learned
–Parboiled2 Documentation = bad
–Understand the syntax
–Interactive REPL in Scala = good
–Write tons of unit tests
–Long term investment
• Customizing SolR with Scala
–Found a good Virtual Mentor
–Learned Scala (not for Spark)
–Started from the ground up
–Reduced from ~1k to 400 LOC