Building Enterprise Search Engines using Open Source Technologies

www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Large Scale Search with Open Source Technologies
Building Search Engines

What do we do?
Streamline, Organize & Unify
Business Information

Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/SolR
• Custom Parser - Written in Scala

Challenge – Why does this matter?
Knowledge
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Resources
Appleseed Framework (Portal, Base, Search)
G Drive
Delta
DropBox
G Drive
Delta
Nutshell
Dropbox
Freshbooks
G Drive
G Sites (KB)
G Drive
Workflowy
Evernote
G Drive
DropBox
OwnCloud
Pocket
Leaves
AIC (WP)
Anant (WP)

Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to
spend a lot of time telling people how to search.

Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.

Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.

Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.

Configuring - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Configuration - Schema
–Data Types
–Pre-Processing
–Collection Definitions
–Managed vs. Unmanaged
• Configuration - ZooKeeper
–Synchronize Configurations
–Distribute Shards
–Manage Replicas
–Elect Leaders
• Configuration - SolrConfig
–Handlers
–Components
–Indexing Configurations
–Memory / Cache
–File System
• Lessons Learned
–Try to use out of the box
–Try to configure your way
–Make sure to upgrade
–Not everything can be configured

• Before Docker
–Setup Zookeeper
•Customize zoo.cfg
•Setup Zookeeper Servers
–Setup SolR Distro
•Download SolR
•Clean up SolR
•Customize Schema.xml
•Customize SolrConfig.xml
•Setup Different Solr Servers
–Start the Cloud
•Custom Start Scripts
• Today w/ Docker
– docker run --name zookeeper
-p 127.0.0.1:2181:2181
-p 127.0.0.1:2888:2888
-p 127.0.0.1:3888:3888
jplock/zookeeper
– docker run --link zookeeper:ZK -i
-p 127.0.0.1:8983:8983
-t dockerimages/docker-solr
/bin/bash -c '
cd /opt/solr/example;
java -jar
-Dbootstrap_confdir=./solr/collection1/conf
-Dcollection.configName=myconf -
DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO
RT_2181_TCP_PORT
-DnumShards=2
start.jar';
https://hub.docker.com/r/dockerima
ges/docker-solr/
https://cwiki.apache.org/confluence/displa
y/solr/Getting+Started+with+SolrCloud
y/solr/Taking+Solr+to+Production

• SolrConfig - Example • Schema - Example
y/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/SchemaXml

User Interface - Super Advanced

Customizing - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Customization - Parsing
–Need Specialized Syntax?
–Java or Scala Based
–Open Plugin Structure
–Several Examples
• Customization - Highlighting
–Need Special Coloring?
–Specialized Syntax Aware
–Open Plugin Structure
–Several Examples
• Customization - Term Counts
–Need Specific Information?
–Create the Logic
–Register the Component
–Complicated Examples
• Lessons Learned
–Major version upgrades = pain
–Newer classes can be extended
better
–Long term investment

Customizing - SolR - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using
-lucenes-new-queryparser-
framework.html

Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it
in Scala made me see how much better the language is.
• Why a Specialized Syntax?
–Legacy Syntax
–Boolean AND Proximity Queries
–Specialized Fielded Expressions
–Ranges / Classifications
• Why not ANTLR or JavaCC?
–Old Parser was in Parboiled(1)
–Parboiled2 was in Scala
–No need to learn a separate
Syntax for Creating Syntax
• Lessons Learned
–Parboiled2 Documentation = bad
–Understand the syntax
–Interactive REPL in Scala = good
–Write tons of unit tests
–Long term investment
• Customizing SolR with Scala
–Found a good Virtual Mentor
–Learned Scala (not for Spark)
–Started from the ground up
–Reduced from ~1k to 400 LOC

JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2

Questions & Contact
@anantcorp
facebook.com/anantCorp
linkedin.com/company/anant
rahul@anant.us
linkedin.com/in/xingh
Rahul Singh
CEO & Founder
Questions & Contact
• Brown Bag Session or Meetup?
• Modern Enterprise
• Mastering Services in the Service of Others
• Hybrid Agile Project Management
• Building Search Engines
• CICD / DevOps
• Connecting Internet Software

Streamlined Data
Integration / Data Pipelines
Organized Knowledge
Search / Data Warehouses
Unified Interfaces
Portals / Dashboards / Mobile

Building Enterprise Search Engines using Open Source Technologies

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Destacado

Destacado (8)

Similar a Building Enterprise Search Engines using Open Source Technologies

Similar a Building Enterprise Search Engines using Open Source Technologies (20)

Más de Rahul Singh

Más de Rahul Singh (13)

Último

Último (20)

Building Enterprise Search Engines using Open Source Technologies