SlideShare una empresa de Scribd logo
1 de 20
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Large Scale Search with Open Source Technologies
Building Search Engines
What do we do?
Streamline, Organize & Unify
Business Information
Agenda
• Challenge - Why does this matter?
• Search Engine - 30k Foot View
• Open - Lucene, Cassandra & Spark
• Customizing - Apache Lucene/SolR
• Custom Parser - Written in Scala
Challenge – Why does this matter?
Knowledge
Project
Information
Client Service
Information
Corporate
Guides
Collaborative
Documents
Assets
& Files
Corporate
Resources
Appleseed Framework (Portal, Base, Search)
G Drive
Delta
DropBox
G Drive
Delta
Nutshell
Dropbox
Freshbooks
G Drive
G Sites (KB)
G Drive
Workflowy
Evernote
G Drive
DropBox
OwnCloud
Pocket
Leaves
AIC (WP)
Anant (WP)
Search Engine – 30 Thousand Foot View
The search index is only as good as your processed data.
If you put everything you find in your index, you are going to
spend a lot of time telling people how to search.
Lucene – More than meets the eye
Who
Next?
Think of it like a “NoSQL” Database that has great indexing..
everywhere.
Cassandra – NoSQL With Structure
Who
Next?
Think of it like a “NoSQL” Database that has structure. Using
“CQL” You can insert into and select from.. just not join.
Spark – Way Better MapReduce
Who
Next?
Think of it like MapReduce if MapReduce were created with
scala, instead of Java, with streams. It’s also 100 times faster.
Configuring - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Configuration - Schema
–Data Types
–Pre-Processing
–Collection Definitions
–Managed vs. Unmanaged
• Configuration - ZooKeeper
–Synchronize Configurations
–Distribute Shards
–Manage Replicas
–Elect Leaders
• Configuration - SolrConfig
–Handlers
–Components
–Indexing Configurations
–Memory / Cache
–File System
• Lessons Learned
–Try to use out of the box
–Try to configure your way
–Make sure to upgrade
–Not everything can be configured
Configuring - SolR - 2/3
• Before Docker
–Setup Zookeeper
•Customize zoo.cfg
•Setup Zookeeper Servers
–Setup SolR Distro
•Download SolR
•Clean up SolR
•Customize Schema.xml
•Customize SolrConfig.xml
•Setup Different Solr Servers
–Start the Cloud
•Custom Start Scripts
• Today w/ Docker
– docker run --name zookeeper 
-p 127.0.0.1:2181:2181 
-p 127.0.0.1:2888:2888 
-p 127.0.0.1:3888:3888 
jplock/zookeeper
– docker run --link zookeeper:ZK -i 
-p 127.0.0.1:8983:8983 
-t dockerimages/docker-solr 
/bin/bash -c '
cd /opt/solr/example; 
java -jar 
-Dbootstrap_confdir=./solr/collection1/conf 
-Dcollection.configName=myconf  -
DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO
RT_2181_TCP_PORT 
-DnumShards=2 
start.jar';
https://hub.docker.com/r/dockerima
ges/docker-solr/
https://cwiki.apache.org/confluence/displa
y/solr/Getting+Started+with+SolrCloud
https://cwiki.apache.org/confluence/displa
y/solr/Taking+Solr+to+Production
Configuring - SolR - 3/3
• SolrConfig - Example • Schema - Example
https://cwiki.apache.org/confluence/displa
y/solr/Configuring+solrconfig.xml
https://wiki.apache.org/solr/SchemaXml
SolR Cloud / Zookeeper
User Interface - Super Advanced
Customizing - SolR - 1/3
SolR is like an eighteen wheel truck you can take apart and rebuild from
the ground up with only what you need, or add as much as you want.
• Customization - Parsing
–Need Specialized Syntax?
–Java or Scala Based
–Open Plugin Structure
–Several Examples
• Customization - Highlighting
–Need Special Coloring?
–Specialized Syntax Aware
–Open Plugin Structure
–Several Examples
• Customization - Term Counts
–Need Specific Information?
–Create the Logic
–Register the Component
–Complicated Examples
• Lessons Learned
–Major version upgrades = pain
–Newer classes can be extended
better
–Long term investment
Customizing - SolR - 2/3
• Custom Component in Scala or Java • Installing the Component
http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using
-lucenes-new-queryparser-
framework.html
Customizing - SolR - 3/3
Creating a Custom Parser with Scala
Building a parser in Scala wasn’t my first choice, but creating it
in Scala made me see how much better the language is.
• Why a Specialized Syntax?
–Legacy Syntax
–Boolean AND Proximity Queries
–Specialized Fielded Expressions
–Ranges / Classifications
• Why not ANTLR or JavaCC?
–Old Parser was in Parboiled(1)
–Parboiled2 was in Scala
–No need to learn a separate
Syntax for Creating Syntax
• Lessons Learned
–Parboiled2 Documentation = bad
–Understand the syntax
–Interactive REPL in Scala = good
–Write tons of unit tests
–Long term investment
• Customizing SolR with Scala
–Found a good Virtual Mentor
–Learned Scala (not for Spark)
–Started from the ground up
–Reduced from ~1k to 400 LOC
JavaCC vs. parboiled2 (Scala)
• Java CC - SurroundQuery.jj • Scala based Parboiled2
Questions & Contact
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
@anantcorp
facebook.com/anantCorp
linkedin.com/company/anant
rahul@anant.us
linkedin.com/in/xingh
Rahul Singh
CEO & Founder
Questions & Contact
• Brown Bag Session or Meetup?
• Modern Enterprise
• Mastering Services in the Service of Others
• Hybrid Agile Project Management
• Building Search Engines
• CICD / DevOps
• Connecting Internet Software
www.anant.us | solutions@anant.us | 202.905.2818
1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007
Streamlined Data
Integration / Data Pipelines
Organized Knowledge
Search / Data Warehouses
Unified Interfaces
Portals / Dashboards / Mobile

Más contenido relacionado

La actualidad más candente

SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.Pini Krisher
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesIke Ellis
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformIke Ellis
 
Harnessing Spark and Cassandra with Groovy
Harnessing Spark and Cassandra with GroovyHarnessing Spark and Cassandra with Groovy
Harnessing Spark and Cassandra with GroovySteve Pember
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBAndrew Siemer
 
Design for scale
Design for scaleDesign for scale
Design for scaleDoug Lampe
 
The Importance of Wait Statistics in SQL Server
The Importance of Wait Statistics in SQL ServerThe Importance of Wait Statistics in SQL Server
The Importance of Wait Statistics in SQL ServerGrant Fritchey
 
Performance Tuning Azure SQL Database
Performance Tuning Azure SQL DatabasePerformance Tuning Azure SQL Database
Performance Tuning Azure SQL DatabaseGrant Fritchey
 
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlassian
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For DevelopersDavide Mauri
 
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013Aaron Saunders
 
Ohio Devfest - Visual Analysis with GCP
Ohio Devfest - Visual Analysis with GCPOhio Devfest - Visual Analysis with GCP
Ohio Devfest - Visual Analysis with GCPWesley Workman
 
Static web apps by GitHub action
Static web apps by GitHub actionStatic web apps by GitHub action
Static web apps by GitHub actionSeven Peaks Speaks
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for AnalyticsIke Ellis
 
Intro to API Design Principles
Intro to API Design PrinciplesIntro to API Design Principles
Intro to API Design PrinciplesVictor Osimitz
 

La actualidad más candente (20)

SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.SQL Azure - the good, the bad and the ugly.
SQL Azure - the good, the bad and the ugly.
 
SQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutesSQL PASS BAC - 60 reporting tips in 60 minutes
SQL PASS BAC - 60 reporting tips in 60 minutes
 
A lap around microsofts business intelligence platform
A lap around microsofts business intelligence platformA lap around microsofts business intelligence platform
A lap around microsofts business intelligence platform
 
Harnessing Spark and Cassandra with Groovy
Harnessing Spark and Cassandra with GroovyHarnessing Spark and Cassandra with Groovy
Harnessing Spark and Cassandra with Groovy
 
Campus days Azure HDInsight automation
Campus days Azure HDInsight automationCampus days Azure HDInsight automation
Campus days Azure HDInsight automation
 
Test driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDBTest driving Azure Search and DocumentDB
Test driving Azure Search and DocumentDB
 
Design for scale
Design for scaleDesign for scale
Design for scale
 
The Importance of Wait Statistics in SQL Server
The Importance of Wait Statistics in SQL ServerThe Importance of Wait Statistics in SQL Server
The Importance of Wait Statistics in SQL Server
 
Performance Tuning Azure SQL Database
Performance Tuning Azure SQL DatabasePerformance Tuning Azure SQL Database
Performance Tuning Azure SQL Database
 
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data CenterAtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
AtlasCamp 2014: Preparing Your Plugin for JIRA Data Center
 
Postgres Open
Postgres OpenPostgres Open
Postgres Open
 
In Memory Cahce Structure
In Memory Cahce StructureIn Memory Cahce Structure
In Memory Cahce Structure
 
SQL Server 2016 What's New For Developers
SQL Server 2016  What's New For DevelopersSQL Server 2016  What's New For Developers
SQL Server 2016 What's New For Developers
 
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013
DC Titanium User Group Meetup: Appcelerator Titanium Alloy jan2013
 
Solr
SolrSolr
Solr
 
Ohio Devfest - Visual Analysis with GCP
Ohio Devfest - Visual Analysis with GCPOhio Devfest - Visual Analysis with GCP
Ohio Devfest - Visual Analysis with GCP
 
SPA vs. MPA
SPA vs. MPASPA vs. MPA
SPA vs. MPA
 
Static web apps by GitHub action
Static web apps by GitHub actionStatic web apps by GitHub action
Static web apps by GitHub action
 
Data modeling trends for Analytics
Data modeling trends for AnalyticsData modeling trends for Analytics
Data modeling trends for Analytics
 
Intro to API Design Principles
Intro to API Design PrinciplesIntro to API Design Principles
Intro to API Design Principles
 

Destacado

Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersRahul Singh
 
Google GmbH - Die Zukunft der Unternehmenssuche
Google GmbH - Die Zukunft der UnternehmenssucheGoogle GmbH - Die Zukunft der Unternehmenssuche
Google GmbH - Die Zukunft der UnternehmenssucheUnic
 
Get Intelligent with Metabase
Get Intelligent with MetabaseGet Intelligent with Metabase
Get Intelligent with MetabaseAnant Corporation
 
Enterprise Search: Potenziale und Fallstricke
Enterprise Search: Potenziale und FallstrickeEnterprise Search: Potenziale und Fallstricke
Enterprise Search: Potenziale und FallstrickeAlexander Stocker
 
Domain-Driven Design with ASP.NET MVC
Domain-Driven Design with ASP.NET MVCDomain-Driven Design with ASP.NET MVC
Domain-Driven Design with ASP.NET MVCSteven Smith
 
Implementing DDD with C#
Implementing DDD with C#Implementing DDD with C#
Implementing DDD with C#Pascal Laurin
 

Destacado (8)

Deliver Excellent Service to your Customers
Deliver Excellent Service to your CustomersDeliver Excellent Service to your Customers
Deliver Excellent Service to your Customers
 
Google GmbH - Die Zukunft der Unternehmenssuche
Google GmbH - Die Zukunft der UnternehmenssucheGoogle GmbH - Die Zukunft der Unternehmenssuche
Google GmbH - Die Zukunft der Unternehmenssuche
 
Get Intelligent with Metabase
Get Intelligent with MetabaseGet Intelligent with Metabase
Get Intelligent with Metabase
 
Enterprise Search: Potenziale und Fallstricke
Enterprise Search: Potenziale und FallstrickeEnterprise Search: Potenziale und Fallstricke
Enterprise Search: Potenziale und Fallstricke
 
Domain-Driven Design with ASP.NET MVC
Domain-Driven Design with ASP.NET MVCDomain-Driven Design with ASP.NET MVC
Domain-Driven Design with ASP.NET MVC
 
Implementing DDD with C#
Implementing DDD with C#Implementing DDD with C#
Implementing DDD with C#
 
Domain Driven Design 101
Domain Driven Design 101Domain Driven Design 101
Domain Driven Design 101
 
Enterprise Solutions_German
Enterprise Solutions_GermanEnterprise Solutions_German
Enterprise Solutions_German
 

Similar a Building Enterprise Search Engines using Open Source Technologies

Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher lucenerevolution
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearchAnton Udovychenko
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr PerformanceLucidworks
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkRahul Jain
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksLucidworks
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCampGokulD
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scalethelabdude
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Shalin Shekhar Mangar
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceLucidworks (Archived)
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenMS Cloud Summit
 
PLAT-4 Understanding the SOLR Integration
PLAT-4 Understanding the SOLR IntegrationPLAT-4 Understanding the SOLR Integration
PLAT-4 Understanding the SOLR IntegrationAlfresco Software
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsAnshum Gupta
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale NetApp
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunktdthomassld
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systexJames Chen
 

Similar a Building Enterprise Search Engines using Open Source Technologies (20)

Big Data Technologies
Big Data Technologies Big Data Technologies
Big Data Technologies
 
Solr
SolrSolr
Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 
Rapid prototyping with solr - By Erik Hatcher
Rapid prototyping with solr -  By Erik Hatcher Rapid prototyping with solr -  By Erik Hatcher
Rapid prototyping with solr - By Erik Hatcher
 
Search and analyze your data with elasticsearch
Search and analyze your data with elasticsearchSearch and analyze your data with elasticsearch
Search and analyze your data with elasticsearch
 
Benchmarking Solr Performance
Benchmarking Solr PerformanceBenchmarking Solr Performance
Benchmarking Solr Performance
 
Real time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache SparkReal time Analytics with Apache Kafka and Apache Spark
Real time Analytics with Apache Kafka and Apache Spark
 
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, LucidworksYour Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
Your Big Data Stack is Too Big!: Presented by Timothy Potter, Lucidworks
 
Big Search with Big Data Principles
Big Search with Big Data PrinciplesBig Search with Big Data Principles
Big Search with Big Data Principles
 
Lucene BootCamp
Lucene BootCampLucene BootCamp
Lucene BootCamp
 
Benchmarking Solr Performance at Scale
Benchmarking Solr Performance at ScaleBenchmarking Solr Performance at Scale
Benchmarking Solr Performance at Scale
 
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
Scaling SolrCloud to a Large Number of Collections - Fifth Elephant 2014
 
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr PerformanceSFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
SFBay Area Solr Meetup - June 18th: Benchmarking Solr Performance
 
Laravel 4 presentation
Laravel 4 presentationLaravel 4 presentation
Laravel 4 presentation
 
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. NielsenJ1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
J1 T1 3 - Azure Data Lake store & analytics 101 - Kenneth M. Nielsen
 
PLAT-4 Understanding the SOLR Integration
PLAT-4 Understanding the SOLR IntegrationPLAT-4 Understanding the SOLR Integration
PLAT-4 Understanding the SOLR Integration
 
Scaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of CollectionsScaling SolrCloud to a large number of Collections
Scaling SolrCloud to a large number of Collections
 
Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale Cost Effectively Run Multiple Oracle Database Copies at Scale
Cost Effectively Run Multiple Oracle Database Copies at Scale
 
From Lucene to Solr 4 Trunk
From Lucene to Solr 4 TrunkFrom Lucene to Solr 4 Trunk
From Lucene to Solr 4 Trunk
 
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
[Hic2011] using hadoop lucene-solr-for-large-scale-search by systex
 

Más de Rahul Singh

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Rahul Singh
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit TogetherRahul Singh
 
Machine Learning & Graph Processing w/ Spark and Accumulo
Machine Learning & Graph Processing w/ Spark and AccumuloMachine Learning & Graph Processing w/ Spark and Accumulo
Machine Learning & Graph Processing w/ Spark and AccumuloRahul Singh
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Rahul Singh
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data ProcessingRahul Singh
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesRahul Singh
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessRahul Singh
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the WebRahul Singh
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Rahul Singh
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsRahul Singh
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Rahul Singh
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentationRahul Singh
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayRahul Singh
 

Más de Rahul Singh (13)

Unifying Business Information with Dashboards
Unifying Business Information with Dashboards Unifying Business Information with Dashboards
Unifying Business Information with Dashboards
 
Get Your Shit Together
Get Your Shit TogetherGet Your Shit Together
Get Your Shit Together
 
Machine Learning & Graph Processing w/ Spark and Accumulo
Machine Learning & Graph Processing w/ Spark and AccumuloMachine Learning & Graph Processing w/ Spark and Accumulo
Machine Learning & Graph Processing w/ Spark and Accumulo
 
Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B) Building Online Business Software 101 (B2B)
Building Online Business Software 101 (B2B)
 
Asynchronous Data Processing
Asynchronous Data ProcessingAsynchronous Data Processing
Asynchronous Data Processing
 
Building Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal SitesBuilding Smart Indexes for Drupal Sites
Building Smart Indexes for Drupal Sites
 
Building People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & HappinessBuilding People First - Lessons in Team Effectiveness & Happiness
Building People First - Lessons in Team Effectiveness & Happiness
 
Select * From Internet - Integrating the Web
Select * From Internet - Integrating the WebSelect * From Internet - Integrating the Web
Select * From Internet - Integrating the Web
 
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
Bill Drayton - Father of Social Entrepreneurship, Leading Leader of Social Ch...
 
The Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 YearsThe Future of the Internet - The Next 30 Years
The Future of the Internet - The Next 30 Years
 
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
Modern Presidential Communications - Communicating Presidential Rhetorical Vi...
 
Rahul.singh.speech presentation
Rahul.singh.speech presentationRahul.singh.speech presentation
Rahul.singh.speech presentation
 
Anant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, TodayAnant - Micro Enterprise - The Future, Today
Anant - Micro Enterprise - The Future, Today
 

Último

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Steffen Staab
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsAlberto González Trastoy
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️Delhi Call girls
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerThousandEyes
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...panagenda
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionSolGuruz
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...MyIntelliSource, Inc.
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxComplianceQuest1
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsAndolasoft Inc
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comFatema Valibhai
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsJhone kinadey
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providermohitmore19
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️anilsa9823
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxbodapatigopi8531
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceanilsa9823
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...harshavardhanraghave
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...MyIntelliSource, Inc.
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...OnePlan Solutions
 

Último (20)

Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
Shapes for Sharing between Graph Data Spaces - and Epistemic Querying of RDF-...
 
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time ApplicationsUnveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
Unveiling the Tech Salsa of LAMs with Janus in Real-Time Applications
 
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
call girls in Vaishali (Ghaziabad) 🔝 >༒8448380779 🔝 genuine Escort Service 🔝✔️✔️
 
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected WorkerHow To Troubleshoot Collaboration Apps for the Modern Connected Worker
How To Troubleshoot Collaboration Apps for the Modern Connected Worker
 
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
W01_panagenda_Navigating-the-Future-with-The-Hitchhikers-Guide-to-Notes-and-D...
 
Diamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with PrecisionDiamond Application Development Crafting Solutions with Precision
Diamond Application Development Crafting Solutions with Precision
 
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
Try MyIntelliAccount Cloud Accounting Software As A Service Solution Risk Fre...
 
A Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docxA Secure and Reliable Document Management System is Essential.docx
A Secure and Reliable Document Management System is Essential.docx
 
How To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.jsHow To Use Server-Side Rendering with Nuxt.js
How To Use Server-Side Rendering with Nuxt.js
 
HR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.comHR Software Buyers Guide in 2024 - HRSoftware.com
HR Software Buyers Guide in 2024 - HRSoftware.com
 
Right Money Management App For Your Financial Goals
Right Money Management App For Your Financial GoalsRight Money Management App For Your Financial Goals
Right Money Management App For Your Financial Goals
 
TECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service providerTECUNIQUE: Success Stories: IT Service provider
TECUNIQUE: Success Stories: IT Service provider
 
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online  ☂️
CALL ON ➥8923113531 🔝Call Girls Kakori Lucknow best sexual service Online ☂️
 
Hand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptxHand gesture recognition PROJECT PPT.pptx
Hand gesture recognition PROJECT PPT.pptx
 
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female serviceCALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
CALL ON ➥8923113531 🔝Call Girls Badshah Nagar Lucknow best Female service
 
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS LiveVip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
Vip Call Girls Noida ➡️ Delhi ➡️ 9999965857 No Advance 24HRS Live
 
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICECHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
CHEAP Call Girls in Pushp Vihar (-DELHI )🔝 9953056974🔝(=)/CALL GIRLS SERVICE
 
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
Reassessing the Bedrock of Clinical Function Models: An Examination of Large ...
 
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
Steps To Getting Up And Running Quickly With MyTimeClock Employee Scheduling ...
 
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
Tech Tuesday-Harness the Power of Effective Resource Planning with OnePlan’s ...
 

Building Enterprise Search Engines using Open Source Technologies

  • 1. www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Large Scale Search with Open Source Technologies Building Search Engines
  • 2. What do we do? Streamline, Organize & Unify Business Information
  • 3. Agenda • Challenge - Why does this matter? • Search Engine - 30k Foot View • Open - Lucene, Cassandra & Spark • Customizing - Apache Lucene/SolR • Custom Parser - Written in Scala
  • 4. Challenge – Why does this matter? Knowledge Project Information Client Service Information Corporate Guides Collaborative Documents Assets & Files Corporate Resources Appleseed Framework (Portal, Base, Search) G Drive Delta DropBox G Drive Delta Nutshell Dropbox Freshbooks G Drive G Sites (KB) G Drive Workflowy Evernote G Drive DropBox OwnCloud Pocket Leaves AIC (WP) Anant (WP)
  • 5. Search Engine – 30 Thousand Foot View The search index is only as good as your processed data. If you put everything you find in your index, you are going to spend a lot of time telling people how to search.
  • 6. Lucene – More than meets the eye Who Next? Think of it like a “NoSQL” Database that has great indexing.. everywhere.
  • 7. Cassandra – NoSQL With Structure Who Next? Think of it like a “NoSQL” Database that has structure. Using “CQL” You can insert into and select from.. just not join.
  • 8. Spark – Way Better MapReduce Who Next? Think of it like MapReduce if MapReduce were created with scala, instead of Java, with streams. It’s also 100 times faster.
  • 9. Configuring - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Configuration - Schema –Data Types –Pre-Processing –Collection Definitions –Managed vs. Unmanaged • Configuration - ZooKeeper –Synchronize Configurations –Distribute Shards –Manage Replicas –Elect Leaders • Configuration - SolrConfig –Handlers –Components –Indexing Configurations –Memory / Cache –File System • Lessons Learned –Try to use out of the box –Try to configure your way –Make sure to upgrade –Not everything can be configured
  • 10. Configuring - SolR - 2/3 • Before Docker –Setup Zookeeper •Customize zoo.cfg •Setup Zookeeper Servers –Setup SolR Distro •Download SolR •Clean up SolR •Customize Schema.xml •Customize SolrConfig.xml •Setup Different Solr Servers –Start the Cloud •Custom Start Scripts • Today w/ Docker – docker run --name zookeeper -p 127.0.0.1:2181:2181 -p 127.0.0.1:2888:2888 -p 127.0.0.1:3888:3888 jplock/zookeeper – docker run --link zookeeper:ZK -i -p 127.0.0.1:8983:8983 -t dockerimages/docker-solr /bin/bash -c ' cd /opt/solr/example; java -jar -Dbootstrap_confdir=./solr/collection1/conf -Dcollection.configName=myconf - DzkHost=$ZK_PORT_2181_TCP_ADDR:$ZK_PO RT_2181_TCP_PORT -DnumShards=2 start.jar'; https://hub.docker.com/r/dockerima ges/docker-solr/ https://cwiki.apache.org/confluence/displa y/solr/Getting+Started+with+SolrCloud https://cwiki.apache.org/confluence/displa y/solr/Taking+Solr+to+Production
  • 11. Configuring - SolR - 3/3 • SolrConfig - Example • Schema - Example https://cwiki.apache.org/confluence/displa y/solr/Configuring+solrconfig.xml https://wiki.apache.org/solr/SchemaXml
  • 12. SolR Cloud / Zookeeper
  • 13. User Interface - Super Advanced
  • 14. Customizing - SolR - 1/3 SolR is like an eighteen wheel truck you can take apart and rebuild from the ground up with only what you need, or add as much as you want. • Customization - Parsing –Need Specialized Syntax? –Java or Scala Based –Open Plugin Structure –Several Examples • Customization - Highlighting –Need Special Coloring? –Specialized Syntax Aware –Open Plugin Structure –Several Examples • Customization - Term Counts –Need Specific Information? –Create the Logic –Register the Component –Complicated Examples • Lessons Learned –Major version upgrades = pain –Newer classes can be extended better –Long term investment
  • 15. Customizing - SolR - 2/3 • Custom Component in Scala or Java • Installing the Component http://wiki.apache.org/solr/SolrPluginshttp://sujitpal.blogspot.com/2011/03/using -lucenes-new-queryparser- framework.html
  • 17. Creating a Custom Parser with Scala Building a parser in Scala wasn’t my first choice, but creating it in Scala made me see how much better the language is. • Why a Specialized Syntax? –Legacy Syntax –Boolean AND Proximity Queries –Specialized Fielded Expressions –Ranges / Classifications • Why not ANTLR or JavaCC? –Old Parser was in Parboiled(1) –Parboiled2 was in Scala –No need to learn a separate Syntax for Creating Syntax • Lessons Learned –Parboiled2 Documentation = bad –Understand the syntax –Interactive REPL in Scala = good –Write tons of unit tests –Long term investment • Customizing SolR with Scala –Found a good Virtual Mentor –Learned Scala (not for Spark) –Started from the ground up –Reduced from ~1k to 400 LOC
  • 18. JavaCC vs. parboiled2 (Scala) • Java CC - SurroundQuery.jj • Scala based Parboiled2
  • 19. Questions & Contact www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 @anantcorp facebook.com/anantCorp linkedin.com/company/anant rahul@anant.us linkedin.com/in/xingh Rahul Singh CEO & Founder Questions & Contact • Brown Bag Session or Meetup? • Modern Enterprise • Mastering Services in the Service of Others • Hybrid Agile Project Management • Building Search Engines • CICD / DevOps • Connecting Internet Software
  • 20. www.anant.us | solutions@anant.us | 202.905.2818 1010 Wisconsin Ave, NW | Suite 250 | Washington, DC 20007 Streamlined Data Integration / Data Pipelines Organized Knowledge Search / Data Warehouses Unified Interfaces Portals / Dashboards / Mobile