SlideShare a Scribd company logo
1 of 37
Moinuddin Ahmed
                           -guided by
            Dr. Pushpak Bhattacharyya
                          IIT Bombay

7/15/2012                           1
Outline
  Solr
     Introduction
     Lucene vs. Solr
     Solr Features
     Indexing in Solr
     Querying in Solr
  Assamese Search Engine
     Monolingual Search
     Cross lingual Search
  Conclusions
  Future Work

7/15/2012                    2
What is Solr?
  Solr is an open source enterprise search platform from
     the Apache Lucene project.[1]

  Solr=Lucene + added features


  Allows for faster, more comprehensive searches on a
     large volume of data


7/15/2012                                                   3
Lucene vs Solr

  Lucene is a library while Solr is a web application that uses
     the Lucene library.

  Built on top of Lucene, Solr extends it with a set of robust
     features like-
        Hit highlighting
        Index replication
        Faceted searching
        Distributed searching etc..



7/15/2012                                                          4
Features
  Hit Highlighting - Shows a snippet of a document in the search
     results that surrounds the search terms.

  Faceted Search – Clusters search results into drill-down
     categories. Users can then “categorize" by applying specific
     constraints to the search results.

  Distributed Searching: The presence of the shards parameter in a
     request will cause that request to be distributed across all shards in the
     list.

  Pass a number of optional request parameters to the request handler
     to control what information is returned

  External XML Configuration –Solr is flexible and adaptable using
     XML configuration
7/15/2012                                                                         5
Hit Highlighting example..



                                         snippet




7/15/2012                                   6
Example of Faceted searching
    Manufacturer is
       FACET




       Dell, HP are
       constraints
  • is a technique for accessing information organized   Facet count

  • Faceted search helps users who think in terms of attribute specifications
  as filtering criteria.
7/15/2012                                                                   7
Faceted searching contd..

  Imagine a situation, where the client wants to have the no. of
     companies in the cities where the companies were found by the query.

  One has to return the no. documents with same field value.


  the chosen facet value is used to construct a filter query which
     matches that value in the index




7/15/2012                                                                   8
Distributed Search



When an index becomes too large to fit on a single system, an index can be
split into multiple shards[2]

A single shard receives the query, distributes the query to other shards

Solr can query and merge results across those shards.

 7/15/2012                                                                    9
STARTING UP THE SOLR SERVER
  Solr 1.4.1 uses Jetty 6.1.3 server


  Solr is started by the following commnad
               java –jar start.jar

  This will start up the jetty application server on
   port 8983

7/15/2012                                               10
INDEXING SOLR



7/15/2012                   11
Indexing can be done in two ways:
        Command line :


                      java -jar post.jar *.xml

        Framework such as Nutch:


            bin/nutch solrindex <solr url> <crawldb> -linkdb <linkdb>
            (<segment> ... | -dir <segments>)




7/15/2012                                                               12
Schema.xml

  This file contains all of the details about which fields
     the documents can contain

  how those fields should be dealt with when adding
     documents to the index, or when querying those
     fields.




7/15/2012                                                     13
Contents of Schema.xml

 1)Data types <type>
 2)Fields <field type>




7/15/2012                14
1)DATA TYPE
 <types>
   <fieldType name="string" class="solr.StrField” />
   <fieldType name="long" class="solr.LongField” />
   <fieldType name="float" class="solr.FloatField” />
   <fieldType name="text" class="solr.TextField” />
 </types>



The <types> section allows one to define:
1. a list of <fieldtype> declarations.
2. underlying Solr class that should be used for that type,



7/15/2012                                                     15
2)Fields
            <field name="id" type="string" indexed="true"
             stored="true" multiValued="true"/>

  The <fields> section lists the individual<field> declarations one wishes
     to use in documents.

  Each <field> has
        a name that will be used to reference it when adding documents or
         executing searches and
        an associated type which identifies the name of the fieldtype one
         wishes to use for this field.


7/15/2012                                                                     16
Some common options that fields can have are...

  default
        The default value for this field if none is provided while adding
      documents
  indexed=true|false
     True if this field should be "indexed". If (and only if) a field is
      indexed, then it is searchable, sortable, and facetable.
  stored=true|false
     True if the value of the field should be retrievable during a search
  multiValued=true|false
     True if this field may contain multiple values per document, i.e. if it
      can appear multiple times in a document


7/15/2012                                                                       17
How to add analyzers in a field?
  <fieldType name="text" class="solr.TextField"
         positionIncrementGap="100">

      <analyzer type="index">

           <tokenizer class="solr.WhitespaceTokenizerFactory"/>
                  <filter class="solr.StopFilterFactory"
                   ignoreCase="true" words="assamese_stop_words.txt"/>

                 <filter class="solr.AssameseStemFilterFactory"/>

  </analyzer>


7/15/2012                                                                18
Querying SOLR..
Adding analyzer during Query time
<fieldType name="text" class="solr.TextField"
        positionIncrementGap="100">

  <analyzer type=“query">
         <tokenizer class="solr.WhitespaceTokenizerFactory"/>
         <filter class="solr.StopFilterFactory“
         words="assamese_stop_words.txt"/>

          <filter class="solr.AssameseStemFilterFactory"/>
          <filter class="solr.RemoveDuplicatesTokenFilterFactory"/>
  </analyzer>
SOLR REQUEST HANDLER
 A SolrRequestHandler is a Solr Plugin that defines the logic
  executed for any request.[4]
 Can be implemented either in solrconfig.xml or directly
  in the url/user interface.

List of Request Handlers utilized
 StandardRequestHandler
 DisMaxRequestHandler
 LukeRequestHandler
 MoreLikeThisHandler
DismaxRequestHandler
 It is designed to process simple user entered phrases and search for the
  individual words across several fields using different weighting (boosts)
  based on the significance of each field. [4]

 Some parameters of DismaxRequestHandler:
 qf(query fields), fl(fields), pf(phrase fields), bq(boost query), etc.

  Example
 <requesthandler=dismax>
 <str name="fl">
           title,content,anchor,host,url
  </str>
 <str name="qf">
           url^3.0 anchor content^10.0 title^3.0 host^2.0
  </str>
</requesthandler>
Response Writers
 A QueryResponseWriter is a Solr Plugin that defines
 the response format for any request[3].

 Uses a default format XmlResponseFormat.


 Also has several others response formats like Xslt
XSLT RESPONSE WRITER..
 The XSLT Response Writer captures the output of the XML
  Response Writer and applies an XSLT transform to it.[3]

 http://localhost:8983/solr/select/?q=‘user query’&wt=xslt&tr=example.xsl
 Parameters:
        Wt: writer used
        Tr: Selects the XSLT transformation to use, which must be found in
  Solr's conf/xslt directory.
 The Content-Type of the response is set according to the <xsl:output>
  statement in the XSLT transform, for example:
         <xsl:output media-type="text/html"/>
IMPLEMENTATION FOR ASSAMESE LANGUAGE
FIELDS IN SCHEMA.XML

 HOST
 SITE
 URL
 CONTENT
 TITLE
 LANG
 ID
 TIME
 TOPKWORDS
 DOMAIN


UNIQUE KEY: TIME(in milliseconds)
INDEXING
For Assamese monolingual search

Indexed around 500 Assamese text files and about 120URLS
upto depth 3.

For Cross Lingual search
Indexed a few English URL s.
Analyzers used…
• Assamese Stemmer
 suffix stripping (rule based) + dictionary look-up
 accuracy: 80%

• English Porter Stemmer


• Both Assamese and English uses Whitespace tokenizer.


• Stop words are removed in both languages.
GUI




Famous temples in Guwahati   29
QUERY FORMATION




     Famous temples in Guwahati
RESULT(XML FORMAT)




    Famous temples in Guwahati   31
XSLT




       Famous temples in Guwahati
Future work..

 Parsing the query programmatically.


 Building the resources for adding the Translation and
 transliteration modules in the monolingual pipeline.
CONCLUSION
  As we now know Solr uses the Lucene search library
     and extends it with a set of robust features.

  Solr's powerful external configuration allows it to be
     tailored to almost any type of application

  So it is preferable to use Solr is if a programmer wants
     to embed its added functionalities into his own
     existing application.

7/15/2012                                                     34
REFERENCES
 1.   Author, Rafal Kuc, Packt Publishing, Apache Solr 1.4.1 Cookbook

 2. Author, David Smiley, Eric Pugh, Apache Solr 1.4 Enterprise Edition 2009


 3. Apache Lucene, http://lucene.apache.org/solr/ , Feb, 2012


 4. Scaling Solr and lucene, http://www.lucidimagination.com/content/scaling-
    lucene-and solr#article.highqueryvolume.solr, Feb, 2012




7/15/2012                                                                       35
THANK YOU



7/15/2012               36
HELLO

 If you guys found this, don’t forget
     to give my reference, it a healthy
     habit 

              • Moinuddin ahmed

7/15/2012                                 37

More Related Content

What's hot

Oracle SQL - Grants, filters, groups and more
Oracle SQL - Grants, filters, groups and moreOracle SQL - Grants, filters, groups and more
Oracle SQL - Grants, filters, groups and moreA Data Guru
 
Oracle SQL - Select Part -1 let's write some queries!
Oracle SQL - Select Part -1  let's write some queries!Oracle SQL - Select Part -1  let's write some queries!
Oracle SQL - Select Part -1 let's write some queries!A Data Guru
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache SolrBiogeeks
 
06 association of value types
06 association of value types06 association of value types
06 association of value typesthirumuru2012
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6DEEPAK KHETAWAT
 
The Power of MySQL Explain
The Power of MySQL ExplainThe Power of MySQL Explain
The Power of MySQL ExplainMYXPLAIN
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" DataArt
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHPPaul Borgermans
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeClay Helberg
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)Erik Hatcher
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solrpittaya
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginsearchbox-com
 
Mastering solr
Mastering solrMastering solr
Mastering solrjurcello
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Erik Hatcher
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with SolrErik Hatcher
 

What's hot (20)

Solr workshop
Solr workshopSolr workshop
Solr workshop
 
Koha 3.2 manual
Koha 3.2 manualKoha 3.2 manual
Koha 3.2 manual
 
Oracle SQL - Grants, filters, groups and more
Oracle SQL - Grants, filters, groups and moreOracle SQL - Grants, filters, groups and more
Oracle SQL - Grants, filters, groups and more
 
Oracle SQL - Select Part -1 let's write some queries!
Oracle SQL - Select Part -1  let's write some queries!Oracle SQL - Select Part -1  let's write some queries!
Oracle SQL - Select Part -1 let's write some queries!
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Building your own search engine with Apache Solr
Building your own search engine with Apache SolrBuilding your own search engine with Apache Solr
Building your own search engine with Apache Solr
 
06 association of value types
06 association of value types06 association of value types
06 association of value types
 
Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6Basics of Solr and Solr Integration with AEM6
Basics of Solr and Solr Integration with AEM6
 
The Power of MySQL Explain
The Power of MySQL ExplainThe Power of MySQL Explain
The Power of MySQL Explain
 
IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys" IT talk SPb "Full text search for lazy guys"
IT talk SPb "Full text search for lazy guys"
 
Solr Presentation
Solr PresentationSolr Presentation
Solr Presentation
 
Get the most out of Solr search with PHP
Get the most out of Solr search with PHPGet the most out of Solr search with PHP
Get the most out of Solr search with PHP
 
Dynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data MergeDynamic Publishing with Arbortext Data Merge
Dynamic Publishing with Arbortext Data Merge
 
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)code4lib 2011 preconference: What's New in Solr (since 1.4.1)
code4lib 2011 preconference: What's New in Solr (since 1.4.1)
 
Using Apache Solr
Using Apache SolrUsing Apache Solr
Using Apache Solr
 
Tutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component pluginTutorial on developing a Solr search component plugin
Tutorial on developing a Solr search component plugin
 
Mastering solr
Mastering solrMastering solr
Mastering solr
 
Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)Lucene's Latest (for Libraries)
Lucene's Latest (for Libraries)
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Rapid Prototyping with Solr
Rapid Prototyping with SolrRapid Prototyping with Solr
Rapid Prototyping with Solr
 

Viewers also liked

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrKai Chan
 
Solr installation
Solr installationSolr installation
Solr installationZHAO Sam
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and SolrGrant Ingersoll
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureMongoDB
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4Paul Hampton
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesRahul Jain
 

Viewers also liked (6)

Search Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and SolrSearch Engine-Building with Lucene and Solr
Search Engine-Building with Lucene and Solr
 
Solr installation
Solr installationSolr installation
Solr installation
 
Intro to Apache Lucene and Solr
Intro to Apache Lucene and SolrIntro to Apache Lucene and Solr
Intro to Apache Lucene and Solr
 
Webinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence ArchitectureWebinar: MongoDB and Polyglot Persistence Architecture
Webinar: MongoDB and Polyglot Persistence Architecture
 
Getting to know alfresco 4
Getting to know alfresco 4Getting to know alfresco 4
Getting to know alfresco 4
 
Introduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and UsecasesIntroduction to Lucene & Solr and Usecases
Introduction to Lucene & Solr and Usecases
 

Similar to Assamese search engine using SOLR by Moinuddin Ahmed ( moin )

Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-WebinarEdureka!
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialSourcesense
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdfAbanti Aazmin
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr WorkshopJSGB
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Manish kumar
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.comFernando Meyer
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1YI-CHING WU
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentAlkacon Software GmbH & Co. KG
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemTrey Grainger
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Websolutions Agency
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Sajindbg Dbg
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to SolrErik Hatcher
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development TutorialErik Hatcher
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Mary Jo Sminkey
 

Similar to Assamese search engine using SOLR by Moinuddin Ahmed ( moin ) (20)

Solr5
Solr5Solr5
Solr5
 
Apache Solr for begginers
Apache Solr for begginersApache Solr for begginers
Apache Solr for begginers
 
Apache Solr-Webinar
Apache Solr-WebinarApache Solr-Webinar
Apache Solr-Webinar
 
Solr 101
Solr 101Solr 101
Solr 101
 
Apache Solr
Apache SolrApache Solr
Apache Solr
 
Dev8d Apache Solr Tutorial
Dev8d Apache Solr TutorialDev8d Apache Solr Tutorial
Dev8d Apache Solr Tutorial
 
Apace Solr Web Development.pdf
Apace Solr Web Development.pdfApace Solr Web Development.pdf
Apace Solr Web Development.pdf
 
Apache Solr Workshop
Apache Solr WorkshopApache Solr Workshop
Apache Solr Workshop
 
Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)Search Engine Capabilities - Apache Solr(Lucene)
Search Engine Capabilities - Apache Solr(Lucene)
 
Search Intelligence @elo7.com
Search Intelligence @elo7.comSearch Intelligence @elo7.com
Search Intelligence @elo7.com
 
Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1Introduction to Lucene and Solr - 1
Introduction to Lucene and Solr - 1
 
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve contentOpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
OpenCms Days 2012 - OpenCms 8.5: Using Apache Solr to retrieve content
 
The Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data EcosystemThe Apache Solr Smart Data Ecosystem
The Apache Solr Smart Data Ecosystem
 
Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8Using Search API, Search API Solr and Facets in Drupal 8
Using Search API, Search API Solr and Facets in Drupal 8
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction Coffee at DBG- Solr introduction
Coffee at DBG- Solr introduction
 
LeVan, "Search Web Services"
LeVan, "Search Web Services"LeVan, "Search Web Services"
LeVan, "Search Web Services"
 
Introduction to Solr
Introduction to SolrIntroduction to Solr
Introduction to Solr
 
Solr Application Development Tutorial
Solr Application Development TutorialSolr Application Development Tutorial
Solr Application Development Tutorial
 
Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)Solr/Elasticsearch for CF Developers (and others)
Solr/Elasticsearch for CF Developers (and others)
 

Recently uploaded

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsJoaquim Jorge
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilV3cube
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptxHampshireHUG
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfhans926745
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobeapidays
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 

Recently uploaded (20)

Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Artificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and MythsArtificial Intelligence: Facts and Myths
Artificial Intelligence: Facts and Myths
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Developing An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of BrazilDeveloping An App To Navigate The Roads of Brazil
Developing An App To Navigate The Roads of Brazil
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
04-2024-HHUG-Sales-and-Marketing-Alignment.pptx
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Tech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdfTech Trends Report 2024 Future Today Institute.pdf
Tech Trends Report 2024 Future Today Institute.pdf
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 

Assamese search engine using SOLR by Moinuddin Ahmed ( moin )

  • 1. Moinuddin Ahmed -guided by Dr. Pushpak Bhattacharyya IIT Bombay 7/15/2012 1
  • 2. Outline  Solr  Introduction  Lucene vs. Solr  Solr Features  Indexing in Solr  Querying in Solr  Assamese Search Engine  Monolingual Search  Cross lingual Search  Conclusions  Future Work 7/15/2012 2
  • 3. What is Solr?  Solr is an open source enterprise search platform from the Apache Lucene project.[1]  Solr=Lucene + added features  Allows for faster, more comprehensive searches on a large volume of data 7/15/2012 3
  • 4. Lucene vs Solr  Lucene is a library while Solr is a web application that uses the Lucene library.  Built on top of Lucene, Solr extends it with a set of robust features like-  Hit highlighting  Index replication  Faceted searching  Distributed searching etc.. 7/15/2012 4
  • 5. Features  Hit Highlighting - Shows a snippet of a document in the search results that surrounds the search terms.  Faceted Search – Clusters search results into drill-down categories. Users can then “categorize" by applying specific constraints to the search results.  Distributed Searching: The presence of the shards parameter in a request will cause that request to be distributed across all shards in the list.  Pass a number of optional request parameters to the request handler to control what information is returned  External XML Configuration –Solr is flexible and adaptable using XML configuration 7/15/2012 5
  • 6. Hit Highlighting example.. snippet 7/15/2012 6
  • 7. Example of Faceted searching Manufacturer is FACET Dell, HP are constraints • is a technique for accessing information organized Facet count • Faceted search helps users who think in terms of attribute specifications as filtering criteria. 7/15/2012 7
  • 8. Faceted searching contd..  Imagine a situation, where the client wants to have the no. of companies in the cities where the companies were found by the query.  One has to return the no. documents with same field value.  the chosen facet value is used to construct a filter query which matches that value in the index 7/15/2012 8
  • 9. Distributed Search When an index becomes too large to fit on a single system, an index can be split into multiple shards[2] A single shard receives the query, distributes the query to other shards Solr can query and merge results across those shards. 7/15/2012 9
  • 10. STARTING UP THE SOLR SERVER  Solr 1.4.1 uses Jetty 6.1.3 server  Solr is started by the following commnad java –jar start.jar  This will start up the jetty application server on port 8983 7/15/2012 10
  • 12. Indexing can be done in two ways:  Command line : java -jar post.jar *.xml  Framework such as Nutch: bin/nutch solrindex <solr url> <crawldb> -linkdb <linkdb> (<segment> ... | -dir <segments>) 7/15/2012 12
  • 13. Schema.xml  This file contains all of the details about which fields the documents can contain  how those fields should be dealt with when adding documents to the index, or when querying those fields. 7/15/2012 13
  • 14. Contents of Schema.xml 1)Data types <type> 2)Fields <field type> 7/15/2012 14
  • 15. 1)DATA TYPE <types> <fieldType name="string" class="solr.StrField” /> <fieldType name="long" class="solr.LongField” /> <fieldType name="float" class="solr.FloatField” /> <fieldType name="text" class="solr.TextField” /> </types> The <types> section allows one to define: 1. a list of <fieldtype> declarations. 2. underlying Solr class that should be used for that type, 7/15/2012 15
  • 16. 2)Fields <field name="id" type="string" indexed="true" stored="true" multiValued="true"/>  The <fields> section lists the individual<field> declarations one wishes to use in documents.  Each <field> has  a name that will be used to reference it when adding documents or executing searches and  an associated type which identifies the name of the fieldtype one wishes to use for this field. 7/15/2012 16
  • 17. Some common options that fields can have are...  default  The default value for this field if none is provided while adding documents  indexed=true|false  True if this field should be "indexed". If (and only if) a field is indexed, then it is searchable, sortable, and facetable.  stored=true|false  True if the value of the field should be retrievable during a search  multiValued=true|false  True if this field may contain multiple values per document, i.e. if it can appear multiple times in a document 7/15/2012 17
  • 18. How to add analyzers in a field?  <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type="index">  <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory" ignoreCase="true" words="assamese_stop_words.txt"/> <filter class="solr.AssameseStemFilterFactory"/>  </analyzer> 7/15/2012 18
  • 20. Adding analyzer during Query time <fieldType name="text" class="solr.TextField" positionIncrementGap="100"> <analyzer type=“query"> <tokenizer class="solr.WhitespaceTokenizerFactory"/> <filter class="solr.StopFilterFactory“ words="assamese_stop_words.txt"/> <filter class="solr.AssameseStemFilterFactory"/> <filter class="solr.RemoveDuplicatesTokenFilterFactory"/> </analyzer>
  • 21. SOLR REQUEST HANDLER  A SolrRequestHandler is a Solr Plugin that defines the logic executed for any request.[4]  Can be implemented either in solrconfig.xml or directly in the url/user interface. List of Request Handlers utilized  StandardRequestHandler  DisMaxRequestHandler  LukeRequestHandler  MoreLikeThisHandler
  • 22. DismaxRequestHandler  It is designed to process simple user entered phrases and search for the individual words across several fields using different weighting (boosts) based on the significance of each field. [4]  Some parameters of DismaxRequestHandler: qf(query fields), fl(fields), pf(phrase fields), bq(boost query), etc. Example <requesthandler=dismax> <str name="fl"> title,content,anchor,host,url </str> <str name="qf"> url^3.0 anchor content^10.0 title^3.0 host^2.0 </str> </requesthandler>
  • 23. Response Writers  A QueryResponseWriter is a Solr Plugin that defines the response format for any request[3].  Uses a default format XmlResponseFormat.  Also has several others response formats like Xslt
  • 24. XSLT RESPONSE WRITER..  The XSLT Response Writer captures the output of the XML Response Writer and applies an XSLT transform to it.[3]  http://localhost:8983/solr/select/?q=‘user query’&wt=xslt&tr=example.xsl  Parameters: Wt: writer used Tr: Selects the XSLT transformation to use, which must be found in Solr's conf/xslt directory.  The Content-Type of the response is set according to the <xsl:output> statement in the XSLT transform, for example: <xsl:output media-type="text/html"/>
  • 26. FIELDS IN SCHEMA.XML  HOST  SITE  URL  CONTENT  TITLE  LANG  ID  TIME  TOPKWORDS  DOMAIN UNIQUE KEY: TIME(in milliseconds)
  • 27. INDEXING For Assamese monolingual search Indexed around 500 Assamese text files and about 120URLS upto depth 3. For Cross Lingual search Indexed a few English URL s.
  • 28. Analyzers used… • Assamese Stemmer suffix stripping (rule based) + dictionary look-up accuracy: 80% • English Porter Stemmer • Both Assamese and English uses Whitespace tokenizer. • Stop words are removed in both languages.
  • 29. GUI Famous temples in Guwahati 29
  • 30. QUERY FORMATION Famous temples in Guwahati
  • 31. RESULT(XML FORMAT) Famous temples in Guwahati 31
  • 32. XSLT Famous temples in Guwahati
  • 33. Future work..  Parsing the query programmatically.  Building the resources for adding the Translation and transliteration modules in the monolingual pipeline.
  • 34. CONCLUSION  As we now know Solr uses the Lucene search library and extends it with a set of robust features.  Solr's powerful external configuration allows it to be tailored to almost any type of application  So it is preferable to use Solr is if a programmer wants to embed its added functionalities into his own existing application. 7/15/2012 34
  • 35. REFERENCES 1. Author, Rafal Kuc, Packt Publishing, Apache Solr 1.4.1 Cookbook 2. Author, David Smiley, Eric Pugh, Apache Solr 1.4 Enterprise Edition 2009 3. Apache Lucene, http://lucene.apache.org/solr/ , Feb, 2012 4. Scaling Solr and lucene, http://www.lucidimagination.com/content/scaling- lucene-and solr#article.highqueryvolume.solr, Feb, 2012 7/15/2012 35
  • 37. HELLO If you guys found this, don’t forget to give my reference, it a healthy habit  • Moinuddin ahmed 7/15/2012 37

Editor's Notes

  1. http://localhost:8983/solr/select?shards=localhost:8983/solr,localhost:7574/solr&amp;indent=true&amp;q=ipod+solr