Se ha denunciado esta presentación.
Utilizamos tu perfil de LinkedIn y tus datos de actividad para personalizar los anuncios y mostrarte publicidad más relevante. Puedes cambiar tus preferencias de publicidad en cualquier momento.

Webinar: OpenNLP and Solr for Superior Relevance

4.370 visualizaciones

Publicado el

Lucidworks Senior Software Engineer and Solr Committer Steve Rowe explains how to increase relevance using Solr with Apache OpenNLP.

Publicado en: Tecnología
  • Increasing Sex Drive And Getting Harder Erections, Naturally ♥♥♥ https://bit.ly/30G1ZO1
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí
  • SECRET: Men usually out of emotion, not logic. Take advantage of this and get your Ex back today! See how at: ■■■ http://ow.ly/mOLD301xGxr
       Responder 
    ¿Estás seguro?    No
    Tu mensaje aparecerá aquí

Webinar: OpenNLP and Solr for Superior Relevance

  1. 1. 2016 OCTOBER 11-14
 BOSTON, MA http://lucenerevolution.com
  2. 2. OpenNLP and Solr for Superior Relevance Steve Rowe @steven_a_rowe Sr. Software Engineer, Lucidworks
  3. 3. • Previously worked at the Center for Natural Language Processing (CNLP) at Syracuse University • Sr. Software Engineer at Lucidworks • Committer on Apache Lucene/Solr project • Committer on JFlex scanner generator project About me
  4. 4. • OpenNLP capabilities • OpenNLP / Solr integration • OpenNLP models: training and licensing • Part-of-speech: what is it good for (absolutely/RB something/NN) • Lemmatization versus stemming • Solr configuration and demonstration of lemmatization and named entity extraction • Future Work Agenda
  5. 5. • Machine learning: maximum entropy and perceptron based • Sentence segmentation • Tokenization • Part-of-speech (POS) tagging • Lemmatization • Named entity recognition (NER) • Phrase chunking • Parsing • Co-reference resolution • Document classification Apache OpenNLP capabilities
  6. 6. • OpenNLP isn’t integrated with Solr in any release (yet) • TDD (talk driven development) • LUCENE-2899: WIP patch, builds, works (demo later) • Currently implemented: • Sentence segmentation and tokenization • Part-of-speech (POS) tagging • Phrase chunking • Dictionary-based lemmatization • Named entity recognition (NER) OpenNLP Solr integration
  7. 7. • Most OpenNLP phases can be trained, but each phase depends on the previous ones. • Publicly available models are based on data with non-free licenses. • You can train your own models, and you very likely want to for production use. • Example workflow: • Use an existing model to tag your training data • Modify the tagged data according to your needs • One way to do that: the brat rapid annotation tool (OpenNLP understands its output format) • Run OpenNLP command-line training tools to create a model • Run OpenNLP command-line evaluation tools to test model performance • Repeat until you get the quality you want OpenNLP models: training & licensing
  8. 8. • Created: 30/Jan/11 10:44 <- over 5 years old • Lance Norskog wrote the bulk of the implementation • I modernized Lance’s patch and added support for dictionary-based lemmatization LUCENE-2899
  9. 9. • Both can be used with search to increase recall • Lemmas are real words: infinitive verbs, singular nouns • e.g. Speaking/VBG, spoke/VB -> speak; stigmata/NNS -> stigma • Can be produced by algorithm and/or known-item dictionary • OpenNLP 1.6.1 will include a machine-learned lemmatization implementation • Caveat: poor quality part-of-speech over short query text • Stems are not (necessarily) real words • e.g. Speaking -> speak, spoke -> spoke, stigmata -> stigmata
 (Porter stemmer) • produced via algorithm Lemmatization vs. Stemming
  10. 10. Penn Treebank part of speech tags PRP$ Possessive pronoun RB Adverb RBR Adverb, comparative RBS Adverb, superlative RP Particle SYM Symbol TO to UH Interjection VB Verb, base form VBD Verb, past tense VBG Verb, gerund or present participle VBN Verb, past participle VBP Verb, non-3rd person singular present VBZ Verb, 3rd person singular present WDT Wh-determiner WP Wh-pronoun WP$ Possessive wh-pronoun WRB Wh-adverb CC Coordinating conjunction CD Cardinal number DT Determiner EX Existential there FW Foreign word IN Preposition/subordinating conjunction JJ Adjective JJR Adjective, comparative JJS Adjective, superlative LS List item marker MD Modal NN Noun, singular or mass NNS Noun, plural NNP Proper noun, singular NNPS Proper noun, plural PDT Predeterminer POS Possessive ending PRP Personal pronoun https://www.ling.upenn.edu/courses/Fall_2003/ling001/penn_treebank_pos.html
  11. 11. Solr configuration • Put jars on classpath • Add required resources to configset: • models • lemmatization dictionary • Add field type(s) using OpenNLP-based analysis components, then fields using these field types
  12. 12. Put jars on the classpath • Add to configset’s solrconfig.xml: <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lucene-libs" regex=".*.jar"/> <lib dir="${solr.install.dir:../../../..}
 /contrib/analysis-extras/lib" regex="opennlp-.*.jar"/>
  13. 13. Add required resources to configset • Download models from 
 http://opennlp.sourceforge.net/models-1.5/ • Download lemma dictionary from 
 http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz conf/
 opennlp/
 en-ner-person.bin
 en-pos-maxent.bin
 en-sent.bin
 en-token.bin
 language-tool-en-lemmatizer.txt
  14. 14. Add field type and fields curl -X POST http://localhost:8983/solr/opennlp/schema 
 -H 'Content-type: application/json' --data-binary '{
 "add-field-type":{
 "name":"text_lemma",
 "class":"solr.TextField",
 "positionIncrementGap":"100",
 "analyzer":{
 "tokenizer":{
 "class":"solr.OpenNLPTokenizerFactory",
 "sentenceModel":"opennlp/en-sent.bin",
 "tokenizerModel":"opennlp/en-token.bin"
 },
 "filters":[{
 "class":"solr.OpenNLPFilterFactory",
 "posTaggerModel":"opennlp/en-pos-maxent.bin"
 },{
 "class":"solr.OpenNLPLemmatizerFilterFactory",
 "dictionary":"opennlp/language-tool-en-lemmatizer.txt"
 }]}},
 "add-field":{
 "name":"content_lemma",
 "type":"text_lemma",
 “stored":true }
 }'
  15. 15. • (Switch to http://localhost:8983/solr here) Demo
  16. 16. • Make Solr update request processors for named entity recognition, maybe phrase chunker. • Optimize memory usage to only process one sentence at a time. • Commit/release LUCENE-2899! Future Work
  17. 17. Resources Solr: http://lucene.apache.org/solr OpenNLP: http://opennlp.apache.org LUCENE-2899: https://issues.apache.org/jira/browse/LUCENE-2899 OpenNLP pre-trained models: http://opennlp.sourceforge.net/models-1.5/ brat rapid annotation tool: http://brat.nlplab.org/index.html LanguageTool lemma dictionaries: http://ixa2.si.ehu.es/ragerri/lemmatizer-dicts.tgz Company: http://www.lucidworks.com Our blog: http://www.lucidworks.com/blog Twitter: @steven_a_rowe

×