SlideShare una empresa de Scribd logo
1 de 57
Descargar para leer sin conexión
Tomasz Korzeniowski
   tomek@polarrose.com
Information Retrieval
Retrieval strategies
• Vector Space Model
• Latent Semantic Indexing
• Probabilistic Retrieval Strategies
• Language Models
• Inference Networks
• Extended Boolean Retrieval
• Neural Networks
• Genetic Algorithms
• Fuzzy Set Retrieval
Vector space model
Text retrieval
Analysis
Tokenization
Stop-words
Stemming

Lemmatization
http://tartarus.org/~martin/
      PorterStemmer/
Document

 Term
Term frequency
r boost for a query on ferrari than the
 get from a query on insurance.
  entInversionof a term used to sca
      frequency df document
total number of documents in a corpu
             frequency follows:
 frequency (idf) of a term t as
                      N
           idft = log     .
                      dft
rare term is high, whereas the idf of a
ure 6.4 gives an example of idf’s in a co
g scheme assigns to term


 tf-idft,d = tft,d × idft .
ssigns to term t a weigh
Search
7 Vector space re

            6


                                      v(q)
                     
                      
                     
                          v(d2 )
                        B
                        ¨
                       ¨
                     ¨¨   v(d2 )
                         I
                    ¨   
                      
                   ¨
                 ¨¨ 
               ¨¨
              ¨
             ¨
                                             -
            ¨
            


             Cosine similarity illustrated.
igure 7.1
Q: “gold silver truck”

D1: “Shipment of gold damaged in a
fire”

D2: “Delivery of silver arrived in a
silver truck”

D3: “Shipment of gold arrived in a
truck”
TF

    a arrived damaged delivery   fire   gold   in of shipment silver truck

D1 1             1               1      1     11       1
         0               0                                     0     0


D2 1     1               1                    11               2
                 0               0      0              0             0


D3 1     1                              1     11       1             1
                 0       0       0                             0


                                        1                      1     1
Q   0    0       0       0       0            00       0
N
                 idft = log        .
                            dft
  •                        • of
area term is high, whereas 0the idf of
             0
    log 3/3 =                log 3/3 =



  • arrived                • silver
re 6.4 gives0.176 example of idf’s in a
                     an                0.477
            log 3/2 =            log 3/1 =



  • damaged                • shipment
ample logarithms are to the base 10.
                     0.477                0.176
                 log 3/1 =               log 3/2 =



  • delivery               • truck
                    0.477              0.176
                log 3/1 =        log 3/2 =



  • fire                    • gold
                0.477                 0.176
       log 3/1 =                log 3/2 =

 always finite?
  • in        0
     log 3/3 =
a arrived damaged delivery    fire   gold   in of shipment silver truck

                0.477            0.477 0.176 0 0      0.176
D1 0     0               0                                      0     0


        0.176           0.477                                 0.954 0.176
D2 0             0                0      0     00       0


        0.176                           0.176 0 0     0.176         0.176
D3 0             0       0        0                             0


                                        0.176 0 0             0.477 0.176
Q   0    0       0       0        0                     0
SC(Q,D1) = (0)(0)+(0)(0)+(0)(0.477)+(0)
(0)+(0)(0.477)+(0.176)(0.176)+(0)(0)+(0)
(0)+(0)(0.176)+(0.477)(0)+(0.176)(0)=
(0.176)(0.176) ⋲ 0.031
SC(Q,D2)=(0.954)(0.477)+(0.176)(0.176) ⋲ 0.486

SC(Q,D3)=(0.176)(0.176)+(0.176)(0.176) ⋲ 0.062
Inverted index
term - 1   (dn,1)    (d10,1)



term - 2   (dn,5)    (dn,3)



term - 3   (d2,11)   (d10,1)



term - 4   (dn,1)    (d2,1)



term - 5   (dn,2)    (d4,3)




term - n   (d6,1)    (d7,3)
Lucene
Analysis
Lucene includes several built-in analyzers. The primary ones are shown in table 4.2.
We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer
and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper,
PerFieldAnalyzerWrapper, to section 4.4.

Table 4.2   Primary analyzers available in Lucene

            Analyzer                                          Steps taken

                                Splits tokens at whitespace
  WhitespaceAnalyzer

                                Divides text at nonletter characters and lowercases
  SimpleAnalyzer

                                Divides text at nonletter characters, lowercases, and removes stop words
  StopAnalyzer

                                Tokenizes based on a sophisticated grammar that recognizes e-mail
  StandardAnalyzer
                                addresses, acronyms, Chinese-Japanese-Korean characters,
                                alphanumerics, and more; lowercases; and removes stop words



The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple-
Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in
almost any Western (European-based) language. You can see the effect of each of
these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple-
Analyzer are both trivial and we don’t cover them in more detail here. We explore
the StopAnalyzer and StandardAnalyzer in more depth because they have non-
Index
Index

• IndexWriter
• Directory
• Analyzer
• Document
• Field
ex options: store
                         store
  Value         Description
  :no           Don’t store field
  :yes          Store field in its original format.
                Use this value if you want to highlight
                matches or print match excerpts a la Google
                search.
  :compressed   Store field in compressed format.
index
Index options: index

        Value                                   Description
        :no                                     Do not make this field searchable.
        :yes                                    Make this field searchable and tok-
                                                enize its contents.
        :untokenized                            Make this field searchable but do not
                                                tokenize its contents. Use this value
                                                for fields you wish to sort by.
        :omit norms                             Same as :yes except omit the norms
                                                file. The norms file can be omit-
                                                ted if you don’t boost any fields and
                                                you don’t need scoring based on field
                                                length.
        :untokenized omit norms                 Same as :untokenized except omit the
                                                norms file.
Ruby Day Kraków: Full Text Search with Ferret
term_vector
Index options: term vector



        Value                                   Description
        :no                                     Don’t store term-vectors
        :yes                                    Store term-vectors without storing positions
                                                or offsets.
        :with positions                         Store term-vectors with positions.
        :with offsets                            Store term-vectors with offsets.
        :with positions ofssets                 Store term-vectors with positions and off-
                                                sets.




Ruby Day Kraków: Full Text Search with Ferret
Search
Search

• IndexSearcher
• Term
• Query
• Hits
Query
Query

• API
 •   new TermQuery(new Term(“name”,”Tomek”));

• Lucene QueryParser
 •   queryParser.parse(“name:Tomekquot;);
TermQuery
 name:Tomek
BooleanQuery
    ramobo OR ninja

+rambo +ninja –name:rocky
PhraseQuery
“ninja java” –name:rocky
SloppyPhraseQuery
 “red-faced politicians”~3
RangeQuery
releaseDate:[2000 TO 2007]
WildcardQuery
 sup?r, su*r, super*
FuzzyQuery
      color~

 colour, collor, colro
http://en.wikipedia.org/wiki/Levenshtein_distance


                 color colour - 1

                  colour coller - 2
Equation 1. Levenstein Distance Score




This means that an exact match will h
corresponding letters will have a score
Boost
title:Spring^10
Information Retrieval with Open Source
Information Retrieval with Open Source

Más contenido relacionado

Último

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPathCommunity
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024Lonnie McRorey
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity PlanDatabarracks
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsRavi Sanghani
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality AssuranceInflectra
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integrationmarketing932765
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxLoriGlavin3
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024TopCSSGallery
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...itnewsafrica
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Nikki Chapple
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesManik S Magar
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterMydbops
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructureitnewsafrica
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxLoriGlavin3
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Farhan Tariq
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentPim van der Noll
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...Wes McKinney
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesBernd Ruecker
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024BookNet Canada
 

Último (20)

UiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to HeroUiPath Community: Communication Mining from Zero to Hero
UiPath Community: Communication Mining from Zero to Hero
 
TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024TeamStation AI System Report LATAM IT Salaries 2024
TeamStation AI System Report LATAM IT Salaries 2024
 
How to write a Business Continuity Plan
How to write a Business Continuity PlanHow to write a Business Continuity Plan
How to write a Business Continuity Plan
 
Potential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and InsightsPotential of AI (Generative AI) in Business: Learnings and Insights
Potential of AI (Generative AI) in Business: Learnings and Insights
 
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance[Webinar] SpiraTest - Setting New Standards in Quality Assurance
[Webinar] SpiraTest - Setting New Standards in Quality Assurance
 
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS:  6 Ways to Automate Your Data IntegrationBridging Between CAD & GIS:  6 Ways to Automate Your Data Integration
Bridging Between CAD & GIS: 6 Ways to Automate Your Data Integration
 
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptxThe Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
The Role of FIDO in a Cyber Secure Netherlands: FIDO Paris Seminar.pptx
 
Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024Top 10 Hubspot Development Companies in 2024
Top 10 Hubspot Development Companies in 2024
 
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...Abdul Kader Baba- Managing Cybersecurity Risks  and Compliance Requirements i...
Abdul Kader Baba- Managing Cybersecurity Risks and Compliance Requirements i...
 
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
Microsoft 365 Copilot: How to boost your productivity with AI – Part one: Ado...
 
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotesMuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
MuleSoft Online Meetup Group - B2B Crash Course: Release SparkNotes
 
Scale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL RouterScale your database traffic with Read & Write split using MySQL Router
Scale your database traffic with Read & Write split using MySQL Router
 
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical InfrastructureVarsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
Varsha Sewlal- Cyber Attacks on Critical Critical Infrastructure
 
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptxDigital Identity is Under Attack: FIDO Paris Seminar.pptx
Digital Identity is Under Attack: FIDO Paris Seminar.pptx
 
Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...Genislab builds better products and faster go-to-market with Lean project man...
Genislab builds better products and faster go-to-market with Lean project man...
 
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native developmentEmixa Mendix Meetup 11 April 2024 about Mendix Native development
Emixa Mendix Meetup 11 April 2024 about Mendix Native development
 
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
The Future Roadmap for the Composable Data Stack - Wes McKinney - Data Counci...
 
QCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architecturesQCon London: Mastering long-running processes in modern architectures
QCon London: Mastering long-running processes in modern architectures
 
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: Loan Stars - Tech Forum 2024
 

Destacado

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by HubspotMarius Sescu
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTExpeed Software
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsPixeldarts
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthThinkNow
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfmarketingartwork
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024Neil Kimberley
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)contently
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024Albert Qian
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsKurio // The Social Media Age(ncy)
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Search Engine Journal
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summarySpeakerHub
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next Tessa Mero
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentLily Ray
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best PracticesVit Horky
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project managementMindGenius
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...RachelPearson36
 

Destacado (20)

2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot2024 State of Marketing Report – by Hubspot
2024 State of Marketing Report – by Hubspot
 
Everything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPTEverything You Need To Know About ChatGPT
Everything You Need To Know About ChatGPT
 
Product Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage EngineeringsProduct Design Trends in 2024 | Teenage Engineerings
Product Design Trends in 2024 | Teenage Engineerings
 
How Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental HealthHow Race, Age and Gender Shape Attitudes Towards Mental Health
How Race, Age and Gender Shape Attitudes Towards Mental Health
 
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdfAI Trends in Creative Operations 2024 by Artwork Flow.pdf
AI Trends in Creative Operations 2024 by Artwork Flow.pdf
 
Skeleton Culture Code
Skeleton Culture CodeSkeleton Culture Code
Skeleton Culture Code
 
PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024PEPSICO Presentation to CAGNY Conference Feb 2024
PEPSICO Presentation to CAGNY Conference Feb 2024
 
Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)Content Methodology: A Best Practices Report (Webinar)
Content Methodology: A Best Practices Report (Webinar)
 
How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024How to Prepare For a Successful Job Search for 2024
How to Prepare For a Successful Job Search for 2024
 
Social Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie InsightsSocial Media Marketing Trends 2024 // The Global Indie Insights
Social Media Marketing Trends 2024 // The Global Indie Insights
 
Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024Trends In Paid Search: Navigating The Digital Landscape In 2024
Trends In Paid Search: Navigating The Digital Landscape In 2024
 
5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary5 Public speaking tips from TED - Visualized summary
5 Public speaking tips from TED - Visualized summary
 
ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd ChatGPT and the Future of Work - Clark Boyd
ChatGPT and the Future of Work - Clark Boyd
 
Getting into the tech field. what next
Getting into the tech field. what next Getting into the tech field. what next
Getting into the tech field. what next
 
Google's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search IntentGoogle's Just Not That Into You: Understanding Core Updates & Search Intent
Google's Just Not That Into You: Understanding Core Updates & Search Intent
 
How to have difficult conversations
How to have difficult conversations How to have difficult conversations
How to have difficult conversations
 
Introduction to Data Science
Introduction to Data ScienceIntroduction to Data Science
Introduction to Data Science
 
Time Management & Productivity - Best Practices
Time Management & Productivity -  Best PracticesTime Management & Productivity -  Best Practices
Time Management & Productivity - Best Practices
 
The six step guide to practical project management
The six step guide to practical project managementThe six step guide to practical project management
The six step guide to practical project management
 
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
Beginners Guide to TikTok for Search - Rachel Pearson - We are Tilt __ Bright...
 

Information Retrieval with Open Source

  • 1.
  • 2. Tomasz Korzeniowski tomek@polarrose.com
  • 4.
  • 5.
  • 6.
  • 7. Retrieval strategies • Vector Space Model • Latent Semantic Indexing • Probabilistic Retrieval Strategies • Language Models • Inference Networks • Extended Boolean Retrieval • Neural Networks • Genetic Algorithms • Fuzzy Set Retrieval
  • 16.
  • 18. r boost for a query on ferrari than the get from a query on insurance. entInversionof a term used to sca frequency df document total number of documents in a corpu frequency follows: frequency (idf) of a term t as N idft = log . dft rare term is high, whereas the idf of a ure 6.4 gives an example of idf’s in a co
  • 19. g scheme assigns to term tf-idft,d = tft,d × idft . ssigns to term t a weigh
  • 21. 7 Vector space re 6 v(q)       v(d2 )   B ¨ ¨   ¨¨ v(d2 ) I   ¨ ¨   ¨¨  ¨¨  ¨ ¨   - ¨ Cosine similarity illustrated. igure 7.1
  • 22.
  • 23. Q: “gold silver truck” D1: “Shipment of gold damaged in a fire” D2: “Delivery of silver arrived in a silver truck” D3: “Shipment of gold arrived in a truck”
  • 24. TF a arrived damaged delivery fire gold in of shipment silver truck D1 1 1 1 1 11 1 0 0 0 0 D2 1 1 1 11 2 0 0 0 0 0 D3 1 1 1 11 1 1 0 0 0 0 1 1 1 Q 0 0 0 0 0 00 0
  • 25. N idft = log . dft • • of area term is high, whereas 0the idf of 0 log 3/3 = log 3/3 = • arrived • silver re 6.4 gives0.176 example of idf’s in a an 0.477 log 3/2 = log 3/1 = • damaged • shipment ample logarithms are to the base 10. 0.477 0.176 log 3/1 = log 3/2 = • delivery • truck 0.477 0.176 log 3/1 = log 3/2 = • fire • gold 0.477 0.176 log 3/1 = log 3/2 = always finite? • in 0 log 3/3 =
  • 26. a arrived damaged delivery fire gold in of shipment silver truck 0.477 0.477 0.176 0 0 0.176 D1 0 0 0 0 0 0.176 0.477 0.954 0.176 D2 0 0 0 0 00 0 0.176 0.176 0 0 0.176 0.176 D3 0 0 0 0 0 0.176 0 0 0.477 0.176 Q 0 0 0 0 0 0
  • 30. term - 1 (dn,1) (d10,1) term - 2 (dn,5) (dn,3) term - 3 (d2,11) (d10,1) term - 4 (dn,1) (d2,1) term - 5 (dn,2) (d4,3) term - n (d6,1) (d7,3)
  • 33.
  • 34.
  • 35. Lucene includes several built-in analyzers. The primary ones are shown in table 4.2. We’ll leave discussion of the two language-specific analyzers, RussianAnalyzer and GermanAnalyzer, to section 4.8.2 and the special per-field analyzer wrapper, PerFieldAnalyzerWrapper, to section 4.4. Table 4.2 Primary analyzers available in Lucene Analyzer Steps taken Splits tokens at whitespace WhitespaceAnalyzer Divides text at nonletter characters and lowercases SimpleAnalyzer Divides text at nonletter characters, lowercases, and removes stop words StopAnalyzer Tokenizes based on a sophisticated grammar that recognizes e-mail StandardAnalyzer addresses, acronyms, Chinese-Japanese-Korean characters, alphanumerics, and more; lowercases; and removes stop words The built-in analyzers we discuss in this section—WhitespaceAnalyzer, Simple- Analyzer, StopAnalyzer, and StandardAnalyzer—are designed to work with text in almost any Western (European-based) language. You can see the effect of each of these analyzers in the output in section 4.2.3. WhitespaceAnalyzer and Simple- Analyzer are both trivial and we don’t cover them in more detail here. We explore the StopAnalyzer and StandardAnalyzer in more depth because they have non-
  • 36. Index
  • 37. Index • IndexWriter • Directory • Analyzer • Document • Field
  • 38. ex options: store store Value Description :no Don’t store field :yes Store field in its original format. Use this value if you want to highlight matches or print match excerpts a la Google search. :compressed Store field in compressed format.
  • 39. index Index options: index Value Description :no Do not make this field searchable. :yes Make this field searchable and tok- enize its contents. :untokenized Make this field searchable but do not tokenize its contents. Use this value for fields you wish to sort by. :omit norms Same as :yes except omit the norms file. The norms file can be omit- ted if you don’t boost any fields and you don’t need scoring based on field length. :untokenized omit norms Same as :untokenized except omit the norms file. Ruby Day Kraków: Full Text Search with Ferret
  • 40. term_vector Index options: term vector Value Description :no Don’t store term-vectors :yes Store term-vectors without storing positions or offsets. :with positions Store term-vectors with positions. :with offsets Store term-vectors with offsets. :with positions ofssets Store term-vectors with positions and off- sets. Ruby Day Kraków: Full Text Search with Ferret
  • 41.
  • 44. Query
  • 45. Query • API • new TermQuery(new Term(“name”,”Tomek”)); • Lucene QueryParser • queryParser.parse(“name:Tomekquot;);
  • 47. BooleanQuery ramobo OR ninja +rambo +ninja –name:rocky
  • 52. FuzzyQuery color~ colour, collor, colro
  • 53. http://en.wikipedia.org/wiki/Levenshtein_distance color colour - 1 colour coller - 2
  • 54. Equation 1. Levenstein Distance Score This means that an exact match will h corresponding letters will have a score