SlideShare a Scribd company logo
1 of 22
The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents).  When something is added to the corpus, the index is updated.
“Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query.  Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
Some things we need to think about ,[object Object],[object Object],[object Object],[object Object],[object Object]
Indexing methods ,[object Object],[object Object],[object Object],[object Object],[object Object]
The inverted index It is an index which has terms marked as keys.  These map to the document they appear in.  The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at  latent semantic indexing (LSI)
LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd  contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is  now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
A quick sketch of LSI Sets of terms and documents = d-dimensional vectors  There are however some big limitations to this method.... Term document  matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
The resulting dimensions can be very difficult to interpret so there are mistakes.  It's unclear what the resulting similarities between terms really mean.  The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and  polysemous  words
What did all that mean? “ Generative data model” -  It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words.  A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
How does it work? ,[object Object],[object Object],[object Object],[object Object]
How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”.  Information is extracted from documents to be placed in the index.  The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
Section recognition Before tokenization happens, all the major parts of a document are identified.  Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns.  Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead.  Most engines don't though. This is also why using JavaScript for example is avoided.
Formats Documents come in all flavours on the web.  There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted.  They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong.  This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index.  Don't let this short presentation fool you, it is a very very big research issue.  Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
Resources The inverted index in detail  http://tinyurl.com/65hbfd   The seminal PLSI paper  http://tinyurl.com/54wd76 The seminal LSI paper  http://tinyurl.com/5e8v36 The semantic indexing project  http://knowledgesearch.org/ Boulder Uni on LSA  http://lsa.colorado.edu/ Apache Lucene  http://lucene.apache.org/java/docs/ Google test data ($150)  http://tinyurl.com/62t4la

More Related Content

What's hot

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Primya Tamil
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic WebJohn Breslin
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Miningsathish sak
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sqlRam kumar
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFSNilesh Wagmare
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model IntroductionNishant Munjal
 
Column oriented database
Column oriented databaseColumn oriented database
Column oriented databaseKanike Krishna
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architectureBishal Khanal
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesMax Irwin
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Abhay Ratnaparkhi
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDBRavi Teja
 

What's hot (20)

Boolean,vector space retrieval Models
Boolean,vector space retrieval Models Boolean,vector space retrieval Models
Boolean,vector space retrieval Models
 
RDF and OWL
RDF and OWLRDF and OWL
RDF and OWL
 
Introduction to Hadoop
Introduction to HadoopIntroduction to Hadoop
Introduction to Hadoop
 
The Social Semantic Web
The Social Semantic WebThe Social Semantic Web
The Social Semantic Web
 
Web services SOAP
Web services SOAPWeb services SOAP
Web services SOAP
 
Web Scraping Basics
Web Scraping BasicsWeb Scraping Basics
Web Scraping Basics
 
Map Reduce
Map ReduceMap Reduce
Map Reduce
 
Web Search and Mining
Web Search and MiningWeb Search and Mining
Web Search and Mining
 
RDF, linked data and semantic web
RDF, linked data and semantic webRDF, linked data and semantic web
RDF, linked data and semantic web
 
Introduction to Relational Databases
Introduction to Relational DatabasesIntroduction to Relational Databases
Introduction to Relational Databases
 
ElasticSearch
ElasticSearchElasticSearch
ElasticSearch
 
Non relational databases-no sql
Non relational databases-no sqlNon relational databases-no sql
Non relational databases-no sql
 
Introduction To RDF and RDFS
Introduction To RDF and RDFSIntroduction To RDF and RDFS
Introduction To RDF and RDFS
 
Relational Data Model Introduction
Relational Data Model IntroductionRelational Data Model Introduction
Relational Data Model Introduction
 
Column oriented database
Column oriented databaseColumn oriented database
Column oriented database
 
Mongodb basics and architecture
Mongodb basics and architectureMongodb basics and architecture
Mongodb basics and architecture
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval Latest trends in AI and information Retrieval
Latest trends in AI and information Retrieval
 
Web mining
Web miningWeb mining
Web mining
 
Introduction to MongoDB
Introduction to MongoDBIntroduction to MongoDB
Introduction to MongoDB
 

Viewers also liked

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spamjagadish thaker
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark RecognitionHimanshu Popli
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesAshesh R
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft PresentationRandall Chesnutt
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source softwareRob Jewitt
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its typesNagarjuna Kalluru
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modificationjacobkwack
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategyjmunks
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 posterAjay Ohri
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquistmglindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKris Jacobson
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword researchJono Alderson
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional RepositoriesSarika Sawant
 

Viewers also liked (20)

Search Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content SpamSearch Engine Spam Index - Types of Link Spam & Content Spam
Search Engine Spam Index - Types of Link Spam & Content Spam
 
Optical Mark Recognition
Optical Mark RecognitionOptical Mark Recognition
Optical Mark Recognition
 
Cybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse CasesCybercrime And Computer Misuse Cases
Cybercrime And Computer Misuse Cases
 
Identity Theft Presentation
Identity Theft PresentationIdentity Theft Presentation
Identity Theft Presentation
 
Mac281 Open Source software
Mac281 Open Source softwareMac281 Open Source software
Mac281 Open Source software
 
Cyber Terrorism
Cyber TerrorismCyber Terrorism
Cyber Terrorism
 
Parts of cpu
Parts of cpuParts of cpu
Parts of cpu
 
Search engines and its types
Search engines and its typesSearch engines and its types
Search engines and its types
 
Types of Search Engines
Types of Search EnginesTypes of Search Engines
Types of Search Engines
 
Port mann bridge modification
Port mann bridge modificationPort mann bridge modification
Port mann bridge modification
 
Presentation search strategy
Presentation   search strategyPresentation   search strategy
Presentation search strategy
 
Richard kwock jsm 2012 poster
Richard kwock jsm 2012 posterRichard kwock jsm 2012 poster
Richard kwock jsm 2012 poster
 
POPSI
POPSIPOPSI
POPSI
 
From KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G LindquistFrom KWIC to Enterprise Search - M G Lindquist
From KWIC to Enterprise Search - M G Lindquist
 
Keyword Searching: Advanced Techniques
Keyword Searching: Advanced TechniquesKeyword Searching: Advanced Techniques
Keyword Searching: Advanced Techniques
 
3rd Thesaurus
3rd Thesaurus3rd Thesaurus
3rd Thesaurus
 
Lawrence kwockresume1
Lawrence kwockresume1Lawrence kwockresume1
Lawrence kwockresume1
 
Advanced keyword research
Advanced keyword researchAdvanced keyword research
Advanced keyword research
 
Searching techniques
Searching techniquesSearching techniques
Searching techniques
 
Institutional Repositories
Institutional RepositoriesInstitutional Repositories
Institutional Repositories
 

Similar to The search engine index

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and SparkAudible, Inc.
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14Steven Toole
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual GuideIRJET Journal
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantinimaxfalc
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text miningIRJET Journal
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsDerek Kane
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibEl Habib NFAOUI
 
G04124041046
G04124041046G04124041046
G04124041046IOSR-JEN
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...Kumar Goud
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONIJDKP
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerIOSR Journals
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsBen DeMott
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerDarrell W. Gunter
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfrobertsamuel23
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperJohn Felahi
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrLucidworks
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...ijsrd.com
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeIJMTST Journal
 

Similar to The search engine index (20)

Elasticsearch and Spark
Elasticsearch and SparkElasticsearch and Spark
Elasticsearch and Spark
 
Demystifying analytics in e discovery white paper 06-30-14
Demystifying analytics in e discovery   white paper 06-30-14Demystifying analytics in e discovery   white paper 06-30-14
Demystifying analytics in e discovery white paper 06-30-14
 
IRJET - BOT Virtual Guide
IRJET -  	  BOT Virtual GuideIRJET -  	  BOT Virtual Guide
IRJET - BOT Virtual Guide
 
Tovek Presentation by Livio Costantini
Tovek Presentation by Livio CostantiniTovek Presentation by Livio Costantini
Tovek Presentation by Livio Costantini
 
Topic detecton by clustering and text mining
Topic detecton by clustering and text miningTopic detecton by clustering and text mining
Topic detecton by clustering and text mining
 
Data Science - Part XI - Text Analytics
Data Science - Part XI - Text AnalyticsData Science - Part XI - Text Analytics
Data Science - Part XI - Text Analytics
 
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habibConceptual foundations of text mining and preprocessing steps nfaoui el_habib
Conceptual foundations of text mining and preprocessing steps nfaoui el_habib
 
G04124041046
G04124041046G04124041046
G04124041046
 
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
professional fuzzy type-ahead rummage around in xml  type-ahead search techni...professional fuzzy type-ahead rummage around in xml  type-ahead search techni...
professional fuzzy type-ahead rummage around in xml type-ahead search techni...
 
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATIONUSING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
USING GOOGLE’S KEYWORD RELATION IN MULTIDOMAIN DOCUMENT CLASSIFICATION
 
Searching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal ComputerSearching and Analyzing Qualitative Data on Personal Computer
Searching and Analyzing Qualitative Data on Personal Computer
 
The need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementationsThe need for sophistication in modern search engine implementations
The need for sophistication in modern search engine implementations
 
XXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair KernerXXIX Charleston 2009 Silverchair Kerner
XXIX Charleston 2009 Silverchair Kerner
 
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdfleewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
leewayhertz.com-What role do embeddings play in a ChatGPT-like model.pdf
 
Oops Concepts
Oops ConceptsOops Concepts
Oops Concepts
 
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White PaperContent Analyst - Conceptualizing LSI Based Text Analytics White Paper
Content Analyst - Conceptualizing LSI Based Text Analytics White Paper
 
Webinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with SolrWebinar: Simpler Semantic Search with Solr
Webinar: Simpler Semantic Search with Solr
 
Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)Beautiful Research Data (Structured Data and Open Refine)
Beautiful Research Data (Structured Data and Open Refine)
 
Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...Extracting and Reducing the Semantic Information Content of Web Documents to ...
Extracting and Reducing the Semantic Information Content of Web Documents to ...
 
Keyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood KnowledgeKeyphrase Extraction using Neighborhood Knowledge
Keyphrase Extraction using Neighborhood Knowledge
 

More from CJ Jenkins

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer CJ Jenkins
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis worksCJ Jenkins
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systemsCJ Jenkins
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs DatabaseCJ Jenkins
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic websiteCJ Jenkins
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine SpidersCJ Jenkins
 
Twitter for business
Twitter for businessTwitter for business
Twitter for businessCJ Jenkins
 

More from CJ Jenkins (7)

I am an experience designer
I am an experience designer I am an experience designer
I am an experience designer
 
How Sentiment Analysis works
How Sentiment Analysis worksHow Sentiment Analysis works
How Sentiment Analysis works
 
Using construction grammar in conversational systems
Using construction grammar in conversational systemsUsing construction grammar in conversational systems
Using construction grammar in conversational systems
 
Knowledgebase vs Database
Knowledgebase vs DatabaseKnowledgebase vs Database
Knowledgebase vs Database
 
Building a semantic website
Building a semantic websiteBuilding a semantic website
Building a semantic website
 
Search Engine Spiders
Search Engine SpidersSearch Engine Spiders
Search Engine Spiders
 
Twitter for business
Twitter for businessTwitter for business
Twitter for business
 

Recently uploaded

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Manik S Magar
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embeddingZilliz
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxhariprasad279825
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024The Digital Insurer
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Patryk Bandurski
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxNavinnSomaal
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machinePadma Pradeep
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfAlex Barbosa Coqueiro
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsMemoori
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Wonjun Hwang
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfSeasiaInfotech2
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek SchlawackFwdays
 

Recently uploaded (20)

Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!Anypoint Exchange: It’s Not Just a Repo!
Anypoint Exchange: It’s Not Just a Repo!
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
Training state-of-the-art general text embedding
Training state-of-the-art general text embeddingTraining state-of-the-art general text embedding
Training state-of-the-art general text embedding
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Artificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptxArtificial intelligence in cctv survelliance.pptx
Artificial intelligence in cctv survelliance.pptx
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024My INSURER PTE LTD - Insurtech Innovation Award 2024
My INSURER PTE LTD - Insurtech Innovation Award 2024
 
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
Integration and Automation in Practice: CI/CD in Mule Integration and Automat...
 
SAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptxSAP Build Work Zone - Overview L2-L3.pptx
SAP Build Work Zone - Overview L2-L3.pptx
 
Install Stable Diffusion in windows machine
Install Stable Diffusion in windows machineInstall Stable Diffusion in windows machine
Install Stable Diffusion in windows machine
 
Unraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdfUnraveling Multimodality with Large Language Models.pdf
Unraveling Multimodality with Large Language Models.pdf
 
AI as an Interface for Commercial Buildings
AI as an Interface for Commercial BuildingsAI as an Interface for Commercial Buildings
AI as an Interface for Commercial Buildings
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
DMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special EditionDMCC Future of Trade Web3 - Special Edition
DMCC Future of Trade Web3 - Special Edition
 
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
Bun (KitWorks Team Study 노별마루 발표 2024.4.22)
 
The Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdfThe Future of Software Development - Devin AI Innovative Approach.pdf
The Future of Software Development - Devin AI Innovative Approach.pdf
 
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
"Subclassing and Composition – A Pythonic Tour of Trade-Offs", Hynek Schlawack
 

The search engine index

  • 1. The Search Engine Index http://scienceforseo.blogspot.com IR tutorial series: Part 1
  • 2. What is an index? The word “index” can mean many things in computing, but in the case of search engines, it can be defined as: A database where information (after being collected, parsed and processed) is stored to allow for quick retrieval. Cache-based engines store the index along with the corpus (collection of documents). When something is added to the corpus, the index is updated.
  • 3. “Index” We call it that because it's exactly what we called it when it was one of these: And that took its name from the index finger Photo from: http://www.homeschoolinthewoods.com
  • 4. Why use an index? If we didn't have an index, it would take too much time to search through the whole corpus to find documents that matched our query. Creating an index means that the retrieval process is faster and the accuracy is better. The search engine doesn't need to scan each document to know what it's about – this saves on storage and makes the whole process faster.
  • 5.
  • 6.
  • 7. The inverted index It is an index which has terms marked as keys. These map to the document they appear in. The index is sorted by its keys and works well with Boolean operators (AND,OR, AND NOT) We find the documents by matching the terms – this is why we say it is inverted. Diagram by http://developer.apple.com/
  • 8. Limitations It can only tell us if a word occurs in a particular document. It can't tell us how often it occurs or its location in the document, it also can't rank those documents either. That information is very important because it helps the search engine determine how relevant to a query a document is. so... we look at latent semantic indexing (LSI)
  • 9. LSI “ Semantic” = meaning “ Latent” = present but hidden It is the analysis of the hidden meaning of words and how often they occur in a document. It can infer meaning from words which isn't obvious: Computer – PC – Laptop => connected It can put together documents that are not obviously created. It can do this because it creates a “latent semantic space”
  • 10. How does LSI work? It uses lots of vectors and creates a “term document matrix” from all the documents it has. Then 3 matrices are created using SVD (“singular value decomposition”) Of these 3 vectors, the 2 nd contains the singular values of the original matrix in a diagonal matrix Sets of documents are represented as d-dimensional vectors Using the cosine of the angle between these vectors, there is now an easy-to-calculate similarity measure between any two sets of terms and/or documents.
  • 11. A quick sketch of LSI Sets of terms and documents = d-dimensional vectors There are however some big limitations to this method.... Term document matrix Box of documents Lots of vectors Matrix 1 Matrix 2 Matrix 3
  • 12. The resulting dimensions can be very difficult to interpret so there are mistakes. It's unclear what the resulting similarities between terms really mean. The input is a bag-of-words so we don't have any text structure information. A compound term (“bull-headed”) is treated as 2 terms. Ambiguous terms create noise in the vector space There's no way to define the optimal dimensionality of the vector space There's a time complexity for SVD in dynamic collections
  • 13. PLSI “ Probabilistic latent semantic indexing” is a better choice because: It has a more robust statistical foundation and provides a proper generative data model It uses the EM algorithm (Expectation maximization to avoid over-fitting (nodes too specific to noise)) - this makes it far more flexible It can deal with domain specific synonymy and polysemous words
  • 14. What did all that mean? “ Generative data model” - It's used for randomly generating observed data from unknown parameters (HMMs are generative data models for example) “ EM algorithm” - it finds the maximum likelihood estimate of parameters in a probabilistic model (where the model depends on unobserved latent variables) – good for machine learning and data clustering. Synonymy – It's the synonym relation between words. A synonym is when 2 different words mean the same thing. Polysemous – a word that has multiple meanings or interpretations
  • 15.
  • 16. How is it different to LSI? The order of the words is lost (but results are still good due to word co-occurrence) Documents can be represented by numeric vectors in a space of words It retrieves topics Each query uses the cosine similarity metric to find the similarity between vectors.
  • 17. More indexing difficulties It's easy for us to pick a document and classify it, well most of the time, but search engines have other difficulties to over come before even getting to the classification stage.
  • 18. Tokenization Machines don't understand sentences in text. They see everything in bytes. Consider: The dog ran in the field We see 6 words. Machine sees 24 characters (chars) The words found in a document are called “tokens”. Information is extracted from documents to be placed in the index. The tokens may be email addresses, words, URLs,... The Part-Of-Speech, line number, sentence number, size and so on can be stored in the index.
  • 19. Section recognition Before tokenization happens, all the major parts of a document are identified. Some documents are newsletters other have a side navigation, some are reports...and the text can be displayed in columns. Machines will read this sequentially though and index the word sequentially as well. The difficulty is finding which view of the document is informative. Some engines will index an abstract representation of the document instead. Most engines don't though. This is also why using JavaScript for example is avoided.
  • 20. Formats Documents come in all flavours on the web. There are documents in HTML, PDF, EXCEL, Powerpoint, and so many others. Before documents are analysed, they are stripped down and the formatting extracted. They are "normalised". It's important for the search engine to not misread "markup" information for content or the index gets polluted.
  • 21. To conclude... The indexing process of a search engine is really very important because if this is wrong, everything is wrong. This is why “Spamdexing” is such an issue. There are a lot of very specialised areas of computing who focus their work on making it easier for machines to create an index. Don't let this short presentation fool you, it is a very very big research issue. Natural language processing is used for rich text analysis, which helps identify what's going on so that the other computational elements can do their job.
  • 22. Resources The inverted index in detail http://tinyurl.com/65hbfd The seminal PLSI paper http://tinyurl.com/54wd76 The seminal LSI paper http://tinyurl.com/5e8v36 The semantic indexing project http://knowledgesearch.org/ Boulder Uni on LSA http://lsa.colorado.edu/ Apache Lucene http://lucene.apache.org/java/docs/ Google test data ($150) http://tinyurl.com/62t4la