SlideShare una empresa de Scribd logo
1 de 30
Exploiting Web Search Engines to Search Structured Databases Author:Sanjay Agrawal, Kaushik Chakrabarti, Dong Xin Publication:WWW 2009 MADRID Presenter: Nita Pawar
1. INTRODUCTION  			2. ARCHITECTURAL OVERVIEW 			3. ENTITY EXTRACTION 			4. ENTITY RANKING 			5. EXPERIMENTAL EVALUATION 			6. CONCLUSION  OUTLINE
1.INTRODUCTION 1] Queris against search engines . 2]  Query is a structured database. 3] Each structured database is searched individually and the relevant structured data items are returned to the web search engine. The search engine gathers the structured search results and displays them along side the web search results. 4]The result from the structured database search is independent of the result from the web search this is called silo-ed search.This approach works well for some queries. 5] However, there is a broad class of queries where this approach would return incomplete or even empty results. 6] The documents returned by a web search engine are likely to mention products that are relevant to the user query. 7] In this paper the evidence each individual returned document provides about the products mentioned in it and then aggregating" those evidences across the returned documents.
8] Here we also study the task of efficiently and accurately identifying relevant information in a structured database for a web search query.
9]  In this approach can return entity results for a much wider range of queries compared to silo-ed search. 10] To implement this approach there are 2 main design main goal (i) be effective for a wide variety of structured data domains,  (ii) be integrated with a search engine and to exploit it effectively.
[object Object],1] The web documents relevant to the query would typically mention the relevant entities. Eg. Figure 1. 2. ARCHITECTURAL OVERVIEW 2.1 Key Insights
2.2 Architecture
[object Object],(i) The crawlers crawl (ii) The indexers process (iii) The searchers (iv) The front end ,[object Object],1] Entity Extraction: The entity extraction component takes a text document as input, analyzes the document and outputs all the mentions of entities from the given structured database mentioned in the document. 2] Entity Retrieval: The task of entity retrieval is to lookup the DocIndex for a given document identifier and retrieve the entities extracted from the document along with their positions.  3] Entity Ranker:  The set of all entities returned by the entity retrieval component.
3. ENTITY EXTRACTION The extraction into two steps, 3.1]Lookup driven extraction 3.2]Entity mention classification
        3.1 Lookup-Driven Extraction Definition 1. (Extraction from an input document) An extraction from a document d with respect to the reference set Є is the set {m1, ….,mn} of all candidate mentions in d of the entities in Є. Example 1. Consider the reference entity table and the documents in Figure The extraction from d1 and d2 with respect to the reference table is  {(d1, “Xbox 360”, 2), (d3,“PlayStation 3”, 1}.
 A] Approximate Match:   1]Consider the set of consumers and electronics devices. In many documents, users may just refer to the entity  i] Canon EOS Digital Rebel XTi SLR Camera       by writing anon XTI", or OS ii] XTI SLR" and to the entity Sony Vaio F150 Laptop       by writing aio F150" or ony F150".  2]Some times these documents match exactly with entity names in the    reference table would cause these product mentions to be missed.  3]So it is very important to consider approximate matches between sub-strings in a document and an entity name in a reference table
B] Memory Requirement:  We assume that the trie  over all entities in the reference set fits in main memory. e.g. The number of entities (product names) runs up to 10 million, we observed that the memory consumption of the trie is still less than 100 MB. For this e.g where the of the trie may be larger than that available, we can consider sophisticated data structures such as compressed full text indices .also we consider approximate lookup structures such as bloom filters
1] Here we consider two mention   a) true mentions.   b) false mentions. 2] E.g. the string “Pi" can refer to either the movie or the mathematical constant. Similarly, an occurrence of the string “Pretty Woman "may or may not refer to the lm of the same name. 3] Here Є is a set of entities from known entity classes such as products, actors and movies. So determining  whether a candidate mention refers an entity e it is a true mention of e, otherwise it is a false mention. 3.2 Entity Mention Classification
We now discuss two important issues for training and applying these classifiers. a] Generating training data: 1] Here e.g. Wikipedia.    (e.g., http://en.wikipedia.org/wiki/-List_of_painters_by_name). 2] If the candidate mention is linked to the Wikipedia page of the entity, then it is a true mention. If it is linked to a different page within Wikipedia, it is a false mention. b] Feature sets: We use the following 4 types of features:
1] Document Level Features:  For example, movies tend to occur in documents about entertainment, and writers in documents about literature. To implement this , we trained the classifier using the 30K n-gram features occurring most frequently in our training set. 2] Entity Context Features:  By using this feature we can increase the accuracy of our classifiers. 3] Entity Token Frequency Features:  It  is an important feature for entity mention classification. e.g. For example, the movie title `The Others' is a common English phrase, likely to appear in documents without referring to the movie of the same name. 4] Entity Token Length Features:      Short strings and strings with few words not refer to entities than longer   ones. Therefore, we include the number of words in an entity string and its length in characters as features.
The task of the entity ranker is to rank the set of entities. 1.Aggregation: The higher the number of mentions, the more relevant the entity. 2. Proximity: An entity mentioned near the query keywords is likely to be more relevant to the query than an entity mentioned. 3. Document importance: An entity mentioned in a document with higher relevance to the query and/or higher static rank is likely to be more relevant than one mentioned in a less relevant and/or lower static rank document. 4. ENTITY RANKING
 D :set of top N documents returned by  the search engine.   De :set of documents mentioning an entity e.  imp(d) :the importance of a document d, and score(d,e) : e's score from the document d. If an entity is not mentioned in any document, its score is 0 by this definition Importance of a Document imp(d):Partition the list of top N documents into ranges of equal size B. where                 each document d, we compute the range. range(d) the document d is in. range(d) is d[            ]   where rank(d) denotes the document's rank among the web search results. we done the importance imp(d) of d to be
Score from a Document score(d,e): The score score(d, e) of an entity e gets from a single document d depends on     (i) whether e is mentioned in the document d   (ii) on the proximity between the query keywords and the mentions of       e in the document d. We define the score(d, e) as follows: score(d,e)  = 0, if e is not mentioned in d                  = wa, if e occurs in snippet                  = wb, otherwise where wa and wb are two constants.  wa :The score an entity gets from a document when it occurs in the snippet,   wb: The score when it does not.
5. EXPERIMENTAL EVALUATION The results of an extensive empirical study to evaluate the approach proposed in this paper. Superiority over silo-ed search: entity search approach overcomes the limitation of silo-ed search over entity database. Effectiveness of entity ranking: ranking based on aggregation and query term proximity produces high quality results. Low overhead: lower overhead on the query time compared to other architectures proposed for entity search. Accuracy of entity extraction: accurate in identifying the true entity mentions in web pages.
1]To Compare our integrated entity search approach with silo-ed search on a product database containing names and descriptions of187,606 Consumer and Electronics products. 2]Database by storing the product information in a table in Microsoft SQL Server 2008 (one row per product). 3]Column using SQL Server Integrated Full Text Search engine (iFTS) 4]Given a keyword query, we issue the corresponding AND query on that column using iFTS and return the results. 5.1 Comparison with Silo-ed Search
The number of results returned by the two approaches for the 316 queries. Integrated entity search returned answers for most queries (for 98.4% of the queries) where as silo-ed search often returned empty results (for about 36.4% of the queries).
5.2 Entity Ranking To evaluate the quality of our entity ranking on 3 datasets: 1] TREC Question Answering benchmark: 1.factoid questions from the TREC Question Answering track for years 1999, 2000 and 2001.
2] IMDB Data: 1.For example, for the movie “The Godfather", the associated people include Francis Ford Coppola, Mario Puzo, Marlon Brando, etc. We treat the movie name as the query and the associated people as ground truth for the query. Used 250 such queries, Average16.3 answers per query. Figure  shows the precision-recall trade off for our integrated entity search technique. vary the precision-recall trade off by varying the number k of top ranked answers returned by the integrated entity search approach (K = 1, 3, 5, 10, 20 and 30).
Using the snippet (i.e., taking proximity into account) improves the quality significantly: the precision improved from 67% to 86%, from 70% to 81% and from 71% to 79% for K = 1; 3 and 5 respectively. MRR improves from 0.8 to 0.91. 3] Wikipedia Data:             Extracted 115 queries and their ground truth from Wikipedia.
Figure  shows the precision-recall tradeoff for our integrated entity search technique for various values of K (K = 1, 3,5, 10, 20 and 30). The integrated entity search approach produces high quality results for this collection as well: the precision is 61% and 42% for the top 1 and 3 answers respectively. Again, using the snippets signicantly improves quality: the precision improved from 54% to 61% and from 35% to 42% for k = 1 and 3 respectively. Furthermore, the MRR improves from 0.67 to 0.74.
5.3 Overhead of our Approach Query Time Overhead: The overhead is less than 0.1 milliseconds which is negligible compared to end-to-end query times of web search engines .We observe that even for the commerce queries on the product database where the aggregation and ranking costs are relatively high the overhead remains below 0.1 ms.
Extraction Overhead:      The extraction time consists of 3 components:    1)the time to scan and decompress the crawled documents.    2)The time to identify all candidate mentions using lookup-driven extraction    3)the time to perform entity mention classification. Measured the above times for a web sample of 30310 documents (250MB in document text). The times for each of the individual components are 18306 ms, 51495 ms and 20723 ms respectively. The total extraction time is the sum of the above three times: 90524 m.  the average extraction overhead at less than 3ms per web documents.
1]Author use a linear Support Vector Machine (SVM) trained using Sequential Minimal Optimization as the classifier. 2]Measured the accuracy of the classier with regards to differentiating true mentions from the false ones among the set of all candidates. 3]54.2% of all entity candidates in the test data are true mentions. 5.4 Accuracy of Entity Classification
1. Establishing and exploiting relationships between web search results and structured entity databases signicantly enhances the effectiveness of search on these structured databases. 2. An architecture, existing search engine components in order to efficiently implement the integrated entity search functionality. 3. Very little space and time overheads to current search engines, high quality results from the give structured databases. 6.Conclusion
THANK  YOU

Más contenido relacionado

La actualidad más candente

International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER) ijceronline
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET Journal
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaningfeiwin
 
Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Editor IJARCET
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data1crore projects
 
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...rahulmonikasharma
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Foldersfeiwin
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clusteringIAEME Publication
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articlesijma
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databasessabhadakwan
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsSease
 

La actualidad más candente (16)

International Journal of Computational Engineering Research(IJCER)
 International Journal of Computational Engineering Research(IJCER)  International Journal of Computational Engineering Research(IJCER)
International Journal of Computational Engineering Research(IJCER)
 
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
IRJET- Empower Syntactic Exploration Based on Conceptual Graph using Searchab...
 
Bi4101343346
Bi4101343346Bi4101343346
Bi4101343346
 
Ir 02
Ir   02Ir   02
Ir 02
 
Focused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSIFocused Crawling System based on Improved LSI
Focused Crawling System based on Improved LSI
 
Email Data Cleaning
Email Data CleaningEmail Data Cleaning
Email Data Cleaning
 
Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362Ijarcet vol-2-issue-4-1357-1362
Ijarcet vol-2-issue-4-1357-1362
 
Context-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML DataContext-Based Diversification for Keyword Queries over XML Data
Context-Based Diversification for Keyword Queries over XML Data
 
LeVan, "Search Web Services"
LeVan, "Search Web Services"LeVan, "Search Web Services"
LeVan, "Search Web Services"
 
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
Context based Document Indexing and Retrieval using Big Data Analytics - A Re...
 
Scalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large FoldersScalable Discovery Of Hidden Emails From Large Folders
Scalable Discovery Of Hidden Emails From Large Folders
 
Automatic document clustering
Automatic document clusteringAutomatic document clustering
Automatic document clustering
 
A Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia ArticlesA Document Exploring System on LDA Topic Model for Wikipedia Articles
A Document Exploring System on LDA Topic Model for Wikipedia Articles
 
Udd for multiple web databases
Udd for multiple web databasesUdd for multiple web databases
Udd for multiple web databases
 
Feature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text CollectionsFeature Extraction for Large-Scale Text Collections
Feature Extraction for Large-Scale Text Collections
 
In3415791583
In3415791583In3415791583
In3415791583
 

Destacado

Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)dariusz1235
 
Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)dariusz1235
 
Budowa komputera
Budowa komputera Budowa komputera
Budowa komputera dariusz1235
 
Decoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersDecoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersAlexis Cristine Walden
 
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Trần Danh Nam
 
Health & safety presentation
Health & safety presentationHealth & safety presentation
Health & safety presentationPhalinng
 
Удаленные юзабилити-тестирования
Удаленные юзабилити-тестированияУдаленные юзабилити-тестирования
Удаленные юзабилити-тестированияMaria Vasilieva
 
Success Starts With Articles
Success Starts With ArticlesSuccess Starts With Articles
Success Starts With ArticlesBryan Johnson
 
2 F Test
2 F Test2 F Test
2 F Test2FRESH
 

Destacado (17)

Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)
 
Teaching 1
Teaching 1Teaching 1
Teaching 1
 
I Love Africa
I Love  AfricaI Love  Africa
I Love Africa
 
流動的世界 Part 2
流動的世界 Part 2流動的世界 Part 2
流動的世界 Part 2
 
11 litopad
11 litopad11 litopad
11 litopad
 
Budowa komputera (1)
Budowa komputera (1)Budowa komputera (1)
Budowa komputera (1)
 
流動的世界 part1
流動的世界 part1流動的世界 part1
流動的世界 part1
 
Budowa komputera
Budowa komputera Budowa komputera
Budowa komputera
 
Decoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang GlidersDecoding A Culture: Beginning Hang Gliders
Decoding A Culture: Beginning Hang Gliders
 
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
Nghien cuu-chuan-ket-noi-khong-day-zigbeeieee-80215
 
Teaching 1
Teaching 1Teaching 1
Teaching 1
 
Health & safety presentation
Health & safety presentationHealth & safety presentation
Health & safety presentation
 
Удаленные юзабилити-тестирования
Удаленные юзабилити-тестированияУдаленные юзабилити-тестирования
Удаленные юзабилити-тестирования
 
Success Starts With Articles
Success Starts With ArticlesSuccess Starts With Articles
Success Starts With Articles
 
Hr Professional
Hr ProfessionalHr Professional
Hr Professional
 
Photo Bar
Photo BarPhoto Bar
Photo Bar
 
2 F Test
2 F Test2 F Test
2 F Test
 

Similar a Exploiting web search engines to search structured

IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure DataIRJET Journal
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-indexmahi_uta
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt064ChetanWani
 
VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011djmichael156
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...Computer Science Journals
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxpriestmanmable
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search EngineIRJET Journal
 
Car Showroom management System in c++
Car Showroom management System in c++Car Showroom management System in c++
Car Showroom management System in c++PUBLIVE
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and TextIRJET Journal
 
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB MahmoudOHassouna
 
Team G
Team GTeam G
Team Gbutest
 
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...ijcseit
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET Journal
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State Universitydhabalia
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksParang Saraf
 
2.business object repository
2.business object repository2.business object repository
2.business object repositoryAjay Kumar ☁
 

Similar a Exploiting web search engines to search structured (20)

IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure DataIRJET-  	  On-AIR Based Information Retrieval System for Semi-Structure Data
IRJET- On-AIR Based Information Retrieval System for Semi-Structure Data
 
Presentation dual inversion-index
Presentation dual inversion-indexPresentation dual inversion-index
Presentation dual inversion-index
 
Web clustring engine
Web clustring engineWeb clustring engine
Web clustring engine
 
software_engg-chap-03.ppt
software_engg-chap-03.pptsoftware_engg-chap-03.ppt
software_engg-chap-03.ppt
 
VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011VRE Cancer Imaging BL RIC Workshop 22032011
VRE Cancer Imaging BL RIC Workshop 22032011
 
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
An Efficient Annotation of Search Results Based on Feature Ranking Approach f...
 
IST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docxIST365 - Project Deliverable #3Create the corresponding relation.docx
IST365 - Project Deliverable #3Create the corresponding relation.docx
 
IRJET- Review on Information Retrieval for Desktop Search Engine
IRJET-  	  Review on Information Retrieval for Desktop Search EngineIRJET-  	  Review on Information Retrieval for Desktop Search Engine
IRJET- Review on Information Retrieval for Desktop Search Engine
 
F04302053057
F04302053057F04302053057
F04302053057
 
E017252831
E017252831E017252831
E017252831
 
Car Showroom management System in c++
Car Showroom management System in c++Car Showroom management System in c++
Car Showroom management System in c++
 
uml.pptx
uml.pptxuml.pptx
uml.pptx
 
IRJET- Foster Hashtag from Image and Text
IRJET-  	  Foster Hashtag from Image and TextIRJET-  	  Foster Hashtag from Image and Text
IRJET- Foster Hashtag from Image and Text
 
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
Murach': HOW TO DEVELOP A DATA DRIVEN MVC WEB
 
Team G
Team GTeam G
Team G
 
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
META SEARCH ENGINE WITH AN INTELLIGENT INTERFACE FOR INFORMATION RETRIEVAL ON...
 
IRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema GeneratorIRJET- Automatic Database Schema Generator
IRJET- Automatic Database Schema Generator
 
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State UniversityLSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
LSP ( Logic Score Preference ) _ Rajan_Dhabalia_San Francisco State University
 
Safeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist NetworksSafeguarding Abila: Discovering Evolving Activist Networks
Safeguarding Abila: Discovering Evolving Activist Networks
 
2.business object repository
2.business object repository2.business object repository
2.business object repository
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking MenDelhi Call girls
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Allon Mureinik
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Miguel Araújo
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxMalak Abu Hammad
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Scriptwesley chun
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)Gabriella Davis
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsEnterprise Knowledge
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfEnterprise Knowledge
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024The Digital Insurer
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEarley Information Science
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
Neo4j - How KGs are shaping the future of Generative AI at AWS Summit London ...
 
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
08448380779 Call Girls In Diplomatic Enclave Women Seeking Men
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)Injustice - Developers Among Us (SciFiDevCon 2024)
Injustice - Developers Among Us (SciFiDevCon 2024)
 
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
Mastering MySQL Database Architecture: Deep Dive into MySQL Shell and MySQL R...
 
The Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptxThe Codex of Business Writing Software for Real-World Solutions 2.pptx
The Codex of Business Writing Software for Real-World Solutions 2.pptx
 
Automating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps ScriptAutomating Google Workspace (GWS) & more with Apps Script
Automating Google Workspace (GWS) & more with Apps Script
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)A Domino Admins Adventures (Engage 2024)
A Domino Admins Adventures (Engage 2024)
 
IAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI SolutionsIAC 2024 - IA Fast Track to Search Focused AI Solutions
IAC 2024 - IA Fast Track to Search Focused AI Solutions
 
Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024Tata AIG General Insurance Company - Insurer Innovation Award 2024
Tata AIG General Insurance Company - Insurer Innovation Award 2024
 
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdfThe Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
The Role of Taxonomy and Ontology in Semantic Layers - Heather Hedden.pdf
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024Axa Assurance Maroc - Insurer Innovation Award 2024
Axa Assurance Maroc - Insurer Innovation Award 2024
 
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptxEIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
EIS-Webinar-Prompt-Knowledge-Eng-2024-04-08.pptx
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024Partners Life - Insurer Innovation Award 2024
Partners Life - Insurer Innovation Award 2024
 

Exploiting web search engines to search structured

  • 1. Exploiting Web Search Engines to Search Structured Databases Author:Sanjay Agrawal, Kaushik Chakrabarti, Dong Xin Publication:WWW 2009 MADRID Presenter: Nita Pawar
  • 2. 1. INTRODUCTION 2. ARCHITECTURAL OVERVIEW 3. ENTITY EXTRACTION 4. ENTITY RANKING 5. EXPERIMENTAL EVALUATION 6. CONCLUSION OUTLINE
  • 3. 1.INTRODUCTION 1] Queris against search engines . 2] Query is a structured database. 3] Each structured database is searched individually and the relevant structured data items are returned to the web search engine. The search engine gathers the structured search results and displays them along side the web search results. 4]The result from the structured database search is independent of the result from the web search this is called silo-ed search.This approach works well for some queries. 5] However, there is a broad class of queries where this approach would return incomplete or even empty results. 6] The documents returned by a web search engine are likely to mention products that are relevant to the user query. 7] In this paper the evidence each individual returned document provides about the products mentioned in it and then aggregating" those evidences across the returned documents.
  • 4. 8] Here we also study the task of efficiently and accurately identifying relevant information in a structured database for a web search query.
  • 5. 9] In this approach can return entity results for a much wider range of queries compared to silo-ed search. 10] To implement this approach there are 2 main design main goal (i) be effective for a wide variety of structured data domains, (ii) be integrated with a search engine and to exploit it effectively.
  • 6.
  • 8.
  • 9. 3. ENTITY EXTRACTION The extraction into two steps, 3.1]Lookup driven extraction 3.2]Entity mention classification
  • 10. 3.1 Lookup-Driven Extraction Definition 1. (Extraction from an input document) An extraction from a document d with respect to the reference set Є is the set {m1, ….,mn} of all candidate mentions in d of the entities in Є. Example 1. Consider the reference entity table and the documents in Figure The extraction from d1 and d2 with respect to the reference table is {(d1, “Xbox 360”, 2), (d3,“PlayStation 3”, 1}.
  • 11. A] Approximate Match: 1]Consider the set of consumers and electronics devices. In many documents, users may just refer to the entity i] Canon EOS Digital Rebel XTi SLR Camera by writing anon XTI", or OS ii] XTI SLR" and to the entity Sony Vaio F150 Laptop by writing aio F150" or ony F150". 2]Some times these documents match exactly with entity names in the reference table would cause these product mentions to be missed. 3]So it is very important to consider approximate matches between sub-strings in a document and an entity name in a reference table
  • 12. B] Memory Requirement: We assume that the trie over all entities in the reference set fits in main memory. e.g. The number of entities (product names) runs up to 10 million, we observed that the memory consumption of the trie is still less than 100 MB. For this e.g where the of the trie may be larger than that available, we can consider sophisticated data structures such as compressed full text indices .also we consider approximate lookup structures such as bloom filters
  • 13. 1] Here we consider two mention a) true mentions. b) false mentions. 2] E.g. the string “Pi" can refer to either the movie or the mathematical constant. Similarly, an occurrence of the string “Pretty Woman "may or may not refer to the lm of the same name. 3] Here Є is a set of entities from known entity classes such as products, actors and movies. So determining whether a candidate mention refers an entity e it is a true mention of e, otherwise it is a false mention. 3.2 Entity Mention Classification
  • 14. We now discuss two important issues for training and applying these classifiers. a] Generating training data: 1] Here e.g. Wikipedia. (e.g., http://en.wikipedia.org/wiki/-List_of_painters_by_name). 2] If the candidate mention is linked to the Wikipedia page of the entity, then it is a true mention. If it is linked to a different page within Wikipedia, it is a false mention. b] Feature sets: We use the following 4 types of features:
  • 15. 1] Document Level Features: For example, movies tend to occur in documents about entertainment, and writers in documents about literature. To implement this , we trained the classifier using the 30K n-gram features occurring most frequently in our training set. 2] Entity Context Features: By using this feature we can increase the accuracy of our classifiers. 3] Entity Token Frequency Features: It is an important feature for entity mention classification. e.g. For example, the movie title `The Others' is a common English phrase, likely to appear in documents without referring to the movie of the same name. 4] Entity Token Length Features: Short strings and strings with few words not refer to entities than longer ones. Therefore, we include the number of words in an entity string and its length in characters as features.
  • 16. The task of the entity ranker is to rank the set of entities. 1.Aggregation: The higher the number of mentions, the more relevant the entity. 2. Proximity: An entity mentioned near the query keywords is likely to be more relevant to the query than an entity mentioned. 3. Document importance: An entity mentioned in a document with higher relevance to the query and/or higher static rank is likely to be more relevant than one mentioned in a less relevant and/or lower static rank document. 4. ENTITY RANKING
  • 17. D :set of top N documents returned by the search engine. De :set of documents mentioning an entity e. imp(d) :the importance of a document d, and score(d,e) : e's score from the document d. If an entity is not mentioned in any document, its score is 0 by this definition Importance of a Document imp(d):Partition the list of top N documents into ranges of equal size B. where each document d, we compute the range. range(d) the document d is in. range(d) is d[ ] where rank(d) denotes the document's rank among the web search results. we done the importance imp(d) of d to be
  • 18. Score from a Document score(d,e): The score score(d, e) of an entity e gets from a single document d depends on (i) whether e is mentioned in the document d (ii) on the proximity between the query keywords and the mentions of e in the document d. We define the score(d, e) as follows: score(d,e) = 0, if e is not mentioned in d = wa, if e occurs in snippet = wb, otherwise where wa and wb are two constants. wa :The score an entity gets from a document when it occurs in the snippet, wb: The score when it does not.
  • 19. 5. EXPERIMENTAL EVALUATION The results of an extensive empirical study to evaluate the approach proposed in this paper. Superiority over silo-ed search: entity search approach overcomes the limitation of silo-ed search over entity database. Effectiveness of entity ranking: ranking based on aggregation and query term proximity produces high quality results. Low overhead: lower overhead on the query time compared to other architectures proposed for entity search. Accuracy of entity extraction: accurate in identifying the true entity mentions in web pages.
  • 20. 1]To Compare our integrated entity search approach with silo-ed search on a product database containing names and descriptions of187,606 Consumer and Electronics products. 2]Database by storing the product information in a table in Microsoft SQL Server 2008 (one row per product). 3]Column using SQL Server Integrated Full Text Search engine (iFTS) 4]Given a keyword query, we issue the corresponding AND query on that column using iFTS and return the results. 5.1 Comparison with Silo-ed Search
  • 21. The number of results returned by the two approaches for the 316 queries. Integrated entity search returned answers for most queries (for 98.4% of the queries) where as silo-ed search often returned empty results (for about 36.4% of the queries).
  • 22. 5.2 Entity Ranking To evaluate the quality of our entity ranking on 3 datasets: 1] TREC Question Answering benchmark: 1.factoid questions from the TREC Question Answering track for years 1999, 2000 and 2001.
  • 23. 2] IMDB Data: 1.For example, for the movie “The Godfather", the associated people include Francis Ford Coppola, Mario Puzo, Marlon Brando, etc. We treat the movie name as the query and the associated people as ground truth for the query. Used 250 such queries, Average16.3 answers per query. Figure shows the precision-recall trade off for our integrated entity search technique. vary the precision-recall trade off by varying the number k of top ranked answers returned by the integrated entity search approach (K = 1, 3, 5, 10, 20 and 30).
  • 24. Using the snippet (i.e., taking proximity into account) improves the quality significantly: the precision improved from 67% to 86%, from 70% to 81% and from 71% to 79% for K = 1; 3 and 5 respectively. MRR improves from 0.8 to 0.91. 3] Wikipedia Data: Extracted 115 queries and their ground truth from Wikipedia.
  • 25. Figure shows the precision-recall tradeoff for our integrated entity search technique for various values of K (K = 1, 3,5, 10, 20 and 30). The integrated entity search approach produces high quality results for this collection as well: the precision is 61% and 42% for the top 1 and 3 answers respectively. Again, using the snippets signicantly improves quality: the precision improved from 54% to 61% and from 35% to 42% for k = 1 and 3 respectively. Furthermore, the MRR improves from 0.67 to 0.74.
  • 26. 5.3 Overhead of our Approach Query Time Overhead: The overhead is less than 0.1 milliseconds which is negligible compared to end-to-end query times of web search engines .We observe that even for the commerce queries on the product database where the aggregation and ranking costs are relatively high the overhead remains below 0.1 ms.
  • 27. Extraction Overhead: The extraction time consists of 3 components: 1)the time to scan and decompress the crawled documents. 2)The time to identify all candidate mentions using lookup-driven extraction 3)the time to perform entity mention classification. Measured the above times for a web sample of 30310 documents (250MB in document text). The times for each of the individual components are 18306 ms, 51495 ms and 20723 ms respectively. The total extraction time is the sum of the above three times: 90524 m. the average extraction overhead at less than 3ms per web documents.
  • 28. 1]Author use a linear Support Vector Machine (SVM) trained using Sequential Minimal Optimization as the classifier. 2]Measured the accuracy of the classier with regards to differentiating true mentions from the false ones among the set of all candidates. 3]54.2% of all entity candidates in the test data are true mentions. 5.4 Accuracy of Entity Classification
  • 29. 1. Establishing and exploiting relationships between web search results and structured entity databases signicantly enhances the effectiveness of search on these structured databases. 2. An architecture, existing search engine components in order to efficiently implement the integrated entity search functionality. 3. Very little space and time overheads to current search engines, high quality results from the give structured databases. 6.Conclusion