How Does the Epitome of Spyware Differ from Other Malicious Software?
Mining Product Synonyms - Slides
1. Mining Product Synonym
Information Retrieval and Extraction
Project On
Presented By :
Vrishank Shete(201305642)
Mohd. Salman Khan(201305513)
Ankush Jain(201101010)
Suprabh Shukla(201001082)
Guided By :
Priya Radhakrishnan
Computer Science and Engineering
International Institute of Information Technology
2. Introduction
Problem Statement : Given an entity query, find canonical
terms by which the entity can be distinguished.
Forms of web queries on structured data.
Gap between user queries and creators describing entities.
E.g. User may query Harry Potter 6 if he wish to search for
Harry Potter and The Half Blood Prince
3. Related Works
String Similarity Measures:
◦ Levenshtein String Similarity function.
◦ Dice Coefficient.
◦ Jaccard String Similarity function.
ExploitingWeb Search to Generate Synonyms for
Entities by Surajit Chaudhuri,Venkatesh Ganti, Dong Xin
4. System Components
Extracting IDTokenSets using documents from web
search.
Expanding IDTokenSets using p-Window context
Searching for possible canonical names from pre-
crawled list.
Validating canonical names from web documents
5. Algorithm
1: Let Le = Pe; //all subsets of e;
2: while (Le is not empty)
3: Te = getnext(Le);
4: SubmitTe to W, and retrieve W(Te);
5: if (corr(Te; e;W(Te)) ¸ µ)
Te is an IDTokenSet
6: Report Te and all its supersets as IDTokenSets;
7: Remove Te and all its supersets from Le;
8: else
Te is not an IDTokenSet
9: Remove Te and its subsets from Le;
10: return.
Here the correlation function (corr) gives the estimate of how much
theTe is important to the current document.
6. Algorithm
11.After getting substrings, we show evidence by levenValue (<= 0.95) ,
jaccard (> 0.10) && dice (> 0.20) (by taking these values) from our data
set.
12.After filtering in step 3, we again filter by correlation method which is
mentioned above.(In Step 12 we get all mentions and all strings which are
matching to the mentions.These strings may or may not be canonical
names.)
13. Now we store all strings in a p-window context for all mentions in the
results of search engine(which we already store in step 1-10) we got in
step 12.
14.We count the number of times each word is occurring in all strings from
step13.
15. Now we take top K words from count hash and search in all the strings
from step 12(those may or may not be a part of canonical names).
16.We match words from step 15 and strings from step 12. best matched
string is our canonical string and our synonym (our desired result).
9. Challenges
The web documents are highly unstructured.The
query string can be present anywhere and in any
form in the respective document.This case is
handled using the p-Window context in which the
string is supposed to be present.
The web search engines do not allow automated
frequent queries in small intervals through a
program.A delay of 2 seconds is introduced
between two queries which makes the searching
somewhat slower but serves our purpose.
11. References
ExploitingWeb Search to Generate
Synonyms for Entities By Surajit
Chaudhuri,Venkatesh Ganti, Dong Xin.
Entity Synonyms for StructuredWeb
Search by Tao Cheng, Hady W. Lauw, and
Stelios Paparizos