My Master's thesis defense slides for Master's thesis, research for which was conducted under Prof. Kyu-Young Whang and successfully defended in KAIST, Computer Science Dept. on 16th December, 2010.
Boost Fertility New Invention Ups Success Rates.pdf
Master's Thesis Defense: Improving the Quality of Web Spam Filtering by Using Seed Refinement
1. December 16, 2010 Database and Multimedia Lab Korea Advanced Institute of Science and Technology (KAIST) Improving the Quality of Web Spam Filtering by Using Seed Refinement Master Thesis Defense Presenter: Qureshi, Muhammad Atif Advisor: Whang, Kyu-Young
2.
3.
4.
5.
6.
7.
8.
9.
10.
11.
12.
13.
14.
15.
16.
17.
18.
19.
20.
21.
22.
23. Experimental Parameters Jan 7, 2011 Table. 2: Parameters used in experiments. Performance Evaluation Parameters Description damp It is a parameter used in TR , MTR , ATR , and MATR . It is the probability of following an outlink. Ratio Top It is the ratio for determining the input seed sets in TR , MTR , ATR , and MATR . Specifically, from Spam (or Non-Spam) Seed Set, we retrieve the domains whose PageRank scores are larger than or equal to the PageRank score of top- Ratio top % domain in the entire domains, and then, use the domains as the input seed set. cutoff Tr It is the cutoff threshold in TR and MTR for declaring the number of non-spam domains. In this thesis, we decide the value of cutoff Tr proportional to the size of input seed set of the non-spam domains. cutoff ATr It is the cutoff threshold in ATR and MATR for declaring the number of spam domains. In this thesis, we decide the value of cutoff ATr proportional to the size of input seed set of the spam domains. relativeMass It is a threshold used in SM and MSM for deciding a domain as a spam such that, if the domain receives excessively higher spam score compared to the non-spam score, the domain is one of the candidates for Web spam. topPR It is a threshold used in SM and MSM for deciding the candidate of being a spam domain by comparing the PageRank score of the domain to be within the top percentage (i.e., topPR %) of the PageRank scores. limitBL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of bidirectional links of the domain is equal to or greater than this threshold. limiOL It is a threshold used in LFS and MLFS for declaring the domain as spam, if the number of outlinks of a domains pointing to the spam domains is equal to or greater than this threshold.
24.
25. Jan 7, 2011 Experimental Measure Performance Evaluation Table. 5: Description of the measures. 1 False negatives are the number of domains incorrectly labeled as not belonging to the class (i.e., spam or non-spam). 1 Measures Description True positives The number of domains correctly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] False positives The number of domains incorrectly labeled as belonging to the class (i.e., spam or non-spam). [BCD08] F - measure The combined representation of precision and recall . Precision, recall [SM86] , and F - measure are expressed as follows. –
26.
27.
28.
29. The Best Succession for the Seed Refiner Jan 7, 2011 Therefore, MATR-MTR is found to be the winner, and hence we select it as the seed refiner. Performance Evaluation Identical performance for both successions Identical performance for both successions Identical performance for both successions Better performance for MATR-MTR compared to MTR-MATR Table. 6: Comparison for the seed refiner. True Positives False Positives For Finding Refined Non-Spam Domains For Finding Refined Spam Domains
46. Possible Combinations for Seed Refinement Module Jan 7, 2011 Supplement Succession 1 ( MATR-MTR ) Succession 2 ( MTR-MATR ) MATR MTR Manual spam and non-spam seed domains Manual non-spam domains and refined spam domains Manual spam and non-spam seed domains MTR MATR Refined spam and non-spam seed domains Refined spam and non-spam seed domains Manual spam domains and refined non-spam domains Seed Refiner Seed Refiner Algorithm Class Data flow
47. Possible Combinations for Spam Detection Module Jan 7, 2011 Supplement Combinations Single Algorithm MLFS-MSM MSM-MLFS MLFS MSM Succession 1 ( MLFS-MSM ) Succession 2 ( MSM-MLFS ) MLFS MSM Refined spam/non-spam seed domains Spam domains and refined non-spam domains Refined spam/non-spam seed domains MSM MLFS Detected spam domains Detected spam domains Spam domains and refined non-spam domains Spam Detector Spam Detector Algorithm Class Data flow
48. TR and ATR problem Jan 7, 2011 Supplement 1 2 3 1/2 t (1)=1 t (2)=1 t (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 A seed non-spam domain t ( i ): The trust score of domain i The domains 5 and 6 are involved in Web spam. A domain being considered t (5)= 5/12 + … 5 6 4 t (4)=1/3 t (6)= 5/12 + … 5/12 5/12 1 2 3 1/2 at (1)=1 at (2)=1 at (3)=5/6 1/2 1/3 1/3 1/3 5/12 5/12 4 The domains 5 ,6 and 7 are non- spam domains. at (5)=5/12 at (6)=5/12 + … 5 6 a t ( i ): The anti-trust score of domain i A domain being considered A seed spam domain 7 5/12 at (4)=1/3 5/12 5/12 at (7)=5/12 + …