Más contenido relacionado Similar a Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 2011 (20) Smartening the Crowd: Computational Techniques for Improving Human Verification, at SOUPS 20113. ©2011CarnegieMellonUniversity:3
Detecting Phishing Websites
• Method 1: Use heuristics
– Unusual patterns in URL, HTML, topology
– Approach is favored by researchers
– High true positives, some false positives
• Method 2: Manually verify
– Approach used by industry blacklists today
(Microsoft, Google, PhishTank)
– Very few false positives, low risk of liability
– Slow, easy to overwhelm
7. ©2011CarnegieMellonUniversity:7
Wisdom of Crowds Approach
• Mechanics of PhishTank
– Submissions require at least 4 votes
and 70% agreement
– Some votes weighted more
• Total stats (Oct2006 – Feb2011)
– 1.1M URL submissions from volunteers
– 4.3M votes
– resulting in about 646k identified phish
• Why so many votes for only 646k phish?
9. ©2011CarnegieMellonUniversity:9
Why Care?
• Can improve performance of
human-verified blacklists
– Dramatically reduce time to blacklist
– Improve breadth of coverage
– Offer same or better level of accuracy
• More broadly, new way of improving
performance of crowd for a task
10. ©2011CarnegieMellonUniversity:10
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions,
teams of voters, leaderboards
11. ©2011CarnegieMellonUniversity:11
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions,
teams of voters, leaderboards
12. ©2011CarnegieMellonUniversity:12
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions,
teams of voters, leaderboards
13. ©2011CarnegieMellonUniversity:13
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions,
teams of voters, leaderboards
14. ©2011CarnegieMellonUniversity:14
Ways of Smartening the Crowd
• Change the order URLs are shown
– Ex. most recent vs closest to completion
• Change how submissions are shown
– Ex. show one at a time or in groups
• Adjust threshold for labels
– PhishTank is 4 votes and 70%
– Ex. vote weights, algorithm also votes
• Motivating people / allocating work
– Filtering by brand, competitions,
teams of voters, leaderboards
15. ©2011CarnegieMellonUniversity:15
Overview of Our Work
• Crawled unverified submissions from
PhishTank over 2 week period
• Replayed URLs on MTurk over 2 weeks
– Required participants to play
2 rounds of Anti-Phishing Phil
– Clustered phish by html similarity
– Two cases: phish one at a time, or in a
cluster (not strictly separate conditions)
– Evaluated effectiveness of vote weight
algorithm after the fact
22. ©2011CarnegieMellonUniversity:22
MTurk Tasks
• Two kinds of tasks, control and cluster
– Listed these two as separate HITs
– MTurkers paid $0.01 per label
– Cannot do between-conditions on MTurk
– MTurker saw a given URL at most once
• Four votes minimum, 70% threshold
– Stopped at 4 votes, cannot dynamically
request more votes on MTurk
– 153 (3.9%) in control and 127 (3.2%) in
cluster not labeled
23. ©2011CarnegieMellonUniversity:23
MTurk Tasks
• URLs were replayed in order
– Ex. If crawled at 2:51am from PhishTank
on day 1, then we would replay at 2:51am
on day 1 of experiment
– Listed new HITs each day rather than a
HIT lasting two weeks (to avoid delays
and last minute rush)
24. ©2011CarnegieMellonUniversity:24
Summary of Experiment
• 3973 suspicious URLs
– Ground truth from Google, MSIE, and
PhishTank, checked every 10 min
– 3877 were phish, 96 not
• 239 MTurkers participated
– 174 did HITs for both control and cluster
– 26 in Control only, 39 in Cluster only
• Total of 33,781 votes placed
– 16,308 in control
– 11,463 in cluster (17473 equivalent)
• Cost (participants + Amazon): $476.67 USD
27. ©2011CarnegieMellonUniversity:27
Voteweight
• Use time and accuracy to weight votes
– Those who vote early and accurately
are weighted more
– Older votes discounted
– Incorporates a penalty for wrong votes
• Done after data was collected
– Harder to do in real-time since we don’t
know true label until later
• See paper for parameter tuning
– Of threshold and penalty function
28. ©2011CarnegieMellonUniversity:28
Voteweight Results
• Control condition best scenario
– Before-after
– 94.8% accuracy, avg 11.8 hrs, median 3.8
– 95.6% accuracy, avg 11.0 hrs, median 2.3
• Cluster condition best scenario
– Before-after
– 95.4% accuracy, avg 1.8 hrs, median 0.7
– 97.2% accuracy, avg 0.8 hrs, median 0.5
• Overall: small gains, potentially more
fragile and more complex though
29. ©2011CarnegieMellonUniversity:29
Limitations of Our Study
• Two limitations of MTurk
– No separation between control and cluster
– ~3% tie votes unresolved (more votes)
• Possible learning effects?
– Hard to tease out with our data
– Aquarium doesn’t offer feedback
– Everyone played Phil
– No condition prioritized over other
• Optimistic case, no active subversion
35. ©2011CarnegieMellonUniversity:35
4.1 whitelists
include 3208 domains
From Google safe browsing (2784)
http://sb.google.com/safebrowsing/update?version
=goog-white-domain:1:1
From millersmiles (424)
http://www.millersmiles.co.uk/scams.php
Reduce false positive
Save human effort
36. ©2011CarnegieMellonUniversity:36
4.2 Clustering
Content similarity measurement (Shingling method)
S(q),S(d) denote the set of unique n-grams in page q and d
The threshold is 0.65
The average time cost : 0.063 microseconds (SD=0.05)
calculating similarity of two web pages on a laptop with 2GHz
dual core CPU with 1 GB of RAM
DBSCAN
Eps=0.65 and MinPts=2.
The time cost of clustering over all 3973 pages collected
was about 1 second.
( )
( ) ( )
( ) ( )dSqS
dSqS
dqr
=,
37. ©2011CarnegieMellonUniversity:37
4.2 Clustering
Incremental update of the data
If there is no similar web page, we create a new cluster for
the new submission.
If the similarity is above the given threshold and all similar
web pages are in the same cluster, we assign the new
submission to this cluster (unless the cluster is at its
maximum size).
If there are many similar web papes in different clusters,
we choose the largest cluster that is not at its maximum
size.
After a new submission is grouped in a cluster
It has zero votes and does not inherit the votes of any other
submissions in the same cluster.
38. ©2011CarnegieMellonUniversity:38
4.3 Voteweight
The core idea behind voteweight is that participants
who are more helpful in terms of time and accuracy are
weighted more than other participants.
It measures how powerful of users’ votes impact on the final
status of the suspicious URLs.
Its value comes from the accuracy of users’ historical data
a correct vote should be rewarded and a wrong one should
be penalized
recent behavior should be weighted more than past behavior
39. ©2011CarnegieMellonUniversity:39
4.3 Voteweight
In our model, we use y {t,+∞} y {-t,-∞}∈ ∪ ∈ to label the
status of a URL,
where y is the sum of voteweight of a given URL,
t is the threshold of voteweight,
y≥ t means a URL has been voted as a phishing
URL
y≤-t means voted as legitimate.
40. ©2011CarnegieMellonUniversity:40
4.3 Voteweight
∑ =
= M
k k
i
i
v
v
v
1
'
≥
=
otherwise
RVifRV
v ii
i
0
0
iii PRRV ⋅−= α ∑=
=⋅
−
+−
=
N
j
LC
j
i jI
TT
TT
R jij
1 0
0
)(
1
∑=
≠⋅
−
+−
=
N
j
LC
j
i jI
TT
TT
P jij
1 0
0
)(
1
( )
∉
∈
=
Axif
Axif
xIA
0
1
∑=
⋅=
K
i
itit Cvl
1
'
−
=
otherwise
phishasvotedif
Cit
1
1
(5)
(2)
(3) (4)
(1)
(6)
(7) (8)
41. ©2011CarnegieMellonUniversity:41
7. Investigating Voteweight
Tuning Parameters in Control Condition
Voteweight achieves its best accuracy 95.6% and time cost of 11
hours with t=0.08 and α=2.5 in the control condition
Average time cost drops to 11 hours (11.8 hours without
voteweight)
Median time cost drops to 2.3 hours (3.8 hours without woteweight)
42. ©2011CarnegieMellonUniversity:42
7. Investigating Voteweight
Tuning Parameters in Cluster Condition
Voteweight achieves its best accuracy of 97.2% and time cost of
0.8 hours with t=0.06 and α= 1 in the control condition
Average time cost drops to 0.8 hours (1.8 hours without
voteweight)
Median time cost drops to 0.5 hours (0.7 hours without
woteweight)
Notas del editor Average accuracy for each decile of users, sorted by accuracy. For example, the average accuracy of the top 10% of users in both conditions was 100%, whereas the average accuracy of the bottom 10% was under 30% for the Control Condition and under 50% in the Cluster Condition.