Poster version of earlier work, presented at ICT.OPEN 2012.
Original paper:
Discovering User Perceptions of Semantic Similarity in Near-duplicate Multimedia Files in Near-duplicate Multimedia Files. In Proc. of 1st International Workshop on Crowdsourcing Web Search, Lyon, France, April 17, 2012, CEUR-WS.org. Available online: http://msp.ewi.tudelft.nl/sites/default/files/crowdsearch2012-vliegendhart.pdf.
ICT.OPEN2012 - One of These Things is Not Like the Other: Crowdsourcing Semantic Similarity of Multimedia Files
1. One of These Things is Not Like the Other:
Crowdsourcing Semantic Similarity of Multimedia Files
Raynor Vliegendhart *, Martha Larson *, and Johan Pouwelse**
Multimedia Information Retrieval Lab* Parallel and Distributed Systems Group**
Delft University of Technology Delft University of Technology
Problem HIT Design
● Problem: What constitutes a near duplicate? Amazon Mechanical Turk (AMT) is a crowdsourcing platform
For example: Are these two files the same? Why (not)? to which Human Intelligence Tasks (HITs) can be submitted.
Phrasing in our HIT is important in order to elicit serious judgments:
● “Imagine that you download the three items in the list and that
you view them.”
Chrono Cross - 'Dream of the Chrono Cross Dream of the
Shore Near Another World' Shore Near Another World Harry Potter and the Sorcerers Stone Audio
Book (478 MB)
Violin/Piano Cover Violin and Piano
Harry Potter and the Sorcerer s Stone
(YouTube: IQYNEj51EUI) (YouTube: Iuh3YrJtK3M) (2001)(ENG GER NL) 2Lions- (4.36 GB)
Harry Potter.And.The.Sorcerer.Stone.DVDR.
Yes: It’s the same song. NTSC.SKJACK.Universal.S (4.46 GB)
No: These are different performances by different performers.
Definition: ● Don’t force workers to make a contrast, and
Functional near-duplicate multimedia items are items that fulfill the
● Explain the definition of functional similarity.
same purpose for the user. Once the user has one of these items,
there is no additional need for another.
o The items are comparable. They are for all practical purposes the
same. Someone would never really need all three of these.
● Task: Discovering new notions of user-perceived similarity between o Each item can be considered unique. I can imagine that someone
multimedia files in a file-sharing setting. might really want to download all three of these items.
● Motivation: Clustering items in search results. o One item is not like the other two. (Please mark that item in the list.)
The other two items are comparable.
Experiments
● Dataset:
● Popular file-sharing site: The Pirate Bay (thepiratebay.se).
Screenshots from Tribler 5.4 (tribler.org)
● 75 queries derived from Top 100 list.
● 32,773 filenames and metadata.
Approach ● 1000 random triads sampled from search results.
● Crowdsourcing Experiment:
● Idea: Point the odd one out, inspired by Sesame Street’s
“one of these things is not like the other”. ● Recruitment HIT and Main HIT run concurrently on AMT.
● 8 out of 14 qualified workers produced free-text judgments
for 308 triads within 36 hours.
● Card Sort:
● Group similar judgments into piles,
merge piles iteratively, and, finally
label each pile.
● End result: 44 user-perceived
dimensions of similarity discovered.
● Crowdsourcing Task:
● 3 multimedia files displayed as search results
Conclusion
● Worker points the odd one out and justifies why.
● Wealth of user-perceived dimensions of similarity discovered.
● Challenge: Eliciting serious judgments ● Quick results due to interesting crowdsourcing task.
Contact: R.Vliegendhart@tudelft.nl ICT.OPEN 2012, Rotterdam, The Netherlands, 2012
@ShinNoNoir