Vision and reflection on Mining Software Repositories research in 2024
Better Contextual Suggestions from Domain Knowledge in ClueWeb12
1. Better Contextual Suggestions from ClueWeb12
Using Domain Knowledge Inferred from The Open Web
Results And AnalysisContextual Suggestion Model
Applying domain knowledge about sites that are more likely to offer
attractions leads to better suggestions
The best results are obtained when identifying attractions through
specialized services such as Foursquare
Conclusions
Thaer Samar†
, Alejandro Bellogín*
, Arjen P. de Vries†
†
Centrum Wiskunde & Informatica, The Netherlands
*
Universidad Autónoma de Madrid, Spain
{samar, arjen.de.vries}@cwi.nl, alejandro.bellogin@uam.es
General performance
TouristFiltered run is better than GeoFiltered run
Performance of TouristFiltered vs. GeoFiltered
Percentage of topics where TouristFiltered run is better than, equal to,
and worse than GeoFiltered run
Dimensions of relevance
Effect of geographical (geo), description (desc), and document (doc)
relevance on the performance of TouristFiltered and GeoFiltered runs
Considering desc and doc relevance only:
Both runs have almost the same effectiveness
Considering the geo relevance only
TouristFiltered is more geographically appropriate
Effect of TouristFiltered parts on the Performance
TouristFiltered sub-collection consists of three parts:
A. TouristListFiltered (TLF), B. TouristOutlinksFiltered (TOF),
and C. AttractionFiltered (AF)
Effect of each part on the performance
Best results from the part obtained through specialized service like
Foursquare
Goal: Applying domain knowledge inferred from the Open Web
to get better suggestions from the Archived Web (ClueWeb12)
2. User Profiles Generation
For each user:
Aggregate descriptions of attractions rated by the user
Split descriptions into positive and negative profiles based on user ratings
3. Representation
Represents attractions and user profiles in a Vector Space Model
1. clean attraction documents (remove HTML tags)
2. remove stop words from profiles and documents
3. transform documents and profiles into vectors of <term, frequency> pairs
4. Similarity of Attractions & User Profiles
For each set of attractions per (user, context) topic
Compute similarity between each attraction and positive and negative profiles
GeoFiltered
Exact mention of the context
Format: “City, ST”
TouristFiltered
50 Attractions: name and URL
Missing URL
“Attraction name
Miami, FL”
Get Hosts of all found URLS
Search in
ClueWeb12
1. Find Attractions in ClueWeb12
“Miami, FL”
A. Tourist sites: Any document whose URL belongs to 7 manually
selected tourist websites. (TouristListFiltered)
{yelp, tripadvisor, wikitravel, zagat, xpedia, orbitz, and travel.yahoo}
B. Expand A
Extract Outlinks
Find Outlinks in ClueWeb12 (TouristOutlinksFiltered)
C. Attraction Oriented (AttractionsFiltered)
Two runs: one based on TouristFiltered and another based on GeoFiltered
The following steps are the same for the two runs
NEW
Same as
2013
Same as
2013
Same as
2013
all: means geo, desc, and doc
are considered