Introductory seminar @ International Hellenic University (8 May 2014)
Main topics covered:
- data collection from social sources
- indexing using mongoDB and Solr
- mining (basic analytics & topic detection)
+971565801893>> ORIGINAL CYTOTEC ABORTION PILLS FOR SALE IN DUBAI AND ABUDHABI<<
Social media crawling and mining [exercises]
1. Lecture @ International Hellenic University
Thessaloniki, 8 May 2014
Social Media Crawling and Mining
Overview of Hands-on Workshop
Symeon (Akis) Papadopoulos, Manos Schinas, Katerina Iliakopoulou,
Yiannis Kompatsiaris
Information Technologies Institute (ITI)
Centre for Research & Technologies Hellas (CERTH)
3. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Streams Manager
#3
How to run :
java –jar StreamsManager.jar stream.conf.xml input.conf.xml
4. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Items, MediaItems and StreamUsers
#4
Item class
Basic fields:
String id
String title
String[] tags
long publicationTime
String uid
String reference
String referenceUserId
String[] mentions
MediaItem class
Basic fields:
String id
String title
String[] tags
long publicationTime
String uid
String reference
5. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Items, MediaItems and StreamUsers
#5
StreamUser class
Basic fields:
String id
String username
String url
int items
long followers
long friends
Getters / Setters for each field
7. IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Direct Queries
#7
1. Find an Item by its id
db.Items.find({“id” : “Twitter#438612090748416”})
2. Find all Items posted before a certain date
db.Items.find({“publicationTime” : {$lt:1393408367000}})
3. Find a Media Item by its reference
db.MediaItems.find({“reference” : “Twitter#438612090748416”})
4. Find all Users with at least 1000 followers
db.StreamsUsers.find({“followers” : {$gt:1000}})
8. IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Query using DAO classes
#8
1. Create instance of ItemDAO to retrieve item
ItemDAO itemDAO = new ItemDAOImpl(“localhost”, “Snow14”, “Items”)
2. Create instance of MediaItemDAO to retrieve mediaItems
MediaItemDAO mediaItemDAO =
new MediaItemDAOImpl(“localhost”, “Snow14”, “MediaItems”)
3. Create instance of StreamUserDAO to retrieve users
StreamUserDAO userDAO =
new StreamUserDAOImpl(“localhost”, “Snow14”, “StreamUsers”)
9. IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Query using DAO classes
#9
1. Find an Item by its id
ItemDAO.getItem(“Twitter#438612090748416”)
2. Find a Media Item by its reference
List<String> items = new ArrayList<String>;
items.add(“Twitter#438612090748416”);
MediaItemDAO.getMediaItemsForItems(items,image,20);
3. Find 1000 latest Items
ItemDAO.getLatestItems(1000);
10. IHU SocialSensor Seminar – May 2014 CERTH-ITI
MongoDB – Generic queries & Iteration
#10
Use BasicDBObject class to represent JSON objects
e.g {“id” : “Twitter#1234567”} ->
BasicDBObject query = new BasicDBObject(“id” : “Twitter#1234567”)
List<Item> items = itemDAO.getItems(query);
To iterate:
ItemIterator it = itemDAO.getIterator(query);
Use methods hasNext() and next() to iterate over
the collection of Items.
11. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Solr – Query using SocialSensor wrappers
#11
1. Create instance of SolrItemHandler to index and retrieve
items
SolrItemHandler itemHandler =
SolrItemHandler.getInstance(
“http://localhost:8080/solr/Items”)
2. Create instance of SolrMediaItemHandler to index and
retrieve mediaItems
SolrMediaItemHandler itemHandler =
SolrMediaItemHandler.getInstance(
“http://localhost:8080/solr/MediaItems”)
12. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Solr – Use of UI and SocialSensor wrappers
#12
Assignment #1
Index all the items from MongoDB to Solr
Fill the method eu.socialsensor.ihu_workshop.indexItems
Assignment #2
Run the following queries to get relevant Items
Q1 : terror attack Q2 : Crimea Q3 : Bitcoin
13. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#13
Assignment #1
1. Find the N most frequent hashtag in a list of Items
1. Process one by one all items in the list
2. Create a map of all detected hashtags and their number of
occurrences.
3. Select the hashtag with the highest value.
2.Find the N most frequent terms in a list of Items using
tokenization
3.Find the N most re-tweeted tweets in the dataset
1. Process one by one all items in collection
2. Create a map of the item (item id) and its retweets
3. Select the item with the highest value
14. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#14
Assignment #1
4. Find N top users based on:
a) Number of posted items
b) Aggregated number of retweets
15. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Basic Social Media Analytics
#15
Assignment #1
5. Create an activity timeline for the tweets in the dataset
and for the set of original tweets
6. Create the timeline of the tweets that contain a hashtag
(or keyword) of your choice
7. Try to visualize the timelines you have created in the
previous steps.
16. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#16
What is a trending topic?
Keywords, N-grams, Named Entities,
Phrases, which are shared a lot in
social media for a certain period of
time.
Keywords, N-grams, Named Entities,
Phrases, which are shared a lot in
social media for a certain period of
time.
17. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#17
Assignment #2
Feature pivot topic detection by using hashtag
1.Baseline method: Split the data into timeslots of the same
length. Calculate the most frequent hashtags of each timeslot
2.Calculate the most trending hashtags by comparing the
current frequency of a hashtag with the values of the previous
timeslots.
18. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#18
Assignment #2
Document pivot event detection by clustering tweets
Cluster “similar” tweets to create groups of tweets that
represent candidate events.
The similarity between two tweets could be a combination of
similarity measures across different dimensions, e.g textual
similarity, time and space proximity, etc.
19. IHU SocialSensor Seminar – May 2014 CERTH-ITI
Detection of Trending Topics and Events
#19
Assignment #2
Frequency pivot event detection by clustering tweets
1.Run document-pivot clustering provided by SocialSensor to
create a set of candidate events.
2.For each produced topic find a list of representative hashtags.
3.Try to calculate a measure of “trendiness” of each event.