SlideShare una empresa de Scribd logo
1 de 8
Descargar para leer sin conexión
Topik: Top-k Data Search in Geo-tagged
Social Media
Saurabh Deochake
Rutgers University
Piscataway, NJ, USA
Email: saurabh.deochake@rutgers.edu
Saurabh Singh
Rutgers University
Piscataway, NJ, USA
Email: s.singh@rutgers.edu
Shreepad Patil
Rutgers University
Piscataway, NJ, USA
Email: shreepad.patil@rutgers.edu
Abstract— Increasing popularity of social media platforms
and location based services has made it possible to query the
data generated by social media interactions in novel ways
other than traditional search. With advent of Data Science
techniques, data scientists can delve deep into the data
generated by social media and inspect it to find different ways
the users interact on social media and in turn, Internet. In
this project, we will make use of data generated by social
media interaction for example, tweets, and/or geo-tagged
instragram photos, and/or facebook posts on a particular
topic from a specific region and bring about a top-k Social
media data(User/Tweets/Posts). Given a location l, in the form
of (Latitude,Longitude), distance d, and a set of keywords K,
the algorithm used in our module will fetch top-k users who
have posted media relevant to the desired keyword k *subset*
K in a proximity to location l(Lat,Long) within distance d.
We congregate, analyze and represent the data to end user.
Finally, we will conduct an experimental study using real
social media data sets such as Twitter to evaluate our module
in development.
I. Project Description
Social media is having more and more effect on
our daily lives. Every second around 6000 tweets are
recorded on twitter. In general, people use social media
to keep in touch with their friends and stay updated on
the social events.
Nowadays, with the widespread use of GPS-enabled
mobile devices, we also get data regarding locations
of the users. Geo-tagged data plays an important role
in analyzing the social media data and scenarios like
friend recommendation, spatial decision, etc. For ex-
ample, Facebook activates the Safety Check feature
during a disaster, which helps marking yourself as safe
during a disaster. Also many location based services
are becoming popular these days.
Traditional search, looks for the keywords in a doc-
ument, twitter also uses a search and fetches the
trending topics, but it lacks the geo-spatial component.
With geo-spatial search it is even possible to provide
novel search services.
Our approach is to recommend Twitter users by
taking into account users’ tweet contents, locations,
and tweets’ popularity that indicates the users’ social
influence. In particular, we study the search of top-
k local users in tweets that are associated with lati-
tude/longitude information, namely geo-tagged tweets.
To that end, we devise a scalable architecture for
hybrid index construction in order to handle high-
volume geo-tagged tweets, and we propose functions
to score tweets as well as methods to rank users that
integrate social relationships, keyword relevance and
spatial proximity for processing such social search.
Even though this is a vast topic of research and
development, with a group of three members, this
project can be built in around two months. Some of
major stumbling blocks that may occur as the project
advances in development and testing is a) how should
we store huge geo-spatial data generated by analyz-
ing and processing the social media data. b) repre-
senting the processed data in graphical form well
enough understood by a computer science novice. As
assumed above, the project shall be completed within
two months (before 1 May 2016). We plan to reach
requirement gathering and project design phase by end
of March 2016 and we will start with infrastructure
development and backend development in first week
of April 2016 until 25 April 2016. The testing phase
will run from 25 April 2016 to 1 May 2016 carrying
out all the test cases designed.
The project has four stages: Gathering, Design, In-
frastructure Implementation, and Testing.
A. Stage1 - The Requirement Gathering Stage.
This section describes in details the work done in the
requirement gathering phase. We start by describing
the proposed system and the types of users who will
work on the system. We then move on to the real-world
scenarios and the planned project phases.
• System Description:
This project will fetch top-k social media users
from vast data pool captured via social media
platforms like Twitter. This will be a Flask-based
web project which has a UI developed in Bootstrap
and Javascript. The backend will be implemented
in PostgreSQL and project’s pivotal functionality
will be implemented in Python.
• Types of Users:
There are two types of users, which are described
in detail as follows:
1) End User: End user is the one who will input a
keyword into the system and the desired location
in the form of (lat,long). This user will have limited
acess to the system which includes the input given
by it to the system and output generated by the
query provided.
2) Super User: The Super User mode will be suit-
able for any user who wants to control the be-
haviour of the system. This shall include inspecting
the log files of the whole system in conjunction
with important data logs such as PostgreSQL’s
WAL (Write Ahead Logging) XLOG records. The
super user mode will give the user power to further
inspect the data values generated by the system
and perform data science, data mining and pattern
recognition operations.
• User Interaction Mode:
The end user will be able to interact with the sys-
tem via a web-based portal designed using above
mentioned technologies in System Description sec-
tion.
• Real world Scenarions:
The real world scenarios are described below
aggregated by user types:
Type: End User
Scenario: 1
Scenario Description: Top-k people/social media
posts in a locality about a query
System Data Input: A query- "People Talking
about Donald Trump" and lattitude, longitude of a
desired area
Input Data Type: String and a Tuple
System Data Output: List of users and
tweets/social media posts about the query in
the area
Output Data Type: List and JSON Strings
Scenario: 2
Scenario Description: Query-targeted
advertisements in a locality
System Data Input: A query- "Burrito" and latitude,
longitude of a desired area
Input Data Type: String and a Tuple System Data
Output: List of users and tweets/social media
posts about the query in the area
Output Data Type: List and JSON Strings
Type: Super User
Scenario: 1
Scenario Description: Inspecting Log file of
PostgreSQL
System Data Input: Click on the "button" input
type on the form
Input Data Type: HTML Code
System Data Output: Opening the Log file on the
console
Output Data Type: HTML Code
Scenario: 2
Scenario Description: Managing the session and
API keys
System Data Input: Click on the "button" input
type on the form
Input Data Type: HTML Code
System Data Output: List session and API keys
used by the system
Output Data Type: HTML Code
• Project Timeline:
Stage 1 viz., requirement gathering shall be
finished by around 21 March 2016 followed by
project design which shall be completed by 1 April
2016. We will start development of infrastructure
and back-end on 1 April 2016 and shall be over by
25 April 2016. From 25 April 2016 to 1 May 2016,
we will thoroughly test the system against the test
cases developed by one of our team members.
The planned division of labor is as below:
Saurabh Deochake: Requirement Gathering,
Project Design, Infrastructure Implementation
including front-end and back-end, Documentation.
Shreepad Patil: Requirement Gathering, Project
Design, Infrastructure Implementation including
front-end and back-end, Documentation.
Saurabh Singh: Requirement Gathering, Project
Design, Front-end Infrastructure development,
Testing.
B. Stage2 - The Design Stage.
This section explains the architecture of the project
and talks about the algorithm and data structure to be
used in the implementation of the project. We also talk
about the constraints in the projects
• Project Description: We describe the main com-
ponents of the project here. The architecture is
also illustrated by the Figure 1.
Twitter API: We use twitter streaming API to ac-
cess the tweets from twitter. We use a filter on the
stream to filter data relevant to our project. The
pseudo code for the Filter is shown in Algorithm
1.
Parser: Once we acquire the filtered stream, we
parse the stream to get the posts from it. We use
stop words and stem the words in the parser. The
post is stored in the format p(uid, t, l, W), where
uid is User ID, t is timestamp, l is location, and W
is set of words in the tweet. We store the acquired
posts into a database. The pseudo code for this
component is shown in Algorithm 2.
Graph Generator: The Graph Generator then
converts these posts to a graph. The graph is
stored as G(U, E), where U is the set of users,
E is an set of edges, (u, v) ∈ E if u has ’replied
to’ or ’forwarded’ v’s post. This component would
append vertices and edges to an existing graph
structure. The graph will stored in adjacency list.
The pseudo code for this component is shown in
Algorithm 3.
Analyzer: This component is the core component
of the architecture. The analyzer receives the
query from the query processor. Internally, the
analyzer finds all the posts that were posted in the
query radius and assign a distance score to these
posts.
Another score is also calculated for the posts, we
call it the keyword relevance score. This score is
the function of tweet popularity and the number of
matching words in the post and query. The tweet
popularity is based on the number of replies and
retweets of the post. The pseudo code for this
component is shown in Algorithm 4.
Query Processor: We plan to use a top-k query
on an aggregate score. The aggregate score is a
weighted combination of distance score and key-
word relevance score. As a response to the query
the analyzer will respond with the top-k Users
relevant to the query. The algorithm is briefly
described in Algorithm 5.
User Interface: The user interface will be taking
user input in the form of q(W, l, r), where W is set
of words, l is location and r is the distance. Once
the user submits the query, it passes the query to
the query processor and displays the output to the
user.
The time complexity of the algorithms is linearith-
mic in the number of posts. However, the number
Figure 1. Proposed architecture for Topik
of posts is expected to be very high. We use a
notation O(nlogn) to denote the time complexity
of the algorithms, where n is number of posts.
Algorithm 1: GetStream
Data: Filter(F) A broad enough filter to acquire
social media feed from API to generate and
store data.
Result: Subset of posts from the feed.
begin
Q ←− empty data structure to get posts to pass
to parser
p(uid, t, l, W) ←− getCurrentTweet()
if p satisfies F then
return p(uid, t, l, W)
else
return ∅
end
end
Algorithm 2: ParsePost
Data: tweet t from getStream() after applying a
filter.
Result: Refined Tweet p (uid, t, l, W ).
begin
for ∀i ∈ tweet words W do
if !IsRelevant(i) then
W ← W − i
end
end
return p (uid, t, l, W)
end
• Flow Diagram Major Constraints: We describe
the major constraints that are will be tackled here
– Twitter Rate limit: We are limited by the
Twitter API for fetching the streaming data
Algorithm 3: GraphGenerator
Data: p-refined tweet post from parser, G: If the
graph already exists, the algorithm will
append to it.
Result: An updated graph G(U, E).
begin
for ∀u ∈ U do
if (p.user replied or forwarded u’s post)
then
E ← (u, v)
end
end
return G
end
Algorithm 4: Analyzer and QueryProcessor
Data: k- number of top posts needed in the result,
radius-radial distance of the posts to be analyzed
from user’s location,
W- set of key words provided as input by user,
thresh- Threshold score above which posts is
inserted into result set.
Result: A sorted set Ak containing top k results.
begin
CalculateDistanceScore(radius)
CalculateKeywordScore(W)
L1 ← SortByDistScore()
L2 ← SortByKeywordScore()
Run Top − k()
return Top-k results
end
Rate limits are divided into 15 minute inter-
vals. Additionally, all endpoints require authen-
tication, so there is no concept of unauthenti-
cated calls and rate limits.
Clients which do not implement backoff and
attempt to reconnect as often as possible will
have their connections rate limited for a small
number of minutes. Rate limited clients will
receive HTTP 420 responses for all connection
requests.
Clients which break a connection and then
reconnect frequently (to change query param-
eters, for example) run the risk of being rate
limited.
– Size constraints: The data gathered from
the social media is humongous. But we will be
working on local machines with limited storage
capacities, so we would be storing limited data
and applying constraints on the streaming data
fetched.
Algorithm 5: Top-k Query Processing using Thresh-
old Algorithm
Data: Sorted Lists L1 and L2
Result: Top-k users related to the query
begin
(1) Do sorted access in parallel to each of the 2
sorted lists. As a new object o is seen under
sorted access in some list, do random access
(calculate the other score) to the other list to
find pi(o) in other list Li . Compute the score
F(o) = F(p1, p2) of object o. If this score is
among the k highest scores seen so far, then
remember object o and its score F(o) (ties are
broken arbitrarily, so that only k objects and
their scores are remembered at any time).
(2) For each list Li , let pi be the score of the
last object seen under sorted access. Define
the threshold value T to be F(p1, p2). As soon
as at least k objects have been seen with
scores at least equal to T, halt.
(3) Let Ak be a set containing the k seen
objects with the highest scores. The output is
the sorted set {(o, F(o))|o ∈ Ak}.
end
C. Stage3 - The Implementation Stage.
We are using a mixed framework for the implemen-
tation. The specification is as below:
• Python Tweepy for back-end
• MySQL MariaDB for back-end
• Python Flask for creating web services
• HTML/CSS, javascript, jquery and twitter boot-
strap for UI
The working is shown in the below sections:
• Data snippet: The data is taken from twitter fire-
hose streaming API. The streaming API gives the
data in JSON format which is then processed by
our parser. The data looks like the snippet below:
1 {
2 "id_str": "721061993747120128",
3 "in_reply_to_status_id_str": null,
4 "in_reply_to_user_id_str": null,
5 "in_reply_to_screen_name": null,
6 "user": {
7 "id_str": "51235119",
8 "name": "Heber R",
9 "screen_name": "EoSean",
10 "profile_image_url": "http
://pbs.twimg.com/
profile_images/6884645
05676812288/
jLsNWXDs_normal.jpg"
11 },
12 "geo": null,
13 "coordinates": null,
14 "place": {
15 "bounding_box": {
16 "type": "Polygon",
17 "coordinates": [..
18 ]
19 },
20 "attributes": {}
21 }
22 }
• Sample small output: The output is given by the
back-end via a web service in a JSON format:
1 [
2 {
3 ’scoreUB’: 0.483731326757061,
4 ’LB’: 1,
5 ’user’: ’192842961’,
6 ’scoreLB’: 0.481231326757061
7 },
8 {
9 ’scoreUB’: 0.4376959174838795,
10 ’LB’: 1,
11 ’user’: ’519165974’,
12 ’scoreLB’: 0.4364459174838795
13 },
14 {
15 ’scoreUB’: 0.4271013729546355,
16 ’LB’: 1,
17 ’user’: ’212309060’,
18 ’scoreLB’: 0.4271013729546355
19 }
20 ]
• Working code: One of the main functions is shown
below:
1 def findTopK( self ,kLim = 5, alpha = 0.5) :
2 self . findTopTextResults () #Calculate text
scores
3 self . findTopDistResults () #Calculate distance
scores
4
5 LDist = [ ]
6 LText = [ ]
7
8 topKList = [ ]
9
10 db = MySQLdb. connect( host="localhost " , user="
spd" , passwd="****" , db="topik ")
11
12
13 curDist = db. cursor ()
14 curText = db. cursor ()
15
16 curDist . execute("SELECT uid ,AVG(score) FROM
temp_distScore GROUP BY uid HAVING AVG(
score) >0 ORDER BY AVG(score) DESC; " )
17 curText . execute("SELECT uid , (SUM(score) /40)
FROM temp_textScore GROUP BY uid HAVING
SUM(score) >0 ORDER BY SUM(score) DESC; " )
18
19 LDist . extend( l i s t ( curDist . fetchall () ) ) #
Sorted l i s t
20 LText . extend( l i s t (curText . fetchall () ) ) #
Sorted l i s t
21
22 k = 0
23 i = 0
24 buff = [ ]
25
26 Maxlength = len ( LDist )
27 i f ( len ( LDist )<len (LText) ) :
28 Maxlength = len (LText)
29
30
31 #Top−K Algorithm implementation
32 for i in range(Maxlength) :
33
34 i f ( i<len ( LDist ) ) :
35 UserL1 = LDist [ i ] [0]
36 ScoreL1 = LDist [ i ] [1]
37 else :
38 UserL1 = ’ ’
39 ScoreL1 = 0
40
41 i f ( i<len (LText) ) :
42 UserL2 = LText[ i ] [0]
43 ScoreL2 = LText[ i ] [1]
44 else :
45 UserL2 = ’ ’
46 ScoreL2 = 0
47 threshold = alpha*ScoreL1 + (1−alpha)
*ScoreL2
48
49 i f UserL1 == UserL2:
50 scoreLB = threshold
51 scoreUB = threshold
52 buff .append({"user" :UserL1, "
scoreLB" :scoreLB, "
scoreUB" :scoreUB, "LB" : 1
})
53 else :
54 indexUserL1 = next (( i for i ,
v in enumerate( buff ) i f v
[ "user" ] == UserL1) , None
)
55 i f (indexUserL1) :
56 buff [indexUserL1] [ "
scoreLB" ] +=
alpha*ScoreL1
57 buff [indexUserL1] [ "
scoreUB" ] = buff [
indexUserL1] [ "
scoreLB" ]
58 else :
59 buff .append({"user" :
UserL1, "scoreLB"
: alpha*ScoreL1, "
scoreUB" :
threshold , "LB" :
1})
60
61 indexUserL2 = next (( i for i ,
v in enumerate( buff ) i f v
[ "user" ] == UserL2) , None
)
62 i f (indexUserL2) :
63 buff [indexUserL2] [ "
scoreLB" ] += (1−
alpha) *ScoreL2
64 buff [indexUserL2] [ "
scoreUB" ] = buff [
indexUserL2] [ "
scoreLB" ]
65 else :
66 buff .append({"user" :
UserL2, "scoreLB"
: (1−alpha) *ScoreL
2, "scoreUB" :
threshold , "LB" :
2})
67
68 #Update buffers
69 for b in buff :
70 i f (b[ "LB" ]==1) :
71 b[ "scoreUB" ] = b[ "
scoreLB" ] + (1−
alpha) *ScoreL2
72 else :
73 b[ "scoreUB" ] = b[ "
scoreLB" ] + alpha
*ScoreL1
74
75 tempBuff = [ ]
76 for b in buff :
77 i f (b[ "scoreLB" ]>threshold ) :
78 topKList .append(b)
79 k += 1
80 else :
81 tempBuff .append(b)
82
83 buff = tempBuff
84
85 i f (k>=kLim) :
86 return topKList
87 return topKList
• Demo and sample findings:
– Data size: The data is collected from twitter
streaming, after 2 hours of streaming we got
a data of size 11 MB(approx.) in database.
– Interesting Findings: What makes our pro-
gram efficient in terms of space complexity
is because we store only stemmed words in
the database obviating the storage of full text
retrieved from Twitter API. On the other hand,
when we talk about time complexity, overall
our model works in O(nlogn). We run Non
Random Access(NRA) algorithm to process the
data stored inside the database to find top-k
results. The usage of NRA algorithm further
makes our model competent in which we could
reduce the data access in database from O(mn)
to O(m+n). As this project demands operating
on big data originated from Twitter API, im-
plementation of hash-like data structure i.e.,
dictionary in Python further helped us reduce
the overall time complexity. We also realized
that using techniques like locality sensitive
hashing over geo-tagged streaming data would
further enhance our model in terms of time and
space complexity.
D. Stage4 - User Interface.
The project is implemented as a client-server model.
The back-end runs on a server and the user interface
can be opened in a browser(client). To open the user
interface, the users will go to the link. Once the web
page is open, the user can input his/her query as a
series of inputs in the given fields. Words, latitude and
longitude, radius, and k are taken as inputs. Once the
user clicks on submit button, the output is displayed as
a table. This stage include the following items :
• User inputs the text queries in the designated
fields and clicks on the submit button
• If the inputs are not acceptable by the query then
appropriate error messages are displayed to the
user in the form of alerts
– Words field cannot be empty or cannot contain
any special characters. An alert is given if valid
input is not obtained
– Latitude and Longitude cannot be empty or
cannot contain any values outside the range
of valid inputs(-180 to 180 for Longitude and
-85 to 85 for Latitude). An error message will
be displayed if the input is invalid
– Radius field cannot be empty or negative. If
an invalid input is obtained, an appropriate
message is displayed
– K field cannot be empty or negative. If an
invalid input is obtained, an appropriate mes-
sage is displayed
• If every input is valid then a query is triggered and
results are displayed to the user in a table
– Handle field shows the twitter handle used by
the user returned by the query
– Name field shows the full name of the user
returned by the query
– Image field shows the profile picture of the
user returned by the query
• The query is handled in a flask based python
web server. The web-server calls the top-k query
processor and executes the query. The results are
then returned in JSON format
Deliverables for Stage4 are as follows:
• To activate the UI, make sure that the MySQL
DB and flask server are online. Then open the
frontendTopik.html file in a javaScript enabled
browser.
Figure 2. Initial screen for the application
Figure 3. Screen with error message for invalid Words input
• Use case: As it is the finals week at Rutgers, many
students are working on project submissions, we
ran a query to find the top 5 users tweeting with
the keyword "project" and within the distance 5
miles. We show the validation cases for running
this query.
• When the user gives improper input, error mes-
sages are shown as below. Refer Figures 3-6. If all
the inputs are valid then the output is shown as
described below. Refer Figure 7
– "keywords can not contain special characters
or can’t be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the words field
– "longitude/latitude not valid (-180/-85 to
180/85) or can’t be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the longitude
or latitude field
– "Radius not valid(Input a positive number) or
can’t be empty":
Figure 4. Screen with error message for invalid Latitude/Longitude
input
Figure 5. Screen with error message for invalid radius input
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the radius field
– "k not valid(Input a positive number) or can’t
be empty":
– The error message explanation (upon which
violation it takes place): The error occurs
when user gives invalid input in the k field
Figure 6. Screen with error message for invalid k input
Figure 7. Screen with the full output as a table
• When the user hits the Submit button with the
correct input fields the output event is triggered.
– The output: A table with top-k users tweeting
about the words within the radius
– The result consists of three column fields in
the table with Image, Name, and the twitter
handle as the output.

Más contenido relacionado

La actualidad más candente

Iaetsd similarity search in information networks using
Iaetsd similarity search in information networks usingIaetsd similarity search in information networks using
Iaetsd similarity search in information networks using
Iaetsd Iaetsd
 
INFO4990_Hossain
INFO4990_HossainINFO4990_Hossain
INFO4990_Hossain
webuploader
 
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
Thomas Niebler
 

La actualidad más candente (20)

Ppt
PptPpt
Ppt
 
Hao lyu slides_sarcasm
Hao lyu slides_sarcasmHao lyu slides_sarcasm
Hao lyu slides_sarcasm
 
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
Information Seeking with Social Signals: Anatomy of a Social Tag-based Explor...
 
Pula 5 Giugno 2007
Pula 5 Giugno 2007Pula 5 Giugno 2007
Pula 5 Giugno 2007
 
Topological Relations in Linked Geographic Data
Topological Relations in Linked Geographic DataTopological Relations in Linked Geographic Data
Topological Relations in Linked Geographic Data
 
Measuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in TwiiterMeasuring Opinion Credibility in Twiiter
Measuring Opinion Credibility in Twiiter
 
CSE509 Lecture 3
CSE509 Lecture 3CSE509 Lecture 3
CSE509 Lecture 3
 
Stack_Overflow-Network_Graph
Stack_Overflow-Network_GraphStack_Overflow-Network_Graph
Stack_Overflow-Network_Graph
 
Iaetsd similarity search in information networks using
Iaetsd similarity search in information networks usingIaetsd similarity search in information networks using
Iaetsd similarity search in information networks using
 
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly DetectionDetection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
Detection and Analysis of Twitter Trending Topics via Link-Anomaly Detection
 
INFO4990_Hossain
INFO4990_HossainINFO4990_Hossain
INFO4990_Hossain
 
Selection of Tags for Tag Clouds
Selection of Tags for Tag CloudsSelection of Tags for Tag Clouds
Selection of Tags for Tag Clouds
 
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
Twitter Based Sentiment Analysis of Each Presidential Candidate Using Long Sh...
 
PageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_HabibPageRank_algorithm_Nfaoui_El_Habib
PageRank_algorithm_Nfaoui_El_Habib
 
Graph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social NetworkGraph-based Analysis and Opinion Mining in Social Network
Graph-based Analysis and Opinion Mining in Social Network
 
News Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic AnalysisNews Reliability Evaluation using Latent Semantic Analysis
News Reliability Evaluation using Latent Semantic Analysis
 
Social Network Analysis - full show
Social Network Analysis - full showSocial Network Analysis - full show
Social Network Analysis - full show
 
Politycheck - Political Ideology Detection (Natural Language Processing Chall...
Politycheck - Political Ideology Detection (Natural Language Processing Chall...Politycheck - Political Ideology Detection (Natural Language Processing Chall...
Politycheck - Political Ideology Detection (Natural Language Processing Chall...
 
Social media analysis project
Social media analysis projectSocial media analysis project
Social media analysis project
 
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
How tagging pragmatics influence Tag Sense Discovery in Social Annotation Sys...
 

Similar a srd117.final.512Spring2016

Graph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in TwitterGraph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in Twitter
raghavr186
 

Similar a srd117.final.512Spring2016 (20)

Location Based Service in Social Media: An Overview of Application
Location Based Service in Social Media: An Overview of Application Location Based Service in Social Media: An Overview of Application
Location Based Service in Social Media: An Overview of Application
 
User Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social NetworkUser Behaviour Pattern Recognition On Twitter Social Network
User Behaviour Pattern Recognition On Twitter Social Network
 
Prediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social NetworksPrediction of Reaction towards Textual Posts in Social Networks
Prediction of Reaction towards Textual Posts in Social Networks
 
Final Algos
Final AlgosFinal Algos
Final Algos
 
Who are the top influencers and what characterizes them?
Who are the top influencers and what characterizes them?Who are the top influencers and what characterizes them?
Who are the top influencers and what characterizes them?
 
Q046049397
Q046049397Q046049397
Q046049397
 
Enabling Citizen-empowered Apps over Linked Data
Enabling Citizen-empowered Apps over Linked DataEnabling Citizen-empowered Apps over Linked Data
Enabling Citizen-empowered Apps over Linked Data
 
Twitter analysis by Kaify Rais
Twitter analysis by Kaify RaisTwitter analysis by Kaify Rais
Twitter analysis by Kaify Rais
 
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter StreamTemporal Exploration in 2D Visualization of Emotions on Twitter Stream
Temporal Exploration in 2D Visualization of Emotions on Twitter Stream
 
Profile Analysis of Users in Data Analytics Domain
Profile Analysis of   Users in Data Analytics DomainProfile Analysis of   Users in Data Analytics Domain
Profile Analysis of Users in Data Analytics Domain
 
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSABetter Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
Better Hackathon 2020 - Fraunhofer IAIS - Semantic geo-clustering with SANSA
 
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
LINKING SOFTWARE DEVELOPMENT PHASE AND PRODUCT ATTRIBUTES WITH USER EVALUATIO...
 
SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNING
SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNINGSENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNING
SENTIMENT ANALYSIS OF SOCIAL MEDIA DATA USING DEEP LEARNING
 
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
IEEE 2014 DOTNET CLOUD COMPUTING PROJECTS A scientometric analysis of cloud c...
 
Graph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in TwitterGraph Based User Interest Modeling in Twitter
Graph Based User Interest Modeling in Twitter
 
Unit 5.1-Basics of Hierarchical Task Analysis (HTA).pptx
Unit 5.1-Basics of Hierarchical Task Analysis (HTA).pptxUnit 5.1-Basics of Hierarchical Task Analysis (HTA).pptx
Unit 5.1-Basics of Hierarchical Task Analysis (HTA).pptx
 
Social media analysis using vader sentiment tool
Social media analysis using vader sentiment toolSocial media analysis using vader sentiment tool
Social media analysis using vader sentiment tool
 
Political prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learningPolitical prediction analysis using text mining and deep learning
Political prediction analysis using text mining and deep learning
 
Political Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptxPolitical Prediction Analysis using text mining and deep learning.pptx
Political Prediction Analysis using text mining and deep learning.pptx
 
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional DatasetsProjection Multi Scale Hashing Keyword Search in Multidimensional Datasets
Projection Multi Scale Hashing Keyword Search in Multidimensional Datasets
 

srd117.final.512Spring2016

  • 1. Topik: Top-k Data Search in Geo-tagged Social Media Saurabh Deochake Rutgers University Piscataway, NJ, USA Email: saurabh.deochake@rutgers.edu Saurabh Singh Rutgers University Piscataway, NJ, USA Email: s.singh@rutgers.edu Shreepad Patil Rutgers University Piscataway, NJ, USA Email: shreepad.patil@rutgers.edu Abstract— Increasing popularity of social media platforms and location based services has made it possible to query the data generated by social media interactions in novel ways other than traditional search. With advent of Data Science techniques, data scientists can delve deep into the data generated by social media and inspect it to find different ways the users interact on social media and in turn, Internet. In this project, we will make use of data generated by social media interaction for example, tweets, and/or geo-tagged instragram photos, and/or facebook posts on a particular topic from a specific region and bring about a top-k Social media data(User/Tweets/Posts). Given a location l, in the form of (Latitude,Longitude), distance d, and a set of keywords K, the algorithm used in our module will fetch top-k users who have posted media relevant to the desired keyword k *subset* K in a proximity to location l(Lat,Long) within distance d. We congregate, analyze and represent the data to end user. Finally, we will conduct an experimental study using real social media data sets such as Twitter to evaluate our module in development. I. Project Description Social media is having more and more effect on our daily lives. Every second around 6000 tweets are recorded on twitter. In general, people use social media to keep in touch with their friends and stay updated on the social events. Nowadays, with the widespread use of GPS-enabled mobile devices, we also get data regarding locations of the users. Geo-tagged data plays an important role in analyzing the social media data and scenarios like friend recommendation, spatial decision, etc. For ex- ample, Facebook activates the Safety Check feature during a disaster, which helps marking yourself as safe during a disaster. Also many location based services are becoming popular these days. Traditional search, looks for the keywords in a doc- ument, twitter also uses a search and fetches the trending topics, but it lacks the geo-spatial component. With geo-spatial search it is even possible to provide novel search services. Our approach is to recommend Twitter users by taking into account users’ tweet contents, locations, and tweets’ popularity that indicates the users’ social influence. In particular, we study the search of top- k local users in tweets that are associated with lati- tude/longitude information, namely geo-tagged tweets. To that end, we devise a scalable architecture for hybrid index construction in order to handle high- volume geo-tagged tweets, and we propose functions to score tweets as well as methods to rank users that integrate social relationships, keyword relevance and spatial proximity for processing such social search. Even though this is a vast topic of research and development, with a group of three members, this project can be built in around two months. Some of major stumbling blocks that may occur as the project advances in development and testing is a) how should we store huge geo-spatial data generated by analyz- ing and processing the social media data. b) repre- senting the processed data in graphical form well enough understood by a computer science novice. As assumed above, the project shall be completed within two months (before 1 May 2016). We plan to reach requirement gathering and project design phase by end of March 2016 and we will start with infrastructure development and backend development in first week of April 2016 until 25 April 2016. The testing phase will run from 25 April 2016 to 1 May 2016 carrying out all the test cases designed. The project has four stages: Gathering, Design, In- frastructure Implementation, and Testing. A. Stage1 - The Requirement Gathering Stage. This section describes in details the work done in the requirement gathering phase. We start by describing the proposed system and the types of users who will work on the system. We then move on to the real-world scenarios and the planned project phases. • System Description: This project will fetch top-k social media users from vast data pool captured via social media platforms like Twitter. This will be a Flask-based web project which has a UI developed in Bootstrap
  • 2. and Javascript. The backend will be implemented in PostgreSQL and project’s pivotal functionality will be implemented in Python. • Types of Users: There are two types of users, which are described in detail as follows: 1) End User: End user is the one who will input a keyword into the system and the desired location in the form of (lat,long). This user will have limited acess to the system which includes the input given by it to the system and output generated by the query provided. 2) Super User: The Super User mode will be suit- able for any user who wants to control the be- haviour of the system. This shall include inspecting the log files of the whole system in conjunction with important data logs such as PostgreSQL’s WAL (Write Ahead Logging) XLOG records. The super user mode will give the user power to further inspect the data values generated by the system and perform data science, data mining and pattern recognition operations. • User Interaction Mode: The end user will be able to interact with the sys- tem via a web-based portal designed using above mentioned technologies in System Description sec- tion. • Real world Scenarions: The real world scenarios are described below aggregated by user types: Type: End User Scenario: 1 Scenario Description: Top-k people/social media posts in a locality about a query System Data Input: A query- "People Talking about Donald Trump" and lattitude, longitude of a desired area Input Data Type: String and a Tuple System Data Output: List of users and tweets/social media posts about the query in the area Output Data Type: List and JSON Strings Scenario: 2 Scenario Description: Query-targeted advertisements in a locality System Data Input: A query- "Burrito" and latitude, longitude of a desired area Input Data Type: String and a Tuple System Data Output: List of users and tweets/social media posts about the query in the area Output Data Type: List and JSON Strings Type: Super User Scenario: 1 Scenario Description: Inspecting Log file of PostgreSQL System Data Input: Click on the "button" input type on the form Input Data Type: HTML Code System Data Output: Opening the Log file on the console Output Data Type: HTML Code Scenario: 2 Scenario Description: Managing the session and API keys System Data Input: Click on the "button" input type on the form Input Data Type: HTML Code System Data Output: List session and API keys used by the system Output Data Type: HTML Code • Project Timeline: Stage 1 viz., requirement gathering shall be finished by around 21 March 2016 followed by project design which shall be completed by 1 April 2016. We will start development of infrastructure and back-end on 1 April 2016 and shall be over by 25 April 2016. From 25 April 2016 to 1 May 2016, we will thoroughly test the system against the test cases developed by one of our team members. The planned division of labor is as below: Saurabh Deochake: Requirement Gathering, Project Design, Infrastructure Implementation including front-end and back-end, Documentation. Shreepad Patil: Requirement Gathering, Project Design, Infrastructure Implementation including front-end and back-end, Documentation. Saurabh Singh: Requirement Gathering, Project Design, Front-end Infrastructure development, Testing. B. Stage2 - The Design Stage. This section explains the architecture of the project and talks about the algorithm and data structure to be used in the implementation of the project. We also talk about the constraints in the projects
  • 3. • Project Description: We describe the main com- ponents of the project here. The architecture is also illustrated by the Figure 1. Twitter API: We use twitter streaming API to ac- cess the tweets from twitter. We use a filter on the stream to filter data relevant to our project. The pseudo code for the Filter is shown in Algorithm 1. Parser: Once we acquire the filtered stream, we parse the stream to get the posts from it. We use stop words and stem the words in the parser. The post is stored in the format p(uid, t, l, W), where uid is User ID, t is timestamp, l is location, and W is set of words in the tweet. We store the acquired posts into a database. The pseudo code for this component is shown in Algorithm 2. Graph Generator: The Graph Generator then converts these posts to a graph. The graph is stored as G(U, E), where U is the set of users, E is an set of edges, (u, v) ∈ E if u has ’replied to’ or ’forwarded’ v’s post. This component would append vertices and edges to an existing graph structure. The graph will stored in adjacency list. The pseudo code for this component is shown in Algorithm 3. Analyzer: This component is the core component of the architecture. The analyzer receives the query from the query processor. Internally, the analyzer finds all the posts that were posted in the query radius and assign a distance score to these posts. Another score is also calculated for the posts, we call it the keyword relevance score. This score is the function of tweet popularity and the number of matching words in the post and query. The tweet popularity is based on the number of replies and retweets of the post. The pseudo code for this component is shown in Algorithm 4. Query Processor: We plan to use a top-k query on an aggregate score. The aggregate score is a weighted combination of distance score and key- word relevance score. As a response to the query the analyzer will respond with the top-k Users relevant to the query. The algorithm is briefly described in Algorithm 5. User Interface: The user interface will be taking user input in the form of q(W, l, r), where W is set of words, l is location and r is the distance. Once the user submits the query, it passes the query to the query processor and displays the output to the user. The time complexity of the algorithms is linearith- mic in the number of posts. However, the number Figure 1. Proposed architecture for Topik of posts is expected to be very high. We use a notation O(nlogn) to denote the time complexity of the algorithms, where n is number of posts. Algorithm 1: GetStream Data: Filter(F) A broad enough filter to acquire social media feed from API to generate and store data. Result: Subset of posts from the feed. begin Q ←− empty data structure to get posts to pass to parser p(uid, t, l, W) ←− getCurrentTweet() if p satisfies F then return p(uid, t, l, W) else return ∅ end end Algorithm 2: ParsePost Data: tweet t from getStream() after applying a filter. Result: Refined Tweet p (uid, t, l, W ). begin for ∀i ∈ tweet words W do if !IsRelevant(i) then W ← W − i end end return p (uid, t, l, W) end • Flow Diagram Major Constraints: We describe the major constraints that are will be tackled here – Twitter Rate limit: We are limited by the Twitter API for fetching the streaming data
  • 4. Algorithm 3: GraphGenerator Data: p-refined tweet post from parser, G: If the graph already exists, the algorithm will append to it. Result: An updated graph G(U, E). begin for ∀u ∈ U do if (p.user replied or forwarded u’s post) then E ← (u, v) end end return G end Algorithm 4: Analyzer and QueryProcessor Data: k- number of top posts needed in the result, radius-radial distance of the posts to be analyzed from user’s location, W- set of key words provided as input by user, thresh- Threshold score above which posts is inserted into result set. Result: A sorted set Ak containing top k results. begin CalculateDistanceScore(radius) CalculateKeywordScore(W) L1 ← SortByDistScore() L2 ← SortByKeywordScore() Run Top − k() return Top-k results end Rate limits are divided into 15 minute inter- vals. Additionally, all endpoints require authen- tication, so there is no concept of unauthenti- cated calls and rate limits. Clients which do not implement backoff and attempt to reconnect as often as possible will have their connections rate limited for a small number of minutes. Rate limited clients will receive HTTP 420 responses for all connection requests. Clients which break a connection and then reconnect frequently (to change query param- eters, for example) run the risk of being rate limited. – Size constraints: The data gathered from the social media is humongous. But we will be working on local machines with limited storage capacities, so we would be storing limited data and applying constraints on the streaming data fetched. Algorithm 5: Top-k Query Processing using Thresh- old Algorithm Data: Sorted Lists L1 and L2 Result: Top-k users related to the query begin (1) Do sorted access in parallel to each of the 2 sorted lists. As a new object o is seen under sorted access in some list, do random access (calculate the other score) to the other list to find pi(o) in other list Li . Compute the score F(o) = F(p1, p2) of object o. If this score is among the k highest scores seen so far, then remember object o and its score F(o) (ties are broken arbitrarily, so that only k objects and their scores are remembered at any time). (2) For each list Li , let pi be the score of the last object seen under sorted access. Define the threshold value T to be F(p1, p2). As soon as at least k objects have been seen with scores at least equal to T, halt. (3) Let Ak be a set containing the k seen objects with the highest scores. The output is the sorted set {(o, F(o))|o ∈ Ak}. end C. Stage3 - The Implementation Stage. We are using a mixed framework for the implemen- tation. The specification is as below: • Python Tweepy for back-end • MySQL MariaDB for back-end • Python Flask for creating web services • HTML/CSS, javascript, jquery and twitter boot- strap for UI The working is shown in the below sections: • Data snippet: The data is taken from twitter fire- hose streaming API. The streaming API gives the data in JSON format which is then processed by our parser. The data looks like the snippet below: 1 { 2 "id_str": "721061993747120128", 3 "in_reply_to_status_id_str": null, 4 "in_reply_to_user_id_str": null, 5 "in_reply_to_screen_name": null, 6 "user": { 7 "id_str": "51235119", 8 "name": "Heber R", 9 "screen_name": "EoSean", 10 "profile_image_url": "http ://pbs.twimg.com/ profile_images/6884645
  • 5. 05676812288/ jLsNWXDs_normal.jpg" 11 }, 12 "geo": null, 13 "coordinates": null, 14 "place": { 15 "bounding_box": { 16 "type": "Polygon", 17 "coordinates": [.. 18 ] 19 }, 20 "attributes": {} 21 } 22 } • Sample small output: The output is given by the back-end via a web service in a JSON format: 1 [ 2 { 3 ’scoreUB’: 0.483731326757061, 4 ’LB’: 1, 5 ’user’: ’192842961’, 6 ’scoreLB’: 0.481231326757061 7 }, 8 { 9 ’scoreUB’: 0.4376959174838795, 10 ’LB’: 1, 11 ’user’: ’519165974’, 12 ’scoreLB’: 0.4364459174838795 13 }, 14 { 15 ’scoreUB’: 0.4271013729546355, 16 ’LB’: 1, 17 ’user’: ’212309060’, 18 ’scoreLB’: 0.4271013729546355 19 } 20 ] • Working code: One of the main functions is shown below: 1 def findTopK( self ,kLim = 5, alpha = 0.5) : 2 self . findTopTextResults () #Calculate text scores 3 self . findTopDistResults () #Calculate distance scores 4 5 LDist = [ ] 6 LText = [ ] 7 8 topKList = [ ] 9 10 db = MySQLdb. connect( host="localhost " , user=" spd" , passwd="****" , db="topik ") 11 12 13 curDist = db. cursor () 14 curText = db. cursor () 15 16 curDist . execute("SELECT uid ,AVG(score) FROM temp_distScore GROUP BY uid HAVING AVG( score) >0 ORDER BY AVG(score) DESC; " ) 17 curText . execute("SELECT uid , (SUM(score) /40) FROM temp_textScore GROUP BY uid HAVING SUM(score) >0 ORDER BY SUM(score) DESC; " ) 18 19 LDist . extend( l i s t ( curDist . fetchall () ) ) # Sorted l i s t 20 LText . extend( l i s t (curText . fetchall () ) ) # Sorted l i s t 21 22 k = 0 23 i = 0 24 buff = [ ] 25 26 Maxlength = len ( LDist ) 27 i f ( len ( LDist )<len (LText) ) : 28 Maxlength = len (LText) 29 30 31 #Top−K Algorithm implementation 32 for i in range(Maxlength) : 33 34 i f ( i<len ( LDist ) ) : 35 UserL1 = LDist [ i ] [0] 36 ScoreL1 = LDist [ i ] [1] 37 else : 38 UserL1 = ’ ’ 39 ScoreL1 = 0 40 41 i f ( i<len (LText) ) : 42 UserL2 = LText[ i ] [0] 43 ScoreL2 = LText[ i ] [1] 44 else : 45 UserL2 = ’ ’ 46 ScoreL2 = 0 47 threshold = alpha*ScoreL1 + (1−alpha) *ScoreL2 48 49 i f UserL1 == UserL2: 50 scoreLB = threshold 51 scoreUB = threshold 52 buff .append({"user" :UserL1, " scoreLB" :scoreLB, " scoreUB" :scoreUB, "LB" : 1 }) 53 else : 54 indexUserL1 = next (( i for i , v in enumerate( buff ) i f v [ "user" ] == UserL1) , None ) 55 i f (indexUserL1) : 56 buff [indexUserL1] [ " scoreLB" ] += alpha*ScoreL1 57 buff [indexUserL1] [ " scoreUB" ] = buff [ indexUserL1] [ " scoreLB" ] 58 else : 59 buff .append({"user" : UserL1, "scoreLB" : alpha*ScoreL1, " scoreUB" : threshold , "LB" : 1}) 60 61 indexUserL2 = next (( i for i ,
  • 6. v in enumerate( buff ) i f v [ "user" ] == UserL2) , None ) 62 i f (indexUserL2) : 63 buff [indexUserL2] [ " scoreLB" ] += (1− alpha) *ScoreL2 64 buff [indexUserL2] [ " scoreUB" ] = buff [ indexUserL2] [ " scoreLB" ] 65 else : 66 buff .append({"user" : UserL2, "scoreLB" : (1−alpha) *ScoreL 2, "scoreUB" : threshold , "LB" : 2}) 67 68 #Update buffers 69 for b in buff : 70 i f (b[ "LB" ]==1) : 71 b[ "scoreUB" ] = b[ " scoreLB" ] + (1− alpha) *ScoreL2 72 else : 73 b[ "scoreUB" ] = b[ " scoreLB" ] + alpha *ScoreL1 74 75 tempBuff = [ ] 76 for b in buff : 77 i f (b[ "scoreLB" ]>threshold ) : 78 topKList .append(b) 79 k += 1 80 else : 81 tempBuff .append(b) 82 83 buff = tempBuff 84 85 i f (k>=kLim) : 86 return topKList 87 return topKList • Demo and sample findings: – Data size: The data is collected from twitter streaming, after 2 hours of streaming we got a data of size 11 MB(approx.) in database. – Interesting Findings: What makes our pro- gram efficient in terms of space complexity is because we store only stemmed words in the database obviating the storage of full text retrieved from Twitter API. On the other hand, when we talk about time complexity, overall our model works in O(nlogn). We run Non Random Access(NRA) algorithm to process the data stored inside the database to find top-k results. The usage of NRA algorithm further makes our model competent in which we could reduce the data access in database from O(mn) to O(m+n). As this project demands operating on big data originated from Twitter API, im- plementation of hash-like data structure i.e., dictionary in Python further helped us reduce the overall time complexity. We also realized that using techniques like locality sensitive hashing over geo-tagged streaming data would further enhance our model in terms of time and space complexity. D. Stage4 - User Interface. The project is implemented as a client-server model. The back-end runs on a server and the user interface can be opened in a browser(client). To open the user interface, the users will go to the link. Once the web page is open, the user can input his/her query as a series of inputs in the given fields. Words, latitude and longitude, radius, and k are taken as inputs. Once the user clicks on submit button, the output is displayed as a table. This stage include the following items : • User inputs the text queries in the designated fields and clicks on the submit button • If the inputs are not acceptable by the query then appropriate error messages are displayed to the user in the form of alerts – Words field cannot be empty or cannot contain any special characters. An alert is given if valid input is not obtained – Latitude and Longitude cannot be empty or cannot contain any values outside the range of valid inputs(-180 to 180 for Longitude and -85 to 85 for Latitude). An error message will be displayed if the input is invalid – Radius field cannot be empty or negative. If an invalid input is obtained, an appropriate message is displayed – K field cannot be empty or negative. If an invalid input is obtained, an appropriate mes- sage is displayed • If every input is valid then a query is triggered and results are displayed to the user in a table – Handle field shows the twitter handle used by the user returned by the query – Name field shows the full name of the user returned by the query – Image field shows the profile picture of the user returned by the query • The query is handled in a flask based python web server. The web-server calls the top-k query processor and executes the query. The results are then returned in JSON format Deliverables for Stage4 are as follows: • To activate the UI, make sure that the MySQL DB and flask server are online. Then open the frontendTopik.html file in a javaScript enabled browser.
  • 7. Figure 2. Initial screen for the application Figure 3. Screen with error message for invalid Words input • Use case: As it is the finals week at Rutgers, many students are working on project submissions, we ran a query to find the top 5 users tweeting with the keyword "project" and within the distance 5 miles. We show the validation cases for running this query. • When the user gives improper input, error mes- sages are shown as below. Refer Figures 3-6. If all the inputs are valid then the output is shown as described below. Refer Figure 7 – "keywords can not contain special characters or can’t be empty": – The error message explanation (upon which violation it takes place): The error occurs when user gives invalid input in the words field – "longitude/latitude not valid (-180/-85 to 180/85) or can’t be empty": – The error message explanation (upon which violation it takes place): The error occurs when user gives invalid input in the longitude or latitude field – "Radius not valid(Input a positive number) or can’t be empty": Figure 4. Screen with error message for invalid Latitude/Longitude input Figure 5. Screen with error message for invalid radius input – The error message explanation (upon which violation it takes place): The error occurs when user gives invalid input in the radius field – "k not valid(Input a positive number) or can’t be empty": – The error message explanation (upon which violation it takes place): The error occurs when user gives invalid input in the k field Figure 6. Screen with error message for invalid k input
  • 8. Figure 7. Screen with the full output as a table • When the user hits the Submit button with the correct input fields the output event is triggered. – The output: A table with top-k users tweeting about the words within the radius – The result consists of three column fields in the table with Image, Name, and the twitter handle as the output.