3. About Twitter
Social networking and micro blogging service
Enables users to send and read messages
Messages of length up to 140 characters, known as
"tweets".
Tweets contain rich information about people’s
preferences.
People share their thoughts about matches and players
stats using Twitter.
4. People’s opinions towards a match have huge impact
on its success.
Our project includes prediction using Twitter data, and
analysis of the prediction results.
High volume of positive tweets may indicate perform-
ance and result of a match and players . But how to
quantify ?
5.
6. The problem in twitter analytics is classifying
polarity of a given text at the document, sentence or a
features/aspect level.
Whether the given document, sentence or a entity of
a features/aspect is positive, negative or neutral.
7. Using social media to predict the future becomes very
popular in recent years.
Predicting the Future with Social Media Bernardo tries to
show that twitter-based prediction of Matches and Players
that can effect in result and performance.
Predicting matches and players performance using social
media (Andrei Oghina, Mathias Breuss, Manos Tsagkias &
Maarten de Rijke 2012) uses twitter and facebook data to
predict the scores and result as well as which player can
perform in that match.
My project includes prediction using Twitter data and
investigation on two new topics based on the prediction
results.
8. Data Collection: existing twitter data set and
recent tweets via Twitter API
Data Pre-processing: get the "clean" data and
transform it to the format we need
Analysis: train a classifier to classify the tweets as:
positive, negative, neutral and irrelevant
Prediction: use the statistics of the tweets' labels
to predict the match result (win/loss)
9. MapReduce – Data Reduction The processing
pillar in the Hadoop ecosystem is the MapReduce
framework.
The framework allows the specification of an
operation to be applied to a huge data set, divide
the problem and data, and run it in parallel.
From an analyst’s point of view, this can occur on
multiple dimensions. For example, a very large
dataset can be reduced into a smaller subset where
analytics can be applied
10. MapReduce - R Executing R code in the
context of a MapReduce job elevates the
kinds and size of analytics that can be
applied to huge datasets.
Problems that fit nicely into this model
include “pleasingly parallel” scenarios.
Here’s a simple use case: Scoring a dataset
against a model built in R.
12. Namenode
• manages the File System's namespace/meta-data/file
blocks
• Runs on 1 machine to several machines
Data node
• Stores and retrieves data blocks
• Reports to Namenode
• Runs on many machines
Secondary Namenode
• Performs house keeping work so Namenode doesn’t have
• Requires similar hardware as Namenode machine
• Not used for high-availability ,not a backup for name
node
13.
14. Imposes key-value input/output
Defines map and reduce functions
map: (K1,V1) → list (K2,V2)
reduce: (K2,list(V2)) → list (K3,V3)
Map function is applied to every input key-value pair
Map function generates intermediate key-value pairs
Intermediate key-values are sorted and grouped by
key
Reduce is applied to sorted and grouped
intermediate key-values
Reduce emits result key-values
15. Takes care of distributed processing and
coordination
Scheduling
– Jobs are broken down into smaller chunks called tasks.
These tasks are scheduled
Task Localization with Data
– Framework strives to place tasks on the nodes that
host
the segment of data to be processed by that specific task
– Code is moved to where the data is
16. Error Handling
– Failures are an expected behavior so tasks are
automatically re-tried on other machines
Data Synchronization
– Shuffle and Sort barrier re-arranges and moves
data between machines
– Input and output are coordinated by the
framework
17. This involves pushing the model to the
Task nodes in the Hadoop cluster, running
a MapReduce job that loads the model into
R on a task node, scoring data either row-by
row ( or in aggregates), and writing the
results back to HDFS.
In the most simplistic case this can be
done with just a Map task.
18.
19. Session is the first step in working within
theHDFS Overview To meet these challenges we
have to start with some basics.
First, we need to understand data storage in
Hadoop, how it can be leveraged from R, and why
it is important.
The basic storage mechanism in Hadoop is
HDFS (Hadoop Distributed File System).
For an R programmer, being able to read/write
files in HDFS from a standalone R .
20. Avoid sampling / aggregation;
Reduce data movement and
replication;
Bring the analytics as close as
possible to the data and;
Optimize computation speed.
21. Creating a Twitter Application
First step to perform Twitter Analysis is to
create a twitter application. This application
will allow you to perform analysis by
connecting your R console to the twitter using
the twitter API. The steps for creating your
twitter applications are:
Go to https://dev.twitter.com and login by
using your twitter account.
Then go to My Applications Create a new
application
22.
23.
24. Give your application a name, describe about
your application in few words, provide your
website’s URL or your blog address (in case you
don’t have any website).
Leave the Callback URL blank for now.
Complete other formalities and create your
twitter application.
Once, all the steps are done, the created
application will show as below.
Please note the Consumer key and Consumer
Secret numbers as they will be used in RStudio
later.
25. This step is done. Next, I will work on my Rstudio.
26. These are twitteR, ROAuth, plyr,
stringr,RJSONIO,Rcurl,bitops and ggplot2.
In this section, I will first use some packages in R.
You can install these packages by the following commands:
Working on Rstudio - Building the corpus
27. Now run the following R script code snippet
After running this script section, the console will look like this
28. Now once this file is downloaded, we are now
moving on to accessing the twitter API.
This step include the script code to perform
handshake using the Consumer Key and
Consumer Secret number of your own application.
You have to change these entries by the keys
from your application.
Following is the code you have to run to perform
handshake
29.
30. Saving Tweets
Once the handshake is done and authorized by twitter, we can
fetch most recent tweets related to any keyword. I have used
#Kejriwal as Mr. Arvind Kejriwal is the most talked about
person in Delhi now a day.
The code for getting tweets related to #Kejriwal is:
This command will get 1000 tweets related to Kejriwal. The
function “searchTwitter” is used to download tweets from the
timeline. Now we need to convert this list of 1000 tweets into
the data frame, so that we can work on it. Then finally we
convert the data frame into .csv file