Analyzing social media with Python and other tools (2/4)
1. The data
The script
Your turn
Questions?
Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam
30 January 2014
10.45
#bigdata
Damian Trilling
2. The data
The script
Your turn
Questions?
In this sesion (2/4):
1 The data
Recording tweets with yourTwapperkeeper
CSV-files
Other ways to collect tweets
Not that different: Facebook posts
2 The script
Pseudo-code
Python code
The output
3 Your turn
4 Questions?
#bigdata
Damian Trilling
3. The data
The script
Your turn
Questions?
Recording tweets with yourTwapperkeeper
The data:
Recording tweets with yourTwapperkeeper
http://datacollection.followthenews-uva.cloudlet.sara.nl
#bigdata
Damian Trilling
4. The data
The script
Your turn
Questions?
Recording tweets with yourTwapperkeeper
yourTwapperkeeper
#bigdata
Damian Trilling
5. The data
The script
Your turn
Questions?
Recording tweets with yourTwapperkeeper
yourTwapperkeeper
Storage
Continuosly calls the Twitter-API and saves all
tweets containing specific hashtags to a
mySQL-database.
You tell it once which data to collect – and
wait some months.
#bigdata
Damian Trilling
6. The data
The script
Your turn
Questions?
Recording tweets with yourTwapperkeeper
yourTwapperkeeper
#bigdata
Damian Trilling
7. The data
The script
Your turn
Questions?
Recording tweets with yourTwapperkeeper
yourTwapperkeeper
Retrieving the data
You could access the MySQL-database directly.
But yourTwapperkeeper has a nice interface
that allows you to export the data to a format
we can use for the analysis.
#bigdata
Damian Trilling
12. The data
The script
Your turn
Questions?
CSV-files
CSV-files
The format of our choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column
breaks
• No limits regarging the size
#bigdata
Damian Trilling
13. The data
The script
Your turn
Questions?
CSV-files
1
2
3
text,to_user_id,from_user,id,from_user_id,
iso_language_code,source,profile_image_url,geo_type,
geo_coordinates_0,geo_coordinates_1,created_at,time
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,,henklbr
,407085917011079169,118374840,nl,web,http://pbs.twimg.
com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun
Dec 01 09:57:00 +0000 2013,1385891820
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL
,406058792573730816,37623918,en,<a href="http://www.
hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs
.twimg.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu
Nov 28 13:55:35 +0000 2013,1385646935
#bigdata
Damian Trilling
14. The data
The script
Your turn
Questions?
Other ways to collect tweets
The data:
Other ways to collect tweets
#bigdata
Damian Trilling
15. The data
The script
Your turn
Questions?
Other ways to collect tweets
Other ways to collect tweets
Again, we want a CSV file. . .
• If you want tweets per person: www.allmytweets.net
• Up to six days backwards: www.scraperwiki.com
• Buy it from a commercial vendor
• TCAT (from the guys at DMI/mediastudies)
• For specific purposes, write your own Python script to access
the Twitter-API
(if you want to, I can show you more about this tomorrow)
#bigdata
Damian Trilling
16. The data
The script
Your turn
Questions?
Not that different: Facebook posts
The data:
Not that different: Facebook posts
#bigdata
Damian Trilling
17. The data
The script
Your turn
Questions?
Not that different: Facebook posts
Not that different: Facebook posts
Have a look at netvizz
• Gephi-files for network analysis
• . . . and a tab-seperated (essentially the same as CSV) file with
the content)
#bigdata
Damian Trilling
18. The data
The script
Your turn
Questions?
Not that different: Facebook posts
Not that different: Facebook posts
Have a look at netvizz
• Gephi-files for network analysis
• . . . and a tab-seperated (essentially the same as CSV) file with
the content)
An alternative: Facepager
• Tool to query different APIs (a.o. Twitter and Facebook) and
to store the result in a CSV table
• http://www.ls1.ifkw.uni-muenchen.de/personen/
wiss_ma/keyling_till/software.html
#bigdata
Damian Trilling
19.
20. The data
The script
Your turn
Questions?
Pseudo-code
The script:
Pseudo-code
#bigdata
Damian Trilling
21. The data
The script
Your turn
Questions?
Pseudo-code
Our task: Identify all tweets that include a reference to Poland
Let’s start with some pseudo-code!
1
2
3
4
5
6
7
open csv-table
for each line:
append column 1 to a list of tweets
append column 3 to a list of corresponding users
look for searchstring in column 1
append search result to a list of results
save lists to a new csv-file
#bigdata
Damian Trilling
22. The data
The script
Your turn
Questions?
Python code
The script:
Python code
#bigdata
Damian Trilling
23. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/python
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
user_list=[]
tweet_list=[]
search_list=[]
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)
print "Opening "+inputfilename
reader=CsvUnicodeReader(open(inputfilename,"r"))
for row in reader:
tweet_list.append(row[0])
user_list.append(row[2])
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)
24. The data
The script
Your turn
Questions?
Python code
1
2
3
4
5
#!/usr/bin/python
# We start with importing some modules:
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
6
7
8
9
10
# Let us define two variables that contain
# the names of the files we want to use
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
#bigdata
Damian Trilling
25. The data
The script
Your turn
Questions?
Python code
1
2
3
4
5
6
# We create some empty lists that we will use later on.
# A list can contain several variables
# and is denoted by square brackets.
user_list=[]
tweet_list=[]
search_list=[]
#bigdata
Damian Trilling
26. The data
The script
Your turn
Questions?
Python code
1
2
# What do we want to look for?
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau
|[Ww]arszawa’)
3
4
5
6
# Enough preparation, let the program begin!
# We tell the user what is going on...
print "Opening "+inputfilename
7
8
9
# ... and call the module that reads the input file.
reader=CsvUnicodeReader(open(inputfilename,"r"))
#bigdata
Damian Trilling
27. The data
The script
Your turn
Questions?
Python code
1
2
3
4
5
6
7
8
# Now we read the file line by line.
# The indented block is repeated for each row
# (thus, each tweet)
for row in reader:
# append data from the current row to our lists.
# Note that we start counting with 0.
tweet_list.append(row[0])
user_list.append(row[2])
9
10
11
12
13
14
15
16
#bigdata
# Let us count how often our searchstring is used in
# in this tweet
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
Damian Trilling
28. The data
The script
Your turn
Questions?
Python code
1
2
# Time to put all the data in one container
# and save it:
3
4
5
6
7
8
9
10
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland
mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)
#bigdata
Damian Trilling
29. 1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
#!/usr/bin/python
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
user_list=[]
tweet_list=[]
search_list=[]
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)
print "Opening "+inputfilename
reader=CsvUnicodeReader(open(inputfilename,"r"))
for row in reader:
tweet_list.append(row[0])
user_list.append(row[2])
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)
30. The data
The script
Your turn
Questions?
The output
The script:
myoutput.csv
#bigdata
Damian Trilling
31. The data
The script
Your turn
Questions?
The output
1
2
3
4
5
tweet,user,how often is Poland mentioned?
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,henklbr,0
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1
RT @greenami1: De winnaars en verliezers van de
lachwekkende #klimaattop in #Warschau (interview):
http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis
,1
De winnaars en verliezers van de lachwekkende #klimaattop
in #Warschau (interview): http://t.co/DEYqnqXHdy #
Misserfolg #Klimaschutz #FAZ,greenami1,1
#bigdata
Damian Trilling
33. The data
The script
Your turn
Questions?
Try it yourself!
We’ll help you getting started. Please go to
http://beehub.nl/bigdata-cw/workshop and download the
some files. Save the Python files
unicsv.py
myfirstscript.py as well as the dataset
mytweets.csv in a new folder called workshop on your
H-drive.
When you are done, start Python (GUI) from the
Windows Start Menu.
#bigdata
Damian Trilling
34. The data
The script
Your turn
Questions?
Recap
1 The data
Recording tweets with yourTwapperkeeper
CSV-files
Other ways to collect tweets
Not that different: Facebook posts
2 The script
Pseudo-code
Python code
The output
3 Your turn
4 Questions?
#bigdata
Damian Trilling
35. The data
The script
Your turn
Questions?
This afternoon
Your own script
#bigdata
Damian Trilling
36. The data
The script
Your turn
Questions?
Vragen of opmerkingen?
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
#bigdata
Damian Trilling