Analyzing social media with Python and other tools (2/4)

The data

The script

Your turn

Questions?

Hands-on-Workshop
Big (Twitter) Data
Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
Afdeling Communicatiewetenschap
Universiteit van Amsterdam

30 January 2014
10.45
#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

In this sesion (2/4):
1 The data

Recording tweets with yourTwapperkeeper
CSV-ﬁles
Other ways to collect tweets
Not that diﬀerent: Facebook posts
2 The script

Pseudo-code
Python code
The output
3 Your turn
4 Questions?

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


The data:
http://datacollection.followthenews-uva.cloudlet.sara.nl

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


yourTwapperkeeper

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


yourTwapperkeeper

Storage
Continuosly calls the Twitter-API and saves all
tweets containing speciﬁc hashtags to a
mySQL-database.
You tell it once which data to collect – and
wait some months.

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


yourTwapperkeeper

Retrieving the data
You could access the MySQL-database directly.
But yourTwapperkeeper has a nice interface
that allows you to export the data to a format
we can use for the analysis.

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

CSV-ﬁles

The data:
CSV-ﬁles

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

CSV-ﬁles

CSV-ﬁles

The format of our choice
• All programs can read it
• Even human-readable in a simple text editor:
• Plain text, with a comma (or a semicolon) denoting column

breaks
• No limits regarging the size

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

CSV-ﬁles

1

2

3

text,to_user_id,from_user,id,from_user_id,
iso_language_code,source,profile_image_url,geo_type,
geo_coordinates_0,geo_coordinates_1,created_at,time
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,,henklbr
,407085917011079169,118374840,nl,web,http://pbs.twimg.
com/profile_images/378800000673845195/
b47785b1595e6a1c63b93e463f3d0ccc_normal.jpeg,,0,0,Sun
Dec 01 09:57:00 +0000 2013,1385891820
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,,Europarl_NL
,406058792573730816,37623918,en,<a href="http://www.
hootsuite.com" rel="nofollow">HootSuite</a>,http://pbs
.twimg.com/profile_images/2943831271/
b6631b23a86502fae808ca3efde23d0d_normal.png,,0,0,Thu
Nov 28 13:55:35 +0000 2013,1385646935

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


The data:

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


Again, we want a CSV ﬁle. . .
• If you want tweets per person: www.allmytweets.net
• Up to six days backwards: www.scraperwiki.com
• Buy it from a commercial vendor
• TCAT (from the guys at DMI/mediastudies)
• For speciﬁc purposes, write your own Python script to access

the Twitter-API
(if you want to, I can show you more about this tomorrow)

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


The data:

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


Have a look at netvizz
• Gephi-ﬁles for network analysis
• . . . and a tab-seperated (essentially the same as CSV) ﬁle with

the content)

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?


Have a look at netvizz
• Gephi-files for network analysis
• . . . and a tab-seperated (essentially the same as CSV) file with

the content)

An alternative: Facepager
• Tool to query different APIs (a.o. Twitter and Facebook) and

to store the result in a CSV table
• http://www.ls1.ifkw.uni-muenchen.de/personen/

wiss_ma/keyling_till/software.html

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Pseudo-code

The script:
Pseudo-code

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Pseudo-code

Our task: Identify all tweets that include a reference to Poland
Let’s start with some pseudo-code!
1
2
3
4
5
6
7

open csv-table
for each line:
append column 1 to a list of tweets
append column 3 to a list of corresponding users
look for searchstring in column 1
append search result to a list of results
save lists to a new csv-file

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Python code

The script:
Python code

#bigdata

Damian Trilling

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27

#!/usr/bin/python
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"
user_list=[]
tweet_list=[]
search_list=[]
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau|[Ww]arszawa’)
print "Opening "+inputfilename
reader=CsvUnicodeReader(open(inputfilename,"r"))
for row in reader:
tweet_list.append(row[0])
user_list.append(row[2])
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)
print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)

The data

The script

Your turn

Questions?

Python code

1
2
3
4
5

#!/usr/bin/python
# We start with importing some modules:
from unicsv import CsvUnicodeReader
from unicsv import CsvUnicodeWriter
import re

6
7
8
9
10

# Let us define two variables that contain
# the names of the files we want to use
inputfilename="mytweets.csv"
outputfilename="myoutput.csv"

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6

# We create some empty lists that we will use later on.
# A list can contain several variables
# and is denoted by square brackets.
user_list=[]
tweet_list=[]
search_list=[]

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Python code

1
2

# What do we want to look for?
searchstring1 = re.compile(r’[Pp]olen|[Pp]ool|[Ww]arschau
|[Ww]arszawa’)

3
4
5
6

# Enough preparation, let the program begin!
# We tell the user what is going on...
print "Opening "+inputfilename

7
8
9

# ... and call the module that reads the input file.
reader=CsvUnicodeReader(open(inputfilename,"r"))

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Python code

1
2
3
4
5
6
7
8

# Now we read the file line by line.
# The indented block is repeated for each row
# (thus, each tweet)
for row in reader:
# append data from the current row to our lists.
# Note that we start counting with 0.
tweet_list.append(row[0])
user_list.append(row[2])

9
10
11
12
13
14
15
16

#bigdata

# Let us count how often our searchstring is used in
# in this tweet
matches1 = searchstring1.findall(row[0])
matchcount1=0
for word in matches1:
matchcount1=matchcount1+1
search_list.append(matchcount1)

Damian Trilling

The data

The script

Your turn

Questions?

Python code

1
2

# Time to put all the data in one container
# and save it:

3
4
5
6

7
8
9
10

print "Constructing data matrix"
outputdata=zip(tweet_list,user_list,search_list)
headers=zip(["tweet"],["user"],["how often is Poland
mentioned?"])
print "Write data matrix to ",outputfilename
writer=CsvUnicodeWriter(open(outputfilename,"wb"))
writer.writerows(headers)
writer.writerows(outputdata)

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

The output

The script:
myoutput.csv

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

The output

1
2

3

4

5

tweet,user,how often is Poland mentioned?
:-) #Lectrr #wereldleiders #uitspraken #Wikileaks #
klimaattop http://t.co/Udjpk48EIB,henklbr,0
Wat zijn de resulaten vd #klimaattop in #Warschau waard?
@EP_Environment ontmoet voorzitter klimaattop
@MarcinKorolec http://t.co/4Lmiaopf60,Europarl_NL,1
RT @greenami1: De winnaars en verliezers van de
lachwekkende #klimaattop in #Warschau (interview):
http://t.co/DEYqnqXHdy #Misserfolg #Kli...,LarsMoratis
,1
De winnaars en verliezers van de lachwekkende #klimaattop
in #Warschau (interview): http://t.co/DEYqnqXHdy #
Misserfolg #Klimaschutz #FAZ,greenami1,1

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

The output

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Try it yourself!
We’ll help you getting started. Please go to
http://beehub.nl/bigdata-cw/workshop and download the
some ﬁles. Save the Python ﬁles
unicsv.py
myfirstscript.py as well as the dataset
mytweets.csv in a new folder called workshop on your
H-drive.
When you are done, start Python (GUI) from the
Windows Start Menu.

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Recap
1 The data

CSV-ﬁles
2 The script

Pseudo-code
Python code
The output
3 Your turn
4 Questions?

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

This afternoon

Your own script

#bigdata

Damian Trilling

The data

The script

Your turn

Questions?

Vragen of opmerkingen?

Damian Trilling
d.c.trilling@uva.nl
@damian0604
www.damiantrilling.net
#bigdata

Damian Trilling

Analyzing social media with Python and other tools (2/4)

Recomendados

Recomendados

Más contenido relacionado

Destacado

Destacado (7)

Similar a Analyzing social media with Python and other tools (2/4)

Similar a Analyzing social media with Python and other tools (2/4) (20)

Más de Department of Communication Science, University of Amsterdam

Más de Department of Communication Science, University of Amsterdam (18)

Último

Último (20)

Analyzing social media with Python and other tools (2/4)