Like many Internet giants Twitter makes money by selling ads, but they’ve got an insidious infestation eroding their advertising credibility: bots. More than 23 million of them. Twitter bots are automatons living in the Twittersphere and ranging wildly in capability. In their simplest form, they follow you maybe fav-ing or retweeting your statuses. At their most complex, they troll and ironically, troll trolls using speech patterns that can, at times, fool humans. But when advertisers pay for engagement, they aren’t interested in a four-hour flame war between a gamergate bot and a Kanye bot. When advertisers analyze social data they want to be sure their findings are the result of human activity. In Bot or Not I describe an end-to-end data analysis to build a classifier with Python.
13. ExperimentalDesign
• Ingest data
• python-twitter
• Clean and process data
• Pandas,NLTK,Seaborn,iPython Notebooks
• Create a classifier
• Scikit-learn
21. def get_friends(self, screen_name, count = 5000):
'''
GET friends/ids i.e. people you follow
returns a list of JSON blobs
'''
friends = self.api.GetFriendIDs(screen_name = screen_name,
count = count)
return friends
22.
23. # break query into bite-size chunks 🍔
def blow_chunks(self, data, max_chunk_size):
for i in range(0, len(data), max_chunk_size):
yield data[i:i + max_chunk_size]
24. if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
25. if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
26. if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
27. if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
31. Who'sready
1. “Flatten” the JSON into one
row per user.
2.Variable recodes. e.g.
consistently denoting
missing values, True/False
into 1/0
3.Select only desired features
for modeling.
toclean?
37. e.g.LexicalDiversity
• A token is a sequence of characters that we want
to treat as a group.
• For instance, lol, #blessed, or 💉🔪💇
• Lexicaldiversity is the ratio of unique tokens to
total tokens.
55. Pythonrules!
• The Python language is an incredibly powerful tool for end-
to-end data analysis.
• Even so,some tasks are more work than they should be.
64. Clicks
• Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million-
bots-1620466086
• HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how-
twitter-bots-fool-you-into-thinking-they-are-real-people
• Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http://
www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet-
automatically-without-human-input.html
• Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/
• How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the-
world
• That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that-
time-2-bots-were-talking-and-bank-of-america-butted-in/374023/
• The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots
• OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/