SlideShare una empresa de Scribd logo
1 de 64
Descargar para leer sin conexión
Bot Not?
@erinshellman
PyData Seattle, July 26, 2015
orEnd-to-end data analysis in Python
PySpark Workshop
@Tune
August 27,6-8pm
Starting a new career in software
@Moz
October 22,6-8pm
Q: Why?
Bots are fun.
Q: How?
Python.
In 2009,24% of tweets
were generated by bots.
Last year Twitter disclosed
that 23 million of its active
users were bots.
Hypothesis:
Bot behavior is
differentiable from
human behavior.
ExperimentalDesign
• Ingest data
• Clean and process data
• Create a classifier
ExperimentalDesign
• Ingest data
• python-twitter
• Clean and process data
• Pandas,NLTK,Seaborn,iPython Notebooks
• Create a classifier
• Scikit-learn
Step1:
Getdata.
lollollollol
ConnectingtoTwitter
https://github.com/bear/python-twitter
def get_friends(self, screen_name, count = 5000):
'''
GET friends/ids i.e. people you follow
returns a list of JSON blobs
'''
friends = self.api.GetFriendIDs(screen_name = screen_name,
count = count)
return friends
# break query into bite-size chunks 🍔
def blow_chunks(self, data, max_chunk_size):
for i in range(0, len(data), max_chunk_size):
yield data[i:i + max_chunk_size]
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
if len(user_ids) > max_query_size:
chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size)
while True:
try:
current_chunk = chunks.next()
for user in current_chunk:
try:
user_data = self.api.GetUser(user_id = str(user))
results.append(user_data.AsDict())
except:
print "got a twitter error! D:"
pass
print "nap time. ZzZzZzzzzz..."
time.sleep(60 * 16)
continue
except StopIteration:
break
{
"name": "Twitter API",
"location": "San Francisco, CA",
"created_at": "Wed May 23 06:01:13 +0000 2007",
"default_profile": true,
"favourites_count": 24,
"url": "http://dev.twitter.com",
"id": 6253282,
"profile_use_background_image": true,
"listed_count": 10713,
"profile_text_color": "333333",
"lang": "en",
"followers_count": 1198334,
"protected": false,
"geo_enabled": true,
"description": "The Real Twitter API.”,
"verified": true,
"notifications": false,
"time_zone": "Pacific Time (US & Canada)",
"statuses_count": 3331,
"status": {
"coordinates": null,
"created_at": "Fri Aug 24 16:15:49 +0000 2012",
"favorited": false,
"truncated": false,
sample size = 8509 accounts
Step2:
preprocessing.
Who'sready
1. “Flatten” the JSON into one
row per user.
2.Variable recodes. e.g.
consistently denoting
missing values, True/False
into 1/0
3.Select only desired features
for modeling.
toclean?
Howto
makedata
withthis?
e.g.LexicalDiversity
• A token is a sequence of characters that we want
to treat as a group.
• For instance, lol, #blessed, or 💉🔪💇
• Lexicaldiversity is the ratio of unique tokens to
total tokens.
def lexical_diversity(text):
if len(text) == 0:
diversity = 0
else:
diversity = float(len(set(text))) / len(text)
return diversity
# Easily compute summaries for each user!
grouped = tweets.groupby('screen_name')
diversity = grouped.apply(lexical_diversity)
Step3:
Classification.
# Naive Bayes
bayes = GaussianNB().fit(train[features], y)
bayes_predict = bayes.predict(test[features])
# Logistic regression
logistic = LogisticRegression().fit(train[features], y)
logistic_predict = logistic.predict(test[features])
# Random Forest
rf = RandomForestClassifier().fit(train[features], y)
rf_predict = rf.predict(test[features])
# Classification Metrics
print(metrics.classification_report(test.bot, bayes_predict))
print(metrics.classification_report(test.bot, logistic_predict))
print(metrics.classification_report(test.bot, rf_predict))
precision recall f1-score
0.0 0.97 0.27 0.42
1.0 0.20 0.95 0.33
avg / total 0.84 0.38 0.41
precision recall f1-score
0.0 0.85 1.00 0.92
1.0 0.94 0.14 0.12
avg / total 0.87 0.85 0.79
precision recall f1-score
0.0 0.91 0.98 0.95
1.0 0.86 0.51 0.64
avg / total 0.90 0.91 0.90
Naive Bayes
Logistic Regression
Random Forest
# construct parameter grid
param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None],
'max_features': [1, 3, 6, 9, 12],
'min_samples_split': [1, 3, 6, 9, 12, 15],
'min_samples_leaf': [1, 3, 6, 9, 12, 15],
'bootstrap': [True, False],
'criterion': ['gini', 'entropy']}
# fit best classifier
grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid).fit(train[features], y)
# assess predictive accuracy
predict = grid_search.predict(test[features])
print(metrics.classification_report(test.bot, predict))
print(grid_search.best_params_)
{'bootstrap': True,
'min_samples_leaf': 15,
'min_samples_split': 9,
'criterion': 'entropy',
'max_features': 6,
'max_depth': 9}
Best parameter set
for random forest
precision recall f1-score
0.0 0.93 0.99 0.96
1.0 0.89 0.59 0.71
avg / total 0.92 0.93 0.92
precision recall f1-score
0.0 0.91 0.98 0.95
1.0 0.86 0.51 0.64
avg / total 0.90 0.91 0.90
Default Random Forest
Tuned Random Forest
Iterative model development
in Scikit-learn is laborious.
logistic_model = train(bot ~ statuses_count + friends_count + followers_count,
data = train,
method = 'glm',
family = binomial,
preProcess = c('center', 'scale'))
> confusionMatrix(logistic_predictions, test$bot)
Confusion Matrix and Statistics
Reference
Prediction 0 1
0 394 22
1 144 70
Accuracy : 0.7365
95% CI : (0.7003, 0.7705)
No Information Rate : 0.854
P-Value [Acc > NIR] : 1
Kappa : 0.3183
Mcnemars Test P-Value : <2e-16
Sensitivity : 0.7323
Specificity : 0.7609
Pos Pred Value : 0.9471
Neg Pred Value : 0.3271
Prevalence : 0.8540
Detection Rate : 0.6254
Detection Prevalence : 0.6603
Balanced Accuracy : 0.7466
'Positive' Class : 0
> summary(logistic_model)
Call:
NULL
Deviance Residuals:
Min 1Q Median 3Q Max
-1.2620 -0.6323 -0.4834 -0.0610 6.0228
Coefficients:
Estimate Std. Error z value Pr(>|z|)
(Intercept) -5.7136 0.7293 -7.835 4.71e-15 ***
statuses_count -2.4120 0.5026 -4.799 1.59e-06 ***
friends_count 30.8238 3.2536 9.474 < 2e-16 ***
followers_count -69.4496 10.7190 -6.479 9.22e-11 ***
---
(Dispersion parameter for binomial family taken to be 1)
Null deviance: 2172.3 on 2521 degrees of freedom
Residual deviance: 1858.3 on 2518 degrees of freedom
AIC: 1866.3
Number of Fisher Scoring iterations: 13
# compare models
results = resamples(list(tree_model = tree_model,
bagged_model = bagged_model,
boost_model = boost_model))
# plot results
dotplot(results)
Step5:
Pontificate.
Pythonrules!
• The Python language is an incredibly powerful tool for end-
to-end data analysis.
• Even so,some tasks are more work than they should be.
Lamebots
Andnow…
thebots.
Clicks
• Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million-
bots-1620466086
• HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how-
twitter-bots-fool-you-into-thinking-they-are-real-people
• Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http://
www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet-
automatically-without-human-input.html
• Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/
• How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the-
world
• That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that-
time-2-bots-were-talking-and-bank-of-america-butted-in/374023/
• The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots
• OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/

Más contenido relacionado

Destacado

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfErin Shellman
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientistsErin Shellman
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with PythonMaris Lemba
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Abhishek Mishra
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BSJohn D
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4Eueung Mulyana
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With PythonRobert Dempsey
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoSammy Fung
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupJim Chang
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with PythonPaul Schreiber
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python Viren Rajput
 

Destacado (13)

Assumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourselfAssumptions: Check yo'self before you wreck yourself
Assumptions: Check yo'self before you wreck yourself
 
Developing effective data scientists
Developing effective data scientistsDeveloping effective data scientists
Developing effective data scientists
 
Intro to web scraping with Python
Intro to web scraping with PythonIntro to web scraping with Python
Intro to web scraping with Python
 
Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010Scraping with Python for Fun and Profit - PyCon India 2010
Scraping with Python for Fun and Profit - PyCon India 2010
 
Web Scraping is BS
Web Scraping is BSWeb Scraping is BS
Web Scraping is BS
 
Python beautiful soup - bs4
Python beautiful soup - bs4Python beautiful soup - bs4
Python beautiful soup - bs4
 
Web Scraping With Python
Web Scraping With PythonWeb Scraping With Python
Web Scraping With Python
 
Python, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and DjangoPython, web scraping and content management: Scrapy and Django
Python, web scraping and content management: Scrapy and Django
 
Parse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful SoupParse The Web Using Python+Beautiful Soup
Parse The Web Using Python+Beautiful Soup
 
Web Scraping with Python
Web Scraping with PythonWeb Scraping with Python
Web Scraping with Python
 
Scraping the web with python
Scraping the web with pythonScraping the web with python
Scraping the web with python
 
Web scraping in python
Web scraping in python Web scraping in python
Web scraping in python
 
Beautiful soup
Beautiful soupBeautiful soup
Beautiful soup
 

Similar a Bot or Not

Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryGiuseppe Rizzo
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoSri Ambati
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)Amazon Web Services
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programmingLars Marius Garshol
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyDatabricks
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionTe-Yen Liu
 
Practical machine learning: rational approach
Practical machine learning: rational approachPractical machine learning: rational approach
Practical machine learning: rational approachDzianis Pirshtuk
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가JaeCheolKim10
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015Fastly
 
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Infrastructure Facility
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token EngineeringTrent McConaghy
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeWim Godden
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeWim Godden
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14Sri Ambati
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBigML, Inc
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538Krishna Sankar
 
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!Dhiana Deva
 

Similar a Bot or Not (20)

Terminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom DiscoveryTerminological cluster trees for Disjointness Axiom Discovery
Terminological cluster trees for Disjointness Axiom Discovery
 
Deep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry LarkoDeep Learning with MXNet - Dmitry Larko
Deep Learning with MXNet - Dmitry Larko
 
wendi_ppt
wendi_pptwendi_ppt
wendi_ppt
 
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
AWS re:Invent 2016: Getting to Ground Truth with Amazon Mechanical Turk (MAC201)
 
Explainable AI
Explainable AIExplainable AI
Explainable AI
 
Experiments in genetic programming
Experiments in genetic programmingExperiments in genetic programming
Experiments in genetic programming
 
High-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-AlchemyHigh-Performance Advanced Analytics with Spark-Alchemy
High-Performance Advanced Analytics with Spark-Alchemy
 
Machine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis IntroductionMachine Learning, Deep Learning and Data Analysis Introduction
Machine Learning, Deep Learning and Data Analysis Introduction
 
Practical machine learning: rational approach
Practical machine learning: rational approachPractical machine learning: rational approach
Practical machine learning: rational approach
 
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
미움 받을 용기 : 저 팀은 뭘 안다고 추천한다고 들쑤시고 다니는건가
 
It Probably Works - QCon 2015
It Probably Works - QCon 2015It Probably Works - QCon 2015
It Probably Works - QCon 2015
 
Pandas application
Pandas applicationPandas application
Pandas application
 
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
SMART Seminar Series: "Blockchain and its Applications". Presented by Prof Wi...
 
Towards a Practice of Token Engineering
Towards a Practice of Token EngineeringTowards a Practice of Token Engineering
Towards a Practice of Token Engineering
 
Beyond php - it's not (just) about the code
Beyond php - it's not (just) about the codeBeyond php - it's not (just) about the code
Beyond php - it's not (just) about the code
 
Beyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the codeBeyond PHP - it's not (just) about the code
Beyond PHP - it's not (just) about the code
 
H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14H2O Open Source Deep Learning, Arno Candel 03-20-14
H2O Open Source Deep Learning, Arno Candel 03-20-14
 
BSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly DetectionBSSML16 L3. Clusters and Anomaly Detection
BSSML16 L3. Clusters and Anomaly Detection
 
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
R, Data Wrangling & Predicting NFL with Elo like Nate SIlver & 538
 
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
My First Attempt on Kaggle - Higgs Machine Learning Challenge: 755st and Proud!
 

Último

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProduct Anonymous
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationSafe Software
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherRemote DBA Services
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfsudhanshuwaghmare1
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...Martijn de Jong
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Enterprise Knowledge
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonAnna Loughnan Colquhoun
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoffsammart93
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdflior mazor
 

Último (20)

A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time AutomationFrom Event to Action: Accelerate Your Decision Making with Real-Time Automation
From Event to Action: Accelerate Your Decision Making with Real-Time Automation
 
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data DiscoveryTrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
TrustArc Webinar - Unlock the Power of AI-Driven Data Discovery
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
Boost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdfBoost Fertility New Invention Ups Success Rates.pdf
Boost Fertility New Invention Ups Success Rates.pdf
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...2024: Domino Containers - The Next Step. News from the Domino Container commu...
2024: Domino Containers - The Next Step. News from the Domino Container commu...
 
Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...Driving Behavioral Change for Information Management through Data-Driven Gree...
Driving Behavioral Change for Information Management through Data-Driven Gree...
 
Data Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt RobisonData Cloud, More than a CDP by Matt Robison
Data Cloud, More than a CDP by Matt Robison
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
GenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdfGenAI Risks & Security Meetup 01052024.pdf
GenAI Risks & Security Meetup 01052024.pdf
 

Bot or Not

  • 1. Bot Not? @erinshellman PyData Seattle, July 26, 2015 orEnd-to-end data analysis in Python
  • 2.
  • 3. PySpark Workshop @Tune August 27,6-8pm Starting a new career in software @Moz October 22,6-8pm
  • 4. Q: Why? Bots are fun. Q: How? Python.
  • 5.
  • 6.
  • 7. In 2009,24% of tweets were generated by bots.
  • 8. Last year Twitter disclosed that 23 million of its active users were bots.
  • 9.
  • 10.
  • 12. ExperimentalDesign • Ingest data • Clean and process data • Create a classifier
  • 13. ExperimentalDesign • Ingest data • python-twitter • Clean and process data • Pandas,NLTK,Seaborn,iPython Notebooks • Create a classifier • Scikit-learn
  • 15.
  • 17.
  • 18.
  • 19.
  • 21. def get_friends(self, screen_name, count = 5000): ''' GET friends/ids i.e. people you follow returns a list of JSON blobs ''' friends = self.api.GetFriendIDs(screen_name = screen_name, count = count) return friends
  • 22.
  • 23. # break query into bite-size chunks 🍔 def blow_chunks(self, data, max_chunk_size): for i in range(0, len(data), max_chunk_size): yield data[i:i + max_chunk_size]
  • 24. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 25. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 26. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 27. if len(user_ids) > max_query_size: chunks = self.blow_chunks(user_ids, max_chunk_size = max_query_size) while True: try: current_chunk = chunks.next() for user in current_chunk: try: user_data = self.api.GetUser(user_id = str(user)) results.append(user_data.AsDict()) except: print "got a twitter error! D:" pass print "nap time. ZzZzZzzzzz..." time.sleep(60 * 16) continue except StopIteration: break
  • 28. { "name": "Twitter API", "location": "San Francisco, CA", "created_at": "Wed May 23 06:01:13 +0000 2007", "default_profile": true, "favourites_count": 24, "url": "http://dev.twitter.com", "id": 6253282, "profile_use_background_image": true, "listed_count": 10713, "profile_text_color": "333333", "lang": "en", "followers_count": 1198334, "protected": false, "geo_enabled": true, "description": "The Real Twitter API.”, "verified": true, "notifications": false, "time_zone": "Pacific Time (US & Canada)", "statuses_count": 3331, "status": { "coordinates": null, "created_at": "Fri Aug 24 16:15:49 +0000 2012", "favorited": false, "truncated": false,
  • 29. sample size = 8509 accounts
  • 31. Who'sready 1. “Flatten” the JSON into one row per user. 2.Variable recodes. e.g. consistently denoting missing values, True/False into 1/0 3.Select only desired features for modeling. toclean?
  • 32.
  • 33.
  • 34.
  • 36.
  • 37. e.g.LexicalDiversity • A token is a sequence of characters that we want to treat as a group. • For instance, lol, #blessed, or 💉🔪💇 • Lexicaldiversity is the ratio of unique tokens to total tokens.
  • 38. def lexical_diversity(text): if len(text) == 0: diversity = 0 else: diversity = float(len(set(text))) / len(text) return diversity
  • 39. # Easily compute summaries for each user! grouped = tweets.groupby('screen_name') diversity = grouped.apply(lexical_diversity)
  • 40.
  • 42.
  • 43. # Naive Bayes bayes = GaussianNB().fit(train[features], y) bayes_predict = bayes.predict(test[features]) # Logistic regression logistic = LogisticRegression().fit(train[features], y) logistic_predict = logistic.predict(test[features]) # Random Forest rf = RandomForestClassifier().fit(train[features], y) rf_predict = rf.predict(test[features]) # Classification Metrics print(metrics.classification_report(test.bot, bayes_predict)) print(metrics.classification_report(test.bot, logistic_predict)) print(metrics.classification_report(test.bot, rf_predict))
  • 44. precision recall f1-score 0.0 0.97 0.27 0.42 1.0 0.20 0.95 0.33 avg / total 0.84 0.38 0.41 precision recall f1-score 0.0 0.85 1.00 0.92 1.0 0.94 0.14 0.12 avg / total 0.87 0.85 0.79 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Naive Bayes Logistic Regression Random Forest
  • 45. # construct parameter grid param_grid = {'max_depth': [1, 3, 6, 9, 12, 15, None], 'max_features': [1, 3, 6, 9, 12], 'min_samples_split': [1, 3, 6, 9, 12, 15], 'min_samples_leaf': [1, 3, 6, 9, 12, 15], 'bootstrap': [True, False], 'criterion': ['gini', 'entropy']} # fit best classifier grid_search = GridSearchCV(RandomForestClassifier(), param_grid = param_grid).fit(train[features], y) # assess predictive accuracy predict = grid_search.predict(test[features]) print(metrics.classification_report(test.bot, predict))
  • 46. print(grid_search.best_params_) {'bootstrap': True, 'min_samples_leaf': 15, 'min_samples_split': 9, 'criterion': 'entropy', 'max_features': 6, 'max_depth': 9} Best parameter set for random forest
  • 47. precision recall f1-score 0.0 0.93 0.99 0.96 1.0 0.89 0.59 0.71 avg / total 0.92 0.93 0.92 precision recall f1-score 0.0 0.91 0.98 0.95 1.0 0.86 0.51 0.64 avg / total 0.90 0.91 0.90 Default Random Forest Tuned Random Forest
  • 48.
  • 49. Iterative model development in Scikit-learn is laborious.
  • 50. logistic_model = train(bot ~ statuses_count + friends_count + followers_count, data = train, method = 'glm', family = binomial, preProcess = c('center', 'scale'))
  • 51. > confusionMatrix(logistic_predictions, test$bot) Confusion Matrix and Statistics Reference Prediction 0 1 0 394 22 1 144 70 Accuracy : 0.7365 95% CI : (0.7003, 0.7705) No Information Rate : 0.854 P-Value [Acc > NIR] : 1 Kappa : 0.3183 Mcnemars Test P-Value : <2e-16 Sensitivity : 0.7323 Specificity : 0.7609 Pos Pred Value : 0.9471 Neg Pred Value : 0.3271 Prevalence : 0.8540 Detection Rate : 0.6254 Detection Prevalence : 0.6603 Balanced Accuracy : 0.7466 'Positive' Class : 0
  • 52. > summary(logistic_model) Call: NULL Deviance Residuals: Min 1Q Median 3Q Max -1.2620 -0.6323 -0.4834 -0.0610 6.0228 Coefficients: Estimate Std. Error z value Pr(>|z|) (Intercept) -5.7136 0.7293 -7.835 4.71e-15 *** statuses_count -2.4120 0.5026 -4.799 1.59e-06 *** friends_count 30.8238 3.2536 9.474 < 2e-16 *** followers_count -69.4496 10.7190 -6.479 9.22e-11 *** --- (Dispersion parameter for binomial family taken to be 1) Null deviance: 2172.3 on 2521 degrees of freedom Residual deviance: 1858.3 on 2518 degrees of freedom AIC: 1866.3 Number of Fisher Scoring iterations: 13
  • 53. # compare models results = resamples(list(tree_model = tree_model, bagged_model = bagged_model, boost_model = boost_model)) # plot results dotplot(results)
  • 55. Pythonrules! • The Python language is an incredibly powerful tool for end- to-end data analysis. • Even so,some tasks are more work than they should be.
  • 59.
  • 60.
  • 61.
  • 62.
  • 63.
  • 64. Clicks • Twitter Is Plagued With 23 Million Automated Accounts: http://valleywag.gawker.com/twitter-is-riddled-with-23-million- bots-1620466086 • HOW TWITTER BOTS FOOL YOU INTO THINKING THEY ARE REAL PEOPLE: http://www.fastcompany.com/3031500/how- twitter-bots-fool-you-into-thinking-they-are-real-people • Rise of the Twitter bots: Social network admits 23 MILLION of its users tweet automatically without human input: http:// www.dailymail.co.uk/sciencetech/article-2722677/Rise-Twitter-bots-Social-network-admits-23-MILLION-users-tweet- automatically-without-human-input.html • Twitter Zombies: 24% of Tweets Created by Bots: http://mashable.com/2009/08/06/twitter-bots/ • How bots are taking over the world: http://www.theguardian.com/commentisfree/2012/mar/30/how-bots-are-taking-over-the- world • That Time 2 Bots Were Talking, and Bank of America Butted In: http://www.theatlantic.com/technology/archive/2014/07/that- time-2-bots-were-talking-and-bank-of-america-butted-in/374023/ • The Rise of Twitter Bots: http://www.newyorker.com/tech/elements/the-rise-of-twitter-bots • OLIVIA TATERS, ROBOT TEENAGER: http://www.onthemedia.org/story/29-olivia-taters-robot-teenager/