These are the slides from my presentation to the NYC Python Meetup on July 28, 2009. The presentation was an overview of data analysis techniques and various python tools and libraries, along with the practical example (with code and algorithms) of a Twitter spam filter implemented with NLTK.
4. Data Analysis on the Web
Data items change rapidly.
Data items are not independent.
There’s a lot of semi-structured data around.
There’s a LOT of data around.
==
Too many problems, few tools, and few experts.
7. Entity Disambiguation
This is important.
Company disambiguation is a very common
problem – Are “Microsoft”, “Microsoft
Corporation”, and “MS” the same company?
This is a hard problem.
13. Python for Data Analysis
import why_python_is_awesome
Python is readable.
Easy to transition from Matlab or R.
Numerical computing support.
Growing set of machine learning libraries.
18. Data: Tweets
Hand-classified. For example, some spam:
| don't disrespect me. I just wanted yall to get a head start so
don't feel bad when I have more followers in two days.
http://xyyx.eu/a1ha |
| oh yay more new followers..hiii...if u want go to
http://xyyx.eu/a1hb
|
| My friend made this new tool to get more twitter followers,
http://xyyx.eu/a1ht
|
| Yes, Twitter is doing some Follower/Following count
corrections. Get it back at: http://xyyx.eu/a1h8
|
| man if i see one more person cry about losing followers!!!
http://xyyx.eu/a1h4
|
20. Naïve Bayesian Classifer
P(A|B) = the conditional probability of A given B
http://yudkowsky.net/rational/bayes
http://blog.oscarbonilla.com/2009/05/visualizin
g-bayes-theorem/
classifier = nltk.NaiveBayesClassifier.train(train_set)
21. Classifer Accuracy
Use a hand-classified test set to see the accuracy
of the classifier:
nltk.classify.accuracy(classifier, test_set)
24. Results
90% accuracy on spam tweets – not bad!
Other possibilities:
categorization – what do you tweet about?
human vs bot?
which celebrity tweeter are you?