SlideShare una empresa de Scribd logo
1 de 148
Descargar para leer sin conexión
Small Data Machine Learning
Andrei Zmievski

The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.
Questions - now and later
WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
TRAVEL
TAKE PHOTOS
DRINK BEER
MAKE BEER
MATH
SOME MATH
AWESOME MATH
@a
For those of you who don’t know me..
Acquired in October 2008
Had a different account earlier, but then @k asked if I wanted it..
Know many other single-letter Twitterers.
FAME

Advantages
FAME
FORTUNE
FAME
FORTUNE

Wall Street Journal?!
FAME
FORTUNE
FOLLOWERS
FAME
FORTUNE
FOLLOWERS
lol, what?!
MAXIMUM REPLY SPACE!
140-length(“@a “)=137
CONS

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
CONS

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
CONS
I hate humanity

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
A
D
D
Annoyance
Driven
Development
Best way to learn something is to be annoyed enough to create a solution based on the tech.
Machine Learning
to the Rescue!
REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
REPLYCLEANER
REPLYCLEANER
REPLYCLEANER
REPLYCLEANER
I still hate humanity
I still hate humanity

I still hate humanity
Machine Learning

A branch of Artificial Intelligence
No widely accepted definition
“Field of study that gives
computers the ability to learn
without being explicitly
programmed.”
— Arthur Samuel (1959)

concerns the construction and study of systems that can learn from data
SPAM FILTERING
RECOMMENDATIONS
TRANSLATION
CLUSTERING
And many more: medical diagnoses, detecting credit card fraud, etc.
supervised
unsupervised

Labeled dataset, training maps input to desired outputs
Example: regression - predicting house prices, classification - spam filtering
supervised

unsupervised
no labels in the dataset, algorithm needs to find structure
Example: clustering
We will be talking about classification, a supervised learning process.
Feature
individual measurable property of the
phenomenon under observation

usually numeric
Feature Vector
a set of features for an observation

Think of it as an array
features
# of rooms
2
sq. m
house age
yard?
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
# of rooms
2
sq. m
house age
yard?

102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

758,013

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

X - input feature vector
theta - weights
dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . .

X - input feature vector
theta - weights
training data
learning algorithm
hypothesis
Hypothesis (decision function): what the system has learned so far
Hypothesis is applied to new data
hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)
parameters

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)

prediction y

parameters

The task of our algorithm is to determine the parameters of the hypothesis.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
1

0.5
z

0

1
g(z) =
1+e

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
1

0.5
z

0

1
g(z) =
1+e
z =✓·X

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
1
h✓ (X) =
1+e

✓·X

Probability that y=1 for input X

LOGISTIC REGRESSION
If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70%
chance it’s spam. Thresholding on that is up to you.
Building the Tool
Corpus
collection of source data used for training and
testing the model
Twitter

MongoDB
phirehose

hooks into streaming API
Twitter

MongoDB
phirehose

8500 tweets
hooks into streaming API
Feature
Identification
independent
&
discriminant
Independent: feature A should not co-occur (correlate) with feature B highly.
Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts
with is not a good feature).
possible features
@a at the end of the tweet
‣ @a...
‣ length < N chars
‣ # of user mentions in the tweet
‣ # of hashtags
‣ language!
‣ @a followed by punctuation and a word
character (except for apostrophe)
‣ …and more
‣
feature = extractor(tweet)

For each feature, write a small function that takes a tweet and returns a numeric value
(floating-point).
corpus
extractors
feature vectors
Run the set of these functions over the corpus and build up feature vectors
Array of arrays
Save to DB
Language
Matters
high correlation between the language of the tweet and its category (good/bad)
Indonesian or Tagalog?
Garbage.
Top 12 Languages
id
en
tl
es
so
ja
pt
ar
nl
it
sw
fr

Indonesian
English
Tagalog
Spanish
Somalian
Japanese
Portuguese
Arabic
Dutch
Italian
Swahili
French

I guarantee you people aren’t tweeting at me in Swahili.

3548
1804
733
329
305
300
262
256
150
137
118
92
Language
Detection

Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
Language
Detection
pear / Text_LanguageDetect
pecl / textcat
Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
EnglishNotEnglish
✓
✓
✓
✓

Clean-up text (remove mentions, links, etc)
Run language detection
If unknown/low weight, pretend it’s English, else:
If not a character set-determined language, try harder:
✓ Tokenize into words
✓ Difference with English vocabulary
✓ If words remain, run parts-of-speech tagger on each
✓ For NNS, VBZ, and VBD run stemming algorithm
✓ If result is in English vocabulary, remove from remaining
✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s English

A lot of this is heuristic-based, after some trial-and-error.
Seems to help with my corpus.
BINARY CLASSIFICATION

Grunt work
Built a web-based tool to display tweets a page at a time and select good ones
INPUT

feature vectors

OUTPUT

labels (good/bad)

Had my input and output
BIAS
CORRECTION

One more thing to address
BIAS
CORRECTION

BAD

99% = bad (less < 100 tweets were good)
Training a model as-is would not produce good results
Need to adjust the bias

GOOD
BIAS
CORRECTION

BAD

GOOD
OVER
SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
OVER
SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
OVER
SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training
SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training
Synthetic
OVERSAMPLING
Synthesize feature vectors by determining what constitutes a good tweet and do weighted
random selection of feature values.
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2
0

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2
0
77

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
Model
Training
We have the hypothesis (decision function) and the training set,
How do we actually determine the weights/parameters?
COST
FUNCTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
REALITY

COST
FUNCTION
PREDICTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
COST
FUNCTION

m
X
1
J(✓) =
Cost(h✓ (x), y)
m i=1

Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
LOGISTIC COST

Cost(h✓ (x), y) =

(

log (h✓ (x))
log (1 h✓ (x))

if y = 1
if y = 0
LOGISTIC COST
y=1

0

y=0

1

Correct guess
Incorrect guess

0

1

Cost = 0
Cost = huge

When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess),
the more we penalize the algorithm. Same for y=0.
minimize cost

OVER θ
Finding the best values of Theta that minimize the cost
GRADIENT DESCENT
Random starting point.
Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step.
Repeat.
Imagine a ball rolling down from a hill.
✓i = ✓i

@J(✓)
↵
@✓i

GRADIENT DESCENT
Each step adjusts the parameters according to the slope
✓i = ✓i

@J(✓)
↵
@✓i

each parameter

Have to update them simultaneously (the whole vector at a time).
learning rate

✓i = ✓i

@J(✓)
↵
@✓i

Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps
If too small, tiny steps, takes too long
If too big, can overshoot the minimum and fail to converge
✓i = ✓i

@J(✓)
↵
@✓i
derivative
aka
“the slope”

The slope indicates the steepness of the descent step for each weight, i.e. direction.
Keep going for a number of iterations or until cost is below a threshold (convergence).
Graph the cost function versus # of iterations and see where it starts to approach 0, past that
are diminishing returns.
✓i = ✓i

↵

m
X

j

(h✓ (x ) y

j

j=1

THE UPDATE ALGORITHM
Derivative for logistic regression simplifies to this term.
Have to update the weights simultaneously!

j
)xi
X1 = [1 12.0]
X2 = [1 -3.5]

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
= 0.088
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)
= 0.305
Note that the hypotheses don’t change within the iteration.
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

θ = [T0 T1]

Replace parameter (weights) vector with the temporaries.

↵ = 0.05
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

↵ = 0.05

θ = [0.088 0.305]

Do next iteration
CROSS

Trai ning
Used to assess the results of the training.
DATA
TRAINING
DATA
TEST

TRAINING
DATA

Train model on training set, then test results on test set.
Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".
Pick the best parameters and save them (DB, other).
Putting It All
Together
Let’s put our model to use, finally.
The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain
error handling, etc. Once we get the actual tweet though..
Load the model
The weights we have calculated via training

Easiest is to load them from DB (can be used to test different models).
HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
truncated retweets: "RT @A ..."

HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
HARD
CODED
RULES

truncated retweets: "RT @A ..."
@ mentions of friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
HARD
CODED
RULES

truncated retweets: "RT @A ..."
@ mentions of friends
tweets from friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
Classifying Tweets

This is the moment we’ve been waiting for.
Classifying Tweets
GOOD
This is the moment we’ve been waiting for.
Classifying Tweets
GOOD
This is the moment we’ve been waiting for.

BAD
Remember this?
1
h✓ (X) =
1+e

First is our hypothesis.

✓·X
Remember this?
1
h✓ (X) =
1+e

✓·X

✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . .

First is our hypothesis.
Finally
h✓ (X) =

1
1+e

(✓0 +✓1 X1 +✓2 X2 +... )

If h > threshold , tweet is bad, otherwise good

Remember that the output of h() is 0..1 (probability).
Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
extract features

3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
extract features
run the model
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
extract features
run the model
act on the result
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
BAD?

Also save the tweet to DB for future analysis.

block
user!
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-Connection handling, backoff in case of problems, undocumented API errors, etc.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-No way for blocked person to get ahold of you via Twitter anymore, so when training the
model, err on the side of caution.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-Some tweets are shown on the website, but never seen through the API.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-Lots of room for improvement.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
PHP sucks at math-y stuff

-Lots of room for improvement.
Realtime feedback
★ More features
★ Grammar analysis
★ Support Vector Machines or
decision trees
★ Clockwork Raven for manual
classification
★ Other minimization algos:
BFGS, conjugate gradient
★ Wish pecl/scikit-learn existed
★

NEXT
STEPS

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.
MongoDB
★ pear/Text_LanguageDetect
★ English vocabulary corpus
★ Parts-Of-Speech tagging
★ SplFixedArray
★ phirehose
★ Python’s scikit-learn (for
validation)
★ Code sample
★

TOOLS

MongoDB (great fit for JSON data)
English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/
SplFixedArray in PHP (memory savings and slightly faster)
LEARN

Coursera.org ML course
★ Ian Barber’s blog
★ FastML.com
★

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.
Questions?

Más contenido relacionado

Similar a Small Data Machine Learning Insights

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Daniel Katz
 
A well-typed program never goes wrong
A well-typed program never goes wrongA well-typed program never goes wrong
A well-typed program never goes wrongJulien Wetterwald
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015lbishal
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learningknowbigdata
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningShubhWadekar
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Vincenzo Santopietro
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against YouC4Media
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11darwinrlo
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep DiveSara Hooker
 
Software fundamentals
Software fundamentalsSoftware fundamentals
Software fundamentalsSusan Winters
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
 

Similar a Small Data Machine Learning Insights (20)

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
 
A well-typed program never goes wrong
A well-typed program never goes wrongA well-typed program never goes wrong
A well-typed program never goes wrong
 
Shap
ShapShap
Shap
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against You
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
c_tutorial_2.ppt
c_tutorial_2.pptc_tutorial_2.ppt
c_tutorial_2.ppt
 
Software fundamentals
Software fundamentalsSoftware fundamentals
Software fundamentals
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
OO Design
OO DesignOO Design
OO Design
 
Diving into Tensorflow.js
Diving into Tensorflow.jsDiving into Tensorflow.js
Diving into Tensorflow.js
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 

Más de PHP Conference Argentina

2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...PHP Conference Argentina
 
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source 2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source PHP Conference Argentina
 
2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a ProPHP Conference Argentina
 
2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida realPHP Conference Argentina
 
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...PHP Conference Argentina
 

Más de PHP Conference Argentina (7)

2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
 
2013 - Mark story - Avoiding the Owasp
2013 - Mark story - Avoiding the Owasp2013 - Mark story - Avoiding the Owasp
2013 - Mark story - Avoiding the Owasp
 
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source 2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
 
2013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 22013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 2
 
2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro
 
2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real
 
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
 

Último

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsRizwan Syed
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteDianaGray10
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfRankYa
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfPrecisely
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebUiPathCommunity
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsMiki Katsuragi
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionDilum Bandara
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024BookNet Canada
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationSlibray Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024Stephanie Beckett
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Enterprise Knowledge
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brandgvaughan
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clashcharlottematthew16
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Commit University
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsMark Billinghurst
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsPixlogix Infotech
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr BaganFwdays
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .Alan Dix
 

Último (20)

Scanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL CertsScanning the Internet for External Cloud Exposures via SSL Certs
Scanning the Internet for External Cloud Exposures via SSL Certs
 
Take control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test SuiteTake control of your SAP testing with UiPath Test Suite
Take control of your SAP testing with UiPath Test Suite
 
Search Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdfSearch Engine Optimization SEO PDF for 2024.pdf
Search Engine Optimization SEO PDF for 2024.pdf
 
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdfHyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
Hyperautomation and AI/ML: A Strategy for Digital Transformation Success.pdf
 
Dev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio WebDev Dives: Streamline document processing with UiPath Studio Web
Dev Dives: Streamline document processing with UiPath Studio Web
 
Vertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering TipsVertex AI Gemini Prompt Engineering Tips
Vertex AI Gemini Prompt Engineering Tips
 
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data PrivacyTrustArc Webinar - How to Build Consumer Trust Through Data Privacy
TrustArc Webinar - How to Build Consumer Trust Through Data Privacy
 
Advanced Computer Architecture – An Introduction
Advanced Computer Architecture – An IntroductionAdvanced Computer Architecture – An Introduction
Advanced Computer Architecture – An Introduction
 
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptxE-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
E-Vehicle_Hacking_by_Parul Sharma_null_owasp.pptx
 
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
Transcript: New from BookNet Canada for 2024: BNC CataList - Tech Forum 2024
 
Connect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck PresentationConnect Wave/ connectwave Pitch Deck Presentation
Connect Wave/ connectwave Pitch Deck Presentation
 
What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024What's New in Teams Calling, Meetings and Devices March 2024
What's New in Teams Calling, Meetings and Devices March 2024
 
Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024Designing IA for AI - Information Architecture Conference 2024
Designing IA for AI - Information Architecture Conference 2024
 
WordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your BrandWordPress Websites for Engineers: Elevate Your Brand
WordPress Websites for Engineers: Elevate Your Brand
 
Powerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time ClashPowerpoint exploring the locations used in television show Time Clash
Powerpoint exploring the locations used in television show Time Clash
 
Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!Nell’iperspazio con Rocket: il Framework Web di Rust!
Nell’iperspazio con Rocket: il Framework Web di Rust!
 
Human Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR SystemsHuman Factors of XR: Using Human Factors to Design XR Systems
Human Factors of XR: Using Human Factors to Design XR Systems
 
The Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and ConsThe Ultimate Guide to Choosing WordPress Pros and Cons
The Ultimate Guide to Choosing WordPress Pros and Cons
 
"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan"ML in Production",Oleksandr Bagan
"ML in Production",Oleksandr Bagan
 
From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .From Family Reminiscence to Scholarly Archive .
From Family Reminiscence to Scholarly Archive .
 

Small Data Machine Learning Insights

  • 1. Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later
  • 2. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  • 3. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  • 11. @a For those of you who don’t know me.. Acquired in October 2008 Had a different account earlier, but then @k asked if I wanted it.. Know many other single-letter Twitterers.
  • 18. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 19. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 20. CONS I hate humanity Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 21. A D D
  • 22. Annoyance Driven Development Best way to learn something is to be annoyed enough to create a solution based on the tech.
  • 24. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  • 25. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  • 30.
  • 31. I still hate humanity
  • 32. I still hate humanity I still hate humanity
  • 33. Machine Learning A branch of Artificial Intelligence No widely accepted definition
  • 34. “Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959) concerns the construction and study of systems that can learn from data
  • 38. CLUSTERING And many more: medical diagnoses, detecting credit card fraud, etc.
  • 39. supervised unsupervised Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filtering
  • 40. supervised unsupervised no labels in the dataset, algorithm needs to find structure Example: clustering We will be talking about classification, a supervised learning process.
  • 41. Feature individual measurable property of the phenomenon under observation usually numeric
  • 42.
  • 43. Feature Vector a set of features for an observation Think of it as an array
  • 44. features # of rooms 2 sq. m house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 45. features parameters # of rooms 2 sq. m house age yard? 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 46. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 47. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 48. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 49. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 758,013 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 50. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . X - input feature vector theta - weights
  • 51. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . ✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . . X - input feature vector theta - weights
  • 52. training data learning algorithm hypothesis Hypothesis (decision function): what the system has learned so far Hypothesis is applied to new data
  • 53. hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  • 54. input data hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  • 55. input data hθ(X) parameters The task of our algorithm is to determine the parameters of the hypothesis.
  • 56. input data hθ(X) prediction y parameters The task of our algorithm is to determine the parameters of the hypothesis.
  • 57. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 58. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 59. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 60. 1 0.5 z 0 1 g(z) = 1+e z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  • 61. 1 0.5 z 0 1 g(z) = 1+e z =✓·X z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  • 62. 1 h✓ (X) = 1+e ✓·X Probability that y=1 for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
  • 64. Corpus collection of source data used for training and testing the model
  • 68. independent & discriminant Independent: feature A should not co-occur (correlate) with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
  • 69. possible features @a at the end of the tweet ‣ @a... ‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) ‣ …and more ‣
  • 70. feature = extractor(tweet) For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).
  • 71. corpus extractors feature vectors Run the set of these functions over the corpus and build up feature vectors Array of arrays Save to DB
  • 72. Language Matters high correlation between the language of the tweet and its category (good/bad)
  • 74. Top 12 Languages id en tl es so ja pt ar nl it sw fr Indonesian English Tagalog Spanish Somalian Japanese Portuguese Arabic Dutch Italian Swahili French I guarantee you people aren’t tweeting at me in Swahili. 3548 1804 733 329 305 300 262 256 150 137 118 92
  • 75. Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  • 76. Language Detection pear / Text_LanguageDetect pecl / textcat Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  • 77. EnglishNotEnglish ✓ ✓ ✓ ✓ Clean-up text (remove mentions, links, etc) Run language detection If unknown/low weight, pretend it’s English, else: If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
  • 78. BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones
  • 81. BIAS CORRECTION BAD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias GOOD
  • 83. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  • 84. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  • 85. OVER SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  • 86. SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  • 87. Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
  • 88. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 89. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 90. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 91. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 92. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 93. Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
  • 94. COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 95. REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 96. COST FUNCTION m X 1 J(✓) = Cost(h✓ (x), y) m i=1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 97. LOGISTIC COST Cost(h✓ (x), y) = ( log (h✓ (x)) log (1 h✓ (x)) if y = 1 if y = 0
  • 98. LOGISTIC COST y=1 0 y=0 1 Correct guess Incorrect guess 0 1 Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
  • 99. minimize cost OVER θ Finding the best values of Theta that minimize the cost
  • 100. GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
  • 101. ✓i = ✓i @J(✓) ↵ @✓i GRADIENT DESCENT Each step adjusts the parameters according to the slope
  • 102. ✓i = ✓i @J(✓) ↵ @✓i each parameter Have to update them simultaneously (the whole vector at a time).
  • 103. learning rate ✓i = ✓i @J(✓) ↵ @✓i Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge
  • 104. ✓i = ✓i @J(✓) ↵ @✓i derivative aka “the slope” The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
  • 105. ✓i = ✓i ↵ m X j (h✓ (x ) y j j=1 THE UPDATE ALGORITHM Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously! j )xi
  • 106. X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 107. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 108. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 109. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 110. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 111. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 112. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 113. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 114. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 115. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 116. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 Note that the hypotheses don’t change within the iteration.
  • 117. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 θ = [T0 T1] Replace parameter (weights) vector with the temporaries. ↵ = 0.05
  • 118. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 ↵ = 0.05 θ = [0.088 0.305] Do next iteration
  • 119. CROSS Trai ning Used to assess the results of the training.
  • 120. DATA
  • 122. TEST TRAINING DATA Train model on training set, then test results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).
  • 123. Putting It All Together Let’s put our model to use, finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
  • 124. Load the model The weights we have calculated via training Easiest is to load them from DB (can be used to test different models).
  • 125. HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 126. SKIP truncated retweets: "RT @A ..." HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 127. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 128. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends tweets from friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 129. Classifying Tweets This is the moment we’ve been waiting for.
  • 130. Classifying Tweets GOOD This is the moment we’ve been waiting for.
  • 131. Classifying Tweets GOOD This is the moment we’ve been waiting for. BAD
  • 132. Remember this? 1 h✓ (X) = 1+e First is our hypothesis. ✓·X
  • 133. Remember this? 1 h✓ (X) = 1+e ✓·X ✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . . First is our hypothesis.
  • 134. Finally h✓ (X) = 1 1+e (✓0 +✓1 X1 +✓2 X2 +... ) If h > threshold , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
  • 135. extract features 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 136. extract features run the model 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 137. extract features run the model act on the result 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 138. BAD? Also save the tweet to DB for future analysis. block user!
  • 139. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective
  • 140. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Connection handling, backoff in case of problems, undocumented API errors, etc.
  • 141. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.
  • 142. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Some tweets are shown on the website, but never seen through the API.
  • 143. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Lots of room for improvement.
  • 144. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective PHP sucks at math-y stuff -Lots of room for improvement.
  • 145. Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classification ★ Other minimization algos: BFGS, conjugate gradient ★ Wish pecl/scikit-learn existed ★ NEXT STEPS Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
  • 146. MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample ★ TOOLS MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)
  • 147. LEARN Coursera.org ML course ★ Ian Barber’s blog ★ FastML.com ★ Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.