Small Data Machine Learning Insights

Small Data Machine Learning
Andrei Zmievski

The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.
Questions - now and later

WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come ﬁnd
me.

@a
For those of you who don’t know me..
Acquired in October 2008
Had a different account earlier, but then @k asked if I wanted it..
Know many other single-letter Twitterers.

FAME
FORTUNE

Wall Street Journal?!

FAME
FORTUNE
FOLLOWERS
lol, what?!

MAXIMUM REPLY SPACE!
140-length(“@a “)=137

CONS

Disadvantages
Visual ﬁltering is next to impossible
Could be a set of hard-coded rules derived empirically

CONS
I hate humanity

Disadvantages
Visual ﬁltering is next to impossible
Could be a set of hard-coded rules derived empirically

Annoyance
Driven
Development
Best way to learn something is to be annoyed enough to create a solution based on the tech.

Machine Learning
to the Rescue!

REPLYCLEANER
Even with false negatives, reduces garbage to where visual ﬁltering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline

I still hate humanity

I still hate humanity

Machine Learning

A branch of Artiﬁcial Intelligence
No widely accepted deﬁnition

“Field of study that gives
computers the ability to learn
without being explicitly
programmed.”
— Arthur Samuel (1959)

concerns the construction and study of systems that can learn from data

CLUSTERING
And many more: medical diagnoses, detecting credit card fraud, etc.

supervised
unsupervised

Labeled dataset, training maps input to desired outputs
Example: regression - predicting house prices, classiﬁcation - spam ﬁltering

supervised

unsupervised
no labels in the dataset, algorithm needs to ﬁnd structure
Example: clustering
We will be talking about classiﬁcation, a supervised learning process.

Feature
individual measurable property of the
phenomenon under observation

usually numeric

Feature Vector
a set of features for an observation

Think of it as an array

features
# of rooms
2
sq. m
house age
yard?
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simpliﬁes
calculation)
dot product produces a linear predictor

features parameters
# of rooms
2
sq. m
house age
yard?

102.3
0.94
-10.1
83.0

calculation)

features parameters
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

calculation)

features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

calculation)

features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

758,013

calculation)

dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

X - input feature vector
theta - weights

dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . .

X - input feature vector
theta - weights

training data
learning algorithm
hypothesis
Hypothesis (decision function): what the system has learned so far
Hypothesis is applied to new data

hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.

input data

hθ(X)


input data

hθ(X)
parameters


input data

hθ(X)

prediction y

parameters


whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classiﬁcation because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.

1

0.5
z

0

1
g(z) =
1+e

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.

1

0.5
z

0

1
g(z) =
1+e
z =✓·X

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.

1
h✓ (X) =
1+e

✓·X

Probability that y=1 for input X

LOGISTIC REGRESSION
If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70%
chance it’s spam. Thresholding on that is up to you.

Corpus
collection of source data used for training and
testing the model

Twitter

MongoDB
phirehose

hooks into streaming API

Twitter

MongoDB
phirehose

8500 tweets
hooks into streaming API

independent
&
discriminant
Independent: feature A should not co-occur (correlate) with feature B highly.
Discriminant: a feature should provide uniquely classiﬁable data (what letter a tweet starts
with is not a good feature).

possible features
@a at the end of the tweet
‣ @a...
‣ length < N chars
‣ # of user mentions in the tweet
‣ # of hashtags
‣ language!
‣ @a followed by punctuation and a word
character (except for apostrophe)
‣ …and more
‣

feature = extractor(tweet)

For each feature, write a small function that takes a tweet and returns a numeric value
(ﬂoating-point).

corpus
extractors
feature vectors
Run the set of these functions over the corpus and build up feature vectors
Array of arrays
Save to DB

Language
Matters
high correlation between the language of the tweet and its category (good/bad)

Indonesian or Tagalog?
Garbage.

Top 12 Languages
id
en
tl
es
so
ja
pt
ar
nl
it
sw
fr

Indonesian
English
Tagalog
Spanish
Somalian
Japanese
Portuguese
Arabic
Dutch
Italian
Swahili
French

I guarantee you people aren’t tweeting at me in Swahili.

3548
1804
733
329
305
300
262
256
150
137
118
92

Language
Detection

Can’t trust the language ﬁeld in user’s proﬁle data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.

Language
Detection
pear / Text_LanguageDetect
pecl / textcat
Can’t trust the language ﬁeld in user’s proﬁle data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.

EnglishNotEnglish
✓
✓
✓
✓

Clean-up text (remove mentions, links, etc)
Run language detection
If unknown/low weight, pretend it’s English, else:
If not a character set-determined language, try harder:
✓ Tokenize into words
✓ Diﬀerence with English vocabulary
✓ If words remain, run parts-of-speech tagger on each
✓ For NNS, VBZ, and VBD run stemming algorithm
✓ If result is in English vocabulary, remove from remaining
✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s English

A lot of this is heuristic-based, after some trial-and-error.
Seems to help with my corpus.

BINARY CLASSIFICATION

Grunt work
Built a web-based tool to display tweets a page at a time and select good ones

INPUT

feature vectors

OUTPUT

labels (good/bad)

Had my input and output

BIAS
CORRECTION

One more thing to address

BIAS
CORRECTION

BAD

99% = bad (less < 100 tweets were good)
Training a model as-is would not produce good results
Need to adjust the bias

GOOD

OVER
SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category

OVER
SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training

SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training

Synthetic
OVERSAMPLING
Synthesize feature vectors by determining what constitutes a good tweet and do weighted
random selection of feature values.

chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)

chance
feature
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end

1

(limited to 1000)

chance
feature
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end

1
2

(limited to 1000)

chance
feature
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end

1
2
0

(limited to 1000)

chance
feature
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end

1
2
0
77

(limited to 1000)

Model
Training
We have the hypothesis (decision function) and the training set,
How do we actually determine the weights/parameters?

COST
FUNCTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.

REALITY

COST
FUNCTION
PREDICTION

COST
FUNCTION

m
X
1
J(✓) =
Cost(h✓ (x), y)
m i=1


LOGISTIC COST

Cost(h✓ (x), y) =

(

log (h✓ (x))
log (1 h✓ (x))

if y = 1
if y = 0

LOGISTIC COST
y=1

0

y=0

1

Correct guess
Incorrect guess

0

1

Cost = 0
Cost = huge

When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess),
the more we penalize the algorithm. Same for y=0.

minimize cost

OVER θ
Finding the best values of Theta that minimize the cost

GRADIENT DESCENT
Random starting point.
Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step.
Repeat.
Imagine a ball rolling down from a hill.

✓i = ✓i

@J(✓)
↵
@✓i

GRADIENT DESCENT
Each step adjusts the parameters according to the slope

✓i = ✓i

@J(✓)
↵
@✓i

each parameter

Have to update them simultaneously (the whole vector at a time).

learning rate

✓i = ✓i

@J(✓)
↵
@✓i

Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps
If too small, tiny steps, takes too long
If too big, can overshoot the minimum and fail to converge

✓i = ✓i

@J(✓)
↵
@✓i
derivative
aka
“the slope”

The slope indicates the steepness of the descent step for each weight, i.e. direction.
Keep going for a number of iterations or until cost is below a threshold (convergence).
Graph the cost function versus # of iterations and see where it starts to approach 0, past that
are diminishing returns.

✓i = ✓i

↵

m
X

j

(h✓ (x ) y

j

j=1

THE UPDATE ALGORITHM
Derivative for logistic regression simpliﬁes to this term.
Have to update the weights simultaneously!

j
)xi

X1 = [1 12.0]
X2 = [1 -3.5]

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.

X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)


X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)

X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
= 0.088

X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)
= 0.305
Note that the hypotheses don’t change within the iteration.

X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

θ = [T0 T1]

Replace parameter (weights) vector with the temporaries.

↵ = 0.05

X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

↵ = 0.05

θ = [0.088 0.305]

Do next iteration

CROSS

Trai ning
Used to assess the results of the training.

TEST

TRAINING
DATA

Train model on training set, then test results on test set.
Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".
Pick the best parameters and save them (DB, other).

Putting It All
Together
Let’s put our model to use, ﬁnally.
The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain
error handling, etc. Once we get the actual tweet though..

Load the model
The weights we have calculated via training

Easiest is to load them from DB (can be used to test different models).

HARD
CODED
RULES
We apply some hardcoded rules to ﬁlter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so ﬁne to skip those.

SKIP
truncated retweets: "RT @A ..."

HARD
CODED
RULES

SKIP
HARD
CODED
RULES

@ mentions of friends


SKIP
HARD
CODED
RULES

@ mentions of friends
tweets from friends


Classifying Tweets

This is the moment we’ve been waiting for.

Classifying Tweets
GOOD

Classifying Tweets
GOOD

BAD

Remember this?
1
h✓ (X) =
1+e

First is our hypothesis.

✓·X

Remember this?
1
h✓ (X) =
1+e

✓·X

✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . .

First is our hypothesis.

Finally
h✓ (X) =

1
1+e

(✓0 +✓1 X1 +✓2 X2 +... )

If h > threshold , tweet is bad, otherwise good

Remember that the output of h() is 0..1 (probability).
Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.

extract features

3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classiﬁer.

extract features
run the model
3 simple steps

extract features
run the model
act on the result
3 simple steps

BAD?

Also save the tweet to DB for future analysis.

block
user!

Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is ﬁnal)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% eﬀective

Lessons Learned
-Connection handling, backoff in case of problems, undocumented API errors, etc.

Lessons Learned

-No way for blocked person to get ahold of you via Twitter anymore, so when training the
model, err on the side of caution.

Lessons Learned

-Some tweets are shown on the website, but never seen through the API.

Lessons Learned

-Lots of room for improvement.

Lessons Learned
PHP sucks at math-y stuﬀ

-Lots of room for improvement.

Realtime feedback
★ More features
★ Grammar analysis
★ Support Vector Machines or
decision trees
★ Clockwork Raven for manual
classiﬁcation
★ Other minimization algos:
BFGS, conjugate gradient
★ Wish pecl/scikit-learn existed
★

NEXT
STEPS

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classiﬁcation to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.

MongoDB
★ pear/Text_LanguageDetect
★ English vocabulary corpus
★ Parts-Of-Speech tagging
★ SplFixedArray
★ phirehose
★ Python’s scikit-learn (for
validation)
★ Code sample
★

TOOLS

MongoDB (great ﬁt for JSON data)
English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/
SplFixedArray in PHP (memory savings and slightly faster)

LEARN

Coursera.org ML course
★ Ian Barber’s blog
★ FastML.com
★

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classiﬁcation to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.

Small Data Machine Learning Insights

Recomendados

Recomendados

Más contenido relacionado

Similar a Small Data Machine Learning Insights

Similar a Small Data Machine Learning Insights (20)

Más de PHP Conference Argentina

Más de PHP Conference Argentina (7)

Último

Último (20)

Small Data Machine Learning Insights