SlideShare una empresa de Scribd logo
1 de 148
Descargar para leer sin conexión
Small Data Machine Learning
Andrei Zmievski

The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic.
Questions - now and later
WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
WORK
We are all superheroes, because we help our customers keep their mission-critical apps
running smoothly. If interested, I can show you a demo of what I’m working on. Come find
me.
TRAVEL
TAKE PHOTOS
DRINK BEER
MAKE BEER
MATH
SOME MATH
AWESOME MATH
@a
For those of you who don’t know me..
Acquired in October 2008
Had a different account earlier, but then @k asked if I wanted it..
Know many other single-letter Twitterers.
FAME

Advantages
FAME
FORTUNE
FAME
FORTUNE

Wall Street Journal?!
FAME
FORTUNE
FOLLOWERS
FAME
FORTUNE
FOLLOWERS
lol, what?!
MAXIMUM REPLY SPACE!
140-length(“@a “)=137
CONS

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
CONS

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
CONS
I hate humanity

Disadvantages
Visual filtering is next to impossible
Could be a set of hard-coded rules derived empirically
A
D
D
Annoyance
Driven
Development
Best way to learn something is to be annoyed enough to create a solution based on the tech.
Machine Learning
to the Rescue!
REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
REPLYCLEANER
Even with false negatives, reduces garbage to where visual filtering is possible
- uses trained model to classify tweets into good/bad
- blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
REPLYCLEANER
REPLYCLEANER
REPLYCLEANER
REPLYCLEANER
I still hate humanity
I still hate humanity

I still hate humanity
Machine Learning

A branch of Artificial Intelligence
No widely accepted definition
“Field of study that gives
computers the ability to learn
without being explicitly
programmed.”
— Arthur Samuel (1959)

concerns the construction and study of systems that can learn from data
SPAM FILTERING
RECOMMENDATIONS
TRANSLATION
CLUSTERING
And many more: medical diagnoses, detecting credit card fraud, etc.
supervised
unsupervised

Labeled dataset, training maps input to desired outputs
Example: regression - predicting house prices, classification - spam filtering
supervised

unsupervised
no labels in the dataset, algorithm needs to find structure
Example: clustering
We will be talking about classification, a supervised learning process.
Feature
individual measurable property of the
phenomenon under observation

usually numeric
Feature Vector
a set of features for an observation

Think of it as an array
features
# of rooms
2
sq. m
house age
yard?
feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
# of rooms
2
sq. m
house age
yard?

102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
features parameters = prediction
1
# of rooms
2
sq. m
house age
yard?

45.7
102.3
0.94
-10.1
83.0

758,013

feature vector and weights vector
1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies
calculation)
dot product produces a linear predictor
dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

X - input feature vector
theta - weights
dot product
⇥

⇤

X = 1 x1 x2 . . .
⇥
⇤
✓ = ✓0 ✓1 ✓2 . . .

✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . .

X - input feature vector
theta - weights
training data
learning algorithm
hypothesis
Hypothesis (decision function): what the system has learned so far
Hypothesis is applied to new data
hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)
parameters

The task of our algorithm is to determine the parameters of the hypothesis.
input data

hθ(X)

prediction y

parameters

The task of our algorithm is to determine the parameters of the hypothesis.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
whisky price $
200
160
120
80
40
5

10

15 20 25 30 35

whisky age

LINEAR REGRESSION
Models the relationship between a scalar dependent variable y and one or more explanatory
variables denoted X. Here X = whisky age, y = whisky price.
Linear regression does not work well for classification because its output is unbounded.
Thresholding on some value is tricky and does not produce good results.
1

0.5
z

0

1
g(z) =
1+e

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
1

0.5
z

0

1
g(z) =
1+e
z =✓·X

z

LOGISTIC REGRESSION
Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin.
z is just our old dot product, the linear predictor. Transforms unbounded output into
bounded.
1
h✓ (X) =
1+e

✓·X

Probability that y=1 for input X

LOGISTIC REGRESSION
If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70%
chance it’s spam. Thresholding on that is up to you.
Building the Tool
Corpus
collection of source data used for training and
testing the model
Twitter

MongoDB
phirehose

hooks into streaming API
Twitter

MongoDB
phirehose

8500 tweets
hooks into streaming API
Feature
Identification
independent
&
discriminant
Independent: feature A should not co-occur (correlate) with feature B highly.
Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts
with is not a good feature).
possible features
@a at the end of the tweet
‣ @a...
‣ length < N chars
‣ # of user mentions in the tweet
‣ # of hashtags
‣ language!
‣ @a followed by punctuation and a word
character (except for apostrophe)
‣ …and more
‣
feature = extractor(tweet)

For each feature, write a small function that takes a tweet and returns a numeric value
(floating-point).
corpus
extractors
feature vectors
Run the set of these functions over the corpus and build up feature vectors
Array of arrays
Save to DB
Language
Matters
high correlation between the language of the tweet and its category (good/bad)
Indonesian or Tagalog?
Garbage.
Top 12 Languages
id
en
tl
es
so
ja
pt
ar
nl
it
sw
fr

Indonesian
English
Tagalog
Spanish
Somalian
Japanese
Portuguese
Arabic
Dutch
Italian
Swahili
French

I guarantee you people aren’t tweeting at me in Swahili.

3548
1804
733
329
305
300
262
256
150
137
118
92
Language
Detection

Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
Language
Detection
pear / Text_LanguageDetect
pecl / textcat
Can’t trust the language field in user’s profile data.
Used character N-grams and character sets for detection.
Has its own error rate, so needs some post-processing.
EnglishNotEnglish
✓
✓
✓
✓

Clean-up text (remove mentions, links, etc)
Run language detection
If unknown/low weight, pretend it’s English, else:
If not a character set-determined language, try harder:
✓ Tokenize into words
✓ Difference with English vocabulary
✓ If words remain, run parts-of-speech tagger on each
✓ For NNS, VBZ, and VBD run stemming algorithm
✓ If result is in English vocabulary, remove from remaining
✓ If remaining list is not empty, calculate:
unusual_word_ratio = size(remaining)/size(words)
✓ If ratio < 20%, pretend it’s English

A lot of this is heuristic-based, after some trial-and-error.
Seems to help with my corpus.
BINARY CLASSIFICATION

Grunt work
Built a web-based tool to display tweets a page at a time and select good ones
INPUT

feature vectors

OUTPUT

labels (good/bad)

Had my input and output
BIAS
CORRECTION

One more thing to address
BIAS
CORRECTION

BAD

99% = bad (less < 100 tweets were good)
Training a model as-is would not produce good results
Need to adjust the bias

GOOD
BIAS
CORRECTION

BAD

GOOD
OVER
SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
OVER
SAMPLING

Oversampling: use multiple copies of good tweets to equalize with bad
Problem: bias very high, each good tweet would have to be copied 100 times, and not
contribute any variance to the good category
OVER
SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training
SAMPLING

UNDER
Undersampling: drop most of the bad tweets to equalize with good
Problem: total corpus ends up being < 200 tweets, not enough for training
Synthetic
OVERSAMPLING
Synthesize feature vectors by determining what constitutes a good tweet and do weighted
random selection of feature values.
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10
The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2
0

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
chance
feature
90% “good” language
70%
no hashtags
25%
1 hashtag
5%
2 hashtags
2%
@a at the end
85% rand length > 10

1
2
0
77

The actual synthesis is somewhat more complex and was also trial-and-error based
Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus
(limited to 1000)
Model
Training
We have the hypothesis (decision function) and the training set,
How do we actually determine the weights/parameters?
COST
FUNCTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
REALITY

COST
FUNCTION
PREDICTION
Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
COST
FUNCTION

m
X
1
J(✓) =
Cost(h✓ (x), y)
m i=1

Measures how far the prediction of the system is from the reality.
The cost depends on the parameters.
The less the cost, the closer we’re to the ideal parameters for the model.
LOGISTIC COST

Cost(h✓ (x), y) =

(

log (h✓ (x))
log (1 h✓ (x))

if y = 1
if y = 0
LOGISTIC COST
y=1

0

y=0

1

Correct guess
Incorrect guess

0

1

Cost = 0
Cost = huge

When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess),
the more we penalize the algorithm. Same for y=0.
minimize cost

OVER θ
Finding the best values of Theta that minimize the cost
GRADIENT DESCENT
Random starting point.
Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step.
Repeat.
Imagine a ball rolling down from a hill.
✓i = ✓i

@J(✓)
↵
@✓i

GRADIENT DESCENT
Each step adjusts the parameters according to the slope
✓i = ✓i

@J(✓)
↵
@✓i

each parameter

Have to update them simultaneously (the whole vector at a time).
learning rate

✓i = ✓i

@J(✓)
↵
@✓i

Controls how big a step you take
If α is big have an aggressive gradient descent
If α is small take tiny steps
If too small, tiny steps, takes too long
If too big, can overshoot the minimum and fail to converge
✓i = ✓i

@J(✓)
↵
@✓i
derivative
aka
“the slope”

The slope indicates the steepness of the descent step for each weight, i.e. direction.
Keep going for a number of iterations or until cost is below a threshold (convergence).
Graph the cost function versus # of iterations and see where it starts to approach 0, past that
are diminishing returns.
✓i = ✓i

↵

m
X

j

(h✓ (x ) y

j

j=1

THE UPDATE ALGORITHM
Derivative for logistic regression simplifies to this term.
Have to update the weights simultaneously!

j
)xi
X1 = [1 12.0]
X2 = [1 -3.5]

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)

Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1)
= 0.088
Hypothesis for each data point based on current parameters.
Each parameter is updated in order and result is saved to a temporary.
X1 = [1 12.0]
X2 = [1 -3.5]
θ = [0.1 0.1]

y1 = 1
y2 = 0

↵ = 0.05

-(0.1 • 1 + 0.1 • 12.0)) = 0.786
h(X1) = 1 / (1 + e
-(0.1 • 1 + 0.1 • -3.5)) = 0.438
h(X2) = 1 / (1 + e

T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20)
= 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5)
= 0.305
Note that the hypotheses don’t change within the iteration.
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

θ = [T0 T1]

Replace parameter (weights) vector with the temporaries.

↵ = 0.05
X1 = [1 12.0]
X2 = [1 -3.5]

y1 = 1
y2 = 0

↵ = 0.05

θ = [0.088 0.305]

Do next iteration
CROSS

Trai ning
Used to assess the results of the training.
DATA
TRAINING
DATA
TEST

TRAINING
DATA

Train model on training set, then test results on test set.
Rinse, lather, repeat feature selection/synthesis/training until results are "good enough".
Pick the best parameters and save them (DB, other).
Putting It All
Together
Let’s put our model to use, finally.
The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain
error handling, etc. Once we get the actual tweet though..
Load the model
The weights we have calculated via training

Easiest is to load them from DB (can be used to test different models).
HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
truncated retweets: "RT @A ..."

HARD
CODED
RULES
We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
HARD
CODED
RULES

truncated retweets: "RT @A ..."
@ mentions of friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
SKIP
HARD
CODED
RULES

truncated retweets: "RT @A ..."
@ mentions of friends
tweets from friends

We apply some hardcoded rules to filter out the tweets we are certain are good or bad.
The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
Classifying Tweets

This is the moment we’ve been waiting for.
Classifying Tweets
GOOD
This is the moment we’ve been waiting for.
Classifying Tweets
GOOD
This is the moment we’ve been waiting for.

BAD
Remember this?
1
h✓ (X) =
1+e

First is our hypothesis.

✓·X
Remember this?
1
h✓ (X) =
1+e

✓·X

✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . .

First is our hypothesis.
Finally
h✓ (X) =

1
1+e

(✓0 +✓1 X1 +✓2 X2 +... )

If h > threshold , tweet is bad, otherwise good

Remember that the output of h() is 0..1 (probability).
Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
extract features

3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
extract features
run the model
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
extract features
run the model
act on the result
3 simple steps
Invoke the feature extractor to construct the feature vector for this tweet.
Evaluate the decision function over the feature vector (input the calculated feature
parameters into the equation).
Use the output of the classifier.
BAD?

Also save the tweet to DB for future analysis.

block
user!
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
-Connection handling, backoff in case of problems, undocumented API errors, etc.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-No way for blocked person to get ahold of you via Twitter anymore, so when training the
model, err on the side of caution.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-Some tweets are shown on the website, but never seen through the API.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective

-Lots of room for improvement.
Lessons Learned
Twitter API is a pain in the rear
Blocking is the only option (and is final)
Streaming API delivery is incomplete
ReplyCleaner judged to be ~80% effective
PHP sucks at math-y stuff

-Lots of room for improvement.
Realtime feedback
★ More features
★ Grammar analysis
★ Support Vector Machines or
decision trees
★ Clockwork Raven for manual
classification
★ Other minimization algos:
BFGS, conjugate gradient
★ Wish pecl/scikit-learn existed
★

NEXT
STEPS

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.
MongoDB
★ pear/Text_LanguageDetect
★ English vocabulary corpus
★ Parts-Of-Speech tagging
★ SplFixedArray
★ phirehose
★ Python’s scikit-learn (for
validation)
★ Code sample
★

TOOLS

MongoDB (great fit for JSON data)
English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/
SplFixedArray in PHP (memory savings and slightly faster)
LEARN

Coursera.org ML course
★ Ian Barber’s blog
★ FastML.com
★

Click on the tweets that are bad and it immediately incorporates them into the model.
Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences.
SVMs more appropriate for biased data sets.
Farm out manual classification to Mechanical Turk.
May help avoid local minima, no need to pick alpha, often faster than GD.
Questions?

Más contenido relacionado

Similar a Small Data Machine Learning Insights

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Daniel Katz
 
A well-typed program never goes wrong
A well-typed program never goes wrongA well-typed program never goes wrong
A well-typed program never goes wrongJulien Wetterwald
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015lbishal
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningCloudxLab
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learningknowbigdata
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep LearningShubhWadekar
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Vincenzo Santopietro
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch Eran Shlomo
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against YouC4Media
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11darwinrlo
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine LearningNarong Intiruk
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep DiveSara Hooker
 
Software fundamentals
Software fundamentalsSoftware fundamentals
Software fundamentalsSusan Winters
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Codemotion
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...Databricks
 

Similar a Small Data Machine Learning Insights (20)

Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
Quantitative Methods for Lawyers - Class #14 - R Boot Camp - Part 1 - Profess...
 
A well-typed program never goes wrong
A well-typed program never goes wrongA well-typed program never goes wrong
A well-typed program never goes wrong
 
Shap
ShapShap
Shap
 
Meetup 29042015
Meetup 29042015Meetup 29042015
Meetup 29042015
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Understanding computer vision with Deep Learning
Understanding computer vision with Deep LearningUnderstanding computer vision with Deep Learning
Understanding computer vision with Deep Learning
 
Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)Introduction to Tensor Flow for Optical Character Recognition (OCR)
Introduction to Tensor Flow for Optical Character Recognition (OCR)
 
Deep learning from scratch
Deep learning from scratch Deep learning from scratch
Deep learning from scratch
 
Types Working for You, Not Against You
Types Working for You, Not Against YouTypes Working for You, Not Against You
Types Working for You, Not Against You
 
Cs221 lecture5-fall11
Cs221 lecture5-fall11Cs221 lecture5-fall11
Cs221 lecture5-fall11
 
Begin with Machine Learning
Begin with Machine LearningBegin with Machine Learning
Begin with Machine Learning
 
Module 2: Machine Learning Deep Dive
Module 2:  Machine Learning Deep DiveModule 2:  Machine Learning Deep Dive
Module 2: Machine Learning Deep Dive
 
c_tutorial_2.ppt
c_tutorial_2.pptc_tutorial_2.ppt
c_tutorial_2.ppt
 
Software fundamentals
Software fundamentalsSoftware fundamentals
Software fundamentals
 
supervised.pptx
supervised.pptxsupervised.pptx
supervised.pptx
 
OO Design
OO DesignOO Design
OO Design
 
Diving into Tensorflow.js
Diving into Tensorflow.jsDiving into Tensorflow.js
Diving into Tensorflow.js
 
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
Paul Hofmann - Recruiting with Jenkins - How engineers can recruit engineers ...
 
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 Deep Anomaly Detection from Research to Production Leveraging Spark and Tens... Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
Deep Anomaly Detection from Research to Production Leveraging Spark and Tens...
 

Más de PHP Conference Argentina

2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...PHP Conference Argentina
 
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source 2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source PHP Conference Argentina
 
2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a ProPHP Conference Argentina
 
2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida realPHP Conference Argentina
 
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...PHP Conference Argentina
 

Más de PHP Conference Argentina (7)

2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
2013 - Nate Abele: HTTP ALL THE THINGS: Simplificando aplicaciones respetando...
 
2013 - Mark story - Avoiding the Owasp
2013 - Mark story - Avoiding the Owasp2013 - Mark story - Avoiding the Owasp
2013 - Mark story - Avoiding the Owasp
 
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source 2013 -  Janis Janovskis: Liderando equipos de desarrollo Open Source
2013 - Janis Janovskis: Liderando equipos de desarrollo Open Source
 
2013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 22013 - Benjamin Eberlei - Doctrine 2
2013 - Benjamin Eberlei - Doctrine 2
 
2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro2013 - Nate Abele Wield AngularJS like a Pro
2013 - Nate Abele Wield AngularJS like a Pro
 
2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real2013 - Dustin whittle - Escalando PHP en la vida real
2013 - Dustin whittle - Escalando PHP en la vida real
 
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
2013 - Igor Sysoev - NGINx: origen, evolución y futuro - PHP Conference Argen...
 

Último

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityPrincipled Technologies
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountPuma Security, LLC
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Servicegiselly40
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdfhans926745
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking MenDelhi Call girls
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024Results
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...Neo4j
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?Antenna Manufacturer Coco
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationMichael W. Hawkins
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxKatpro Technologies
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘RTylerCroy
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Igalia
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessPixlogix Infotech
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 

Último (20)

Boost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivityBoost PC performance: How more available memory can improve productivity
Boost PC performance: How more available memory can improve productivity
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Breaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path MountBreaking the Kubernetes Kill Chain: Host Path Mount
Breaking the Kubernetes Kill Chain: Host Path Mount
 
CNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of ServiceCNv6 Instructor Chapter 6 Quality of Service
CNv6 Instructor Chapter 6 Quality of Service
 
[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf[2024]Digital Global Overview Report 2024 Meltwater.pdf
[2024]Digital Global Overview Report 2024 Meltwater.pdf
 
08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men08448380779 Call Girls In Civil Lines Women Seeking Men
08448380779 Call Girls In Civil Lines Women Seeking Men
 
A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024A Call to Action for Generative AI in 2024
A Call to Action for Generative AI in 2024
 
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...Workshop - Best of Both Worlds_ Combine  KG and Vector search for  enhanced R...
Workshop - Best of Both Worlds_ Combine KG and Vector search for enhanced R...
 
What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?What Are The Drone Anti-jamming Systems Technology?
What Are The Drone Anti-jamming Systems Technology?
 
GenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day PresentationGenCyber Cyber Security Day Presentation
GenCyber Cyber Security Day Presentation
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptxFactors to Consider When Choosing Accounts Payable Services Providers.pptx
Factors to Consider When Choosing Accounts Payable Services Providers.pptx
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
🐬 The future of MySQL is Postgres 🐘
🐬  The future of MySQL is Postgres   🐘🐬  The future of MySQL is Postgres   🐘
🐬 The future of MySQL is Postgres 🐘
 
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
Raspberry Pi 5: Challenges and Solutions in Bringing up an OpenGL/Vulkan Driv...
 
Advantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your BusinessAdvantages of Hiring UIUX Design Service Providers for Your Business
Advantages of Hiring UIUX Design Service Providers for Your Business
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 

Small Data Machine Learning Insights

  • 1. Small Data Machine Learning Andrei Zmievski The goal is not a comprehensive introduction, but to plant a seed in your head to get you interested in this topic. Questions - now and later
  • 2. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  • 3. WORK We are all superheroes, because we help our customers keep their mission-critical apps running smoothly. If interested, I can show you a demo of what I’m working on. Come find me.
  • 11. @a For those of you who don’t know me.. Acquired in October 2008 Had a different account earlier, but then @k asked if I wanted it.. Know many other single-letter Twitterers.
  • 18. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 19. CONS Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 20. CONS I hate humanity Disadvantages Visual filtering is next to impossible Could be a set of hard-coded rules derived empirically
  • 21. A D D
  • 22. Annoyance Driven Development Best way to learn something is to be annoyed enough to create a solution based on the tech.
  • 24. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  • 25. REPLYCLEANER Even with false negatives, reduces garbage to where visual filtering is possible - uses trained model to classify tweets into good/bad - blocks the authors of the bad ones, since Twitter does not have a way to remove an individual tweet from the timeline
  • 30.
  • 31. I still hate humanity
  • 32. I still hate humanity I still hate humanity
  • 33. Machine Learning A branch of Artificial Intelligence No widely accepted definition
  • 34. “Field of study that gives computers the ability to learn without being explicitly programmed.” — Arthur Samuel (1959) concerns the construction and study of systems that can learn from data
  • 38. CLUSTERING And many more: medical diagnoses, detecting credit card fraud, etc.
  • 39. supervised unsupervised Labeled dataset, training maps input to desired outputs Example: regression - predicting house prices, classification - spam filtering
  • 40. supervised unsupervised no labels in the dataset, algorithm needs to find structure Example: clustering We will be talking about classification, a supervised learning process.
  • 41. Feature individual measurable property of the phenomenon under observation usually numeric
  • 42.
  • 43. Feature Vector a set of features for an observation Think of it as an array
  • 44. features # of rooms 2 sq. m house age yard? feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 45. features parameters # of rooms 2 sq. m house age yard? 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 46. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 47. features parameters 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 48. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 49. features parameters = prediction 1 # of rooms 2 sq. m house age yard? 45.7 102.3 0.94 -10.1 83.0 758,013 feature vector and weights vector 1 added to pad the vector (account for the initial offset / bias / intercept weight, simplifies calculation) dot product produces a linear predictor
  • 50. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . X - input feature vector theta - weights
  • 51. dot product ⇥ ⇤ X = 1 x1 x2 . . . ⇥ ⇤ ✓ = ✓0 ✓1 ✓2 . . . ✓·X = ✓0 + ✓1 x1 + ✓2 x2 + . . . X - input feature vector theta - weights
  • 52. training data learning algorithm hypothesis Hypothesis (decision function): what the system has learned so far Hypothesis is applied to new data
  • 53. hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  • 54. input data hθ(X) The task of our algorithm is to determine the parameters of the hypothesis.
  • 55. input data hθ(X) parameters The task of our algorithm is to determine the parameters of the hypothesis.
  • 56. input data hθ(X) prediction y parameters The task of our algorithm is to determine the parameters of the hypothesis.
  • 57. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 58. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 59. whisky price $ 200 160 120 80 40 5 10 15 20 25 30 35 whisky age LINEAR REGRESSION Models the relationship between a scalar dependent variable y and one or more explanatory variables denoted X. Here X = whisky age, y = whisky price. Linear regression does not work well for classification because its output is unbounded. Thresholding on some value is tricky and does not produce good results.
  • 60. 1 0.5 z 0 1 g(z) = 1+e z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  • 61. 1 0.5 z 0 1 g(z) = 1+e z =✓·X z LOGISTIC REGRESSION Logistic function (also sigmoid function). Asymptotes at 0 and 1. Crosses 0.5 at origin. z is just our old dot product, the linear predictor. Transforms unbounded output into bounded.
  • 62. 1 h✓ (X) = 1+e ✓·X Probability that y=1 for input X LOGISTIC REGRESSION If hypothesis describes spam, then given X=body of email, if y=0.7, means there’s 70% chance it’s spam. Thresholding on that is up to you.
  • 64. Corpus collection of source data used for training and testing the model
  • 68. independent & discriminant Independent: feature A should not co-occur (correlate) with feature B highly. Discriminant: a feature should provide uniquely classifiable data (what letter a tweet starts with is not a good feature).
  • 69. possible features @a at the end of the tweet ‣ @a... ‣ length < N chars ‣ # of user mentions in the tweet ‣ # of hashtags ‣ language! ‣ @a followed by punctuation and a word character (except for apostrophe) ‣ …and more ‣
  • 70. feature = extractor(tweet) For each feature, write a small function that takes a tweet and returns a numeric value (floating-point).
  • 71. corpus extractors feature vectors Run the set of these functions over the corpus and build up feature vectors Array of arrays Save to DB
  • 72. Language Matters high correlation between the language of the tweet and its category (good/bad)
  • 74. Top 12 Languages id en tl es so ja pt ar nl it sw fr Indonesian English Tagalog Spanish Somalian Japanese Portuguese Arabic Dutch Italian Swahili French I guarantee you people aren’t tweeting at me in Swahili. 3548 1804 733 329 305 300 262 256 150 137 118 92
  • 75. Language Detection Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  • 76. Language Detection pear / Text_LanguageDetect pecl / textcat Can’t trust the language field in user’s profile data. Used character N-grams and character sets for detection. Has its own error rate, so needs some post-processing.
  • 77. EnglishNotEnglish ✓ ✓ ✓ ✓ Clean-up text (remove mentions, links, etc) Run language detection If unknown/low weight, pretend it’s English, else: If not a character set-determined language, try harder: ✓ Tokenize into words ✓ Difference with English vocabulary ✓ If words remain, run parts-of-speech tagger on each ✓ For NNS, VBZ, and VBD run stemming algorithm ✓ If result is in English vocabulary, remove from remaining ✓ If remaining list is not empty, calculate: unusual_word_ratio = size(remaining)/size(words) ✓ If ratio < 20%, pretend it’s English A lot of this is heuristic-based, after some trial-and-error. Seems to help with my corpus.
  • 78. BINARY CLASSIFICATION Grunt work Built a web-based tool to display tweets a page at a time and select good ones
  • 81. BIAS CORRECTION BAD 99% = bad (less < 100 tweets were good) Training a model as-is would not produce good results Need to adjust the bias GOOD
  • 83. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  • 84. OVER SAMPLING Oversampling: use multiple copies of good tweets to equalize with bad Problem: bias very high, each good tweet would have to be copied 100 times, and not contribute any variance to the good category
  • 85. OVER SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  • 86. SAMPLING UNDER Undersampling: drop most of the bad tweets to equalize with good Problem: total corpus ends up being < 200 tweets, not enough for training
  • 87. Synthetic OVERSAMPLING Synthesize feature vectors by determining what constitutes a good tweet and do weighted random selection of feature values.
  • 88. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 89. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 90. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 91. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 92. chance feature 90% “good” language 70% no hashtags 25% 1 hashtag 5% 2 hashtags 2% @a at the end 85% rand length > 10 1 2 0 77 The actual synthesis is somewhat more complex and was also trial-and-error based Synthesized tweets + existing good tweets = 2/3 of # of bad tweets in the training corpus (limited to 1000)
  • 93. Model Training We have the hypothesis (decision function) and the training set, How do we actually determine the weights/parameters?
  • 94. COST FUNCTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 95. REALITY COST FUNCTION PREDICTION Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 96. COST FUNCTION m X 1 J(✓) = Cost(h✓ (x), y) m i=1 Measures how far the prediction of the system is from the reality. The cost depends on the parameters. The less the cost, the closer we’re to the ideal parameters for the model.
  • 97. LOGISTIC COST Cost(h✓ (x), y) = ( log (h✓ (x)) log (1 h✓ (x)) if y = 1 if y = 0
  • 98. LOGISTIC COST y=1 0 y=0 1 Correct guess Incorrect guess 0 1 Cost = 0 Cost = huge When y=1 and h(x) is 1 (good guess), cost is 0, but the closer h(x) gets to 0 (wrong guess), the more we penalize the algorithm. Same for y=0.
  • 99. minimize cost OVER θ Finding the best values of Theta that minimize the cost
  • 100. GRADIENT DESCENT Random starting point. Pretend you’re standing on a hill. Find the direction of the steepest descent and take a step. Repeat. Imagine a ball rolling down from a hill.
  • 101. ✓i = ✓i @J(✓) ↵ @✓i GRADIENT DESCENT Each step adjusts the parameters according to the slope
  • 102. ✓i = ✓i @J(✓) ↵ @✓i each parameter Have to update them simultaneously (the whole vector at a time).
  • 103. learning rate ✓i = ✓i @J(✓) ↵ @✓i Controls how big a step you take If α is big have an aggressive gradient descent If α is small take tiny steps If too small, tiny steps, takes too long If too big, can overshoot the minimum and fail to converge
  • 104. ✓i = ✓i @J(✓) ↵ @✓i derivative aka “the slope” The slope indicates the steepness of the descent step for each weight, i.e. direction. Keep going for a number of iterations or until cost is below a threshold (convergence). Graph the cost function versus # of iterations and see where it starts to approach 0, past that are diminishing returns.
  • 105. ✓i = ✓i ↵ m X j (h✓ (x ) y j j=1 THE UPDATE ALGORITHM Derivative for logistic regression simplifies to this term. Have to update the weights simultaneously! j )xi
  • 106. X1 = [1 12.0] X2 = [1 -3.5] Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 107. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 108. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 109. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 110. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 111. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 112. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 113. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 114. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 115. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T0 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 1 + (0.438 - 0) • 1) = 0.088 Hypothesis for each data point based on current parameters. Each parameter is updated in order and result is saved to a temporary.
  • 116. X1 = [1 12.0] X2 = [1 -3.5] θ = [0.1 0.1] y1 = 1 y2 = 0 ↵ = 0.05 -(0.1 • 1 + 0.1 • 12.0)) = 0.786 h(X1) = 1 / (1 + e -(0.1 • 1 + 0.1 • -3.5)) = 0.438 h(X2) = 1 / (1 + e T1 = 0.1 - 0.05 • ((h(X1) - y1) • X10 + (h(X2) - y2) • X20) = 0.1 - 0.05 • ((0.786 - 1) • 12.0 + (0.438 - 0) • -3.5) = 0.305 Note that the hypotheses don’t change within the iteration.
  • 117. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 θ = [T0 T1] Replace parameter (weights) vector with the temporaries. ↵ = 0.05
  • 118. X1 = [1 12.0] X2 = [1 -3.5] y1 = 1 y2 = 0 ↵ = 0.05 θ = [0.088 0.305] Do next iteration
  • 119. CROSS Trai ning Used to assess the results of the training.
  • 120. DATA
  • 122. TEST TRAINING DATA Train model on training set, then test results on test set. Rinse, lather, repeat feature selection/synthesis/training until results are "good enough". Pick the best parameters and save them (DB, other).
  • 123. Putting It All Together Let’s put our model to use, finally. The tool hooks into Twitter Streaming API, and naturally that comes with need to do certain error handling, etc. Once we get the actual tweet though..
  • 124. Load the model The weights we have calculated via training Easiest is to load them from DB (can be used to test different models).
  • 125. HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 126. SKIP truncated retweets: "RT @A ..." HARD CODED RULES We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 127. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 128. SKIP HARD CODED RULES truncated retweets: "RT @A ..." @ mentions of friends tweets from friends We apply some hardcoded rules to filter out the tweets we are certain are good or bad. The truncate RT ones don’t show up on the Web or other tools anyway, so fine to skip those.
  • 129. Classifying Tweets This is the moment we’ve been waiting for.
  • 130. Classifying Tweets GOOD This is the moment we’ve been waiting for.
  • 131. Classifying Tweets GOOD This is the moment we’ve been waiting for. BAD
  • 132. Remember this? 1 h✓ (X) = 1+e First is our hypothesis. ✓·X
  • 133. Remember this? 1 h✓ (X) = 1+e ✓·X ✓·X = ✓0 + ✓1 X1 + ✓2 X2 + . . . First is our hypothesis.
  • 134. Finally h✓ (X) = 1 1+e (✓0 +✓1 X1 +✓2 X2 +... ) If h > threshold , tweet is bad, otherwise good Remember that the output of h() is 0..1 (probability). Threshold is [0, 1], adjust it for your degree of tolerance. I used 0.9 to reduce false positives.
  • 135. extract features 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 136. extract features run the model 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 137. extract features run the model act on the result 3 simple steps Invoke the feature extractor to construct the feature vector for this tweet. Evaluate the decision function over the feature vector (input the calculated feature parameters into the equation). Use the output of the classifier.
  • 138. BAD? Also save the tweet to DB for future analysis. block user!
  • 139. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective
  • 140. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Connection handling, backoff in case of problems, undocumented API errors, etc.
  • 141. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -No way for blocked person to get ahold of you via Twitter anymore, so when training the model, err on the side of caution.
  • 142. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Some tweets are shown on the website, but never seen through the API.
  • 143. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective -Lots of room for improvement.
  • 144. Lessons Learned Twitter API is a pain in the rear Blocking is the only option (and is final) Streaming API delivery is incomplete ReplyCleaner judged to be ~80% effective PHP sucks at math-y stuff -Lots of room for improvement.
  • 145. Realtime feedback ★ More features ★ Grammar analysis ★ Support Vector Machines or decision trees ★ Clockwork Raven for manual classification ★ Other minimization algos: BFGS, conjugate gradient ★ Wish pecl/scikit-learn existed ★ NEXT STEPS Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.
  • 146. MongoDB ★ pear/Text_LanguageDetect ★ English vocabulary corpus ★ Parts-Of-Speech tagging ★ SplFixedArray ★ phirehose ★ Python’s scikit-learn (for validation) ★ Code sample ★ TOOLS MongoDB (great fit for JSON data) English vocabulary corpus: http://corpus.byu.edu/ or http://www.keithv.com/software/wlist/ SplFixedArray in PHP (memory savings and slightly faster)
  • 147. LEARN Coursera.org ML course ★ Ian Barber’s blog ★ FastML.com ★ Click on the tweets that are bad and it immediately incorporates them into the model. Grammar analysis to eliminate the common "@a bar" or "two @a time" occurrences. SVMs more appropriate for biased data sets. Farm out manual classification to Mechanical Turk. May help avoid local minima, no need to pick alpha, often faster than GD.