Looking into the Future: Using Google's Prediction API

Looking into the Future
Using Google’s Prediction API
Justin Grammens
Recursive Awesome & IoT Weekly

What is Prediction?
• Deﬁned by Wikipedia as: “A statement about an
uncertain event.”
• Continues on to read… “It is often, but not
always, based upon experience or knowledge.”
• In statistics, prediction is a part of Statistical
Inference.

Statistical Inference
• Statistical inference is the process of deducing
properties of an underlying distribution by analysis
of data.
• Two major paradigms used for statistical inference
• Frequentist Inference
• Bayesian Inference

Frequentist Inference
• Data is repeatable random sample with a speciﬁc
probability
• Parameters and probabilities remain constant during
the test
• Results are independent results from prior tests
• Q: Will the sun rise tomorrow? What’s the probability
of a sun dying based on all the suns in the universe

Bayesian Inference
• Take into account prior results and subjective
beliefs
• Update probabilities of occurrence based on new
data
• Tests are NOT run in isolation and affect one
another
• Q: Will the sun rise tomorrow? Depends on how
many times we have seen it rise in the past

Predictions by Machines
• Could therefore deﬁne
prediction as an “informed
guess or opinion.”
• Software systems have to
be trained before they can
be effective.
source: reading.pppst.com

What is Prediction API?
• Announced at Google I/O in 2011
• Provides pattern-matching and machine learning
capabilities.
• Handles both numeric or text input
• Handles both classiﬁcation or regression output
• Access from App Engine, client libs and command line
• Able to retrain the model on the ﬂy - Bayesian?

What Do You Need?
• Google Account
• Google Platform Console project
• Google Predication API Activated
• Google Cloud Storage API Activated

Steps Involved
• Deﬁne what you are trying to accomplish
• Find the training data and format to support your goal
(hardest part)
• Upload training data to Google Cloud Storage
• Train the system against the data you provide
• Send queries to your model
• Upload additional data with new information gained.

Hosted Model
• The Prediction API hosts a gallery of user-submitted
models
• Owners can charge for the use of the model
• Hosted models are versioned so they an be updated
easily
• Models are submitted in PMML format
• XML-based language to deﬁne statistical & data models
• Appears to currently be a waitlist

How To Train
• 3 ways to create and train the correct type of model
• CSV File - Lives on Google Cloud Storage
• Training data embedded in request
• Limited to the size of an HTTP Request < 2MB
• Empty model created and trained with update
calls

CSV File Rules
• Maximum ﬁle size 2.5 GB
• No header row. Yes, to the system it’s irrelevant
• One example per line
• The ﬁrst column indicates to the system the type of
model.
• Ideally remove punctuation (other then
apostrophes) from your data.

CSV File Rules
• Text Strings
• Double quotes around all text strings
• Text matching is case-sensitive
• Numeric Values
• Integer and decimals are supported
• Numbers: "1", "23", “999"
• Strings: "6 12", “colt 45"

Structuring Data
• Example Value
• “The Answer”
• Features
• No limit on number of
feature
• More features & examples
the better
• To train 16MB ~ 1 hour

Regression Model
Example Data
• Deﬁne your data to support numbers and strings
• Query of “Seattle, 288, sunny”, might get back value of 62
• Don’t need to match any values in the dataset
• Fill model with all columns then query with ﬁrst column missing

Classiﬁcation Model
Example Data
• Query of “Lose weight now!” you would get
result of “spam”
• Returns the category from the dataset

Authorization
• You must use OAuth 2.0 to authorize requests
• Can share your model with others
• View: User can call Analyze, Get, List and Predict on the
project and/or any model owned by the project.
• Edit: User has all the permissions of Can view, but can also
Delete, Insert, and Update any models owned by the
project.
• Is Owner: User has all the permissions of Can edit, but can
also grant permissions to other users to access the project.

Tips & Tricks
• The more examples & features the better results
• However - Adding more features doesn’t always give better
predictions
is_comedy is_drama is_action is_horror
Y N N N
VS
genre
Comedy

Tips & Tricks
• Need to add a numeric aspect to the genre?
• Add additional genre columns and weight it based
on count
genre genre genre genre genre
Drama Drama Drama Comedy Comedy

Tips & Tricks
• Always put something into each feature
• Include all the features that you know about
• For Regression:
• Make sure will have the time to ensure the values are
correct
• Conversely, if you have exact numbers use them
• Try to have at least a few hundred examples for each
category

Tips & Tricks
• Can only compare against known relationships
• Can’t feed an untrained title and user to get rating
• Solution is to break the title into genre, director,
actors
Rating user_name movie_title
9.5 Justin Star Wars
2.2 Justin Disaster Movie
5.0 Justin Billy Madison

Let’s Talk Data!
• Nice Ride
• Based on the starting station, predict the ending station
• New York Cab Rides
• Given a starting GPS coordinate, predict where the cab
ride will end
• Sentiment Analysis
• Based on the state of the union speech deﬁne the
sentiment

Based on the starting
station, can we predict
the ending station?

Nice Ride Location Rides
• https://
www.niceridemn.org/
data/
• Offers a live XML
stream to update
along the way

Started
with this:
Next: Ended
with
this:

Nice Ride Insert Data
ID
&
Location

Nice Ride Running
Prediction
Status

Lessons Learned
• I forgot to put the
values in quotes.
Treated it as
numerical
regression.
• Verify how it’s
interpreting your
data with “get” call.
Type

Show Scripts, API & Results

Can we predict the
movement of NYC cabs?

NYC Cab Ride Data
Data DictionaryData Website

Sample Data
Contains pickup & drop off latitude and longitude

There’s A Problem
• Asking for 2 inputs and 2 outputs!
• Not possible with Prediction API as it only supports
one dependent variable. :(
• Change of plan…

Let’s predict the cost of
a NYC cab ride instead!

Prediction Demo
• Features are
distances (B)
• Examples are prices
(A)
• Is this accurate?
• Different fares
based on areas of
the city

Ok, not really… Let's
use location based
data instead

Prediction Demo
• Latitude /
Longitude are the
features (B, C, D, E
• Price Is The
Example (A)
• Examples

NYC Cab Ride Location

Sentiment Analysis of
a Speech

Speech Sentiment
• Always Check Your Data!
• Website incorrectly
claimed positive(4),
negative(0) and
neutral(2) sentiment.
• Data had groups of
sentiment values.
• Source

Speech Sentiment
FeatureExample Value
Training
Examples

Sentiment Example
Obama State of the Union Speech - 1/16
Donald Trump Speech Des Moines, IA - 1/24

Smart Spreadsheets
Install Smart Autoﬁll Add-on

Smart Spreadsheets
Prediction API used to ﬁll in missing values

Smart Spreadsheets
Select columns to use for data training

Smart Spreadsheets
“Example Values” are populated

Final Thoughts - Overfitting
• Overfitting the model generally takes the form of
making an overly complex model to explain
idiosyncrasies in the data under study.
• Therefore, a model that has been overfit will
generally have poor predictive performance, as it
can exaggerate minor fluctuations in the data.
• Exact query should not return EXACT examples

Thank You
Justin Grammens
justin@recursiveawesome.com
http://recursiveawesome.com
Checkout my IoT Weekly Newsletter
http://iotweeklynews.com

Looking into the Future: Using Google's Prediction API

Recomendados

Recomendados

Más contenido relacionado

La actualidad más candente

La actualidad más candente (20)

Similar a Looking into the Future: Using Google's Prediction API

Similar a Looking into the Future: Using Google's Prediction API (20)

Más de Justin Grammens

Más de Justin Grammens (16)

Último

Último (20)

Looking into the Future: Using Google's Prediction API