We all would like to predict the future at some point in our lives. Well thanks to Google we can now be one step closer! This talk will give an overview of what the Google Prediction API is, how you can use it to analyze data sets, it's strengths and weaknesses and run open data sets through the system covering both regression and categorization models.
Aspirational Block Program Block Syaldey District - Almora
Looking into the Future: Using Google's Prediction API
1. Looking into the Future
Using Google’s Prediction API
Justin Grammens
Recursive Awesome & IoT Weekly
2. What is Prediction?
• Defined by Wikipedia as: “A statement about an
uncertain event.”
• Continues on to read… “It is often, but not
always, based upon experience or knowledge.”
• In statistics, prediction is a part of Statistical
Inference.
3. Statistical Inference
• Statistical inference is the process of deducing
properties of an underlying distribution by analysis
of data.
• Two major paradigms used for statistical inference
• Frequentist Inference
• Bayesian Inference
4. Frequentist Inference
• Data is repeatable random sample with a specific
probability
• Parameters and probabilities remain constant during
the test
• Results are independent results from prior tests
• Q: Will the sun rise tomorrow? What’s the probability
of a sun dying based on all the suns in the universe
5. Bayesian Inference
• Take into account prior results and subjective
beliefs
• Update probabilities of occurrence based on new
data
• Tests are NOT run in isolation and affect one
another
• Q: Will the sun rise tomorrow? Depends on how
many times we have seen it rise in the past
6. Predictions by Machines
• Could therefore define
prediction as an “informed
guess or opinion.”
• Software systems have to
be trained before they can
be effective.
source: reading.pppst.com
7. What is Prediction API?
• Announced at Google I/O in 2011
• Provides pattern-matching and machine learning
capabilities.
• Handles both numeric or text input
• Handles both classification or regression output
• Access from App Engine, client libs and command line
• Able to retrain the model on the fly - Bayesian?
9. What Do You Need?
• Google Account
• Google Platform Console project
• Google Predication API Activated
• Google Cloud Storage API Activated
10. Steps Involved
• Define what you are trying to accomplish
• Find the training data and format to support your goal
(hardest part)
• Upload training data to Google Cloud Storage
• Train the system against the data you provide
• Send queries to your model
• Upload additional data with new information gained.
11. Hosted Model
• The Prediction API hosts a gallery of user-submitted
models
• Owners can charge for the use of the model
• Hosted models are versioned so they an be updated
easily
• Models are submitted in PMML format
• XML-based language to define statistical & data models
• Appears to currently be a waitlist
12. How To Train
• 3 ways to create and train the correct type of model
• CSV File - Lives on Google Cloud Storage
• Training data embedded in request
• Limited to the size of an HTTP Request < 2MB
• Empty model created and trained with update
calls
13. CSV File Rules
• Maximum file size 2.5 GB
• No header row. Yes, to the system it’s irrelevant
• One example per line
• The first column indicates to the system the type of
model.
• Ideally remove punctuation (other then
apostrophes) from your data.
14. CSV File Rules
• Text Strings
• Double quotes around all text strings
• Text matching is case-sensitive
• Numeric Values
• Integer and decimals are supported
• Numbers: "1", "23", “999"
• Strings: "6 12", “colt 45"
15. Structuring Data
• Example Value
• “The Answer”
• Features
• No limit on number of
feature
• More features & examples
the better
• To train 16MB ~ 1 hour
17. Regression Model
Example Data
• Define your data to support numbers and strings
• Query of “Seattle, 288, sunny”, might get back value of 62
• Don’t need to match any values in the dataset
• Fill model with all columns then query with first column missing
19. Authorization
• You must use OAuth 2.0 to authorize requests
• Can share your model with others
• View: User can call Analyze, Get, List and Predict on the
project and/or any model owned by the project.
• Edit: User has all the permissions of Can view, but can also
Delete, Insert, and Update any models owned by the
project.
• Is Owner: User has all the permissions of Can edit, but can
also grant permissions to other users to access the project.
20. Tips & Tricks
• The more examples & features the better results
• However - Adding more features doesn’t always give better
predictions
is_comedy is_drama is_action is_horror
Y N N N
VS
genre
Comedy
21. Tips & Tricks
• Need to add a numeric aspect to the genre?
• Add additional genre columns and weight it based
on count
genre genre genre genre genre
Drama Drama Drama Comedy Comedy
22. Tips & Tricks
• Always put something into each feature
• Include all the features that you know about
• For Regression:
• Make sure will have the time to ensure the values are
correct
• Conversely, if you have exact numbers use them
• Try to have at least a few hundred examples for each
category
23. Tips & Tricks
• Can only compare against known relationships
• Can’t feed an untrained title and user to get rating
• Solution is to break the title into genre, director,
actors
Rating user_name movie_title
9.5 Justin Star Wars
2.2 Justin Disaster Movie
5.0 Justin Billy Madison
24. Let’s Talk Data!
• Nice Ride
• Based on the starting station, predict the ending station
• New York Cab Rides
• Given a starting GPS coordinate, predict where the cab
ride will end
• Sentiment Analysis
• Based on the state of the union speech define the
sentiment
25. Based on the starting
station, can we predict
the ending station?
26. Nice Ride Location Rides
• https://
www.niceridemn.org/
data/
• Offers a live XML
stream to update
along the way
30. Lessons Learned
• I forgot to put the
values in quotes.
Treated it as
numerical
regression.
• Verify how it’s
interpreting your
data with “get” call.
Type
35. There’s A Problem
• Asking for 2 inputs and 2 outputs!
• Not possible with Prediction API as it only supports
one dependent variable. :(
• Change of plan…
42. Speech Sentiment
• Always Check Your Data!
• Website incorrectly
claimed positive(4),
negative(0) and
neutral(2) sentiment.
• Data had groups of
sentiment values.
• Source
50. Final Thoughts - Overfitting
• Overfitting the model generally takes the form of
making an overly complex model to explain
idiosyncrasies in the data under study.
• Therefore, a model that has been overfit will
generally have poor predictive performance, as it
can exaggerate minor fluctuations in the data.
• Exact query should not return EXACT examples