SlideShare una empresa de Scribd logo
1 de 42
Descargar para leer sin conexión
MLconf ATL!
Sept 23rd, 2016
Chris Fregly
Research Scientist @ PipelineIO
Who am I?
Chris Fregly, Research Scientist @ PipelineIO, San Francisco
Previously, Engineer @ Netflix, Databricks, and IBM Spark
Contributor @ Apache Spark, Committer @ Netflix OSS
Founder @ Advanced Spark and TensorFlow Meetup
Author @ Advanced Spark (advancedspark.com)
Advanced Spark and Tensorflow Meetup
ATL Spark Meetup (9/22)
http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016
ATL Hadoop Meetup (9/21)
http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
Confession #1
I Failed Linguistics in College!
Chose Pass/Fail Option
(90 (mid-term) + 70 (final)) / 2 = 80 = C+
How did a C+ turn into an F?
ZER0 (0) CLASS PARTICIPATION?!
Confession #2
I Hated Statistics in College
2 Degrees: Mechanical + Manufacturing Engg
Approximations were Bad!
I Wasn’t a Fluffy Physics Major
Though, I Kinda Wish I Was!
Wait… Please Don’t Leave!
I’m Older and Wiser Now
Approximate is the New Exact
Computational Linguistics and NLP are My Jam!
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
What is Tensorflow?
General Purpose Numerical Computation Engine
Happens to be good for neural nets!
Tooling
Tensorboard (port 6006 == `goog`) à
DAG-based like Spark!
Computation graph is logical plan
Stored in Protobuf’s
TF converts logical -> physical plan
Lots of Libraries
TFLearn (Tensorflow’s Scikit-learn Impl)
Tensorflow Serving (Prediction Layer) à ^^
Distributed and GPU-Optimized
What are Neural Networks?
Like All ML, Goal is to Minimize Loss (Error)
Error relative to known outcome of labeled data
Mostly Supervised Learning Classification
Labeled training data
Training Steps
Step 1: Randomly Guess Input Weights
Step 2: Calculate Error Against Labeled Data
Step 3: Determine Gradient Value, +/- Direction
Step 4: Back-propagateGradient to Update Each Input Weight
Step 5: Repeat Step 1 with New Weights until Convergence
Activation
Function
Activation Functions
Goal: Learn and Train a Model on Input Data
Non-Linear Functions
Find Non-Linear Fit of Input Data
Common Activation Functions
Sigmoid Function (sigmoid)
{0, 1}
Hyperbolic Tangent (tanh)
{-1, 1}
Back Propagation
http://kratzert.github.io/2016/02/12/understanding-the-gradient-flow-through-the-batch-normalization-layer.html
Gradients Calculated by Comparing to Known Label
Use Gradients to Adjust Input Weights
Chain Rule
Loss/Error Optimizers
Gradient Descent
Batch (entire dataset)
Per-record (don’t do this!)
Mini-batch (empirically 16 -> 512)
Stochastic (approximation)
Momentum (optimization)
AdaGrad
SGD with adaptive learning rates per feature
Set initial learning rate
More-likely to incorrectly converge on local minima
http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation-
advanced-spark-and-tensorflow-meetup-08042016
The Math
Linear Algebra
Matrix Multiplication
Very Parallelizable
Calculus
Derivatives
Chain Rule
Convolutional Neural Networks
Feed-forward
Do not form a cycle
Apply Many Layers (aka. Filters) to Input
Each Layer/Filter Picks up on Features
Features not necessarily human-grokkable
Examples of Human-grokkable Filters
3 color filters: RGB
Moving AVG for time series
Brute Force
Try Diff numLayers & layerSizes
CNN Use Case: Stitch Fix
Stitch Fix Also Uses NLP to Analyze Return/Reject Comments
StitchFix Strata Conf SF 2016:
Using Deep Learning to Create New Clothing Styles!
Recurrent Neural Networks
Forms a Cycle (vs. Feed-forward)
Maintains State over Time
Keep track of context
Learns sequential patterns
Decay over time
Use Cases
Speech
Text/NLP Prediction
RNN Sequences
Input: Image
Output: Classification
http://karpathy.github.io/2015/05/21/rnn-effectiveness/
Input: Image
Output: Text (Captions)
Input: Text
Output: Class (Sentiment)
Input: Text (English)
Output: Text (Spanish)
Input
Layer
Hidden
Layer
Output
Layer
Character-based RNNs
Tokens are Characters vs. Words/Phrases
Microsoft trains ever 3 characters
Less Combination of Possible Neighbors
Only 26 alpha character tokens vs. millions of word tokens
Preserves state
between
1st and 2nd ‘l’
improves prediction
Long Short Term Memory (LSTM)
More Complex
State Update
Function
than
Vanilla RNN
LSTM State Update
http://colah.github.io/posts/2015-08-Understanding-LSTMs/
Cell State
Forget Gate Layer
(Sigmoid)
Input Gate Layer
(Sigmoid)
Candidate Gate Layer
(tanh)
Output
Layer
Transfer Learning
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Use Cases
Document Summary
TextRank: TF/IDF + PageRank
Article Classification and Similarity
LDA: calculate top `k` topic distribution
Machine Translation
word2vec: compare word embedding vectors
Must Convert Text to Numbers!
Core Concepts
Corpus
Collection of text
ie. Documents, articles, genetic codes
Embeddings
Tokens represented/embedded in vector space
Learned, hidden features (~PCA, SVD)
Similar tokens cluster together, analogies cluster apart
k-skip-gram
Skip k neighbors when defining tokens
n-gram
Treat n consecutive tokens as a single token
Composable:
1-skip, bi-gram
(every other word)
Parsers and POS Taggers
Describe grammatical sentence structure
Requires context of entire sentence
Helps reason about sentence
80% obvious, simple token neighbors
Major bottleneck in NLP pipeline!
Pre-trained Parsers and Taggers
Penn Treebank
Parser and Part-of-Speech Tagger
Human-annotated (!)
Trained on 4.5 million words
Parsey McParseface
Trained by SyntaxNet
Feature Engineering
Lower-case
Preserve proper nouns using carat (`^`)
“MLconf” => “^m^lconf”
“Varsity” => “^varsity”
Encode Common N-grams (Phrases)
Create a single token using underscore (`_`)
“Senior Developer” => “senior_developer”
Stemming and Lemmatization
Try to avoid: let the neural network figure this out
Can preserve part of speech (POS) using “_noun”, “_verb”
“banking” => “banking_verb”
Agenda
Tensorflow + Neural Nets
NLP Fundamentals
NLP Models
Count-based Models
Goal: Convert Text to Vector of Neighbor Co-occurrences
Bag of Words (BOW)
Simple hashmap with word counts
Loses neighbor context
Term Frequency / Inverse Document Frequency (TF/IDF)
Normalizes based on token frequency
GloVe
Matrix factorization on co-occurrence matrix
Highly parallelizable, reduce dimensions, capture global co-occurrence stats
Log smoothing of probability ratios
Stores word vector diffs for fast analogy lookups
Neural-based Predictive Models
Goal: Predict Text using Learned Embedding Vectors
word2vec
Shallow neural network
Local: nearby words predict each other
Fixed word embedding vector size (ie. 300)
Optimizer: Mini-batch Stochastic Gradient Descent (SGD)
SyntaxNet
Deep(er) neural network
Global(er)
Not a Recurrent Neural Net (RNN)!
Can combine with BOW-based models (ie. word2vec CBOW)
word2vec
CBOW word2vec
Predict target word from source context
A single source context is an observation
Loses useful distribution information
Good for small datasets
Skip-gram word2vec (Inverse of CBOW)
Predict source context words from target word
Each (source context, target word) tuple is observation
Better for large datasets
word2vec Libraries
gensim
Python only
Most popular
Spark ML
Python + Java/Scala
Supports only synonyms
*2vec
lda2vec
LDA (global) + word2vec (local)
From Chris Moody @ Stitch Fix
like2vec
Embedding-based Recommender
word2vec vs. GloVe
Both are Fundamentally Similar
Capture local co-occurrence statistics (neighbors)
Capture distance between embedding vector
(analogies)
GloVe
Count-based
Also captures global co-occurrence statistics
Requires upfront pass through entire dataset
SyntaxNet POS Tagging
Determine coarse-grained grammatical role of each word
Multiple contexts, multiple roles
Neural Net
Inputs: stack, buffer
Results: POS probability distro
Already
Tagged
SyntaxNet Dependency Parser
Determine fine-grained roles using grammatical relationships
“Transition-based”, Incremental Dependency Parser
Globally Normalized using Beam Search with Early Update
Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs
Fine-grained
Coarse-grained
SyntaxNet Use Case: Nutrition
Nutrition and Health Startup in SF (Stealth)
Using Google’s SyntaxNet
Rate Recipes and Menus by Nutritional Value
Correct
Incorrect
Model Validation
Unsupervised Learning Requires Validation
Google has Published Analogy Tests for Model Validation
Thanks, Google!
Thank You, Atlanta!
Chris Fregly, Research Scientist @ PipelineIO
All Source Code, Demos, and Docker Images
@ pipeline.io
Join the Global Meetup for all Slides and Videos
@ advancedspark.com

Más contenido relacionado

La actualidad más candente

Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
MLconf
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
Databricks
 

La actualidad más candente (20)

Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
Hussein Mehanna, Engineering Director, ML Core - Facebook at MLconf ATL 2016
 
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
Avi Pfeffer, Principal Scientist, Charles River Analytics at MLconf SEA - 5/2...
 
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
Jean-François Puget, Distinguished Engineer, Machine Learning and Optimizatio...
 
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
Sanjeev Satheesj, Research Scientist, Baidu at The AI Conference 2017
 
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
Rajat Monga, Engineering Director, TensorFlow, Google at MLconf 2016
 
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
Building Large Scale Machine Learning Applications with Pipelines-(Evan Spark...
 
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
Erin LeDell, Machine Learning Scientist, H2O.ai at MLconf ATL 2016
 
Best Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflowBest Practices for Hyperparameter Tuning with MLflow
Best Practices for Hyperparameter Tuning with MLflow
 
The road ahead for scientific computing with Python
The road ahead for scientific computing with PythonThe road ahead for scientific computing with Python
The road ahead for scientific computing with Python
 
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D... Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
Apache Spark Based Hyper-Parameter Selection and Adaptive Model Tuning for D...
 
Introduction to Keras
Introduction to KerasIntroduction to Keras
Introduction to Keras
 
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
Braxton McKee, Founder & CEO, Ufora at MLconf SF - 11/13/15
 
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
Braxton McKee, CEO & Founder, Ufora at MLconf NYC - 4/15/16
 
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
Narayanan Sundaram, Research Scientist, Intel Labs at MLconf SF - 11/13/15
 
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
Mathias Brandewinder, Software Engineer & Data Scientist, Clear Lines Consult...
 
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
Do Your Homework! Writing tests for Data Science and Stochastic Code - David ...
 
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 Separating Hype from Reality in Deep Learning with Sameer Farooqui Separating Hype from Reality in Deep Learning with Sameer Farooqui
Separating Hype from Reality in Deep Learning with Sameer Farooqui
 
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
Justin Basilico, Research/ Engineering Manager at Netflix at MLconf SF - 11/1...
 
Tensorflow Ecosystem
Tensorflow EcosystemTensorflow Ecosystem
Tensorflow Ecosystem
 
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLPDictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
Dictionary based Annotation at Scale with Spark, SolrTextTagger and OpenNLP
 

Destacado

Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
MLconf
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
MLconf
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
MLconf
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
MLconf
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
MLconf
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
MLconf
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
MLconf
 

Destacado (20)

Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
Alex Smola, Director of Machine Learning, AWS/Amazon, at MLconf SF 2016
 
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
Alex Dimakis, Associate Professor, Dept. of Electrical and Computer Engineeri...
 
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
Ross Goodwin, Technologist, Sunspring, MLconf NYC 2017
 
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
Andrew Musselman, Committer and PMC Member, Apache Mahout, at MLconf Seattle ...
 
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
Corinna Cortes, Head of Research, Google, at MLconf NYC 2017
 
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
Irina Rish, Researcher, IBM Watson, at MLconf NYC 2017
 
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
Anjuli Kannan, Software Engineer, Google at MLconf SF 2016
 
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
Virginia Smith, Researcher, UC Berkeley at MLconf SF 2016
 
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
Aaron Roth, Associate Professor, University of Pennsylvania, at MLconf NYC 2017
 
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016Daniel Shank, Data Scientist, Talla at MLconf SF 2016
Daniel Shank, Data Scientist, Talla at MLconf SF 2016
 
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
Mayur Thakur, Managing Director, Goldman Sachs, at MLconf NYC 2017
 
Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017Scott Clark, CEO, SigOpt, at The AI Conference 2017
Scott Clark, CEO, SigOpt, at The AI Conference 2017
 
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
Caroline Sinders, Online Harassment Researcher, Wikimedia at The AI Conferenc...
 
Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba Layla El Asri, Research Scientist, Maluuba
Layla El Asri, Research Scientist, Maluuba
 
Jeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, AdaptrisJeff Bradshaw, Founder, Adaptris
Jeff Bradshaw, Founder, Adaptris
 
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
Alexandra Johnson, Software Engineer, SigOpt, at MLconf NYC 2017
 
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
Scott Clark, CEO, SigOpt, at MLconf Seattle 2017
 
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
Ben Lau, Quantitative Researcher, Hobbyist, at MLconf NYC 2017
 
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
Brian Lucena, Senior Data Scientist, Metis at MLconf SF 2016
 
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
Scott Clark, Co-Founder and CEO, SigOpt at MLconf SF 2016
 

Similar a Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

Similar a Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016 (20)

Preparing for Scala 3
Preparing for Scala 3Preparing for Scala 3
Preparing for Scala 3
 
Apache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - PanoraysApache Spark 101 - Demi Ben-Ari - Panorays
Apache Spark 101 - Demi Ben-Ari - Panorays
 
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
Cassandra Summit Sept 2015 - Real Time Advanced Analytics with Spark and Cass...
 
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and VocabulariesHaystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
Haystack 2018 - Algorithmic Extraction of Keywords Concepts and Vocabularies
 
Distributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark MeetupDistributed Deep Learning + others for Spark Meetup
Distributed Deep Learning + others for Spark Meetup
 
Building and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache AirflowBuilding and deploying LLM applications with Apache Airflow
Building and deploying LLM applications with Apache Airflow
 
Introduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processingIntroduction to Apache Flink - Fast and reliable big data processing
Introduction to Apache Flink - Fast and reliable big data processing
 
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdfWord2Vec model to generate synonyms on the fly in Apache Lucene.pdf
Word2Vec model to generate synonyms on the fly in Apache Lucene.pdf
 
Hands on Mahout!
Hands on Mahout!Hands on Mahout!
Hands on Mahout!
 
Avro
AvroAvro
Avro
 
Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...Code as Data workshop: Using source{d} Engine to extract insights from git re...
Code as Data workshop: Using source{d} Engine to extract insights from git re...
 
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23Sjug #26   ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
Sjug #26 ml is in java but is dl too - ver1.04 - tomasz sikora 2018-03-23
 
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
Fast and Simplified Streaming, Ad-Hoc and Batch Analytics with FiloDB and Spa...
 
Programming the Semantic Web
Programming the Semantic WebProgramming the Semantic Web
Programming the Semantic Web
 
Language-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible researchLanguage-agnostic data analysis workflows and reproducible research
Language-agnostic data analysis workflows and reproducible research
 
Overview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developmentsOverview of the SPARQL-Generate language and latest developments
Overview of the SPARQL-Generate language and latest developments
 
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's DataFrom Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
From Pandas to Koalas: Reducing Time-To-Insight for Virgin Hyperloop's Data
 
Measuring Your Code
Measuring Your CodeMeasuring Your Code
Measuring Your Code
 
Measuring Your Code 2.0
Measuring Your Code 2.0Measuring Your Code 2.0
Measuring Your Code 2.0
 
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger KingContext-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
Context-aware Fast Food Recommendation with Ray on Apache Spark at Burger King
 

Más de MLconf

Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
MLconf
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
MLconf
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
MLconf
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
MLconf
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
MLconf
 

Más de MLconf (20)

Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
Jamila Smith-Loud - Understanding Human Impact: Social and Equity Assessments...
 
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language UnderstandingTed Willke - The Brain’s Guide to Dealing with Context in Language Understanding
Ted Willke - The Brain’s Guide to Dealing with Context in Language Understanding
 
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
Justin Armstrong - Applying Computer Vision to Reduce Contamination in the Re...
 
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold RushIgor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
Igor Markov - Quantum Computing: a Treasure Hunt, not a Gold Rush
 
Josh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious ExperienceJosh Wills - Data Labeling as Religious Experience
Josh Wills - Data Labeling as Religious Experience
 
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
Vinay Prabhu - Project GaitNet: Ushering in the ImageNet moment for human Gai...
 
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
Jekaterina Novikova - Machine Learning Methods in Detecting Alzheimer’s Disea...
 
Meghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the CheapMeghana Ravikumar - Optimized Image Classification on the Cheap
Meghana Ravikumar - Optimized Image Classification on the Cheap
 
Noam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data CollectionNoam Finkelstein - The Importance of Modeling Data Collection
Noam Finkelstein - The Importance of Modeling Data Collection
 
June Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of MLJune Andrews - The Uncanny Valley of ML
June Andrews - The Uncanny Valley of ML
 
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection TasksSneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
Sneha Rajana - Deep Learning Architectures for Semantic Relation Detection Tasks
 
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
Anoop Deoras - Building an Incrementally Trained, Local Taste Aware, Global D...
 
Vito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI WorldVito Ostuni - The Voice: New Challenges in a Zero UI World
Vito Ostuni - The Voice: New Challenges in a Zero UI World
 
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
Anna choromanska - Data-driven Challenges in AI: Scale, Information Selection...
 
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
Janani Kalyanam - Machine Learning to Detect Illegal Online Sales of Prescrip...
 
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
Esperanza Lopez Aguilera - Using a Bayesian Neural Network in the Detection o...
 
Neel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to codeNeel Sundaresan - Teaching a machine to code
Neel Sundaresan - Teaching a machine to code
 
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
Rishabh Mehrotra - Recommendations in a Marketplace: Personalizing Explainabl...
 
Soumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better SoftwareSoumith Chintala - Increasing the Impact of AI Through Better Software
Soumith Chintala - Increasing the Impact of AI Through Better Software
 
Roy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime ChangesRoy Lowrance - Predicting Bond Prices: Regime Changes
Roy Lowrance - Predicting Bond Prices: Regime Changes
 

Último

Último (20)

ICT role in 21st century education and its challenges
ICT role in 21st century education and its challengesICT role in 21st century education and its challenges
ICT role in 21st century education and its challenges
 
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
Biography Of Angeliki Cooney | Senior Vice President Life Sciences | Albany, ...
 
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot TakeoffStrategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
Strategize a Smooth Tenant-to-tenant Migration and Copilot Takeoff
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
Web Form Automation for Bonterra Impact Management (fka Social Solutions Apri...
 
Corporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptxCorporate and higher education May webinar.pptx
Corporate and higher education May webinar.pptx
 
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemkeProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
ProductAnonymous-April2024-WinProductDiscovery-MelissaKlemke
 
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​Elevate Developer Efficiency & build GenAI Application with Amazon Q​
Elevate Developer Efficiency & build GenAI Application with Amazon Q​
 
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
Apidays New York 2024 - APIs in 2030: The Risk of Technological Sleepwalk by ...
 
Exploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with MilvusExploring Multimodal Embeddings with Milvus
Exploring Multimodal Embeddings with Milvus
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)Introduction to Multilingual Retrieval Augmented Generation (RAG)
Introduction to Multilingual Retrieval Augmented Generation (RAG)
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
CNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In PakistanCNIC Information System with Pakdata Cf In Pakistan
CNIC Information System with Pakdata Cf In Pakistan
 
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, AdobeApidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
Apidays New York 2024 - Scaling API-first by Ian Reasor and Radu Cotescu, Adobe
 
Strategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a FresherStrategies for Landing an Oracle DBA Job as a Fresher
Strategies for Landing an Oracle DBA Job as a Fresher
 
Vector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptxVector Search -An Introduction in Oracle Database 23ai.pptx
Vector Search -An Introduction in Oracle Database 23ai.pptx
 
Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..Understanding the FAA Part 107 License ..
Understanding the FAA Part 107 License ..
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot ModelMcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
Mcleodganj Call Girls 🥰 8617370543 Service Offer VIP Hot Model
 

Chris Fregly, Research Scientist, PipelineIO at MLconf ATL 2016

  • 1. MLconf ATL! Sept 23rd, 2016 Chris Fregly Research Scientist @ PipelineIO
  • 2. Who am I? Chris Fregly, Research Scientist @ PipelineIO, San Francisco Previously, Engineer @ Netflix, Databricks, and IBM Spark Contributor @ Apache Spark, Committer @ Netflix OSS Founder @ Advanced Spark and TensorFlow Meetup Author @ Advanced Spark (advancedspark.com)
  • 3. Advanced Spark and Tensorflow Meetup
  • 4. ATL Spark Meetup (9/22) http://www.slideshare.net/cfregly/atlanta-spark-user-meetup-09-22-2016
  • 5. ATL Hadoop Meetup (9/21) http://www.slideshare.net/cfregly/atlanta-hadoop-users-meetup-09-21-2016
  • 6.
  • 7. Confession #1 I Failed Linguistics in College! Chose Pass/Fail Option (90 (mid-term) + 70 (final)) / 2 = 80 = C+ How did a C+ turn into an F? ZER0 (0) CLASS PARTICIPATION?!
  • 8. Confession #2 I Hated Statistics in College 2 Degrees: Mechanical + Manufacturing Engg Approximations were Bad! I Wasn’t a Fluffy Physics Major Though, I Kinda Wish I Was!
  • 9. Wait… Please Don’t Leave! I’m Older and Wiser Now Approximate is the New Exact Computational Linguistics and NLP are My Jam!
  • 10. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 11. What is Tensorflow? General Purpose Numerical Computation Engine Happens to be good for neural nets! Tooling Tensorboard (port 6006 == `goog`) à DAG-based like Spark! Computation graph is logical plan Stored in Protobuf’s TF converts logical -> physical plan Lots of Libraries TFLearn (Tensorflow’s Scikit-learn Impl) Tensorflow Serving (Prediction Layer) à ^^ Distributed and GPU-Optimized
  • 12. What are Neural Networks? Like All ML, Goal is to Minimize Loss (Error) Error relative to known outcome of labeled data Mostly Supervised Learning Classification Labeled training data Training Steps Step 1: Randomly Guess Input Weights Step 2: Calculate Error Against Labeled Data Step 3: Determine Gradient Value, +/- Direction Step 4: Back-propagateGradient to Update Each Input Weight Step 5: Repeat Step 1 with New Weights until Convergence Activation Function
  • 13. Activation Functions Goal: Learn and Train a Model on Input Data Non-Linear Functions Find Non-Linear Fit of Input Data Common Activation Functions Sigmoid Function (sigmoid) {0, 1} Hyperbolic Tangent (tanh) {-1, 1}
  • 15. Loss/Error Optimizers Gradient Descent Batch (entire dataset) Per-record (don’t do this!) Mini-batch (empirically 16 -> 512) Stochastic (approximation) Momentum (optimization) AdaGrad SGD with adaptive learning rates per feature Set initial learning rate More-likely to incorrectly converge on local minima http://www.slideshare.net/cfregly/gradient-descent-back-propagation-and-auto-differentiation- advanced-spark-and-tensorflow-meetup-08042016
  • 16. The Math Linear Algebra Matrix Multiplication Very Parallelizable Calculus Derivatives Chain Rule
  • 17. Convolutional Neural Networks Feed-forward Do not form a cycle Apply Many Layers (aka. Filters) to Input Each Layer/Filter Picks up on Features Features not necessarily human-grokkable Examples of Human-grokkable Filters 3 color filters: RGB Moving AVG for time series Brute Force Try Diff numLayers & layerSizes
  • 18. CNN Use Case: Stitch Fix Stitch Fix Also Uses NLP to Analyze Return/Reject Comments StitchFix Strata Conf SF 2016: Using Deep Learning to Create New Clothing Styles!
  • 19. Recurrent Neural Networks Forms a Cycle (vs. Feed-forward) Maintains State over Time Keep track of context Learns sequential patterns Decay over time Use Cases Speech Text/NLP Prediction
  • 20. RNN Sequences Input: Image Output: Classification http://karpathy.github.io/2015/05/21/rnn-effectiveness/ Input: Image Output: Text (Captions) Input: Text Output: Class (Sentiment) Input: Text (English) Output: Text (Spanish) Input Layer Hidden Layer Output Layer
  • 21. Character-based RNNs Tokens are Characters vs. Words/Phrases Microsoft trains ever 3 characters Less Combination of Possible Neighbors Only 26 alpha character tokens vs. millions of word tokens Preserves state between 1st and 2nd ‘l’ improves prediction
  • 22. Long Short Term Memory (LSTM) More Complex State Update Function than Vanilla RNN
  • 23. LSTM State Update http://colah.github.io/posts/2015-08-Understanding-LSTMs/ Cell State Forget Gate Layer (Sigmoid) Input Gate Layer (Sigmoid) Candidate Gate Layer (tanh) Output Layer
  • 25. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 26. Use Cases Document Summary TextRank: TF/IDF + PageRank Article Classification and Similarity LDA: calculate top `k` topic distribution Machine Translation word2vec: compare word embedding vectors Must Convert Text to Numbers!
  • 27. Core Concepts Corpus Collection of text ie. Documents, articles, genetic codes Embeddings Tokens represented/embedded in vector space Learned, hidden features (~PCA, SVD) Similar tokens cluster together, analogies cluster apart k-skip-gram Skip k neighbors when defining tokens n-gram Treat n consecutive tokens as a single token Composable: 1-skip, bi-gram (every other word)
  • 28. Parsers and POS Taggers Describe grammatical sentence structure Requires context of entire sentence Helps reason about sentence 80% obvious, simple token neighbors Major bottleneck in NLP pipeline!
  • 29. Pre-trained Parsers and Taggers Penn Treebank Parser and Part-of-Speech Tagger Human-annotated (!) Trained on 4.5 million words Parsey McParseface Trained by SyntaxNet
  • 30. Feature Engineering Lower-case Preserve proper nouns using carat (`^`) “MLconf” => “^m^lconf” “Varsity” => “^varsity” Encode Common N-grams (Phrases) Create a single token using underscore (`_`) “Senior Developer” => “senior_developer” Stemming and Lemmatization Try to avoid: let the neural network figure this out Can preserve part of speech (POS) using “_noun”, “_verb” “banking” => “banking_verb”
  • 31. Agenda Tensorflow + Neural Nets NLP Fundamentals NLP Models
  • 32. Count-based Models Goal: Convert Text to Vector of Neighbor Co-occurrences Bag of Words (BOW) Simple hashmap with word counts Loses neighbor context Term Frequency / Inverse Document Frequency (TF/IDF) Normalizes based on token frequency GloVe Matrix factorization on co-occurrence matrix Highly parallelizable, reduce dimensions, capture global co-occurrence stats Log smoothing of probability ratios Stores word vector diffs for fast analogy lookups
  • 33. Neural-based Predictive Models Goal: Predict Text using Learned Embedding Vectors word2vec Shallow neural network Local: nearby words predict each other Fixed word embedding vector size (ie. 300) Optimizer: Mini-batch Stochastic Gradient Descent (SGD) SyntaxNet Deep(er) neural network Global(er) Not a Recurrent Neural Net (RNN)! Can combine with BOW-based models (ie. word2vec CBOW)
  • 34. word2vec CBOW word2vec Predict target word from source context A single source context is an observation Loses useful distribution information Good for small datasets Skip-gram word2vec (Inverse of CBOW) Predict source context words from target word Each (source context, target word) tuple is observation Better for large datasets
  • 35. word2vec Libraries gensim Python only Most popular Spark ML Python + Java/Scala Supports only synonyms
  • 36. *2vec lda2vec LDA (global) + word2vec (local) From Chris Moody @ Stitch Fix like2vec Embedding-based Recommender
  • 37. word2vec vs. GloVe Both are Fundamentally Similar Capture local co-occurrence statistics (neighbors) Capture distance between embedding vector (analogies) GloVe Count-based Also captures global co-occurrence statistics Requires upfront pass through entire dataset
  • 38. SyntaxNet POS Tagging Determine coarse-grained grammatical role of each word Multiple contexts, multiple roles Neural Net Inputs: stack, buffer Results: POS probability distro Already Tagged
  • 39. SyntaxNet Dependency Parser Determine fine-grained roles using grammatical relationships “Transition-based”, Incremental Dependency Parser Globally Normalized using Beam Search with Early Update Parsey McParseface: Pre-trained Parser/Tagger avail in 40 langs Fine-grained Coarse-grained
  • 40. SyntaxNet Use Case: Nutrition Nutrition and Health Startup in SF (Stealth) Using Google’s SyntaxNet Rate Recipes and Menus by Nutritional Value Correct Incorrect
  • 41. Model Validation Unsupervised Learning Requires Validation Google has Published Analogy Tests for Model Validation Thanks, Google!
  • 42. Thank You, Atlanta! Chris Fregly, Research Scientist @ PipelineIO All Source Code, Demos, and Docker Images @ pipeline.io Join the Global Meetup for all Slides and Videos @ advancedspark.com