SlideShare una empresa de Scribd logo
1 de 35
National Aeronautics and Space
Administration
Jet Propulsion Laboratory
California Institute of Technology
Machine Learning and Instrument Autonomy Group (398E)
Contact (Supervisor): Tara Estlin
Presenter: Lukas Mandrake
Tara.a.estlin@jpl.nasa.gov
Lukas.mandrake@jpl.nasa.gov
© 2015 California Institute of Technology
Government sponsorship acknowledged
• Machine Learning Definition
• Amenable Data Types
• Curse of Dimensionality
• Common Techniques
• Example Applications
2
3
Notional
Graphs
Only
%ThinkingSpent
kB
MB
GB
TB
Available Data
Electrical sensors for data taking
Computers for analytics
ML for building models
Leaves humans to interpret
4
Linear regressions
Correlation coeff’s, R2
Higher curve fits
Separate & identify sub-populations
Make & Compare Models
Throw out outliers
Wonder if you picked the best X-axis
Pick 2 dimensions and plot
Potentially interesting
Importantvariable
What you hope for
Scatter Data
What you get What you do
Spectral Data
Fit slopes
Subtract / divide backgrounds
Take ratios of known frequencies
Look for familiar peaks
Make & Compare Models
Hope you know which frequencies
are most informative
Does it work?
YES! It’s how we got here.
Takes a LOT of human time
That’s the currency of merit
Bins (λ,ν)
Counts
5
• Replicate what humans do while analyzing data
– Builds and auto-tunes models
– Forces analyst to make metrics clear(er)
• Subjective decisions become principled (or at least
repeatable) and FAST!
• Higher dimensions than humans can visualize
• Replicate over huge datasets
• In the end, extends human analysis.
No one gets replaced, they get augmented!
6
7
Hyperspectral imagers: pictures with high-res spectra at each pixel
4000 frequencies x 5e6 pixels x 3000 images
A human can select a pixel to study the spectra, or make an image
of a particular frequency (or ratio). Maybe take a dot product with a
desired spectra.
Most of image(s) remain unexamined! Needle in haystack problem.
Image databases: How to search?
Martian crater expert only wants images with craters, but would take hundreds of
grad students to label them all. Even if you do, each student’s criteria will be
different (fatigue).
This classification problem is ubiquitous in image and video processing
Maybe want to find list of “things that aren’t anything you taught me” to look for
interesting new landforms not expected.
Compare nearly overlapping images at different times and find “what’s changed”
for dynamics study.
Just too large for even a team to tackle.8
Metadata-Rich Datasets:
Observations aren’t just numbers: whole vector of associated data
- Estimates of T, P, aerosol estimates, cloudiness, H2O content…
- 10 to 1000 such parameters
Look for correlated trends, sub-category description / separation,
anomaly discovery, and primary correlates to unwanted behavior
Fundamentally >2D relationships will be missed by simple plotting
Object databases: What’s in there?
Martian rock database records dozens of properties of each rock
examined… thousands or millions of them
What rock types group together?
What stands out as unique? Why was it unique?
Given we now know that, what’s the next most unique thing?
9
• Sphere vols increase until dim=5 then go to zero as d->inf
• Sphere within cube eventually removes no volume from cube
• A smaller sphere within a larger removes no volume
• Zero prob within 1 stdev in Gaussian (“Losing the Middle”)
…
10
Example: Analyze a genetics dataset
• Data: 1e6 samples, each is gene snippet 1e5 base-pair long
• “Find the gene locations that correlate with an observed condition”
• Already processed dataset with only 1e2 base-pairs just fine
Each base-pair location has ~10 samples. Dbig makes data coverage -> 0
Regressions detonate as (ever more likely) correlated genes cause singularities
Even without correlations, distance function diluted. Everybody is very far.
Dilutes meaning of all regressions, nearest neighbor comparisons
Space is too large to be searched. Requires exponential sampling.
Errors! Singular matrices! Meaningless results!
D1
2
+ D2
2
+D3
2
+ D4
2
+ D5
2
+ D6
2
+D7
2
+×××
11
• Identify most informative dimensions or Mixtures
“feature selection”
• Requires a search over number of dimensions D
Can take time
• But once informative features recognized,
everything else is faster and easier
• Fundamental to Machine Learning,
usually as a first step
12
13
• Operational Scenarios• Data > Transmission
– Autonomous prioritize by “interest”
– “interest” can be specified, calculated in-situ, or anomalous
• Comm Delay > Decision Time
– Autonomous decision making
– Pre-defined or anomalous triggers
– Plan/schedule response and follow-up
• Volume > Analysis Time
– Identify uninteresting / interesting data and sub-populations
– Identify anomalies, test models
• Data collection capability > data storage capability
– Autonomous decision what to retain
14
15
Unsupervised
“Data Mining”
Supervised
“Learning”
Other Methods
• Algorithm studies data independently
• User does not “help” algorithm understand
• No user assumptions to corrupt results
• No human expertise either
• Human provides labeled examples to “learn”
• Human selects algorithm / model for generalization
• Algorithm figures out how to generalize labels
• Resulting tuned model reveals structure of data
• Produces useful system for replicating the labeling
• Might involve humans as part of the learning cycle
• Might seek feedback to make new labeled data
• Might use evolution to figure out best ML parameters
• Might maintain multi-goal output space
16
17
Finding Hidden Structure In Unlabeled Data
Clustering: “Are there sub-populations in my data?”
Defines n clusters to which all observations are members
Easy to see in 2D, harder in Dbig
Sub-populations can guide analyst to independent analysis
May correspond to physically meaningful populations
Must provide distance metric, algorithm, parameters, data
filtration, parameter n
PCA: “What combination of dims explains my data Var?”
New axes based on linear combination of original dims
Axes ordered in terms of data variance they explain
Works if data variance is all “interesting” (rare)
Dimension reduction: take first n axes until 99% Var explained
Can’t handle dimensional correlations
HMM: “What statistical model produced my data?”
Pertains to time-series or sequential data
Assumes prob model that only depends on last state
Constructs most likely model that would explain dataset
Can reveal hidden relations and driving processes
Principle Component Analysis
Hidden Markov Model
18
Finding Hidden Structure In Unlabeled Data
Rules Learning: “What events tend to co-occur?”
Discovers strong, simple rules between dimensions
Useful to figure out hidden relations in large datasets
Can also be helpful to remove correlated features
Gives potential rules for interpretation and investigation
Segmentation: “What regions best describe image?”
Groups pixels/samples into larger regions of similar nature
Expensive, slow analysis may then be done per region
Averaging across region may reduce noise in “super pixel”
Helps image recognition and classification tasks
Focuses analysis on complex areas vs boring stretches
DOGO: “Order my data by its quality / utility”
Specify a metric to max/minimize
Finds features that, via filtration, optimize metric
Constructs sliding filter that monotonically reduces metric
Inverts filter to produce data ordering of most to least trusted
Useful when data isn’t merely “good” or “bad”
Data Ordering through Genetic Optimization
19
20
Algorithms Learning from Humans
LDA: “What hyperplane would best separate my labels?”
Like PCA, but works to separate labels, not explain Variance
Returns vector of how useful each dimension was to separate
Vulnerable to correlated features
Surprisingly powerful for simple classification
SVM: “What set of samples best define label separation?”
Same idea of LDA. Make a separating hyperplane
Pays attention only to the most confusing examples
Creates a “basis” set of support vectors, most informative samples
Gives idea of data importance: which samples change answer
Neural Network: “Predict my labels, I don’t care how”
Define layer geometry: # hidden layers, input types, output types
Train on input data and user labels. Defines weights.
Black-box predictor now online
Monte Carlo stimulate inputs to maximize output concept signal
Hard to get insight from network itself
Linear Discriminant Analysis
Support Vector Machine
21
Algorithms Learning from Humans
Decision Tree: “Play 20 questions to separate my labels.”
Set # of tree branches allowed
Learns best series of questions that isolate provided labels
Directly interpretable by domain experts… no black box
Extremely fast to evaluate once trained
Naïve Bayes: “What’s probability of belong to any label?”
Assumes non-correlated dimensions (often works despite this)
Needs relatively small number of input labels
Learns distribution of training labels independently
Predicts probability of sample being in all label categories
Can easily have “I don’t know” response added
Nearest Neighbor: “Use comparables to predict label”
User picks how many neighbors to consider
For each sample to predict, scans all input training data
Finds k neighbors by distance metric, then averages labels
Learns no structure, uses no models, just distance metric
Can be slow to evaluate if lots of labeled data22
Algorithms Learning from Humans
Random Forest: “Make lots of little trees and take vote.”
Trains hundreds of small trees on label data subsets
In final prediction, let them vote for final output
Overcomes Decision Tree tendency to overfit data
Removes Decision Tree strength of interpretability
Boosting / Ensemble: “Combine algorithms to improve.”
Iteratively train lots of weak / simple methods (any mix will do)
Larger optimization (genetic) twiddles with all their parameters
Larger optimization learns weights to combine answers together
Takes a lot of processing power and input data, but great results
Can learn about data based on which predictors were selected
23
24
Algorithms that work with or without labels
or that are interactive and can generate them
Genetic Algorithm: “Iteratively optimize my (gene) model”
Define a “gene” of all parameters you want GA to optimize
Define goal metric(s) GA should try to max/minimize
GA’s handle mixed input, arbitrary goal metrics
Not really learning, but useful in similar situations. Slow.
Active Learning: “What should I have labeled to help?”
Initialize system with supervised or unsupervised method
Have system predict and display to user
User corrects errors and addresses confusing examples
Iterate between prediction and feedback until results look good
Can easily over-fit, so hold-out tests are important here
25
Some concepts to save you from harm
Over-Fitting
• Samples is not >> degrees of freedom
• Gets you great results… on your training set
• Can’t generalize! Predictor fails on new data.
• Use cross-validation and/or simpler model
• Train on 20% data, test on 80%? Vice versa? 50/50?
• Depends on data volume and algorithm need
• Data structure also determines: how heterogeneous
Train / Test Split
• Automated way to explore all possible train/test splits
• Second level withholds data from test & train
• Takes lots of data and time
• Actually tests generalization
Cross-Validation
Label Imbalance
• If 5% of your labels are “yes” and 95% are “no”…
• Just guess no all the time, and you’re right 95%!
• This can imbalance some training algorithms
26
Normalization
• Should you normalize all inputs between 0-1?
• Perhaps they should have mean 0, STD=1 instead?
• Not if spectrum where relative intensity matters
• 4 metrics: TPOS, TNEG, FPOS, FNEG
• Application Specific
• Make trade-off curve by running algorithm w/ different parameters
• Receiver Operating Characteristic (ROC)
• Just means “How often do you miss” vs. “How often do you hallucinate?”
• Curse says what your options are. You pick what you can tolerate most.
False Positives ->
TruePositives->
27
28
• Automatically Identifies and Classifies Rocks
Image Data Classification Random Forest
Dr. David Thompson29
March 11, 2011 0500 UTC March 11, 2011 0530 UTC March 11, 2011 0600 UTC
March 11, 2011 0630 UTC March 11, 2011 0700 UTC March 13, 2011 1300 UTC
Before the rupture
Nominal states
Rupture Initiation Propagation
~1.5 hour timescale
propagation of
state changes
Rupture completion
Nominal states
Two days later
Growth of feature
near triple junction
(near Tokyo)
• Looks for Earthquake behavior people miss
Time Series Data Anomaly Detection Hidden Markov Models
Dr. Robert Granat30
Full Pancam View
AEGIS Autonomously
Delivers 13F Pancam image
• Notices “interesting” objects while driving/scanning
• Takes higher resolution images for later analysis
• Operational on MER
Real Time Image Data Anomaly Detection Segmentation
Dr. Tara Estlin31
• Prioritizes incoming soundings on usefulness for further analysis
• Lets retrieval algorithms initially work only on cleanest data
• Will be operational in OCO-2 DAC to meet Level 1 requirements
• Advises scientists on which data to include in their analysis
Real-Time Soundings Data Prioritization Genetic Algorithm
Dr. Lukas Mandrake32
• Automatically recognize, outline, and classify Martian landmarks
• Hi-Rise database = tens of thousands of huge resolution images
• How to search for your field of interest?
• What are statistics on various landforms?
Image Database Anomaly Detection & Classification Boost + SVM
Dr. Kiri Wagstaff33
0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Noise
RFI
RFI
Transients
Quadratic
Sum
Robust
ECDF
! (x1
,d)
!(x2
,d)
Decision boundaries for multistation transient detectors
A transient signal Separating signals in the
parameter space
• Recognizes brand new Supernova in a few seconds
Fast, Real-Time Series Data Anomaly Detection Random Forests
Supernova &
Pulsars
Dr. Umaa Rebbapragada34
• Machine Learning is for everyone!
• Relatively simple algorithms lying around for use
• Can help researcher understand their data initially
• Can help drill-down into sub-populations
• Can automate monotonous labeling tasks
• Available in
– Python (Scikit-learn, Orange, Weka)
– Matlab (Statistics, Neural Net, Fuzzy Logic Toolboxes)
– Most languages (OpenCV)
Or just jot us an email! We love to collaborate.35

Más contenido relacionado

La actualidad más candente

Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Sudeep Das, Ph.D.
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveJames Hendler
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesHendrik Drachsler
 

La actualidad más candente (7)

Empirical AI Research
Empirical AI Research Empirical AI Research
Empirical AI Research
 
From byte to mind
From byte to mindFrom byte to mind
From byte to mind
 
Sensors1(1)
Sensors1(1)Sensors1(1)
Sensors1(1)
 
Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it! Crafting Recommenders: the Shallow and the Deep of it!
Crafting Recommenders: the Shallow and the Deep of it!
 
Machine reasoning
Machine reasoningMachine reasoning
Machine reasoning
 
Why Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspectiveWhy Watson Won: A cognitive perspective
Why Watson Won: A cognitive perspective
 
Data Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for UniversitiesData Sets as Facilitator for new Products and Services for Universities
Data Sets as Facilitator for new Products and Services for Universities
 

Similar a Machine Learning Summary for Caltech2

Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data ScienceSean Taylor
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dmsumit621
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social NetworksEhren Foss
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2Roger Barga
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera, Inc.
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers helpDanko Nikolic
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Amr Rashed
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning TutorialAmr Rashed
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Datajonblower
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown BagDataTactics
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Rich Heimann
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7CS, NcState
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningLars Marius Garshol
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273Abutest
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018HJ van Veen
 

Similar a Machine Learning Summary for Caltech2 (20)

Putting the Magic in Data Science
Putting the Magic in Data SciencePutting the Magic in Data Science
Putting the Magic in Data Science
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Cssu dw dm
Cssu dw dmCssu dw dm
Cssu dw dm
 
The Science Of Social Networks
The Science Of Social NetworksThe Science Of Social Networks
The Science Of Social Networks
 
Barga Data Science lecture 2
Barga Data Science lecture 2Barga Data Science lecture 2
Barga Data Science lecture 2
 
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your DataCloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
Cloudera Breakfast: Advanced Analytics Part II: Do More With Your Data
 
How data science works and how can customers help
How data science works and how can customers helpHow data science works and how can customers help
How data science works and how can customers help
 
Data mining BY Zubair Yaseen
Data mining BY Zubair YaseenData mining BY Zubair Yaseen
Data mining BY Zubair Yaseen
 
Deep learning tutorial 9/2019
Deep learning tutorial 9/2019Deep learning tutorial 9/2019
Deep learning tutorial 9/2019
 
Deep Learning Tutorial
Deep Learning TutorialDeep Learning Tutorial
Deep Learning Tutorial
 
In search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked DataIn search of lost knowledge: joining the dots with Linked Data
In search of lost knowledge: joining the dots with Linked Data
 
Data Science and Analytics Brown Bag
Data Science and Analytics Brown BagData Science and Analytics Brown Bag
Data Science and Analytics Brown Bag
 
Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)Data Tactics Data Science Brown Bag (April 2014)
Data Tactics Data Science Brown Bag (April 2014)
 
OpenML data@Sheffield
OpenML data@SheffieldOpenML data@Sheffield
OpenML data@Sheffield
 
Dm sei-tutorial-v7
Dm sei-tutorial-v7Dm sei-tutorial-v7
Dm sei-tutorial-v7
 
Introduction to Big Data/Machine Learning
Introduction to Big Data/Machine LearningIntroduction to Big Data/Machine Learning
Introduction to Big Data/Machine Learning
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
Machine Learning ICS 273A
Machine Learning ICS 273AMachine Learning ICS 273A
Machine Learning ICS 273A
 
LR2. Summary Day 2
LR2. Summary Day 2LR2. Summary Day 2
LR2. Summary Day 2
 
Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018Hacking Predictive Modeling - RoadSec 2018
Hacking Predictive Modeling - RoadSec 2018
 

Machine Learning Summary for Caltech2

  • 1. National Aeronautics and Space Administration Jet Propulsion Laboratory California Institute of Technology Machine Learning and Instrument Autonomy Group (398E) Contact (Supervisor): Tara Estlin Presenter: Lukas Mandrake Tara.a.estlin@jpl.nasa.gov Lukas.mandrake@jpl.nasa.gov © 2015 California Institute of Technology Government sponsorship acknowledged
  • 2. • Machine Learning Definition • Amenable Data Types • Curse of Dimensionality • Common Techniques • Example Applications 2
  • 3. 3
  • 4. Notional Graphs Only %ThinkingSpent kB MB GB TB Available Data Electrical sensors for data taking Computers for analytics ML for building models Leaves humans to interpret 4
  • 5. Linear regressions Correlation coeff’s, R2 Higher curve fits Separate & identify sub-populations Make & Compare Models Throw out outliers Wonder if you picked the best X-axis Pick 2 dimensions and plot Potentially interesting Importantvariable What you hope for Scatter Data What you get What you do Spectral Data Fit slopes Subtract / divide backgrounds Take ratios of known frequencies Look for familiar peaks Make & Compare Models Hope you know which frequencies are most informative Does it work? YES! It’s how we got here. Takes a LOT of human time That’s the currency of merit Bins (λ,ν) Counts 5
  • 6. • Replicate what humans do while analyzing data – Builds and auto-tunes models – Forces analyst to make metrics clear(er) • Subjective decisions become principled (or at least repeatable) and FAST! • Higher dimensions than humans can visualize • Replicate over huge datasets • In the end, extends human analysis. No one gets replaced, they get augmented! 6
  • 7. 7
  • 8. Hyperspectral imagers: pictures with high-res spectra at each pixel 4000 frequencies x 5e6 pixels x 3000 images A human can select a pixel to study the spectra, or make an image of a particular frequency (or ratio). Maybe take a dot product with a desired spectra. Most of image(s) remain unexamined! Needle in haystack problem. Image databases: How to search? Martian crater expert only wants images with craters, but would take hundreds of grad students to label them all. Even if you do, each student’s criteria will be different (fatigue). This classification problem is ubiquitous in image and video processing Maybe want to find list of “things that aren’t anything you taught me” to look for interesting new landforms not expected. Compare nearly overlapping images at different times and find “what’s changed” for dynamics study. Just too large for even a team to tackle.8
  • 9. Metadata-Rich Datasets: Observations aren’t just numbers: whole vector of associated data - Estimates of T, P, aerosol estimates, cloudiness, H2O content… - 10 to 1000 such parameters Look for correlated trends, sub-category description / separation, anomaly discovery, and primary correlates to unwanted behavior Fundamentally >2D relationships will be missed by simple plotting Object databases: What’s in there? Martian rock database records dozens of properties of each rock examined… thousands or millions of them What rock types group together? What stands out as unique? Why was it unique? Given we now know that, what’s the next most unique thing? 9
  • 10. • Sphere vols increase until dim=5 then go to zero as d->inf • Sphere within cube eventually removes no volume from cube • A smaller sphere within a larger removes no volume • Zero prob within 1 stdev in Gaussian (“Losing the Middle”) … 10
  • 11. Example: Analyze a genetics dataset • Data: 1e6 samples, each is gene snippet 1e5 base-pair long • “Find the gene locations that correlate with an observed condition” • Already processed dataset with only 1e2 base-pairs just fine Each base-pair location has ~10 samples. Dbig makes data coverage -> 0 Regressions detonate as (ever more likely) correlated genes cause singularities Even without correlations, distance function diluted. Everybody is very far. Dilutes meaning of all regressions, nearest neighbor comparisons Space is too large to be searched. Requires exponential sampling. Errors! Singular matrices! Meaningless results! D1 2 + D2 2 +D3 2 + D4 2 + D5 2 + D6 2 +D7 2 +××× 11
  • 12. • Identify most informative dimensions or Mixtures “feature selection” • Requires a search over number of dimensions D Can take time • But once informative features recognized, everything else is faster and easier • Fundamental to Machine Learning, usually as a first step 12
  • 13. 13
  • 14. • Operational Scenarios• Data > Transmission – Autonomous prioritize by “interest” – “interest” can be specified, calculated in-situ, or anomalous • Comm Delay > Decision Time – Autonomous decision making – Pre-defined or anomalous triggers – Plan/schedule response and follow-up • Volume > Analysis Time – Identify uninteresting / interesting data and sub-populations – Identify anomalies, test models • Data collection capability > data storage capability – Autonomous decision what to retain 14
  • 15. 15
  • 16. Unsupervised “Data Mining” Supervised “Learning” Other Methods • Algorithm studies data independently • User does not “help” algorithm understand • No user assumptions to corrupt results • No human expertise either • Human provides labeled examples to “learn” • Human selects algorithm / model for generalization • Algorithm figures out how to generalize labels • Resulting tuned model reveals structure of data • Produces useful system for replicating the labeling • Might involve humans as part of the learning cycle • Might seek feedback to make new labeled data • Might use evolution to figure out best ML parameters • Might maintain multi-goal output space 16
  • 17. 17
  • 18. Finding Hidden Structure In Unlabeled Data Clustering: “Are there sub-populations in my data?” Defines n clusters to which all observations are members Easy to see in 2D, harder in Dbig Sub-populations can guide analyst to independent analysis May correspond to physically meaningful populations Must provide distance metric, algorithm, parameters, data filtration, parameter n PCA: “What combination of dims explains my data Var?” New axes based on linear combination of original dims Axes ordered in terms of data variance they explain Works if data variance is all “interesting” (rare) Dimension reduction: take first n axes until 99% Var explained Can’t handle dimensional correlations HMM: “What statistical model produced my data?” Pertains to time-series or sequential data Assumes prob model that only depends on last state Constructs most likely model that would explain dataset Can reveal hidden relations and driving processes Principle Component Analysis Hidden Markov Model 18
  • 19. Finding Hidden Structure In Unlabeled Data Rules Learning: “What events tend to co-occur?” Discovers strong, simple rules between dimensions Useful to figure out hidden relations in large datasets Can also be helpful to remove correlated features Gives potential rules for interpretation and investigation Segmentation: “What regions best describe image?” Groups pixels/samples into larger regions of similar nature Expensive, slow analysis may then be done per region Averaging across region may reduce noise in “super pixel” Helps image recognition and classification tasks Focuses analysis on complex areas vs boring stretches DOGO: “Order my data by its quality / utility” Specify a metric to max/minimize Finds features that, via filtration, optimize metric Constructs sliding filter that monotonically reduces metric Inverts filter to produce data ordering of most to least trusted Useful when data isn’t merely “good” or “bad” Data Ordering through Genetic Optimization 19
  • 20. 20
  • 21. Algorithms Learning from Humans LDA: “What hyperplane would best separate my labels?” Like PCA, but works to separate labels, not explain Variance Returns vector of how useful each dimension was to separate Vulnerable to correlated features Surprisingly powerful for simple classification SVM: “What set of samples best define label separation?” Same idea of LDA. Make a separating hyperplane Pays attention only to the most confusing examples Creates a “basis” set of support vectors, most informative samples Gives idea of data importance: which samples change answer Neural Network: “Predict my labels, I don’t care how” Define layer geometry: # hidden layers, input types, output types Train on input data and user labels. Defines weights. Black-box predictor now online Monte Carlo stimulate inputs to maximize output concept signal Hard to get insight from network itself Linear Discriminant Analysis Support Vector Machine 21
  • 22. Algorithms Learning from Humans Decision Tree: “Play 20 questions to separate my labels.” Set # of tree branches allowed Learns best series of questions that isolate provided labels Directly interpretable by domain experts… no black box Extremely fast to evaluate once trained Naïve Bayes: “What’s probability of belong to any label?” Assumes non-correlated dimensions (often works despite this) Needs relatively small number of input labels Learns distribution of training labels independently Predicts probability of sample being in all label categories Can easily have “I don’t know” response added Nearest Neighbor: “Use comparables to predict label” User picks how many neighbors to consider For each sample to predict, scans all input training data Finds k neighbors by distance metric, then averages labels Learns no structure, uses no models, just distance metric Can be slow to evaluate if lots of labeled data22
  • 23. Algorithms Learning from Humans Random Forest: “Make lots of little trees and take vote.” Trains hundreds of small trees on label data subsets In final prediction, let them vote for final output Overcomes Decision Tree tendency to overfit data Removes Decision Tree strength of interpretability Boosting / Ensemble: “Combine algorithms to improve.” Iteratively train lots of weak / simple methods (any mix will do) Larger optimization (genetic) twiddles with all their parameters Larger optimization learns weights to combine answers together Takes a lot of processing power and input data, but great results Can learn about data based on which predictors were selected 23
  • 24. 24
  • 25. Algorithms that work with or without labels or that are interactive and can generate them Genetic Algorithm: “Iteratively optimize my (gene) model” Define a “gene” of all parameters you want GA to optimize Define goal metric(s) GA should try to max/minimize GA’s handle mixed input, arbitrary goal metrics Not really learning, but useful in similar situations. Slow. Active Learning: “What should I have labeled to help?” Initialize system with supervised or unsupervised method Have system predict and display to user User corrects errors and addresses confusing examples Iterate between prediction and feedback until results look good Can easily over-fit, so hold-out tests are important here 25
  • 26. Some concepts to save you from harm Over-Fitting • Samples is not >> degrees of freedom • Gets you great results… on your training set • Can’t generalize! Predictor fails on new data. • Use cross-validation and/or simpler model • Train on 20% data, test on 80%? Vice versa? 50/50? • Depends on data volume and algorithm need • Data structure also determines: how heterogeneous Train / Test Split • Automated way to explore all possible train/test splits • Second level withholds data from test & train • Takes lots of data and time • Actually tests generalization Cross-Validation Label Imbalance • If 5% of your labels are “yes” and 95% are “no”… • Just guess no all the time, and you’re right 95%! • This can imbalance some training algorithms 26 Normalization • Should you normalize all inputs between 0-1? • Perhaps they should have mean 0, STD=1 instead? • Not if spectrum where relative intensity matters
  • 27. • 4 metrics: TPOS, TNEG, FPOS, FNEG • Application Specific • Make trade-off curve by running algorithm w/ different parameters • Receiver Operating Characteristic (ROC) • Just means “How often do you miss” vs. “How often do you hallucinate?” • Curse says what your options are. You pick what you can tolerate most. False Positives -> TruePositives-> 27
  • 28. 28
  • 29. • Automatically Identifies and Classifies Rocks Image Data Classification Random Forest Dr. David Thompson29
  • 30. March 11, 2011 0500 UTC March 11, 2011 0530 UTC March 11, 2011 0600 UTC March 11, 2011 0630 UTC March 11, 2011 0700 UTC March 13, 2011 1300 UTC Before the rupture Nominal states Rupture Initiation Propagation ~1.5 hour timescale propagation of state changes Rupture completion Nominal states Two days later Growth of feature near triple junction (near Tokyo) • Looks for Earthquake behavior people miss Time Series Data Anomaly Detection Hidden Markov Models Dr. Robert Granat30
  • 31. Full Pancam View AEGIS Autonomously Delivers 13F Pancam image • Notices “interesting” objects while driving/scanning • Takes higher resolution images for later analysis • Operational on MER Real Time Image Data Anomaly Detection Segmentation Dr. Tara Estlin31
  • 32. • Prioritizes incoming soundings on usefulness for further analysis • Lets retrieval algorithms initially work only on cleanest data • Will be operational in OCO-2 DAC to meet Level 1 requirements • Advises scientists on which data to include in their analysis Real-Time Soundings Data Prioritization Genetic Algorithm Dr. Lukas Mandrake32
  • 33. • Automatically recognize, outline, and classify Martian landmarks • Hi-Rise database = tens of thousands of huge resolution images • How to search for your field of interest? • What are statistics on various landforms? Image Database Anomaly Detection & Classification Boost + SVM Dr. Kiri Wagstaff33
  • 34. 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 Noise RFI RFI Transients Quadratic Sum Robust ECDF ! (x1 ,d) !(x2 ,d) Decision boundaries for multistation transient detectors A transient signal Separating signals in the parameter space • Recognizes brand new Supernova in a few seconds Fast, Real-Time Series Data Anomaly Detection Random Forests Supernova & Pulsars Dr. Umaa Rebbapragada34
  • 35. • Machine Learning is for everyone! • Relatively simple algorithms lying around for use • Can help researcher understand their data initially • Can help drill-down into sub-populations • Can automate monotonous labeling tasks • Available in – Python (Scikit-learn, Orange, Weka) – Matlab (Statistics, Neural Net, Fuzzy Logic Toolboxes) – Most languages (OpenCV) Or just jot us an email! We love to collaborate.35