5. Linear regressions
Correlation coeff’s, R2
Higher curve fits
Separate & identify sub-populations
Make & Compare Models
Throw out outliers
Wonder if you picked the best X-axis
Pick 2 dimensions and plot
Potentially interesting
Importantvariable
What you hope for
Scatter Data
What you get What you do
Spectral Data
Fit slopes
Subtract / divide backgrounds
Take ratios of known frequencies
Look for familiar peaks
Make & Compare Models
Hope you know which frequencies
are most informative
Does it work?
YES! It’s how we got here.
Takes a LOT of human time
That’s the currency of merit
Bins (λ,ν)
Counts
5
6. • Replicate what humans do while analyzing data
– Builds and auto-tunes models
– Forces analyst to make metrics clear(er)
• Subjective decisions become principled (or at least
repeatable) and FAST!
• Higher dimensions than humans can visualize
• Replicate over huge datasets
• In the end, extends human analysis.
No one gets replaced, they get augmented!
6
8. Hyperspectral imagers: pictures with high-res spectra at each pixel
4000 frequencies x 5e6 pixels x 3000 images
A human can select a pixel to study the spectra, or make an image
of a particular frequency (or ratio). Maybe take a dot product with a
desired spectra.
Most of image(s) remain unexamined! Needle in haystack problem.
Image databases: How to search?
Martian crater expert only wants images with craters, but would take hundreds of
grad students to label them all. Even if you do, each student’s criteria will be
different (fatigue).
This classification problem is ubiquitous in image and video processing
Maybe want to find list of “things that aren’t anything you taught me” to look for
interesting new landforms not expected.
Compare nearly overlapping images at different times and find “what’s changed”
for dynamics study.
Just too large for even a team to tackle.8
9. Metadata-Rich Datasets:
Observations aren’t just numbers: whole vector of associated data
- Estimates of T, P, aerosol estimates, cloudiness, H2O content…
- 10 to 1000 such parameters
Look for correlated trends, sub-category description / separation,
anomaly discovery, and primary correlates to unwanted behavior
Fundamentally >2D relationships will be missed by simple plotting
Object databases: What’s in there?
Martian rock database records dozens of properties of each rock
examined… thousands or millions of them
What rock types group together?
What stands out as unique? Why was it unique?
Given we now know that, what’s the next most unique thing?
9
10. • Sphere vols increase until dim=5 then go to zero as d->inf
• Sphere within cube eventually removes no volume from cube
• A smaller sphere within a larger removes no volume
• Zero prob within 1 stdev in Gaussian (“Losing the Middle”)
…
10
11. Example: Analyze a genetics dataset
• Data: 1e6 samples, each is gene snippet 1e5 base-pair long
• “Find the gene locations that correlate with an observed condition”
• Already processed dataset with only 1e2 base-pairs just fine
Each base-pair location has ~10 samples. Dbig makes data coverage -> 0
Regressions detonate as (ever more likely) correlated genes cause singularities
Even without correlations, distance function diluted. Everybody is very far.
Dilutes meaning of all regressions, nearest neighbor comparisons
Space is too large to be searched. Requires exponential sampling.
Errors! Singular matrices! Meaningless results!
D1
2
+ D2
2
+D3
2
+ D4
2
+ D5
2
+ D6
2
+D7
2
+×××
11
12. • Identify most informative dimensions or Mixtures
“feature selection”
• Requires a search over number of dimensions D
Can take time
• But once informative features recognized,
everything else is faster and easier
• Fundamental to Machine Learning,
usually as a first step
12
14. • Operational Scenarios• Data > Transmission
– Autonomous prioritize by “interest”
– “interest” can be specified, calculated in-situ, or anomalous
• Comm Delay > Decision Time
– Autonomous decision making
– Pre-defined or anomalous triggers
– Plan/schedule response and follow-up
• Volume > Analysis Time
– Identify uninteresting / interesting data and sub-populations
– Identify anomalies, test models
• Data collection capability > data storage capability
– Autonomous decision what to retain
14
16. Unsupervised
“Data Mining”
Supervised
“Learning”
Other Methods
• Algorithm studies data independently
• User does not “help” algorithm understand
• No user assumptions to corrupt results
• No human expertise either
• Human provides labeled examples to “learn”
• Human selects algorithm / model for generalization
• Algorithm figures out how to generalize labels
• Resulting tuned model reveals structure of data
• Produces useful system for replicating the labeling
• Might involve humans as part of the learning cycle
• Might seek feedback to make new labeled data
• Might use evolution to figure out best ML parameters
• Might maintain multi-goal output space
16
18. Finding Hidden Structure In Unlabeled Data
Clustering: “Are there sub-populations in my data?”
Defines n clusters to which all observations are members
Easy to see in 2D, harder in Dbig
Sub-populations can guide analyst to independent analysis
May correspond to physically meaningful populations
Must provide distance metric, algorithm, parameters, data
filtration, parameter n
PCA: “What combination of dims explains my data Var?”
New axes based on linear combination of original dims
Axes ordered in terms of data variance they explain
Works if data variance is all “interesting” (rare)
Dimension reduction: take first n axes until 99% Var explained
Can’t handle dimensional correlations
HMM: “What statistical model produced my data?”
Pertains to time-series or sequential data
Assumes prob model that only depends on last state
Constructs most likely model that would explain dataset
Can reveal hidden relations and driving processes
Principle Component Analysis
Hidden Markov Model
18
19. Finding Hidden Structure In Unlabeled Data
Rules Learning: “What events tend to co-occur?”
Discovers strong, simple rules between dimensions
Useful to figure out hidden relations in large datasets
Can also be helpful to remove correlated features
Gives potential rules for interpretation and investigation
Segmentation: “What regions best describe image?”
Groups pixels/samples into larger regions of similar nature
Expensive, slow analysis may then be done per region
Averaging across region may reduce noise in “super pixel”
Helps image recognition and classification tasks
Focuses analysis on complex areas vs boring stretches
DOGO: “Order my data by its quality / utility”
Specify a metric to max/minimize
Finds features that, via filtration, optimize metric
Constructs sliding filter that monotonically reduces metric
Inverts filter to produce data ordering of most to least trusted
Useful when data isn’t merely “good” or “bad”
Data Ordering through Genetic Optimization
19
21. Algorithms Learning from Humans
LDA: “What hyperplane would best separate my labels?”
Like PCA, but works to separate labels, not explain Variance
Returns vector of how useful each dimension was to separate
Vulnerable to correlated features
Surprisingly powerful for simple classification
SVM: “What set of samples best define label separation?”
Same idea of LDA. Make a separating hyperplane
Pays attention only to the most confusing examples
Creates a “basis” set of support vectors, most informative samples
Gives idea of data importance: which samples change answer
Neural Network: “Predict my labels, I don’t care how”
Define layer geometry: # hidden layers, input types, output types
Train on input data and user labels. Defines weights.
Black-box predictor now online
Monte Carlo stimulate inputs to maximize output concept signal
Hard to get insight from network itself
Linear Discriminant Analysis
Support Vector Machine
21
22. Algorithms Learning from Humans
Decision Tree: “Play 20 questions to separate my labels.”
Set # of tree branches allowed
Learns best series of questions that isolate provided labels
Directly interpretable by domain experts… no black box
Extremely fast to evaluate once trained
Naïve Bayes: “What’s probability of belong to any label?”
Assumes non-correlated dimensions (often works despite this)
Needs relatively small number of input labels
Learns distribution of training labels independently
Predicts probability of sample being in all label categories
Can easily have “I don’t know” response added
Nearest Neighbor: “Use comparables to predict label”
User picks how many neighbors to consider
For each sample to predict, scans all input training data
Finds k neighbors by distance metric, then averages labels
Learns no structure, uses no models, just distance metric
Can be slow to evaluate if lots of labeled data22
23. Algorithms Learning from Humans
Random Forest: “Make lots of little trees and take vote.”
Trains hundreds of small trees on label data subsets
In final prediction, let them vote for final output
Overcomes Decision Tree tendency to overfit data
Removes Decision Tree strength of interpretability
Boosting / Ensemble: “Combine algorithms to improve.”
Iteratively train lots of weak / simple methods (any mix will do)
Larger optimization (genetic) twiddles with all their parameters
Larger optimization learns weights to combine answers together
Takes a lot of processing power and input data, but great results
Can learn about data based on which predictors were selected
23
25. Algorithms that work with or without labels
or that are interactive and can generate them
Genetic Algorithm: “Iteratively optimize my (gene) model”
Define a “gene” of all parameters you want GA to optimize
Define goal metric(s) GA should try to max/minimize
GA’s handle mixed input, arbitrary goal metrics
Not really learning, but useful in similar situations. Slow.
Active Learning: “What should I have labeled to help?”
Initialize system with supervised or unsupervised method
Have system predict and display to user
User corrects errors and addresses confusing examples
Iterate between prediction and feedback until results look good
Can easily over-fit, so hold-out tests are important here
25
26. Some concepts to save you from harm
Over-Fitting
• Samples is not >> degrees of freedom
• Gets you great results… on your training set
• Can’t generalize! Predictor fails on new data.
• Use cross-validation and/or simpler model
• Train on 20% data, test on 80%? Vice versa? 50/50?
• Depends on data volume and algorithm need
• Data structure also determines: how heterogeneous
Train / Test Split
• Automated way to explore all possible train/test splits
• Second level withholds data from test & train
• Takes lots of data and time
• Actually tests generalization
Cross-Validation
Label Imbalance
• If 5% of your labels are “yes” and 95% are “no”…
• Just guess no all the time, and you’re right 95%!
• This can imbalance some training algorithms
26
Normalization
• Should you normalize all inputs between 0-1?
• Perhaps they should have mean 0, STD=1 instead?
• Not if spectrum where relative intensity matters
27. • 4 metrics: TPOS, TNEG, FPOS, FNEG
• Application Specific
• Make trade-off curve by running algorithm w/ different parameters
• Receiver Operating Characteristic (ROC)
• Just means “How often do you miss” vs. “How often do you hallucinate?”
• Curse says what your options are. You pick what you can tolerate most.
False Positives ->
TruePositives->
27
29. • Automatically Identifies and Classifies Rocks
Image Data Classification Random Forest
Dr. David Thompson29
30. March 11, 2011 0500 UTC March 11, 2011 0530 UTC March 11, 2011 0600 UTC
March 11, 2011 0630 UTC March 11, 2011 0700 UTC March 13, 2011 1300 UTC
Before the rupture
Nominal states
Rupture Initiation Propagation
~1.5 hour timescale
propagation of
state changes
Rupture completion
Nominal states
Two days later
Growth of feature
near triple junction
(near Tokyo)
• Looks for Earthquake behavior people miss
Time Series Data Anomaly Detection Hidden Markov Models
Dr. Robert Granat30
31. Full Pancam View
AEGIS Autonomously
Delivers 13F Pancam image
• Notices “interesting” objects while driving/scanning
• Takes higher resolution images for later analysis
• Operational on MER
Real Time Image Data Anomaly Detection Segmentation
Dr. Tara Estlin31
32. • Prioritizes incoming soundings on usefulness for further analysis
• Lets retrieval algorithms initially work only on cleanest data
• Will be operational in OCO-2 DAC to meet Level 1 requirements
• Advises scientists on which data to include in their analysis
Real-Time Soundings Data Prioritization Genetic Algorithm
Dr. Lukas Mandrake32
33. • Automatically recognize, outline, and classify Martian landmarks
• Hi-Rise database = tens of thousands of huge resolution images
• How to search for your field of interest?
• What are statistics on various landforms?
Image Database Anomaly Detection & Classification Boost + SVM
Dr. Kiri Wagstaff33
34. 0.6 0.8 1 1.2 1.4 1.6 1.8 2 2.2 2.4
0.6
0.8
1
1.2
1.4
1.6
1.8
2
2.2
Noise
RFI
RFI
Transients
Quadratic
Sum
Robust
ECDF
! (x1
,d)
!(x2
,d)
Decision boundaries for multistation transient detectors
A transient signal Separating signals in the
parameter space
• Recognizes brand new Supernova in a few seconds
Fast, Real-Time Series Data Anomaly Detection Random Forests
Supernova &
Pulsars
Dr. Umaa Rebbapragada34
35. • Machine Learning is for everyone!
• Relatively simple algorithms lying around for use
• Can help researcher understand their data initially
• Can help drill-down into sub-populations
• Can automate monotonous labeling tasks
• Available in
– Python (Scikit-learn, Orange, Weka)
– Matlab (Statistics, Neural Net, Fuzzy Logic Toolboxes)
– Most languages (OpenCV)
Or just jot us an email! We love to collaborate.35