SlideShare una empresa de Scribd logo
1 de 37
Descargar para leer sin conexión
© 2019 Samasource
Practical Approaches to
Training Data Strategy:
Bias, Legal and Ethical
Considerations
Audrey Jill Boguchwal
Samasource
May 2019
© 2019 Samasource
Training Data is the Soul of AI
Training data lays the
groundwork for model
performance.
– IBM, Microsoft, MIT CSAIL
Computer vision training data
may include: images, video,
lidar, radar and other sensor
data.
2
© 2019 Samasource
AI Development and Adoption Challenges
Training data presents the majority of challenges that can limit AI
development.
• Obtaining data sets
• Labeling training data
• Bias in training data, bias in algorithms, and bias in models
• Explaining why a decision was reached by an algorithm
• Carrying learnings from one algorithm model to another
"Notes from the AI frontier: Applications and value of deep learning,” McKinsey
3
© 2019 Samasource
Presentation Outline: Training Data Bias and Sourcing
Strategies to avoid data bias and obtain data ethically and legally.
• Common types of bias
• How unintended bias can creep into datasets
• Impact of biased training data
• Strategies to avoid many types of bias
• How to test for bias
• Legal and ethical data sourcing considerations, with real-world
examples and impact of problems
• Best practices to avoid and mitigate sourcing issues
4
© 2019 Samasource 5
Common Types
of Unintended
Training Data Bias
© 2019 Samasource
Sample Bias
Data is unrepresentative
of reality.
Example:
Data set has too few
examples of people with
darker skin tones.
6
Stock image example, not a real dataset
© 2019 Samasource
Historical Bias
Data reflects a prejudice or
stereotype that we do not
want to project into the future.
Example:
Data set has many images of
women in kitchens and men
in offices; but few of the
reverse.
7
Stock image example, not a real dataset
© 2019 Samasource
Measurement Bias
Systemic value distortion
from a problem with the
device capturing data.
Example:
Image data came from
one camera only, with an
overexposure problem.
8
Stock image example, not from a real dataset
© 2019 Samasource 9
How Unintended Bias
Can Creep into Datasets
© 2019 Samasource
Dataset Bias
Datasets used in training
have similar images and lack
diversity.
Example:
Cars images from 5 data
sets have similar qualities
within each set.
10
From “An Unbiased Look at Dataset Bias,” citation in Resources.
© 2019 Samasource
Selection and Capture Biases
Selection:
Keyword search returns
similar images.
Capture:
Objects photographed in
similar ways that do not
generalize.
Example:
Google Image results for
“sunglasses” too similar.
11
Google Image search results for “Sunglasses,” all photographed in a similar way
© 2019 Samasource
Class Imbalance
Too few or too many
examples of a class.
Example:
Dataset for a dog
classifier has too many
German Shepherds and
no other dogs.
12
From Stanford Dogs Dataset.
© 2019 Samasource
Negative Set Bias
Data of “the rest of the
world” is not well
represented or balanced.
Example:
Features that classify
“woman” are not on the
person, but in the
environment.
13
Stock image examples, not from a real dataset
© 2019 Samasource 14
Impact of Biased
Training Data:
Case Studies
© 2019 Samasource
Models Trained on Bias Data Can Be Less Accurate
Models can be
overconfident and not
discriminative.
Models will classify
based on the wrong
features, leading to
misclassifications.
Example:
Classifier uses scene,
not person, to identify
gender of person.
15
From “Men Also Like Shopping,” citation in Resources.
© 2019 Samasource
Biased Data has Ethical, Legal, and Safety Implications
16
• Inability to detect presence, identity and/or correct gender
expression of people with darker skin tones
• Causes problems for facial recognition used in identification, surveillance,
and law enforcement – “Gender Shades”
• Lack of visibility as seen by autonomous vehicles (potentially) – “Predictive
Inequity in Object Detection”
• Perpetuating historical, negative stereotypes across race & gender
• Stereotype: women belong in the kitchen, men in the office – “Men Also Like
Shopping”
• Google Photos wrongly labeled a black person as a gorilla – As posted on
Twitter, discussed in popular press
© 2019 Samasource
Case: AVs More Likely to Hit People with Darker Skin?
Test data used to
determine if object
detection systems, like
those seen in self-driving
cars, have equitable
detection for pedestrians
of all skin tones – and if
not, why?
Results indicate
detection accuracy is 5%
higher for lighter skin –
but many unaccounted
variables remain. 17
Stock image example, not from a real dataset
© 2019 Samasource
Is All Training Data Bias Undesirable?
Unintended bias in data is undesirable.
• All datasets are biased because they are not the full visual world
• If data accurately represents reality and reality has a statistical bias,
then the data should share that bias
• Goals: understand, mitigate and manage bias
18
© 2019 Samasource 19
Strategies to Avoid
Training Data Bias
© 2019 Samasource
Strategies to Avoid the Effects of Training Data Bias
20
Offset dataset bias and capture bias by preprocessing data.
• For object classifiers, if images look similar, consider
transformations: flip or automatically crop to vary
Avoid negative set bias by varying data.
• Collect data that contains background scenes in addition to objects
of interest
Avoid selection bias by varying search terms and data sources.
• Vary keywords, search engines to retrieve different kinds of images
© 2019 Samasource
Ensure Reality is Always Represented in the Data
21
Avoid sample bias by sourcing and selecting training data with the end
training goal in mind.
• Ensure many diverse examples of all classes and edge cases
• Example: When classifying pedestrians, source city street data showing
people from all demographics. Highway data with few people isn’t a fit.
Avoid historical bias and measurement bias with diverse sources.
• Have multiple, diverse, varied data sources from many devices
• Example: Use more than one training set, especially if it’s a stock set
• Refresh data and retrain several times a year as the world changes
• Example: Refresh data often for a clothing classifier to keep up with fashion
© 2019 Samasource
Case: “Gender Shades” on Facial Dataset Diversity
22
Joy Buolamwini, real and average faces to test and train facial recognition.
© 2019 Samasource 23
Tests to Detect Dataset Bias
© 2019 Samasource
Dataset Test: Name that Dataset
If the test classifier can identify
the source dataset, there may
be dataset bias.
Example:
Given 3 images from 12 popular
datasets, can you match images
with the set?
24
From “Unbiased Look at Dataset Bias,”
citation in Resources
© 2019 Samasource
Model Test: Cross Dataset Generalization
Test how well a typical object
detector trained on one
“native” dataset can
generalize when tested on
other, representative sets.
Example:
Can an object detector
trained on LabelMe cars
identify other cars? If not,
indicates problems with
LabelMe data.
25
From “An Unbiased Look at Dataset Bias,” citation in Resource.
© 2019 Samasource
Model Test: Negative Bias
Test that a model is using the right
features from the data to define
objects, evaluate whether
background data is representative.
Example:
Test a model’s classification of “not
car” using “not car” examples from
other datasets it hasn’t been trained
on.
26
Stock image example, not from a real dataset
© 2019 Samasource 27
Legal and Ethical
Data Sourcing
Considerations
© 2019 Samasource
Check Local Privacy and Property Laws, Consult Experts
Governments move
slower than technology.
Laws can change.
Example:
IBM’s “Diversity in Faces”
used public Flickr photos
without explicit consent.
May not be legal in the
future, could discredit the
dataset; a shame.
28
© 2019 Samasource
Case: Compliant Facial Data Sourcing in East Africa
Tech company legally
sourcing diverse facial
images from East Africa,
complete with consent
forms.
Realized after collecting
that Kenyan privacy laws
were more rigid.
Used Uganda-sourced
data only, instead of
risking legal action in
Kenya.
29
© 2019 Samasource
Best Practices: Acquire Data Ethically and Legally
• Know the legal definition of data consent in the collection location
• If scraping (legally), consider images of celebrities who are already in
the public eye
• Data from private citizens, even if legal, is more likely to cause controversy
• Buy data from accredited sources that own and manage image rights
and know how to do business, such as Getty
• It may cost more, but it might save you legal fees and embarrassing headlines
• Document and credit sources
• Understand EU’s GDPR & other major laws
30
© 2019 Samasource
Best Practices: Evaluate Methodology for Ethics
Use Fast.AI’s “Data Checklist” to work to make fewer ethical mistakes:
• Have we tested our training data to ensure that it is fair and
representative?
• Have we studied and understood the possible sources of bias in
our data?
• Does our team reflect diversity of opinions, background and all
kinds of thought [enabling us to see and catch more bias]?
• What kinds of user consent do we need to collect or use the data?
• Do we have a mechanism for gathering consent from users?
• Have we clearly explained what users are consenting to?
31
© 2019 Samasource
Best Practices: Understand What Bias Truly Means
• Humans are inherently biased; eliminating all forms of bias is
impossible
• Understand cognitive bias, limitations and decision making (your
algorithm makes decisions)
• Challenge and test assumptions: weigh evidence, don’t jump to
conclusions
• Constantly, rigorously examine bias:
• Your own biases
• Biases of those providing data/information
32
© 2019 Samasource
Key Takeaways to Avoid Bias and Source Properly
• Clearly articulate your end training goal and know what data is
needed to get to it
• Map out ways bias can enter data proactively source data to avoid
it
• Ensure data represents reality for your training goal in quantity
and diversity, replenish data often
• Test data before and after training on a wide range of data
• Be aware of ethics and laws, both current and potential
• Always get proper consent for data, even for public data
33
© 2019 Samasource
25% of the Fortune 50 trust Samasource
to Solve Their Training Data Challenges
34
Over one billion points annotated in 2018.
We’ve helped lift 50,000 people out of
poverty.
Meet
Samasource at
booth #621
© 2019 Samasource 35
Resources
© 2019 Samasource
Resources
36
Whose lives matter to self-driving cars?
https://www.consumeraffairs.com/amp/
news/whose-lives-matter-to-self-driving-
cars-043019.html
16 Things You Can Do to Make Tech
More Ethical, part 1
https://www.fast.ai/2019/04/22/ethics-
action-1/, Checklist for data projects
When it comes to Gorillas, Google Photos
Remains Blind
https://www.wired.com/story/when-it-
comes-to-gorillas-google-photos-remains-
blind/
General Resources
MIT Tech Review: AI Bias
https://www.technologyreview.com/s/61
2876/this-is-how-ai-bias-really-
happensand-why-its-so-hard-to-fix/
Challenges with AI
https://www.mckinsey.com/featured-
insights/artificial-intelligence/notes-from-
the-ai-frontier-applications-and-value-of-
deep-learning
Stanford Dog Dataset
http://vision.stanford.edu/aditya86/Ima
geNetDogs/
© 2019 Samasource
Resources
37
Unbiased Look at Dataset Bias
http://citeseerx.ist.psu.edu/viewdoc/do
wnload?doi=10.1.1.208.2314&rep=rep1
&type=pdf
Predictive Inequity in Object Detection
https://arxiv.org/pdf/1902.11097.pdf
Undoing the Damage of Dataset Bias
http://people.csail.mit.edu/khosla/paper
s/eccv2012_khosla.pdf
About Samasource
https://www.samasource.com
Papers & Studies Referenced
Men also like Shopping
https://arxiv.org/abs/1707.09457
Impact of Biases in Big Data
https://arxiv.org/pdf/1803.00897.pdf
Gender Shades & Update
http://proceedings.mlr.press/v81/buola
mwini18a/buolamwini18a.pdf
http://www.aies-conference.com/wp-
content/uploads/2019/01/AIES-
19_paper_223.pdf

Más contenido relacionado

Similar a "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedKrishnaram Kenthapadi
 
Ethics in Data Management.pptx
Ethics in Data Management.pptxEthics in Data Management.pptx
Ethics in Data Management.pptxRavindra Babu
 
Ethical Issues in Artificial Intelligence: Examining Bias and Discrimination
Ethical Issues in Artificial Intelligence: Examining Bias and DiscriminationEthical Issues in Artificial Intelligence: Examining Bias and Discrimination
Ethical Issues in Artificial Intelligence: Examining Bias and DiscriminationTechCyber Vision
 
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptx
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptxARTIFICIAL INTELLIGENCE AND ETHICS 29.pptx
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptxAmalaPaulson
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessClare Corthell
 
Overview of data mining
Overview of data miningOverview of data mining
Overview of data miningMasterM0212
 
LAK16 privacy and analytics (2016)
LAK16 privacy and analytics (2016)LAK16 privacy and analytics (2016)
LAK16 privacy and analytics (2016)Wolfgang Greller
 
Ethics In DW & DM
Ethics In DW & DMEthics In DW & DM
Ethics In DW & DMabethan
 
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Saurabh Mishra
 
Adversarial Attacks and Defense
Adversarial Attacks and DefenseAdversarial Attacks and Defense
Adversarial Attacks and DefenseKishor Datta Gupta
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Krishnaram Kenthapadi
 
Does Synthetic Data Hold The Secret To Artificial Intelligence?
Does Synthetic Data Hold The Secret To Artificial Intelligence?Does Synthetic Data Hold The Secret To Artificial Intelligence?
Does Synthetic Data Hold The Secret To Artificial Intelligence?Bernard Marr
 
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018Joe Keating
 
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018Joe Keating
 
[MU630] 005. Ethics, Privacy and Security
[MU630] 005. Ethics, Privacy and Security[MU630] 005. Ethics, Privacy and Security
[MU630] 005. Ethics, Privacy and SecurityAriantoMuditomo
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?University of Minnesota, Duluth
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...Edge AI and Vision Alliance
 
Introduction To Predictive Modelling
Introduction To Predictive ModellingIntroduction To Predictive Modelling
Introduction To Predictive ModellingSpotle.ai
 

Similar a "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource (20)

Responsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons LearnedResponsible AI in Industry: Practical Challenges and Lessons Learned
Responsible AI in Industry: Practical Challenges and Lessons Learned
 
A.I.pptx
A.I.pptxA.I.pptx
A.I.pptx
 
Ethics in Data Management.pptx
Ethics in Data Management.pptxEthics in Data Management.pptx
Ethics in Data Management.pptx
 
Ethical Issues in Artificial Intelligence: Examining Bias and Discrimination
Ethical Issues in Artificial Intelligence: Examining Bias and DiscriminationEthical Issues in Artificial Intelligence: Examining Bias and Discrimination
Ethical Issues in Artificial Intelligence: Examining Bias and Discrimination
 
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptx
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptxARTIFICIAL INTELLIGENCE AND ETHICS 29.pptx
ARTIFICIAL INTELLIGENCE AND ETHICS 29.pptx
 
Engineering Ethics: Practicing Fairness
Engineering Ethics: Practicing FairnessEngineering Ethics: Practicing Fairness
Engineering Ethics: Practicing Fairness
 
Overview of data mining
Overview of data miningOverview of data mining
Overview of data mining
 
LAK16 privacy and analytics (2016)
LAK16 privacy and analytics (2016)LAK16 privacy and analytics (2016)
LAK16 privacy and analytics (2016)
 
Ethics In DW & DM
Ethics In DW & DMEthics In DW & DM
Ethics In DW & DM
 
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
Breakout 3. AI for Sustainable Development and Human Rights: Inclusion, Diver...
 
Adversarial Attacks and Defense
Adversarial Attacks and DefenseAdversarial Attacks and Defense
Adversarial Attacks and Defense
 
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
Fairness-aware Machine Learning: Practical Challenges and Lessons Learned (WS...
 
Does Synthetic Data Hold The Secret To Artificial Intelligence?
Does Synthetic Data Hold The Secret To Artificial Intelligence?Does Synthetic Data Hold The Secret To Artificial Intelligence?
Does Synthetic Data Hold The Secret To Artificial Intelligence?
 
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation Slides - Ethical Data Science - BoI Analytics Connect 2018
 
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018
Glantus Presentation: Ethical Data Science - BoI Analytics Connect 2018
 
[MU630] 005. Ethics, Privacy and Security
[MU630] 005. Ethics, Privacy and Security[MU630] 005. Ethics, Privacy and Security
[MU630] 005. Ethics, Privacy and Security
 
Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?Algorithmic Bias : What is it? Why should we care? What can we do about it?
Algorithmic Bias : What is it? Why should we care? What can we do about it?
 
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
“Responsible AI: Tools and Frameworks for Developing AI Solutions,” a Present...
 
Machine learning
Machine learningMachine learning
Machine learning
 
Introduction To Predictive Modelling
Introduction To Predictive ModellingIntroduction To Predictive Modelling
Introduction To Predictive Modelling
 

Más de Edge AI and Vision Alliance

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...Edge AI and Vision Alliance
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...Edge AI and Vision Alliance
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...Edge AI and Vision Alliance
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...Edge AI and Vision Alliance
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...Edge AI and Vision Alliance
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...Edge AI and Vision Alliance
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...Edge AI and Vision Alliance
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsightsEdge AI and Vision Alliance
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...Edge AI and Vision Alliance
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...Edge AI and Vision Alliance
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...Edge AI and Vision Alliance
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...Edge AI and Vision Alliance
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...Edge AI and Vision Alliance
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...Edge AI and Vision Alliance
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...Edge AI and Vision Alliance
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from SamsaraEdge AI and Vision Alliance
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...Edge AI and Vision Alliance
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...Edge AI and Vision Alliance
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...Edge AI and Vision Alliance
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...Edge AI and Vision Alliance
 

Más de Edge AI and Vision Alliance (20)

“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
“Learning Compact DNN Models for Embedded Vision,” a Presentation from the Un...
 
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
“Introduction to Computer Vision with CNNs,” a Presentation from Mohammad Hag...
 
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
“Selecting Tools for Developing, Monitoring and Maintaining ML Models,” a Pre...
 
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
“Building Accelerated GStreamer Applications for Video and Audio AI,” a Prese...
 
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
“Understanding, Selecting and Optimizing Object Detectors for Edge Applicatio...
 
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
“Introduction to Modern LiDAR for Machine Perception,” a Presentation from th...
 
“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...“Vision-language Representations for Robotics,” a Presentation from the Unive...
“Vision-language Representations for Robotics,” a Presentation from the Unive...
 
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
“ADAS and AV Sensors: What’s Winning and Why?,” a Presentation from TechInsights
 
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
“Computer Vision in Sports: Scalable Solutions for Downmarkets,” a Presentati...
 
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
“Detecting Data Drift in Image Classification Neural Networks,” a Presentatio...
 
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
“Deep Neural Network Training: Diagnosing Problems and Implementing Solutions...
 
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
“AI Start-ups: The Perils of Fishing for Whales (War Stories from the Entrepr...
 
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
“A Computer Vision System for Autonomous Satellite Maneuvering,” a Presentati...
 
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
“Bias in Computer Vision—It’s Bigger Than Facial Recognition!,” a Presentatio...
 
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
“Sensor Fusion Techniques for Accurate Perception of Objects in the Environme...
 
“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara“Updating the Edge ML Development Process,” a Presentation from Samsara
“Updating the Edge ML Development Process,” a Presentation from Samsara
 
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
“Combating Bias in Production Computer Vision Systems,” a Presentation from R...
 
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
“Developing an Embedded Vision AI-powered Fitness System,” a Presentation fro...
 
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
“Navigating the Evolving Venture Capital Landscape for Edge AI Start-ups,” a ...
 
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
“Advanced Presence Sensing: What It Means for the Smart Home,” a Presentation...
 

Último

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024The Digital Insurer
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)wesley chun
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesBoston Institute of Analytics
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodJuan lago vázquez
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...apidays
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024Rafal Los
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUK Journal
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...apidays
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyKhushali Kathiriya
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...apidays
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Drew Madelung
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businesspanagenda
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAndrey Devyatkin
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationRadu Cotescu
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerThousandEyes
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...DianaGray10
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CVKhem
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024SynarionITSolutions
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?Igalia
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024The Digital Insurer
 

Último (20)

Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024Manulife - Insurer Innovation Award 2024
Manulife - Insurer Innovation Award 2024
 
Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)Powerful Google developer tools for immediate impact! (2023-24 C)
Powerful Google developer tools for immediate impact! (2023-24 C)
 
HTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation StrategiesHTML Injection Attacks: Impact and Mitigation Strategies
HTML Injection Attacks: Impact and Mitigation Strategies
 
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin WoodPolkadot JAM Slides - Token2049 - By Dr. Gavin Wood
Polkadot JAM Slides - Token2049 - By Dr. Gavin Wood
 
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
Apidays New York 2024 - The Good, the Bad and the Governed by David O'Neill, ...
 
The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024The 7 Things I Know About Cyber Security After 25 Years | April 2024
The 7 Things I Know About Cyber Security After 25 Years | April 2024
 
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdfUnderstanding Discord NSFW Servers A Guide for Responsible Users.pdf
Understanding Discord NSFW Servers A Guide for Responsible Users.pdf
 
Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...Apidays New York 2024 - The value of a flexible API Management solution for O...
Apidays New York 2024 - The value of a flexible API Management solution for O...
 
Artificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : UncertaintyArtificial Intelligence Chap.5 : Uncertainty
Artificial Intelligence Chap.5 : Uncertainty
 
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
Apidays Singapore 2024 - Building Digital Trust in a Digital Economy by Veron...
 
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
Strategies for Unlocking Knowledge Management in Microsoft 365 in the Copilot...
 
Why Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire businessWhy Teams call analytics are critical to your entire business
Why Teams call analytics are critical to your entire business
 
AWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of TerraformAWS Community Day CPH - Three problems of Terraform
AWS Community Day CPH - Three problems of Terraform
 
Scaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organizationScaling API-first – The story of a global engineering organization
Scaling API-first – The story of a global engineering organization
 
How to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected WorkerHow to Troubleshoot Apps for the Modern Connected Worker
How to Troubleshoot Apps for the Modern Connected Worker
 
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
Connector Corner: Accelerate revenue generation using UiPath API-centric busi...
 
Real Time Object Detection Using Open CV
Real Time Object Detection Using Open CVReal Time Object Detection Using Open CV
Real Time Object Detection Using Open CV
 
Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024Top 10 Most Downloaded Games on Play Store in 2024
Top 10 Most Downloaded Games on Play Store in 2024
 
A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?A Year of the Servo Reboot: Where Are We Now?
A Year of the Servo Reboot: Where Are We Now?
 
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
Bajaj Allianz Life Insurance Company - Insurer Innovation Award 2024
 

"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

  • 1. © 2019 Samasource Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations Audrey Jill Boguchwal Samasource May 2019
  • 2. © 2019 Samasource Training Data is the Soul of AI Training data lays the groundwork for model performance. – IBM, Microsoft, MIT CSAIL Computer vision training data may include: images, video, lidar, radar and other sensor data. 2
  • 3. © 2019 Samasource AI Development and Adoption Challenges Training data presents the majority of challenges that can limit AI development. • Obtaining data sets • Labeling training data • Bias in training data, bias in algorithms, and bias in models • Explaining why a decision was reached by an algorithm • Carrying learnings from one algorithm model to another "Notes from the AI frontier: Applications and value of deep learning,” McKinsey 3
  • 4. © 2019 Samasource Presentation Outline: Training Data Bias and Sourcing Strategies to avoid data bias and obtain data ethically and legally. • Common types of bias • How unintended bias can creep into datasets • Impact of biased training data • Strategies to avoid many types of bias • How to test for bias • Legal and ethical data sourcing considerations, with real-world examples and impact of problems • Best practices to avoid and mitigate sourcing issues 4
  • 5. © 2019 Samasource 5 Common Types of Unintended Training Data Bias
  • 6. © 2019 Samasource Sample Bias Data is unrepresentative of reality. Example: Data set has too few examples of people with darker skin tones. 6 Stock image example, not a real dataset
  • 7. © 2019 Samasource Historical Bias Data reflects a prejudice or stereotype that we do not want to project into the future. Example: Data set has many images of women in kitchens and men in offices; but few of the reverse. 7 Stock image example, not a real dataset
  • 8. © 2019 Samasource Measurement Bias Systemic value distortion from a problem with the device capturing data. Example: Image data came from one camera only, with an overexposure problem. 8 Stock image example, not from a real dataset
  • 9. © 2019 Samasource 9 How Unintended Bias Can Creep into Datasets
  • 10. © 2019 Samasource Dataset Bias Datasets used in training have similar images and lack diversity. Example: Cars images from 5 data sets have similar qualities within each set. 10 From “An Unbiased Look at Dataset Bias,” citation in Resources.
  • 11. © 2019 Samasource Selection and Capture Biases Selection: Keyword search returns similar images. Capture: Objects photographed in similar ways that do not generalize. Example: Google Image results for “sunglasses” too similar. 11 Google Image search results for “Sunglasses,” all photographed in a similar way
  • 12. © 2019 Samasource Class Imbalance Too few or too many examples of a class. Example: Dataset for a dog classifier has too many German Shepherds and no other dogs. 12 From Stanford Dogs Dataset.
  • 13. © 2019 Samasource Negative Set Bias Data of “the rest of the world” is not well represented or balanced. Example: Features that classify “woman” are not on the person, but in the environment. 13 Stock image examples, not from a real dataset
  • 14. © 2019 Samasource 14 Impact of Biased Training Data: Case Studies
  • 15. © 2019 Samasource Models Trained on Bias Data Can Be Less Accurate Models can be overconfident and not discriminative. Models will classify based on the wrong features, leading to misclassifications. Example: Classifier uses scene, not person, to identify gender of person. 15 From “Men Also Like Shopping,” citation in Resources.
  • 16. © 2019 Samasource Biased Data has Ethical, Legal, and Safety Implications 16 • Inability to detect presence, identity and/or correct gender expression of people with darker skin tones • Causes problems for facial recognition used in identification, surveillance, and law enforcement – “Gender Shades” • Lack of visibility as seen by autonomous vehicles (potentially) – “Predictive Inequity in Object Detection” • Perpetuating historical, negative stereotypes across race & gender • Stereotype: women belong in the kitchen, men in the office – “Men Also Like Shopping” • Google Photos wrongly labeled a black person as a gorilla – As posted on Twitter, discussed in popular press
  • 17. © 2019 Samasource Case: AVs More Likely to Hit People with Darker Skin? Test data used to determine if object detection systems, like those seen in self-driving cars, have equitable detection for pedestrians of all skin tones – and if not, why? Results indicate detection accuracy is 5% higher for lighter skin – but many unaccounted variables remain. 17 Stock image example, not from a real dataset
  • 18. © 2019 Samasource Is All Training Data Bias Undesirable? Unintended bias in data is undesirable. • All datasets are biased because they are not the full visual world • If data accurately represents reality and reality has a statistical bias, then the data should share that bias • Goals: understand, mitigate and manage bias 18
  • 19. © 2019 Samasource 19 Strategies to Avoid Training Data Bias
  • 20. © 2019 Samasource Strategies to Avoid the Effects of Training Data Bias 20 Offset dataset bias and capture bias by preprocessing data. • For object classifiers, if images look similar, consider transformations: flip or automatically crop to vary Avoid negative set bias by varying data. • Collect data that contains background scenes in addition to objects of interest Avoid selection bias by varying search terms and data sources. • Vary keywords, search engines to retrieve different kinds of images
  • 21. © 2019 Samasource Ensure Reality is Always Represented in the Data 21 Avoid sample bias by sourcing and selecting training data with the end training goal in mind. • Ensure many diverse examples of all classes and edge cases • Example: When classifying pedestrians, source city street data showing people from all demographics. Highway data with few people isn’t a fit. Avoid historical bias and measurement bias with diverse sources. • Have multiple, diverse, varied data sources from many devices • Example: Use more than one training set, especially if it’s a stock set • Refresh data and retrain several times a year as the world changes • Example: Refresh data often for a clothing classifier to keep up with fashion
  • 22. © 2019 Samasource Case: “Gender Shades” on Facial Dataset Diversity 22 Joy Buolamwini, real and average faces to test and train facial recognition.
  • 23. © 2019 Samasource 23 Tests to Detect Dataset Bias
  • 24. © 2019 Samasource Dataset Test: Name that Dataset If the test classifier can identify the source dataset, there may be dataset bias. Example: Given 3 images from 12 popular datasets, can you match images with the set? 24 From “Unbiased Look at Dataset Bias,” citation in Resources
  • 25. © 2019 Samasource Model Test: Cross Dataset Generalization Test how well a typical object detector trained on one “native” dataset can generalize when tested on other, representative sets. Example: Can an object detector trained on LabelMe cars identify other cars? If not, indicates problems with LabelMe data. 25 From “An Unbiased Look at Dataset Bias,” citation in Resource.
  • 26. © 2019 Samasource Model Test: Negative Bias Test that a model is using the right features from the data to define objects, evaluate whether background data is representative. Example: Test a model’s classification of “not car” using “not car” examples from other datasets it hasn’t been trained on. 26 Stock image example, not from a real dataset
  • 27. © 2019 Samasource 27 Legal and Ethical Data Sourcing Considerations
  • 28. © 2019 Samasource Check Local Privacy and Property Laws, Consult Experts Governments move slower than technology. Laws can change. Example: IBM’s “Diversity in Faces” used public Flickr photos without explicit consent. May not be legal in the future, could discredit the dataset; a shame. 28
  • 29. © 2019 Samasource Case: Compliant Facial Data Sourcing in East Africa Tech company legally sourcing diverse facial images from East Africa, complete with consent forms. Realized after collecting that Kenyan privacy laws were more rigid. Used Uganda-sourced data only, instead of risking legal action in Kenya. 29
  • 30. © 2019 Samasource Best Practices: Acquire Data Ethically and Legally • Know the legal definition of data consent in the collection location • If scraping (legally), consider images of celebrities who are already in the public eye • Data from private citizens, even if legal, is more likely to cause controversy • Buy data from accredited sources that own and manage image rights and know how to do business, such as Getty • It may cost more, but it might save you legal fees and embarrassing headlines • Document and credit sources • Understand EU’s GDPR & other major laws 30
  • 31. © 2019 Samasource Best Practices: Evaluate Methodology for Ethics Use Fast.AI’s “Data Checklist” to work to make fewer ethical mistakes: • Have we tested our training data to ensure that it is fair and representative? • Have we studied and understood the possible sources of bias in our data? • Does our team reflect diversity of opinions, background and all kinds of thought [enabling us to see and catch more bias]? • What kinds of user consent do we need to collect or use the data? • Do we have a mechanism for gathering consent from users? • Have we clearly explained what users are consenting to? 31
  • 32. © 2019 Samasource Best Practices: Understand What Bias Truly Means • Humans are inherently biased; eliminating all forms of bias is impossible • Understand cognitive bias, limitations and decision making (your algorithm makes decisions) • Challenge and test assumptions: weigh evidence, don’t jump to conclusions • Constantly, rigorously examine bias: • Your own biases • Biases of those providing data/information 32
  • 33. © 2019 Samasource Key Takeaways to Avoid Bias and Source Properly • Clearly articulate your end training goal and know what data is needed to get to it • Map out ways bias can enter data proactively source data to avoid it • Ensure data represents reality for your training goal in quantity and diversity, replenish data often • Test data before and after training on a wide range of data • Be aware of ethics and laws, both current and potential • Always get proper consent for data, even for public data 33
  • 34. © 2019 Samasource 25% of the Fortune 50 trust Samasource to Solve Their Training Data Challenges 34 Over one billion points annotated in 2018. We’ve helped lift 50,000 people out of poverty. Meet Samasource at booth #621
  • 35. © 2019 Samasource 35 Resources
  • 36. © 2019 Samasource Resources 36 Whose lives matter to self-driving cars? https://www.consumeraffairs.com/amp/ news/whose-lives-matter-to-self-driving- cars-043019.html 16 Things You Can Do to Make Tech More Ethical, part 1 https://www.fast.ai/2019/04/22/ethics- action-1/, Checklist for data projects When it comes to Gorillas, Google Photos Remains Blind https://www.wired.com/story/when-it- comes-to-gorillas-google-photos-remains- blind/ General Resources MIT Tech Review: AI Bias https://www.technologyreview.com/s/61 2876/this-is-how-ai-bias-really- happensand-why-its-so-hard-to-fix/ Challenges with AI https://www.mckinsey.com/featured- insights/artificial-intelligence/notes-from- the-ai-frontier-applications-and-value-of- deep-learning Stanford Dog Dataset http://vision.stanford.edu/aditya86/Ima geNetDogs/
  • 37. © 2019 Samasource Resources 37 Unbiased Look at Dataset Bias http://citeseerx.ist.psu.edu/viewdoc/do wnload?doi=10.1.1.208.2314&rep=rep1 &type=pdf Predictive Inequity in Object Detection https://arxiv.org/pdf/1902.11097.pdf Undoing the Damage of Dataset Bias http://people.csail.mit.edu/khosla/paper s/eccv2012_khosla.pdf About Samasource https://www.samasource.com Papers & Studies Referenced Men also like Shopping https://arxiv.org/abs/1707.09457 Impact of Biases in Big Data https://arxiv.org/pdf/1803.00897.pdf Gender Shades & Update http://proceedings.mlr.press/v81/buola mwini18a/buolamwini18a.pdf http://www.aies-conference.com/wp- content/uploads/2019/01/AIES- 19_paper_223.pdf