"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

© 2019 Samasource
Practical Approaches to
Training Data Strategy:
Bias, Legal and Ethical
Considerations
Audrey Jill Boguchwal
Samasource
May 2019

© 2019 Samasource
Training Data is the Soul of AI
Training data lays the
groundwork for model
performance.
– IBM, Microsoft, MIT CSAIL
Computer vision training data
may include: images, video,
lidar, radar and other sensor
data.
2

© 2019 Samasource
AI Development and Adoption Challenges
Training data presents the majority of challenges that can limit AI
development.
• Obtaining data sets
• Labeling training data
• Bias in training data, bias in algorithms, and bias in models
• Explaining why a decision was reached by an algorithm
• Carrying learnings from one algorithm model to another
"Notes from the AI frontier: Applications and value of deep learning,” McKinsey
3

© 2019 Samasource
Presentation Outline: Training Data Bias and Sourcing
Strategies to avoid data bias and obtain data ethically and legally.
• Common types of bias
• How unintended bias can creep into datasets
• Impact of biased training data
• Strategies to avoid many types of bias
• How to test for bias
• Legal and ethical data sourcing considerations, with real-world
examples and impact of problems
• Best practices to avoid and mitigate sourcing issues
4

© 2019 Samasource 5
Common Types
of Unintended
Training Data Bias

© 2019 Samasource
Sample Bias
Data is unrepresentative
of reality.
Example:
Data set has too few
examples of people with
darker skin tones.
6
Stock image example, not a real dataset

© 2019 Samasource
Historical Bias
Data reflects a prejudice or
stereotype that we do not
want to project into the future.
Example:
Data set has many images of
women in kitchens and men
in offices; but few of the
reverse.
7
Stock image example, not a real dataset

© 2019 Samasource
Measurement Bias
Systemic value distortion
from a problem with the
device capturing data.
Example:
Image data came from
one camera only, with an
overexposure problem.
8
Stock image example, not from a real dataset

How Unintended Bias
Can Creep into Datasets

© 2019 Samasource
Dataset Bias
Datasets used in training
have similar images and lack
diversity.
Example:
Cars images from 5 data
sets have similar qualities
within each set.
10
From “An Unbiased Look at Dataset Bias,” citation in Resources.

© 2019 Samasource
Selection and Capture Biases
Selection:
Keyword search returns
similar images.
Capture:
Objects photographed in
similar ways that do not
generalize.
Example:
Google Image results for
“sunglasses” too similar.
11
Google Image search results for “Sunglasses,” all photographed in a similar way

© 2019 Samasource
Class Imbalance
Too few or too many
examples of a class.
Example:
Dataset for a dog
classifier has too many
German Shepherds and
no other dogs.
12
From Stanford Dogs Dataset.

© 2019 Samasource
Negative Set Bias
Data of “the rest of the
world” is not well
represented or balanced.
Example:
Features that classify
“woman” are not on the
person, but in the
environment.
13
Stock image examples, not from a real dataset

Impact of Biased
Training Data:
Case Studies

© 2019 Samasource
Models Trained on Bias Data Can Be Less Accurate
Models can be
overconfident and not
discriminative.
Models will classify
based on the wrong
features, leading to
misclassifications.
Example:
Classifier uses scene,
not person, to identify
gender of person.
15
From “Men Also Like Shopping,” citation in Resources.

© 2019 Samasource
Biased Data has Ethical, Legal, and Safety Implications
16
• Inability to detect presence, identity and/or correct gender
expression of people with darker skin tones
• Causes problems for facial recognition used in identification, surveillance,
and law enforcement – “Gender Shades”
• Lack of visibility as seen by autonomous vehicles (potentially) – “Predictive
Inequity in Object Detection”
• Perpetuating historical, negative stereotypes across race & gender
• Stereotype: women belong in the kitchen, men in the office – “Men Also Like
Shopping”
• Google Photos wrongly labeled a black person as a gorilla – As posted on
Twitter, discussed in popular press

© 2019 Samasource
Case: AVs More Likely to Hit People with Darker Skin?
Test data used to
determine if object
detection systems, like
those seen in self-driving
cars, have equitable
detection for pedestrians
of all skin tones – and if
not, why?
Results indicate
detection accuracy is 5%
higher for lighter skin –
but many unaccounted
variables remain. 17

© 2019 Samasource
Is All Training Data Bias Undesirable?
Unintended bias in data is undesirable.
• All datasets are biased because they are not the full visual world
• If data accurately represents reality and reality has a statistical bias,
then the data should share that bias
• Goals: understand, mitigate and manage bias
18

Strategies to Avoid
Training Data Bias

© 2019 Samasource
Strategies to Avoid the Effects of Training Data Bias
20
Offset dataset bias and capture bias by preprocessing data.
• For object classifiers, if images look similar, consider
transformations: flip or automatically crop to vary
Avoid negative set bias by varying data.
• Collect data that contains background scenes in addition to objects
of interest
Avoid selection bias by varying search terms and data sources.
• Vary keywords, search engines to retrieve different kinds of images

© 2019 Samasource
Ensure Reality is Always Represented in the Data
21
Avoid sample bias by sourcing and selecting training data with the end
training goal in mind.
• Ensure many diverse examples of all classes and edge cases
• Example: When classifying pedestrians, source city street data showing
people from all demographics. Highway data with few people isn’t a fit.
Avoid historical bias and measurement bias with diverse sources.
• Have multiple, diverse, varied data sources from many devices
• Example: Use more than one training set, especially if it’s a stock set
• Refresh data and retrain several times a year as the world changes
• Example: Refresh data often for a clothing classifier to keep up with fashion

© 2019 Samasource
Case: “Gender Shades” on Facial Dataset Diversity
22
Joy Buolamwini, real and average faces to test and train facial recognition.

Tests to Detect Dataset Bias

© 2019 Samasource
Dataset Test: Name that Dataset
If the test classifier can identify
the source dataset, there may
be dataset bias.
Example:
Given 3 images from 12 popular
datasets, can you match images
with the set?
24
From “Unbiased Look at Dataset Bias,”
citation in Resources

© 2019 Samasource
Model Test: Cross Dataset Generalization
Test how well a typical object
detector trained on one
“native” dataset can
generalize when tested on
other, representative sets.
Example:
Can an object detector
trained on LabelMe cars
identify other cars? If not,
indicates problems with
LabelMe data.
25
From “An Unbiased Look at Dataset Bias,” citation in Resource.

© 2019 Samasource
Model Test: Negative Bias
Test that a model is using the right
features from the data to define
objects, evaluate whether
background data is representative.
Example:
Test a model’s classification of “not
car” using “not car” examples from
other datasets it hasn’t been trained
on.
26

Legal and Ethical
Data Sourcing
Considerations

© 2019 Samasource
Check Local Privacy and Property Laws, Consult Experts
Governments move
slower than technology.
Laws can change.
Example:
IBM’s “Diversity in Faces”
used public Flickr photos
without explicit consent.
May not be legal in the
future, could discredit the
dataset; a shame.
28

© 2019 Samasource
Case: Compliant Facial Data Sourcing in East Africa
Tech company legally
sourcing diverse facial
images from East Africa,
complete with consent
forms.
Realized after collecting
that Kenyan privacy laws
were more rigid.
Used Uganda-sourced
data only, instead of
risking legal action in
Kenya.
29

© 2019 Samasource
Best Practices: Acquire Data Ethically and Legally
• Know the legal definition of data consent in the collection location
• If scraping (legally), consider images of celebrities who are already in
the public eye
• Data from private citizens, even if legal, is more likely to cause controversy
• Buy data from accredited sources that own and manage image rights
and know how to do business, such as Getty
• It may cost more, but it might save you legal fees and embarrassing headlines
• Document and credit sources
• Understand EU’s GDPR & other major laws
30

© 2019 Samasource
Best Practices: Evaluate Methodology for Ethics
Use Fast.AI’s “Data Checklist” to work to make fewer ethical mistakes:
• Have we tested our training data to ensure that it is fair and
representative?
• Have we studied and understood the possible sources of bias in
our data?
• Does our team reflect diversity of opinions, background and all
kinds of thought [enabling us to see and catch more bias]?
• What kinds of user consent do we need to collect or use the data?
• Do we have a mechanism for gathering consent from users?
• Have we clearly explained what users are consenting to?
31

© 2019 Samasource
Best Practices: Understand What Bias Truly Means
• Humans are inherently biased; eliminating all forms of bias is
impossible
• Understand cognitive bias, limitations and decision making (your
algorithm makes decisions)
• Challenge and test assumptions: weigh evidence, don’t jump to
conclusions
• Constantly, rigorously examine bias:
• Your own biases
• Biases of those providing data/information
32

© 2019 Samasource
Key Takeaways to Avoid Bias and Source Properly
• Clearly articulate your end training goal and know what data is
needed to get to it
• Map out ways bias can enter data proactively source data to avoid
it
• Ensure data represents reality for your training goal in quantity
and diversity, replenish data often
• Test data before and after training on a wide range of data
• Be aware of ethics and laws, both current and potential
• Always get proper consent for data, even for public data
33

© 2019 Samasource
25% of the Fortune 50 trust Samasource
to Solve Their Training Data Challenges
34
Over one billion points annotated in 2018.
We’ve helped lift 50,000 people out of
poverty.
Meet
Samasource at
booth #621

Resources

© 2019 Samasource
Resources
36
Whose lives matter to self-driving cars?
https://www.consumeraffairs.com/amp/
news/whose-lives-matter-to-self-driving-
cars-043019.html
16 Things You Can Do to Make Tech
More Ethical, part 1
https://www.fast.ai/2019/04/22/ethics-
action-1/, Checklist for data projects
When it comes to Gorillas, Google Photos
Remains Blind
https://www.wired.com/story/when-it-
comes-to-gorillas-google-photos-remains-
blind/
General Resources
MIT Tech Review: AI Bias
https://www.technologyreview.com/s/61
2876/this-is-how-ai-bias-really-
happensand-why-its-so-hard-to-fix/
Challenges with AI
https://www.mckinsey.com/featured-
insights/artificial-intelligence/notes-from-
the-ai-frontier-applications-and-value-of-
deep-learning
Stanford Dog Dataset
http://vision.stanford.edu/aditya86/Ima
geNetDogs/

© 2019 Samasource
Resources
37
Unbiased Look at Dataset Bias
http://citeseerx.ist.psu.edu/viewdoc/do
wnload?doi=10.1.1.208.2314&rep=rep1
&type=pdf
Predictive Inequity in Object Detection
https://arxiv.org/pdf/1902.11097.pdf
Undoing the Damage of Dataset Bias
http://people.csail.mit.edu/khosla/paper
s/eccv2012_khosla.pdf
About Samasource
https://www.samasource.com
Papers & Studies Referenced
Men also like Shopping
https://arxiv.org/abs/1707.09457
Impact of Biases in Big Data
https://arxiv.org/pdf/1803.00897.pdf
Gender Shades & Update
http://proceedings.mlr.press/v81/buola
mwini18a/buolamwini18a.pdf
http://www.aies-conference.com/wp-
content/uploads/2019/01/AIES-
19_paper_223.pdf

"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

Recomendados

Recomendados

Más contenido relacionado

Similar a "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource

Similar a "Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource (20)

Más de Edge AI and Vision Alliance

Más de Edge AI and Vision Alliance (20)

Último

Último (20)

"Practical Approaches to Training Data Strategy: Bias, Legal and Ethical Considerations," a Presentation from Samasource