Más contenido relacionado Más de Amazon Web Services (20) [NEW LAUNCH!] Amazon SageMaker Ground Truth – A Deep Dive with an Interactive Workshop to Build High-Quality Training Datasets (AIM370) - AWS re:Invent 20182. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AmazonSageMakerGroundTruth –A Deep Dive
with an InteractiveWorkshop to Build High-Quality
Training Datasets
Vikram Madan
Product Manager
Amazon Web Services
A I M 3 7 0
Arvind Jayasundar
Engineering Manager
Amazon Web Services
Fedor Zhdanov
Applied Scientist
Amazon Web Services
3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1) Introduction to data labeling and Amazon SageMaker Ground Truth (15
minutes)
2) Demo setup of manifest and creation of labeling job (15 minutes)
3) Activity 1: Set up data Amazon S3 (15 minutes)
4) How to write effective labeling task instructions (10 minutes)
5) Activity 2: Create labeling jobs (40 minutes)
6) How auto labeling and annotation consolidation work (20 minutes)
7) Activity 3: Understand results from labeling job (20 minutes)
4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
For training MachineLearning (ML)models
• Text analysis
• Precision agriculture
• Manufacturing efficiency
• Food safety
• Self-driving cars
• Inventory cataloging
and many more use cases…
6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Basicof datalabeling
7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whyis datalabeling difficult?
DL models need large labeled datasets
Large number of humans to perform labeling
Difficult to achieve high accuracy for labels
Consumes up to 80% of time to deploy ML
8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AmazonSageMaker
Label machine learning training data easily and accurately
New
10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GroundTruthkeyfeatures
Dataset and label
management
Automated labeling
Data labeling jobs
High accuracy
labeling
Improve accuracy with annotation consolidation and UI templates with built-in labeling UX best
practices
Prioritize which data goes to humans first (“not all data is created equal”)
Get part of your data labeled automatically (reduces redundant / unnecessary labeling)
Use pre-built templates for image and text labeling tasks
Create customized tasks for your specific image and text labeling requirements
Query and analyze the results of your labeling jobs
Track and manage your datasets and enable easy integration with your data lake
Multiple workforce
options
Scale out labeling easily with the public MechanicalTurk workforce
Direct work to your own workers or use vendor workforces listed on AWS Marketplace
11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Supported datalabeling usecases
Bounding boxes Image classification Semantic segmentation
Text classification Custom tasks
12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multipleworkforce options
Public
An on-demand 24 x7 workforce
of over 500,000 independent
Contractors worldwide, powered
by Amazon Mechanical Turk
Private
A team of workers that you have
sourced yourself, including your
own employees or contractors
for handling data that needs to
stay within your organization
Vendors
A curated list of third party
vendors that specialize in
providing data labeling services,
available via the AWS Marketplace
13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key ideas:Machinewithhumanin theloop
• Consolidate annotations
from multiple workers
• Only send to humans
examples which are hard for
the machines to label well
Common
Insight
14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set up inputdataset
16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Createlabeling job
17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Access data here: https://justpaste.it/3dkdp
19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whygood instructions?
21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Takeanexample: Howcanweimprove
• Task “Draw a bounding box”
makes people think.
“Draw bounding boxes around
objects of the specified class in this
image.”
• Panel instruction is too long and
should be shorted
22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whatare good instructions?
23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More recommendations on tasksetup
• Limit the scope of the labelling task to cover a single, narrow/simple-to-
describe use case
• If the dataset contains objects from dramatically different contexts, workers
may not be clear on what to label
• Consider order-of-operations and job chaining for complex tasks. If you have
two classes objects that need to be bounded, consider a classification job as a
first step, then object detection jobs for each object class.
• Avoid bounding-box tasks with large # of targeted bounding-boxes (bound 1 ~
5 boxes)
• Avoid instructions that lean on cultural awareness, biases, or use complex
language
24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whyconsolidate?
27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Image classification
28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Image classification
Worker
quality
estimate
True label
estimate
* Dawid AP, Skene AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics. 1979
Jan 1:20-8.
29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object detection
30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Semanticsegmentation
31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High-levelviewofAmazonSageMakerGroundTruthlogic
33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mainloop
34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep neural networks:Classification
35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep neural networks:Object detection
36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Find thebird
37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing costs
Cost savings on Birds8 classification of 12K images:
3834 auto-annotated images, cost $17.75
8166 human-annotated images, cost $408.30
41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Image classification
42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Object detection
43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Object detection
44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object detection: mAPvsmIoU
https://storage.googleapis.com/openimages/web/factsfigures.html
The annotations are licensed by Google Inc. under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license.
45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
46. Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Vikram Madan
Arvind Jayasundar
Fedor Zhdanov
47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Notas del editor Create a simple illustration of your task that your workers can immediately understand. Create a simple illustration of your task that your workers can immediately understand.
Workers are incentivized per image, so will aim to label each image quickly. Expecting too many boxes may lower overall quality.
Simple solution: test them on data you know the labels for.
Hurdles: not always available, not economical especially if the tested workers only do a small number of jobs.
Engaging audience slide: can you see a hawk? a pigeon? This is hard as there is no single confidence for an image