SlideShare una empresa de Scribd logo
1 de 47
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AmazonSageMakerGroundTruth –A Deep Dive
with an InteractiveWorkshop to Build High-Quality
Training Datasets
Vikram Madan
Product Manager
Amazon Web Services
A I M 3 7 0
Arvind Jayasundar
Engineering Manager
Amazon Web Services
Fedor Zhdanov
Applied Scientist
Amazon Web Services
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Agenda
1) Introduction to data labeling and Amazon SageMaker Ground Truth (15
minutes)
2) Demo setup of manifest and creation of labeling job (15 minutes)
3) Activity 1: Set up data Amazon S3 (15 minutes)
4) How to write effective labeling task instructions (10 minutes)
5) Activity 2: Create labeling jobs (40 minutes)
6) How auto labeling and annotation consolidation work (20 minutes)
7) Activity 3: Understand results from labeling job (20 minutes)
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
For training MachineLearning (ML)models
• Text analysis
• Precision agriculture
• Manufacturing efficiency
• Food safety
• Self-driving cars
• Inventory cataloging
and many more use cases…
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Basicof datalabeling
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whyis datalabeling difficult?
DL models need large labeled datasets
Large number of humans to perform labeling
Difficult to achieve high accuracy for labels
Consumes up to 80% of time to deploy ML
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
AmazonSageMaker
Label machine learning training data easily and accurately
New
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
GroundTruthkeyfeatures
Dataset and label
management
Automated labeling
Data labeling jobs
High accuracy
labeling
 Improve accuracy with annotation consolidation and UI templates with built-in labeling UX best
practices
 Prioritize which data goes to humans first (“not all data is created equal”)
 Get part of your data labeled automatically (reduces redundant / unnecessary labeling)
 Use pre-built templates for image and text labeling tasks
 Create customized tasks for your specific image and text labeling requirements
 Query and analyze the results of your labeling jobs
 Track and manage your datasets and enable easy integration with your data lake
Multiple workforce
options
 Scale out labeling easily with the public MechanicalTurk workforce
 Direct work to your own workers or use vendor workforces listed on AWS Marketplace
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Supported datalabeling usecases
Bounding boxes Image classification Semantic segmentation
Text classification Custom tasks
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Multipleworkforce options
Public
An on-demand 24 x7 workforce
of over 500,000 independent
Contractors worldwide, powered
by Amazon Mechanical Turk
Private
A team of workers that you have
sourced yourself, including your
own employees or contractors
for handling data that needs to
stay within your organization
Vendors
A curated list of third party
vendors that specialize in
providing data labeling services,
available via the AWS Marketplace
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Key ideas:Machinewithhumanin theloop
• Consolidate annotations
from multiple workers
• Only send to humans
examples which are hard for
the machines to label well
Common
Insight
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Set up inputdataset
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Createlabeling job
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Access data here: https://justpaste.it/3dkdp
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whygood instructions?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Takeanexample: Howcanweimprove
• Task “Draw a bounding box”
makes people think.
“Draw bounding boxes around
objects of the specified class in this
image.”
• Panel instruction is too long and
should be shorted
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whatare good instructions?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
More recommendations on tasksetup
• Limit the scope of the labelling task to cover a single, narrow/simple-to-
describe use case
• If the dataset contains objects from dramatically different contexts, workers
may not be clear on what to label
• Consider order-of-operations and job chaining for complex tasks. If you have
two classes objects that need to be bounded, consider a classification job as a
first step, then object detection jobs for each object class.
• Avoid bounding-box tasks with large # of targeted bounding-boxes (bound 1 ~
5 boxes)
• Avoid instructions that lean on cultural awareness, biases, or use complex
language
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Whyconsolidate?
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Image classification
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Image classification
Worker
quality
estimate
True label
estimate
* Dawid AP, Skene AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics. 1979
Jan 1:20-8.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object detection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Semanticsegmentation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
High-levelviewofAmazonSageMakerGroundTruthlogic
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Mainloop
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep neural networks:Classification
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Deep neural networks:Object detection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Find thebird
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Activelearning and auto-annotation
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Reducing costs
Cost savings on Birds8 classification of 12K images:
3834 auto-annotated images, cost $17.75
8166 human-annotated images, cost $408.30
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Image classification
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Object detection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Sorting by difficulty:Object detection
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Object detection: mAPvsmIoU
https://storage.googleapis.com/openimages/web/factsfigures.html
The annotations are licensed by Google Inc. under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license.
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Thank you!
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
Vikram Madan
Arvind Jayasundar
Fedor Zhdanov
© 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Más contenido relacionado

Más de Amazon Web Services

Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
Amazon Web Services
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
Amazon Web Services
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
Amazon Web Services
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
Amazon Web Services
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
Amazon Web Services
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Amazon Web Services
 

Más de Amazon Web Services (20)

Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows WorkloadsMicrosoft Active Directory su AWS per supportare i tuoi Windows Workloads
Microsoft Active Directory su AWS per supportare i tuoi Windows Workloads
 
Computer Vision con AWS
Computer Vision con AWSComputer Vision con AWS
Computer Vision con AWS
 
Database Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatareDatabase Oracle e VMware Cloud on AWS i miti da sfatare
Database Oracle e VMware Cloud on AWS i miti da sfatare
 
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJSCrea la tua prima serverless ledger-based app con QLDB e NodeJS
Crea la tua prima serverless ledger-based app con QLDB e NodeJS
 
API moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e webAPI moderne real-time per applicazioni mobili e web
API moderne real-time per applicazioni mobili e web
 
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatareDatabase Oracle e VMware Cloud™ on AWS: i miti da sfatare
Database Oracle e VMware Cloud™ on AWS: i miti da sfatare
 
Tools for building your MVP on AWS
Tools for building your MVP on AWSTools for building your MVP on AWS
Tools for building your MVP on AWS
 
How to Build a Winning Pitch Deck
How to Build a Winning Pitch DeckHow to Build a Winning Pitch Deck
How to Build a Winning Pitch Deck
 
Building a web application without servers
Building a web application without serversBuilding a web application without servers
Building a web application without servers
 
Fundraising Essentials
Fundraising EssentialsFundraising Essentials
Fundraising Essentials
 
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
AWS_HK_StartupDay_Building Interactive websites while automating for efficien...
 
Introduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container ServiceIntroduzione a Amazon Elastic Container Service
Introduzione a Amazon Elastic Container Service
 
Come costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWSCome costruire un'architettura Serverless nel Cloud AWS
Come costruire un'architettura Serverless nel Cloud AWS
 
AWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei serverAWS Serverless per startup: come innovare senza preoccuparsi dei server
AWS Serverless per startup: come innovare senza preoccuparsi dei server
 
Crea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSightCrea dashboard interattive con Amazon QuickSight
Crea dashboard interattive con Amazon QuickSight
 
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker AutopilotCostruisci modelli di Machine Learning con Amazon SageMaker Autopilot
Costruisci modelli di Machine Learning con Amazon SageMaker Autopilot
 
Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows Migra le tue file shares in cloud con FSx for Windows
Migra le tue file shares in cloud con FSx for Windows
 
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
La tua organizzazione è pronta per adottare una strategia di cloud ibrido?
 
Protect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced AttacksProtect your applications from DDoS/BOT & Advanced Attacks
Protect your applications from DDoS/BOT & Advanced Attacks
 
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
Track 6 Session 6_ 透過 AWS AI 服務模擬、部署機器人於產業之應用
 

[NEW LAUNCH!] Amazon SageMaker Ground Truth – A Deep Dive with an Interactive Workshop to Build High-Quality Training Datasets (AIM370) - AWS re:Invent 2018

  • 1.
  • 2. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AmazonSageMakerGroundTruth –A Deep Dive with an InteractiveWorkshop to Build High-Quality Training Datasets Vikram Madan Product Manager Amazon Web Services A I M 3 7 0 Arvind Jayasundar Engineering Manager Amazon Web Services Fedor Zhdanov Applied Scientist Amazon Web Services
  • 3. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Agenda 1) Introduction to data labeling and Amazon SageMaker Ground Truth (15 minutes) 2) Demo setup of manifest and creation of labeling job (15 minutes) 3) Activity 1: Set up data Amazon S3 (15 minutes) 4) How to write effective labeling task instructions (10 minutes) 5) Activity 2: Create labeling jobs (40 minutes) 6) How auto labeling and annotation consolidation work (20 minutes) 7) Activity 3: Understand results from labeling job (20 minutes)
  • 4. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 5. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. For training MachineLearning (ML)models • Text analysis • Precision agriculture • Manufacturing efficiency • Food safety • Self-driving cars • Inventory cataloging and many more use cases…
  • 6. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Basicof datalabeling
  • 7. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Whyis datalabeling difficult? DL models need large labeled datasets Large number of humans to perform labeling Difficult to achieve high accuracy for labels Consumes up to 80% of time to deploy ML
  • 8. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 9. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. AmazonSageMaker Label machine learning training data easily and accurately New
  • 10. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. GroundTruthkeyfeatures Dataset and label management Automated labeling Data labeling jobs High accuracy labeling  Improve accuracy with annotation consolidation and UI templates with built-in labeling UX best practices  Prioritize which data goes to humans first (“not all data is created equal”)  Get part of your data labeled automatically (reduces redundant / unnecessary labeling)  Use pre-built templates for image and text labeling tasks  Create customized tasks for your specific image and text labeling requirements  Query and analyze the results of your labeling jobs  Track and manage your datasets and enable easy integration with your data lake Multiple workforce options  Scale out labeling easily with the public MechanicalTurk workforce  Direct work to your own workers or use vendor workforces listed on AWS Marketplace
  • 11. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Supported datalabeling usecases Bounding boxes Image classification Semantic segmentation Text classification Custom tasks
  • 12. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Multipleworkforce options Public An on-demand 24 x7 workforce of over 500,000 independent Contractors worldwide, powered by Amazon Mechanical Turk Private A team of workers that you have sourced yourself, including your own employees or contractors for handling data that needs to stay within your organization Vendors A curated list of third party vendors that specialize in providing data labeling services, available via the AWS Marketplace
  • 13. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Key ideas:Machinewithhumanin theloop • Consolidate annotations from multiple workers • Only send to humans examples which are hard for the machines to label well Common Insight
  • 14. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 15. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Set up inputdataset
  • 16. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Createlabeling job
  • 17. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 18. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Access data here: https://justpaste.it/3dkdp
  • 19. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 20. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Whygood instructions?
  • 21. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Takeanexample: Howcanweimprove • Task “Draw a bounding box” makes people think. “Draw bounding boxes around objects of the specified class in this image.” • Panel instruction is too long and should be shorted
  • 22. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Whatare good instructions?
  • 23. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. More recommendations on tasksetup • Limit the scope of the labelling task to cover a single, narrow/simple-to- describe use case • If the dataset contains objects from dramatically different contexts, workers may not be clear on what to label • Consider order-of-operations and job chaining for complex tasks. If you have two classes objects that need to be bounded, consider a classification job as a first step, then object detection jobs for each object class. • Avoid bounding-box tasks with large # of targeted bounding-boxes (bound 1 ~ 5 boxes) • Avoid instructions that lean on cultural awareness, biases, or use complex language
  • 24. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 25. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 26. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Whyconsolidate?
  • 27. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Image classification
  • 28. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Image classification Worker quality estimate True label estimate * Dawid AP, Skene AM. Maximum likelihood estimation of observer error-rates using the EM algorithm. Applied statistics. 1979 Jan 1:20-8.
  • 29. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Object detection
  • 30. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Semanticsegmentation
  • 31. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 32. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. High-levelviewofAmazonSageMakerGroundTruthlogic
  • 33. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Mainloop
  • 34. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep neural networks:Classification
  • 35. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Deep neural networks:Object detection
  • 36. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Find thebird
  • 37. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Activelearning and auto-annotation
  • 38. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Activelearning and auto-annotation
  • 39. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Activelearning and auto-annotation
  • 40. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Reducing costs Cost savings on Birds8 classification of 12K images: 3834 auto-annotated images, cost $17.75 8166 human-annotated images, cost $408.30
  • 41. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sorting by difficulty:Image classification
  • 42. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sorting by difficulty:Object detection
  • 43. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Sorting by difficulty:Object detection
  • 44. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Object detection: mAPvsmIoU https://storage.googleapis.com/openimages/web/factsfigures.html The annotations are licensed by Google Inc. under CC BY 4.0 license. The images are listed as having a CC BY 2.0 license.
  • 45. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.
  • 46. Thank you! © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved. Vikram Madan Arvind Jayasundar Fedor Zhdanov
  • 47. © 2018, Amazon Web Services, Inc. or its affiliates. All rights reserved.

Notas del editor

  1. Create a simple illustration of your task that your workers can immediately understand.
  2. Create a simple illustration of your task that your workers can immediately understand. Workers are incentivized per image, so will aim to label each image quickly. Expecting too many boxes may lower overall quality.
  3. Simple solution: test them on data you know the labels for. Hurdles: not always available, not economical especially if the tested workers only do a small number of jobs.
  4. Engaging audience slide: can you see a hawk? a pigeon?
  5. This is hard as there is no single confidence for an image