We describe why and how to be mindful about designing you data annotation pipeline to be scalable and to delivery consistent high quality results regardless of domain
5. Software 2.0
“A large portion of programmers of tomorrow … collect, clean,
manipulate, label, analyze and visualize data that feeds neural
networks."
Andrej Karpathy, Tesla
6. The data is an intrinsic part of the algorithm
Outcome depends as much on the data as on the code
TLDR: There are ways to be as mindful about your data
strategy as you are about your algorithm strategy
Algorithm Training is Algorithm Design
8. Data Annotation Takes Time
Figure-Eight estimates 80% of
development time spent on Data Prep
and Labeling
Cognilytica estimates 25% of time spent
on Data Labeling
9. Data Annotation Needs Are Substantial
Automotive Customer
● 250 k – 500 k frames per month
● Average 10 objects/frame for object detection
● Average 45 mins per frame for full segmentation
● Multiple judgements (3-5) on each data piece
Medical Image Customer
● 200 k endoscopic scans
● Average 2 anomalies per scan
● Multiple judgements (3-5) on each data piece
10. Bounding
Boxes
Polygons
Segmentation
PanOptic
Segmentation
Tracking
LIDAR
MultiSensor
Fusion
Data Annotation is increasingly Complex
Simple Boxes
+Secs/ task
Precise
Boundaries on
some objects
+Mins/task
All objects
precisely
marked
+30mins/task
All objects
precisely
marked and
clubbed by
type
+45mins/task
Objects
marked and
tracked across
frames of
video
+30mins/task
Thousands of
points clubbed
into objects
+90mins/task
Combine
LIDAR and
images from
multiple angles
+90mins/task
11. Complex Subject Matter
Healthcare, finance, law
Jargon-Rich Domain
Image editing, e-commerce
(brand jargon)
SKILLED
GENERAL
Specific World Knowledge
Current events, fashion
General Knowledge
Travel AI assistant
SPECIALIZED
EXPERT
Diagnosis & Treatment
Clinical History, Epidemiology,
Contextual analysis
Classification
Pathophysiology, multiple
dependency decision tree
Identification
Anatomy & Physiology, Pattern
Recognition, Ontological
Understanding
Segmentation
Navigation & Tool Familiarity
DOMAIN
LABELING
Data Annotation involves Domains
12. Data Security and Audit Trail
Quality and Consistency
Custom Tooling and Insights
Domain Knowledge & Targeted Skilling
Retained Learnings across Iterations
The Case For Enterprise Annotation
14. iMerit is a tech-enabled data services company that leverages human intelligence in
data, content, and machine learning.
We deliver high-quality, managed services while effecting
positive social and economic change.
Our data experts work full-time onsite at our secure delivery facilities.
We are iMerit
24x7
operations
< 5%
attrition
9
centers
200 M+
data points
delivered
130+
clients
SOC 2
certified
2,600
employees
16. Capture Video during game
Mark joint positions of pitcher
Build 3D skeleton for analytics
Expand to multiple teams
Extend to batters, fielders
HELPING CHICAGO CUBS WIN WORLD SERIES
17. • Street scenes for Autonomous Vehicles -Images + LiDAR
• Named Entitites/Salience in Financial Documents
• Aerial Imagery of healthy and diseased crops
• Peril Assessment for Property Insurance
• Identification of tumors and lesions in medical scans
• Risk Assessment of Power Assets
Experience and Expertise
21. For generalists
Narrow and Deep
Example Rich, requires time to
train, practice, and iterate
2. Guidelines & Training
22. Data and QC Pipeline
UI optimizations
Crawl (calibration)
Walk (soft production-rapid feedback)
Run (production, internal QA)
Supports scale, ensures quality
3. Workflow Customization
23. Collaboration: SMEs, PM,
engineer, generalists
Insights into unanticipated
deviations
No penalty for challenging
assumptions
Improve model by identifying biases
Ensure reliability of annotations
4. Feedback Cycle
24. Key metrics & thresholds
Share responsibility
Test against gold set
Measure inter-rater reliability
Increase rigor over project life
Minimize rework iterations
Ensures quality
Validates assumptions
5. Evaluation
27. Good Annotation Design: Context Matters
Are you trying to avoid
hitting people or are you
counting vehicles?
Person or Vehicle?
28. Good Annotation Design: UI Matters
I want bounding boxes no
smaller than 1.5 cms. in
any dimension
Go for it !
29. iMerit Solution Architect + Customer
Expert:
Unpack the jargon
Create deep and narrow training
curriculum (docs, videos, video-
confs)
Retain learnings across time
Good Annotation Design: Domain Specific
30. Good Annotation Design: Allow Open Feedback
● Conversation around quality
Are some errors more important than other errors ?
How will you sample quality ?
● Safe space to Iterate without penalty
● Small discovery and calibration pilots
● Ask your labeling force to question edge cases
31. Summary – Mindful Data Annotation
Data strategy as mindful as your
algorithm strategy
● Ask the right questions
● Plan time and budget
● Plan for increased skill needs
● Partner with your annotation
team
● Create an environment where
insight is possible
● Build long term, secure, scalable
pipeline