The acquisition of labeled, unbiased, high quality remote sensing information for training AI systems is expensive, error prone, and sometimes impossible or dangerous. The efficacy of Remote Sensing and Imagery Analysis tools that use AI depends directly on the data used for training and validation, meaning that the cost and availability of data limits the application of AI for imagery exploitation. Synthetic Computer Vision (CV) data has become a strategy to reduce the cost and limitations of using real-world data in detection problems in data sparse domains. Focusing on remote sensing data including visible and invisible electromagnetic spectra, attendees will learn about the expanding options for generating synthetic data that are being used in commercial and academic domains, the technology options available for users who want to create CV content of a variety of types, and patterns of creating synthetic data to support
Learning Objectives
- Describe synthetic data including different types such as Generative AI and physics-based data
- Identify the opportunities for applying synthetic data in place of real sensor data
Will be able to describe the steps required to generate synthetic data for computer vision workflows from concept to production for training and validating AI.
- The intent of this class is to introduce the concepts and mechanisms behind the creation of synthetic data and to expose students to approaches for generating synthetic data using tools currently on the market.
- Familiarity with concepts around AI training and validation using remotely sensed data will be helpful for attendees.
[2024]Digital Global Overview Report 2024 Meltwater.pdf
2023 GEOINT Tutorial - Synthetic Data Tools for Computer Vision-Based AI - Rendered.ai
1.
2. Synthetic Data Tools for
Computer Vision-Based AI
Chris Andrews
COO, Rendered.ai
chris@rendered.ai
Dan Hedges
Lead Solution Architect, Rendered.ai
dan@rendered.ai
3. Presenters
3
• COO & Head of Product, Rendered.ai
• 25 years experience in commercial and
government geospatial-related products
and technologies
• 3D, enterprise integration, BIM-GIS,
defense-related apps & solutions
• 15 years experience with product
development at companies including Esri,
IBM, and Autodesk
• Lead Solution Engineer, Rendered.ai
• 11+ years experience building geospatial
solutions for industry verticals including
urban planning, local government, federal
government
• Subject matter expertise in remote
sensing, 3D data, and feature extraction
Chris Andrews Dan Hedges
4. Rendered.ai’s cloud-hosted platform for synthetic
data enables customers to overcome the costs
and challenges of acquiring and using real data
for training and validating computer vision ML and
AI systems and algorithms
• Established in 2019 in Bellevue, WA
• Inclusive subscription encompasses 2D & 3D content
creation, simulation design, data generation
• Rapid setup and configuration for shortest path to
synthetic data generation for multiple applications
• Available on the AWS Marketplace
Synthetic Data experts with experience in:
• Remote sensing – Satellite, Aerial
• Ground-based imagery & video
• Non-visible EM spectra
• 2D and 3D modeling and simulation
• GAN training and dataset post processing
• Dataset comparison and validation
The Platform as a Service for Synthetic Data
Partnering with
Member
5. The AI Data Problem
BIAS
COST & TIME
INNOVATION PRIVACY/SECURITY
Real data is expensive and often
costly and time consuming to
acquire and label
Rare objects and scenarios
are hard to capture
Without data it is impossible to
explore new sensors and data types
Real data can have security or high-risk
information concerns that limit usage
7. Synthetic Data solves the AI data problem
Rendered.ai is a PaaS and developer
framework for synthetic data
Synthetic Data is engineered data that
AI interprets as real data
60% of data used for AI and data analytics
projects will be synthetic, and by 2030, synthetic
data will have completely overtaken real data in AI
models.
- Gartner, September 2021
Imagine if it were possible to produce infinite
amounts of the world’s most valuable resource,
cheaply and quickly…
This is a reality today. It is called synthetic data.
- Forbes, July 2022
8. What do we mean by Synthetic Data?
Synthetic data can be created for any type of data used to train or
validate AI/ML systems, even for sensors or systems that don’t exist
CV-based synthetic data simulates bitmap sensor data capture
whether from sensors, recorded spatial patterns, or other CV input
content
Physics-based synthetic data includes creation of 2D/3D/4D output
based on ‘digital twins’ of physical sensors, the sensor platform, and
the scene in which the sensor would operate
Rendered.ai can be used to generate any kind of synthetic data
Initial focus has been on physics-based synthetic data generation for
CV workflows
• RGB imagery and video, RGB microscopy, IR imagery, X-ray, SAR,…
Source: Wikipedia
9. Today’s AI workflow relies on finding or
acquiring data
Acquire or
find data
Train algorithm Test algorithm Accept/Reject
result
Expensive, unpredictable data acquisition costs
Difficulty training algorithms on inconsistent data
Testing requires reuse of real datasets
Results are limited to what can be achieved with real datasets
10. Tomorrow’s AI workflow incorporates synthetic data
Inexpensive, unlimited data generation
100% accurate labeling, consistent data
Real datasets used for comparison and post processing
Data can be designed for edge or impossible cases and for removing bias
Create data Train algorithm
Test algorithm
Compare
datasets
11. Simulator
Dataset and
metadata
Managed Compute
Improved and explainable
outcomes
World building and
procedural gen
Asset Acquisition /
Integration
AI Model
Real-world workflow
For more information:info@rendered.ai
Post processing /
Domain adaptation
Quality assessment
Synthetic Data Engineers
Data Scientist
Platform Automation
Simulator
Synthetic
Dataset
AI Model
Hypothetical workflow
12. Synthetic data generation steps
1. Scenario characterization - Data output, variability, problem(s) addressed or tested
2. World building - Asset and scene content composition and aggregation
3. Sensor modeling & simulation - Rendering, visual effects, environmental effects
4. Annotation & mask calculation
5. Job execution & dataset compilation
6. Annotation mapping
7. Domain Adaptation post-processing
8. Dataset characterization and comparison
12
13. New AI job: Synthetic Data Engineer
If most data used to train AI will be synthetic…
…who will be engineering the data?
Design & engineer datasets to achieve specific AI outcomes
Software development-oriented
• python, data science, 3D, game engines
Domain or industry expertise
Expert in specific data types & technologies
• Sensors, Renderers, Modeling, Simulation
14. What about Generative AI?
Physics-based synthetic data
• Starts from a 3D simulation
• Can add wide variation including absurd,
unnatural, or extremely rare phenomena
• Can generate multiple ‘maps’ for depth,
instances, surfaces, normals, motion
• Can generate fully pixel-labeled content
• Can incorporate accurate physics-based
models for imagery generation
Generative AI (2023)
• Starts with large, known datasets
• Can add variation, but must be driven by
addition of additional training data
• Cannot generate extra maps with
information in the scene
• Cannot label at the pixel level
• Does not incorporate physics-based
models
Generative AI is moving fast and we see it as another tool for both
content generation and post processing or consuming other synthetic data
15. New AI job: Prompt Engineer
In the world of Generative AI, someone needs to tell the AI what to
produce!
Design & engineer inputs to Generative AI systems to achieve specific
outcomes
Narrative-oriented
• Good at defining context, describing problems
Domain or industry expertise
Expert in specific data types & technologies
• Sensors, Renderers, Modeling, Simulation
16. Common gaps when introducing customers to synthetic data
• Hyper focus on the bounds of found or acquired data only
• Most data scientists aren’t sensor experts
• Concern about ‘good data’
• Concern about one-off datasets vs. investment in data
• Belief that human perception is good enough to judge data quality
• Confusion over Generative AI vs. simulation ntechniques
… Note that the biggest hurdle is that customers rarely stop to ask what the
ideal dataset would be that would address their business problem!
17. Synthetic data generation is an empirical process
17
Identify the
problem
Describe
the (ideal)
data
Generate
data
Can I
achieve
any
training?
Refine
data
generation
Can I
improve
training?
18. Supporting GEOINT workflows with continuously evolving AI
Model digital
sensor
Aggregate &
create scene
content
Create
Channel
configuration
Publish to
Rendered.ai
Add Channel
to
Workspace
Create &
configure
Graphs
Run Jobs
Channel development
(GIS Developer, Database Engineer,
Synthetic Data Engineer)
Train and
Evaluate AI
Datasets
Graph configuration and job execution
(GIS Analyst, Computer Vision Engineers,
Data Scientists & Automated Workflows)
Change graph
configuration
Add/update sensor configuration,
Scene content, scene configuration
Annotation
Images
Masks
Statistics
GIS tools
Data Science toolkits
Embedded AI tools
20. Don’t rebuild everything for every AI application
Remote Sensing
Supply Chain
Object detection
Automotive
Economic monitoring
Medical Imaging
Security
…
Sensors
Radar Imagery
RGB Camera
Panchromatic
Infrared
High-Definition Radar
Microscopy
X-Ray/CT Scan
MRI
…
Applications Reusable modular architecture
in the cloud
• Content pipelines
• Sensor models
• Analytics toolsets
• AI integrations
Enabling access to synthetic
data as an enterprise capability
21. Channel Development | Blender
Content Code: SATRDEMO
- Dependencies installed:
- Blender and Python (versions harmonized),
OpenCV, GPU drivers, Ana, Anatools SDK
- Can Edit and Deploy Channels with SDK
- Offered as AMI or from git with
.devcontainer for VS-code
- ArcGIS integration for 2D raster
backgrounds
Custom Code
Available now
22. Case study slide: EO scenarios
Searching for cranes, and crane trucks as an economic indicator in satellite imagery
Objects are rare relative to other features in overhead imagery.
Which means very large labelling campaigns are needed to collect
examples. Original dataset only had ~100 examples of each class.
Objects are difficult to label. Inconsistent sizing of crane bounding
boxes and similarities between crane trucks and cement pumps
were two notable challenges in the real datasets.
Synthetic and real datasets
2-3x improvement in AP scores over peak performance
without Synthetic data
23. Channel Development | DIRSIGTM
Content Code: DIRSIGDEMO
- DIRSIG accessed through
python and web interface
- Can Edit and Deploy Channels
with SDK
- No RIT DIRSIG training
required! Custom Code
Available now
24. Example Applications:
Hyper-spectral Imaging,
Multi-spectral Imaging
Unique relationship with RIT allows
Rendered.ai to package DIRSIG in synthetic
data channels for customers
MSI, HSI, other radiometrically complex
imagery output
Validation possible with calibration panels,
3rd party consulting
Pixel-level geospatial accuracy
Geospatially accurate, high resolution scene
content used in cloud-based generation for
very large datasets
RGB bands from MSI, HSI images created with
DIRSIG and Rendered.ai
25. Channel Development | Omniverse
Available on request
• Preinstalled dependencies:
• USD, Python, OpenCV, GPU drivers,
Ana, Anatools SDK
• Edit and Deploy Channels to
Rendered.ai with SDK
• Offered as AMI or from git with
.devcontainer for VS-code
Custom Code
26. Example Applications:
Omniverse Replicator
channel
Use industry-leading 3D toolkit in the cloud
Configurable in a web-based SaaS experience
Starting place for users who may already
have some experience or investment in
NVIDIA tools
Familiar architecture that extends to
multiple use cases
Synthetic imagery chips generated with Omniverse Replicator running inside Rendered.ai on AWS
27. Example Application:
Synthetic Aperture
Radar
Enterprise & Developer Subscription
Customers
Experimental, cutting-edge Synthetic
Aperture Radar simulation built by
Rendered.ai
SAR output is not human readable,
making human labeling impossible
Emerging commercial SAR industry
seeking better tools for exploitation,
value creation
Applications in defense, disaster
response, Earth observation &
monitoring, insuretech
Synthetic SAR images generated using Rendered.ai
Identical object shown with several image capture scenarios
28. Example
Application: Marine
Imagery
Enterprise tier customer
Vessel detection in open ocean
scenarios for defense and
contraband interdiction
Supporting edge-based, onboard
object detection systems
Variable weather, wave, obstruction
characteristics
Variable object placement
generators
Synthetic RGB images simulating marine UAV imagery capture
29. Satellite Visible
Synthetic IR (MWIR)
Synthetic SAR
Over 1.2TB of
synthetic
images
produced with
channel coverage
growing
Security Imaging FLIR Camera
Examples of synthetic CV content
X-Ray and CT scans
Urban & natural
environments
Industrial and
residential settings
30. And after you have your imagery… compare it!
Creating datasets is a starting point
Training and Validation are next
Compare datasets to explore similarity
• Real-synthetic, synthetic-synthetic
Use tools such as UMAP, FID
Use inference to change SDG
Try again!
UMAP analysis enables data scientists
To explore similarities and differences in
The parameter embedding space of multiple
datasets
32. Internal
Past experience with cost or failure of
one-off synthetic data experiments
Unprepared for experimentation
Effort to achieve acceptable level of
realism
Complexity/difficulty with physics-
based modeling
External
• Information about emerging tools
• TCO of yet-another-IT project
• Talent shortage
• Lack of benchmarks/standards
─ Need for analytic tools
─ Need for sensitivity analysis
• Lack of industry collaboration
Typical challenges adopting synthetic data
33. Opportunity of Synthetic Data
Supplement real data
Evaluate and remove bias
Reduce expensive dataset
labeling and reacquisition
Explore scenarios
Simulate sensor models
and collection techniques
Create novel data with
zero PII or
security concerns
34. Synthetic data as a Standard
Synthetic data is rapidly moving from uncertain value to required tool. Synthetic
data has the opportunity to be used as part of regulatory and ethical frameworks
around bias reduction, demonstrable sensitivity analysis, and reducing the need
for human curation of training data.
Regulatory & compliance
• Bias reduction and testing
• Sensitivity analysis
• Efficacy demonstration
• Removing human-in-the-loop from ethical/harmful scenarios
35. Synthetic data as an enabler for innovation
As synthetic data generation capabilities improve and become more
accessible, users will have expanded opportunity to experiment,
innovate, and build AI without expensive or impossible real sensor
dataset collection.
Innovation
• Complex sensor fusion
• New & hard-to-acquire sensors
• New dataset combinations
• Digital Twins
36. Synthetic data driving sustainability
Synthetic data is 100% reliably labeled, has been shown to reduce the size
of training datasets, and potentially reduces the need for real sensor-based
data collection.
Cost and impact
• Reducing labeling costs
• Reducing collection costs
• Reducing environmental footprint of real sensor data collection
• Enabling innovation without physical material consumption/investment
37. Wrap up
37
For slides and supporting content:
https://bit.ly/GEOINT2023
Try it at:
https://rendered.ai/getstarted.html