Standard annotation tools strip geospatial metadata. Standard IoU thresholds fail production geospatial systems. Standard crowdsourcing platforms lack annotators who can interpret spectral signatures beyond visible light. Satellite image annotation is not computer vision annotation applied to overhead imagery it is a fundamentally different discipline that requires different tools, different annotator expertise, and different quality standards.
The evidence is blunt: ResNet trained on natural images hits only 55% accuracy on aerial crops versus 82% when trained on aerial data from the start. The distribution mismatch between ground-level imagery and satellite imagery is not trivial it is catastrophic. And that gap is before accounting for the coordinate system loss, spectral band errors, and oriented bounding box failures that standard annotation workflows introduce into geospatial training data.
This guide covers what makes satellite image annotation structurally different, the four annotation types that geospatial AI pipelines require, the accuracy standard that separates production-grade from benchmark-grade labels, and the quality validation process that catches systematic errors before they enter training.
Three ways satellite annotation differs from standard CV
Coordinate reference systems the metadata that standard tools destroy
Every satellite image carries geospatial metadata linking pixels to locations on Earth's surface. Annotations must embed this coordinate information to map back to Earth the difference between a bounding box that means "object at these pixel coordinates" and one that means "object at 12.97°N, 77.59°E." Standard annotation tools strip this metadata during export, forcing teams to manually reconstruct the spatial reference afterward a process that introduces errors and breaks downstream GIS integration.
Common coordinate systems in production workflows include WGS84 (EPSG:4326) for global latitude/longitude reference and UTM for zone-based metric projections suited to local analysis. The solution is using workflows that support Cloud-Optimized GeoTIFF (COG) input with embedded CRS information in all annotation exports not retrofitting geospatial metadata after annotation is complete.
Multi-spectral complexity - annotating beyond visible light
Sentinel-2 provides 13 spectral bands. RGB is only 3 of them. The remaining 10 bands encode information invisible to human eyes but critical to what geospatial AI needs to learn: near-infrared (NIR) for vegetation health, short-wave infrared (SWIR) for moisture content and burn detection, and thermal bands for heat signatures. Annotators labeling only on visible light composites systematically misclassify features that are distinguishable in non-visible bands healthy crops versus stressed crops, wet soil versus dry soil, active fires versus recent burn scars.
Generic crowdsourcing platforms lack annotator training in spectral signatures and false-color composites. The result is labels that are visually plausible but spectrally incorrect invisible in QA reviews but catastrophic during model training on multi-spectral inputs.
Oriented bounding boxes - what standard detectors cannot handle
Standard object detection frameworks expect axis-aligned bounding boxes: fixed horizontal and vertical boundaries. Objects in satellite imagery appear at arbitrary orientations a ship at 37 degrees, an aircraft at 215 degrees, a storage tank at any angle. Standard bounding boxes around rotated objects include enormous amounts of background, confusing the model with irrelevant context. Oriented bounding boxes (OBB) add a rotation angle alongside position and size parameters, requiring annotation frameworks that support mmrotate, R3Det, and S2ANet not standard COCO or VOC format tools. A 5-degree rotation error in annotations on small objects (covering ~15 pixels) compounds across thousands of training examples and produces systematic heading errors in production predictions.
The four annotation types geospatial AI pipelines require
Semantic segmentation for land cover classification
Pixel-level classification across large areas assigning every pixel in a scene to a land cover class (forest, agriculture, urban, water, bare soil). Single satellite images can reach 10,000 × 10,000 pixels covering 100+ square kilometres, making tiling essential. The annotation challenge is tile boundary management: clean semantic segmentation masks must handle overlap between tiles consistently, smooth polygon boundaries at tile edges, and maintain classification consistency across the full scene. Staircase artifacts on polygon edges caused by rushed annotation create prediction artifacts that appear systematically along tile boundaries during inference.
Object detection with oriented bounding boxes
Identifying specific targets in satellite imagery vehicles, ships, aircraft, storage tanks, buildings using bounding boxes that capture rotation angle alongside position and size. Small objects in high-resolution imagery (sub-meter WorldView-3 data) may cover as few as 15 pixels, making annotation precision critical. A 5-degree rotation error on a 15-pixel object adds significant background noise to the training example. This annotation type requires familiarity with OBB annotation interfaces and the frameworks that consume them (mmrotate, R3Det, S2ANet) not standard detection annotation tools designed for axis-aligned boxes.
Change detection annotation
Temporal annotation workflows that pair images of the same location at different timestamps with masks tracking what changed between them new construction, deforestation, flood extent, agricultural harvest cycles. The annotation challenge is three-fold: maintaining precise spatial alignment between image pairs (a sub-pixel shift makes change masks meaningless), enforcing temporal consistency (the same location at two timestamps must share a coherent semantic context), and distinguishing change types semantically rather than just flagging difference (a label of "change" is less valuable than "newly built" or "demolished" or "flooded").
SAR annotation - through clouds and darkness
Synthetic Aperture Radar (SAR) imagery is collected by active sensors that emit microwave pulses and record backscatter, enabling imaging through clouds, rain, and darkness. SAR annotation requires annotators who understand radar signatures: metallic structures produce characteristic bright returns independent of visual appearance, water appears dark in SAR but bright in optical imagery, and vegetation produces distinctive texture patterns that differ from their optical counterparts. Generic annotators applying visual pattern recognition to SAR data produce systematically incorrect labels. This represents a genuine market gap defense programs have SAR expertise; most commercial annotation teams do not.
Resolution requirements by application matching training to deployment
Training on 10-meter data and deploying on sub-meter imagery produces significant accuracy drops. The reverse is also true sub-meter training data used on lower-resolution deployment imagery wastes annotation cost on precision that the deployment sensor cannot resolve. Resolution matching between training data and deployment conditions is a prerequisite for production accuracy.
Edge cases that break geospatial models and how to annotate them
Cloud cover and cloud shadows
Cloud cover is the most common data quality issue in optical satellite imagery and the most commonly mishandled in annotation workflows. The four classes clear sky, thin cloud, thick cloud, and cloud shadow require different treatment in training data. Thin cloud detection is the hardest annotation task in satellite imagery: the difference between "clear with atmospheric haze" and "thin cirrus cloud" is subtle in RGB and only reliably distinguishable in specific spectral bands. Cloud shadows are even more problematic they appear as dark regions that naive models confuse with water or dark soil. Systematic cloud misclassification in training data teaches models incorrect spectral relationships that persist across geographic regions.
Off-nadir angle distortion
Satellites collect imagery at varying look angles nadir (straight down) and off-nadir (at an angle). Off-nadir imagery shows building lean, tree lean, and geometric distortion that changes the apparent shape and size of annotated objects. A building footprint annotated from nadir imagery will not match the same building's appearance in off-nadir imagery. Models trained on nadir-only data and deployed on off-nadir imagery experience systematic spatial misalignment. Annotation guidelines must specify how to handle building shadows, lean, and distortion consistently across viewing angles.
Geographic domain shift
Models trained on European annotated data deployed in Africa show accuracy drops from 90% to below 50% for identical land cover features because the spectral signatures, vegetation species, building materials, and agricultural patterns differ between regions. This geographic domain shift is one of the most common causes of production failures in geospatial AI. It cannot be fixed by augmentation or fine-tuning alone it requires annotation data from the actual deployment region. Indian agricultural AI trained on USDA or European farming datasets fails on Indian varietals, cropping patterns, and field sizes.
Seasonal spectral variation
Winter and summer imagery of the same agricultural field differ dramatically in every spectral band. A model trained on monsoon-season imagery of Indian paddy fields will misclassify the same fields during the rabi (winter) season when they are fallow or planted with a different crop. Temporal annotation programs must explicitly capture seasonal variation labeling the same geographic areas across multiple seasons or accept that the model's accuracy will degrade significantly in seasons not represented in the training data.
Foundation models and AI-assisted annotation where the time savings actually are
SAM2 (Segment Anything Model 2) accepts point prompts or rough boxes to generate initial masks, reducing polygon annotation time by roughly 3×. NASA-IBM's Prithvi model, fine-tuned on massive Landsat-Sentinel datasets, enables training with 90% fewer labels than training from scratch on some geospatial tasks.
The practical workflow: the foundation model generates pre-annotations, a human expert refines boundaries and corrects errors, and QC validates the output before delivery. For datasets above 500 images with standard feature types, the time economics are compelling: 10 hours of manual annotation versus 2 hours of SAM2 refinement plus 1 hour of QC.
The limitations are important. Foundation models struggle with the hardest annotation tasks: thin cloud versus cirrus distinction, building versus greenhouse confusion in low-resolution imagery, and seasonal variation in agricultural signatures. They also amplify bad annotations errors in small initial labeled sets propagate systematically during fine-tuning, making quality of the seed annotation set critical.
Skip AI pre-annotation for datasets under 500 images or highly specialized features (SAR annotation, specific crop varietals, rare infrastructure types) where the model's pre-training distribution provides little useful signal. The time savings evaporate when the model's initial predictions are so poor that correction time approaches manual annotation time.
Quality validation before model deployment the five checks that prevent training on bad labels
Re-annotation vs augmentation - how to decide
When model accuracy is insufficient after training, teams face a choice: re-annotate the existing dataset or augment it with new data. The decision depends on error source. Re-annotate when: model accuracy falls below 70%, geographic variance is high (domain shift), edge cases fail completely, or training data contains fundamental labeling errors. Augment when: labels are correct but volume is insufficient, class balance requires improvement, or active learning identifies high-uncertainty predictions that need coverage. The false economy is augmenting a dataset with systematic labeling errors generating more examples of incorrect labels at lower cost produces a larger, more confidently wrong model.