Auto-labeling tools achieve 92% accuracy on fully visible objects. On objects that are 60–80% occluded — hidden behind other vehicles, poles, or barriers — that accuracy drops to 38%. This is the gap where human-in-the-loop annotation is not optional. It is the only thing that works.
In real urban traffic, partial occlusion is not an edge case. It is the norm. A pedestrian stepping out from between parked cars is 70–80% occluded until they are in the lane. A motorcycle filtering between two trucks is visible only as a handlebar and a helmet. A child behind a bus is a pair of shoes below the bumper line. These are precisely the scenarios where an ADAS system must perform correctly.
Why occlusion breaks automated annotation
Object detection in urban driving is a solved problem — for visible objects. Modern auto-labeling pipelines using YOLO, Detectron2, or proprietary models from annotation platforms can draw tight bounding boxes around cars, pedestrians, cyclists, and trucks when those objects are clearly visible in the frame. The problem begins the moment an object is partially hidden.
A missed detection at 80% occlusion becomes a safety-critical failure 500 milliseconds later when the object is 40% occluded and the vehicle has not begun braking because the perception system never registered the object's presence. Auto-labeling tools fail on occluded objects for three specific, measurable reasons.
Visible footprint mismatch
The object's visible portion does not match any template in the detection model's training data. A car that is 80% hidden behind a truck presents as a thin vertical slice of door panel and a wheel — the model has learned to recognise "car" as a horizontal rectangle with a roof line, windshield, and four wheels. The visible slice does not trigger a detection.
Ambiguous pixel boundary
The boundary between the occluding object and the occluded object is ambiguous at the pixel level. Where does the truck end and the hidden car begin? The colour similarity between two vehicles of similar shade makes the boundary almost invisible to automated segmentation. A human with scene understanding knows there are two separate objects. The algorithm sees one continuous shape.
Broken temporal continuity
If a tracking algorithm (ByteTrack, DeepSORT) loses track of an object when it becomes occluded, it assigns a new ID when the object re-emerges. The training data now contains two objects — one that disappeared and one that appeared — rather than one continuous object that was temporarily hidden. A model trained on this data learns that objects cease to exist when occluded and new objects appear spontaneously.
The Indian road context multiplies the problem
Standard occlusion datasets — Cityscapes (Germany), nuScenes (Boston/Singapore), KITTI (Karlsruhe) — contain European and American traffic patterns. Occlusion in these datasets is dominated by car-behind-car scenarios at traffic lights and intersections with clear lane structure. Indian urban traffic introduces occlusion patterns that these datasets do not contain.
Auto-rickshaws weaving between cars create three-layer occlusion stacks (rickshaw behind car behind truck) that Western datasets never show. Two-wheelers filtering through traffic at sub-lane widths are 60–90% occluded at any given frame. Pedestrians crossing multi-lane roads without crosswalks create occlusion from unexpected angles. Cattle standing between vehicles create non-rigid occlusion boundaries that differ from vehicle-to-vehicle occlusion.
The approach — dataset and scope
For this methodology demonstration, we used the Cityscapes dataset — 5,000 finely annotated urban driving images from 50 European cities — as the baseline. We selected 500 images containing at least one object with 50% or greater occlusion, categorised by level: 50–60% (moderate), 60–80% (heavy), and 80%+ (extreme). We then added 200 images from Indian urban traffic — sourced from publicly available Indian driving datasets — containing occlusion patterns specific to Indian roads.
Step 1 — Baseline auto-labeling performance
We ran the full 700-image set through three auto-labeling pipelines to establish baseline performance on occluded objects specifically.
The pattern is consistent across all three tools: performance degrades sharply beyond 50% occlusion. SAM2 performs best due to its foundation model architecture, but still fails on nearly half of heavily occluded objects. On Indian road scenes with heavy occlusion, even the best automated tool misses more than half of the objects that a human annotator correctly identifies.
Step 2 — Human-in-the-loop correction protocol
We did not ask annotators to simply "fix the auto-labels." We designed a structured human-in-the-loop protocol with four specific phases.
SAM2 pre-annotation as starting point
Every image was first processed through SAM2 to generate initial segmentation masks. For clearly visible objects, SAM2's masks were used as-is with minor boundary corrections. For occluded objects, SAM2's masks served as a starting point — annotators knew an object was there but needed to determine the correct boundary. This hybrid approach meant annotators spent zero time on the 97% of clearly visible objects that SAM2 handles well. 100% of their time was focused on the occluded objects where human judgment is required.
Occlusion-specific annotation guidelines
Standard annotation guidelines say "draw a bounding box around the object." For occluded objects, this instruction is ambiguous. Our guidelines specified precisely: annotate the estimated full extent of the object, including the occluded portion. Use the visible portion as an anchor and extend the boundary based on contextual cues — the angle of the visible roofline, the position of visible wheels, the expected size of the vehicle class. Mark the occlusion boundary separately so the training data distinguishes between the visible portion and the inferred portion.
Annotator calibration on occlusion tasks specifically
Before any live annotation, all annotators completed a 30-task calibration set consisting exclusively of heavily occluded objects. Each annotator independently annotated the same 30 images. We then measured Cohen's kappa on three dimensions: object detection (did they find the occluded object?), class assignment (car / pedestrian / cyclist / other?), and boundary accuracy (does the segmentation mask accurately trace the visible boundary and the estimated full extent?).
Three-tier QA with occlusion-specific checks
Tier 1 automated checks: Did the annotator mark any occluded objects in a scene that statistically should contain them? Is the estimated full-extent bounding box physically plausible? Are the occlusion boundaries consistent with the depth ordering of the scene? Tier 2 peer review: A second annotator reviewed all heavy and extreme occlusion labels. Disagreements were adjudicated by a third annotator. Tier 3 expert review: Our senior ML engineer reviewed all images where the auto-labeling tool detected zero objects but the human annotator found one or more — the highest-value annotations.
Detection kappa ≥ 0.85 (binary — detected or missed) Classification kappa ≥ 0.90 (car / pedestrian / cyclist / truck) Boundary IoU ≥ 0.72 (intersection over union on mask) Detection and classification kappa are high because these are less subjective — either you found the object or you did not. Boundary IoU is lower because boundary estimation on heavily occluded objects inherently involves judgment about the object's full extent.
The results
Detection rate improvement
The improvement is most dramatic in the range that matters most for safety — 80%+ occlusion, where auto-labeling catches only 31% of objects but human annotators catch 87%. This is not a marginal improvement. It is the difference between a perception system that sees the pedestrian stepping out from behind a parked car and one that does not.
Boundary accuracy improvement
Cohen's kappa (detection) 0.88 Cohen's kappa (classification) 0.93 Mean boundary IoU (all occlusion) 0.84 Gold standard accuracy 94% Annotation throughput 42 images/hour (with SAM2 pre-annotation) Throughput without SAM2 ~18 images/hour SAM2 speed advantage 2.3× faster than fully manual
Downstream model impact
To validate that the improved annotation quality translates to model performance, we trained two identical Mask R-CNN models — one on auto-labeled data alone and one on the human-in-the-loop corrected data.
An 8% improvement on overall mAP is solid but modest. The 38-percentage-point improvement on 80%+ occluded objects — from 0.14 (essentially random) to 0.52 (usable in production) — is the metric that determines whether an ADAS system is safe for deployment. It is entirely attributable to annotation quality.