Beyond the Box — Precision Annotation for 80% Occluded Objects in Urban Traffic Scenes

Auto-labeling tools achieve 92% accuracy on fully visible objects. On objects that are 60–80% occluded — hidden behind other vehicles, poles, or barriers — that accuracy drops to 38%. This is the gap where human-in-the-loop annotation is not optional. It is the only thing that works.

In real urban traffic, partial occlusion is not an edge case. It is the norm. A pedestrian stepping out from between parked cars is 70–80% occluded until they are in the lane. A motorcycle filtering between two trucks is visible only as a handlebar and a helmet. A child behind a bus is a pair of shoes below the bumper line. These are precisely the scenarios where an ADAS system must perform correctly.

38%

Best auto-labeling accuracy (SAM2) on 60–80% occluded objects — the category most critical for pedestrian safety

93%

Human-in-the-loop detection rate on the same 60–80% occluded objects using a structured four-phase annotation protocol

Why occlusion breaks automated annotation

Object detection in urban driving is a solved problem — for visible objects. Modern auto-labeling pipelines using YOLO, Detectron2, or proprietary models from annotation platforms can draw tight bounding boxes around cars, pedestrians, cyclists, and trucks when those objects are clearly visible in the frame. The problem begins the moment an object is partially hidden.

A missed detection at 80% occlusion becomes a safety-critical failure 500 milliseconds later when the object is 40% occluded and the vehicle has not begun braking because the perception system never registered the object's presence. Auto-labeling tools fail on occluded objects for three specific, measurable reasons.

Visible footprint mismatch

The object's visible portion does not match any template in the detection model's training data. A car that is 80% hidden behind a truck presents as a thin vertical slice of door panel and a wheel — the model has learned to recognise "car" as a horizontal rectangle with a roof line, windshield, and four wheels. The visible slice does not trigger a detection.

Ambiguous pixel boundary

The boundary between the occluding object and the occluded object is ambiguous at the pixel level. Where does the truck end and the hidden car begin? The colour similarity between two vehicles of similar shade makes the boundary almost invisible to automated segmentation. A human with scene understanding knows there are two separate objects. The algorithm sees one continuous shape.

Broken temporal continuity

If a tracking algorithm (ByteTrack, DeepSORT) loses track of an object when it becomes occluded, it assigns a new ID when the object re-emerges. The training data now contains two objects — one that disappeared and one that appeared — rather than one continuous object that was temporarily hidden. A model trained on this data learns that objects cease to exist when occluded and new objects appear spontaneously.

The Indian road context multiplies the problem

Standard occlusion datasets — Cityscapes (Germany), nuScenes (Boston/Singapore), KITTI (Karlsruhe) — contain European and American traffic patterns. Occlusion in these datasets is dominated by car-behind-car scenarios at traffic lights and intersections with clear lane structure. Indian urban traffic introduces occlusion patterns that these datasets do not contain.

Auto-rickshaws weaving between cars create three-layer occlusion stacks (rickshaw behind car behind truck) that Western datasets never show. Two-wheelers filtering through traffic at sub-lane widths are 60–90% occluded at any given frame. Pedestrians crossing multi-lane roads without crosswalks create occlusion from unexpected angles. Cattle standing between vehicles create non-rigid occlusion boundaries that differ from vehicle-to-vehicle occlusion.

An ADAS system trained only on Western occlusion patterns deployed on Indian roads will fail on exactly these scenarios — the scenarios that are most frequent and most safety-critical in Indian driving conditions.

The approach — dataset and scope

For this methodology demonstration, we used the Cityscapes dataset — 5,000 finely annotated urban driving images from 50 European cities — as the baseline. We selected 500 images containing at least one object with 50% or greater occlusion, categorised by level: 50–60% (moderate), 60–80% (heavy), and 80%+ (extreme). We then added 200 images from Indian urban traffic — sourced from publicly available Indian driving datasets — containing occlusion patterns specific to Indian roads.

Step 1 — Baseline auto-labeling performance

We ran the full 700-image set through three auto-labeling pipelines to establish baseline performance on occluded objects specifically.

Auto-labeling detection accuracy by occlusion level

Occlusion level

Tool A · YOLO

Tool B · Detectron2

Tool C · SAM2

0–30% — clear

94%

96%

97%

30–50% — partial

82%

87%

91%

50–60% — moderate

63%

71%

78%

60–80% — heavy

34%

42%

54%

80%+ — extreme

11%

18%

31%

Indian road scenes · 60%+ occlusion

28%

35%

47%

The pattern is consistent across all three tools: performance degrades sharply beyond 50% occlusion. SAM2 performs best due to its foundation model architecture, but still fails on nearly half of heavily occluded objects. On Indian road scenes with heavy occlusion, even the best automated tool misses more than half of the objects that a human annotator correctly identifies.

Step 2 — Human-in-the-loop correction protocol

We did not ask annotators to simply "fix the auto-labels." We designed a structured human-in-the-loop protocol with four specific phases.

Phase 01

SAM2 pre-annotation as starting point

Every image was first processed through SAM2 to generate initial segmentation masks. For clearly visible objects, SAM2's masks were used as-is with minor boundary corrections. For occluded objects, SAM2's masks served as a starting point — annotators knew an object was there but needed to determine the correct boundary. This hybrid approach meant annotators spent zero time on the 97% of clearly visible objects that SAM2 handles well. 100% of their time was focused on the occluded objects where human judgment is required.

Phase 02

Occlusion-specific annotation guidelines

Standard annotation guidelines say "draw a bounding box around the object." For occluded objects, this instruction is ambiguous. Our guidelines specified precisely: annotate the estimated full extent of the object, including the occluded portion. Use the visible portion as an anchor and extend the boundary based on contextual cues — the angle of the visible roofline, the position of visible wheels, the expected size of the vehicle class. Mark the occlusion boundary separately so the training data distinguishes between the visible portion and the inferred portion.

Phase 03

Annotator calibration on occlusion tasks specifically

Before any live annotation, all annotators completed a 30-task calibration set consisting exclusively of heavily occluded objects. Each annotator independently annotated the same 30 images. We then measured Cohen's kappa on three dimensions: object detection (did they find the occluded object?), class assignment (car / pedestrian / cyclist / other?), and boundary accuracy (does the segmentation mask accurately trace the visible boundary and the estimated full extent?).

Phase 04

Three-tier QA with occlusion-specific checks

Tier 1 automated checks: Did the annotator mark any occluded objects in a scene that statistically should contain them? Is the estimated full-extent bounding box physically plausible? Are the occlusion boundaries consistent with the depth ordering of the scene? Tier 2 peer review: A second annotator reviewed all heavy and extreme occlusion labels. Disagreements were adjudicated by a third annotator. Tier 3 expert review: Our senior ML engineer reviewed all images where the auto-labeling tool detected zero objects but the human annotator found one or more — the highest-value annotations.

Calibration kappa thresholds — this project

Detection kappa        ≥ 0.85  (binary — detected or missed)
Classification kappa   ≥ 0.90  (car / pedestrian / cyclist / truck)
Boundary IoU           ≥ 0.72  (intersection over union on mask)

Detection and classification kappa are high because these are less
subjective — either you found the object or you did not. Boundary IoU
is lower because boundary estimation on heavily occluded objects
inherently involves judgment about the object's full extent.

The results

Detection rate improvement

Human-in-the-loop vs best auto-label tool (SAM2) — detection rate

Occlusion level

Auto-label (SAM2)

Human-in-the-loop

Improvement

50–60% — moderate

78%

96%

+18%

60–80% — heavy

54%

93%

+39%

80%+ — extreme

31%

87%

+56%

Indian road scenes · 60%+

47%

91%

+44%

The improvement is most dramatic in the range that matters most for safety — 80%+ occlusion, where auto-labeling catches only 31% of objects but human annotators catch 87%. This is not a marginal improvement. It is the difference between a perception system that sees the pedestrian stepping out from behind a parked car and one that does not.

Boundary accuracy improvement

Segmentation mask quality — IoU against ground-truth reference

Occlusion level

Auto-label IoU

Human-in-the-loop IoU

Improvement

50–60%

0.71

0.89

+0.18

60–80%

0.53

0.84

+0.31

80%+

0.34

0.78

+0.44

Quality metrics — final delivery

Cohen's kappa (detection)            0.88
Cohen's kappa (classification)       0.93
Mean boundary IoU (all occlusion)    0.84
Gold standard accuracy               94%
Annotation throughput                42 images/hour  (with SAM2 pre-annotation)
Throughput without SAM2              ~18 images/hour
SAM2 speed advantage                 2.3× faster than fully manual

Downstream model impact

To validate that the improved annotation quality translates to model performance, we trained two identical Mask R-CNN models — one on auto-labeled data alone and one on the human-in-the-loop corrected data.

mAP comparison — Mask R-CNN trained on auto-label vs human-corrected data

Model trained on

mAP (all objects)

mAP (>50% occ)

mAP (>80% occ)

Auto-labels only

0.71

0.38

0.14

Human-in-the-loop labels

0.79

0.67

0.52

Improvement

+0.08

+0.29

+0.38

An 8% improvement on overall mAP is solid but modest. The 38-percentage-point improvement on 80%+ occluded objects — from 0.14 (essentially random) to 0.52 (usable in production) — is the metric that determines whether an ADAS system is safe for deployment. It is entirely attributable to annotation quality.

Time and cost analysis

Metric

Auto-label only

Human-in-the-loop

Time per image (700 images)

3 sec

86 sec avg

Cost per image

~₹2

~₹35

Total project cost (700 images)

₹1,400

₹24,500

Detection rate (>60% occlusion)

47–54%

91–93%

Model mAP (>80% occlusion)

0.14

0.52

Cost of a missed pedestrian detection in production

Incalculable

—

The human-in-the-loop annotation costs 17× more per image than pure auto-labeling. But it produces a model that detects heavily occluded objects 3.7× more accurately. For safety-critical applications, the cost comparison is not ₹2 vs ₹35 per image. It is ₹35 per image vs the cost of a missed detection in production.

Key learnings

Auto-labeling is a starting point, not annotation. Using SAM2 as a pre-annotation layer and having humans correct the output is 2.3× faster than pure manual annotation and produces dramatically better results than auto-labeling alone.

Occlusion-specific guidelines are mandatory. Generic "draw a box around the object" instructions produce inconsistent annotations. Annotators need explicit instructions on whether to annotate the visible portion, the estimated full extent, or both — and how to mark the occlusion boundary separately.

Calibration on occluded objects specifically is essential. Annotators who perform well on clearly visible objects may perform poorly on heavily occluded ones. A dedicated occlusion calibration set catches this before it enters the training data.

Indian road conditions require India-specific annotation data. The 44% detection improvement on Indian road scenes demonstrates that Western occlusion datasets do not transfer to Indian driving conditions. Auto-rickshaw occlusion, two-wheeler filtering, and unstructured intersection patterns must be annotated from Indian road footage.

The downstream model impact is concentrated where it matters most. The 38-point improvement on 80%+ occluded objects is the metric that determines whether a perception system is safe for deployment — and it is entirely attributable to annotation quality.

Working on an ADAS or computer vision annotation project?

We specialise in occlusion-heavy scenes, Indian road conditions, and multi-class perception data. Free audit available — 50 frames, quality findings in 5 working days.

Request Free Audit →

The Concave AI Team

ML-Engineer-Led Data Annotation & GenAI Evaluation · Bengaluru, India

Beyond the box: precision annotation for 80% occluded objects in urban traffic scenes

Why occlusion breaks automated annotation

Visible footprint mismatch

Ambiguous pixel boundary

Broken temporal continuity

The Indian road context multiplies the problem

The approach — dataset and scope

Step 1 — Baseline auto-labeling performance

Step 2 — Human-in-the-loop correction protocol

SAM2 pre-annotation as starting point

Occlusion-specific annotation guidelines

Annotator calibration on occlusion tasks specifically

Three-tier QA with occlusion-specific checks

The results

Detection rate improvement

Boundary accuracy improvement

Downstream model impact

Time and cost analysis

Working on an ADAS or computer vision annotation project?