Consider annotating a pedestrian in a busy street scene. In frame 1, the pedestrian is fully visible and easy to box. By frame 47, they are partially occluded by a parked car. By frame 68, they briefly leave frame and re-enter from behind the car. By frame 94, they overlap with another pedestrian for 12 frames. A naive tracking system will either lose the ID during these events or swap IDs with the other pedestrian both errors that corrupt your training data.
Our video annotation pipeline uses ByteTrack for initial multi-object tracking (SOTA for maintaining IDs through occlusions), combined with temporal interpolation between human-annotated keyframes (typically 1 keyframe every 10–30 frames depending on scene complexity). Human annotators focus their effort on keyframes and occlusion events where tracking is hardest and most consequential rather than manually annotating every frame.
For Indian-specific autonomous vehicle data, we have annotators who are familiar with the specific challenges of Indian traffic: two-wheelers weaving between lanes, auto-rickshaws making unexpected turns, pedestrians crossing at unstructured intersections, cattle on roads, and the distinctive right-of-way patterns that Western-trained models do not understand. This domain familiarity is the difference between annotation that actually represents the scene and annotation that imposes Western traffic patterns onto Indian footage.