Consider annotating a pedestrian in a busy street scene. In frame 1, the pedestrian is fully visible and easy to box. By frame 47, they are partially occluded by a parked car. By frame 68, they briefly leave frame and re-enter from behind the car. By frame 94, they overlap with another pedestrian for 12 frames. A naive tracking system will either lose the ID during these events or swap IDs with the other pedestrian — both errors that corrupt your training data.
Our video annotation pipeline uses ByteTrack for initial multi-object tracking (SOTA for maintaining IDs through occlusions), combined with temporal interpolation between human-annotated keyframes (typically 1 keyframe every 10–30 frames depending on scene complexity). Human annotators focus their effort on keyframes and occlusion events — where tracking is hardest and most consequential — rather than manually annotating every frame.
For Indian-specific autonomous vehicle data, we have annotators who are familiar with the specific challenges of Indian traffic: two-wheelers weaving between lanes, auto-rickshaws making unexpected turns, pedestrians crossing at unstructured intersections, cattle on roads, and the distinctive right-of-way patterns that Western-trained models do not understand. This domain familiarity is the difference between annotation that actually represents the scene and annotation that imposes Western traffic patterns onto Indian footage.