The legal domain has a fundamental AI annotation problem. Legal text is dense with ambiguity — jurisdiction matters, precedent matters, the difference between "shall" and "may" matters. Generic annotators, even well-educated ones, cannot reliably annotate legal documents without domain training. Yet most legal AI training datasets are produced by exactly these annotators.
The result: models that confidently misclassify contract clauses, hallucinate case citations, confuse regulatory frameworks across jurisdictions, and apply common law precedent to civil law contexts. These are not model architecture failures. They are training data failures — annotation by people who did not know what they were reading.
India's legal AI landscape is particularly complex. With 29 state jurisdictions, multiple court hierarchies, overlapping regulatory frameworks across SEBI, RBI, IRDAI, and MCA, and a legal system spanning English common law heritage alongside Indian constitutional law and personal law statutes — legal annotation for Indian AI requires annotators who are practising Indian lawyers, not just law graduates.