Continuous Model Quality Retainer

Live Quality Dashboard

Model Quality Monitoring Interface

Your retainer team surfaces this dashboard every Friday — showing inter-annotator agreement, output quality trends, drift signals, and retraining batch status.

ConcaveLabel Quality Monitor · Model: HealthBot v3.1 · Week 12 of Retainer · Sprint Summary

Cohen's κ (IAA)

0.87

↑ +0.03 vs last week

Output Quality Score

4.6/5

↑ +0.2 vs baseline

Drift Alert Status

MEDIUM

↓ Medical queries drifting

Retraining Batch

2,400

✓ Delivered this month

WEEKLY QUALITY SCORE TREND (last 8 weeks)

Week 5

2.6

Week 6

3.2

Week 7

3.6

Week 8

3.8

Week 9

4.1

Week 10

4.3

Week 11

4.4

Week 12 ▶

4.6

THIS WEEK'S ALERTS

Medical dosage queries: 14% error rate spike vs 6% baseline — flag for retraining priority

Appointment booking category: 4.9/5 quality — stable, no intervention needed

IAA κ improved to 0.87 following calibration session on Week 11

Monthly retraining batch (2,400 samples) delivered — model update scheduled

Scroll

The Problem

Your model was great at launch.
Then user behaviour changed.

Every language model degrades over time. Not because the model changes — it doesn't — but because the world does. New slang enters the language. Policy topics evolve. Your users start asking questions your training data never anticipated. Edge cases pile up.

Most AI teams catch this late: a spike in negative feedback, a customer complaint that goes viral, a safety incident. By that point, you need a fast retraining scramble under pressure — expensive, disruptive, and rushed.

The alternative is a standing quality retainer: a dedicated team that monitors your live model every week, surfaces drift before it becomes visible, and delivers a clean retraining batch every month — so your model improves continuously rather than degrades silently.

Think of it as a quality engineering function embedded in your model pipeline, without the overhead of hiring, managing, and retaining annotation staff internally.

Without a retainer

Drift discovered via user complaints. Emergency retraining cycle at high cost. No systematic coverage of what went wrong or why. Model quality lurches rather than improves steadily. Safety incidents handled reactively.

With a Concave AI retainer

Drift detected in weekly evaluation. Corrective data already being prepared. Monthly retraining batch delivers targeted improvement. Safety issues surfaced before users see them. Model quality improves quarter-over-quarter, measurably.

What's Included

Six deliverables, every month

Every retainer engagement includes a fixed set of deliverables on a weekly and monthly cadence. Nothing is ad hoc — you know exactly what you're getting.

📊

Weekly Quality Report

Detailed breakdown of 500–2,000 live outputs evaluated that week. Includes error rate by category, any anomalies flagged, inter-annotator agreement score, and comparison against your baseline from onboarding. Delivered every Friday.

🚨

Drift Alerts

When any quality dimension drops more than 8% week-over-week, you receive a same-day alert with specific examples, a severity classification, and a recommended corrective action. 48-hour response SLA for P1 drift events.

📦

Monthly Retraining Batch

200–1,000 curated annotation pairs specifically targeting your model's weakest areas identified that month. Includes preference pairs (for RLHF), corrective examples (for SFT), or flagged-and-corrected outputs — whichever format your pipeline requires.

🎯

Monthly Quality Digest

Executive-readable summary of the month's quality performance: trend lines, top failure categories, drift events and resolutions, retraining batch composition, and a recommended priority for the next month's evaluation focus.

📈

Quarterly Benchmark Review

Every 90 days, we run a full benchmark against your agreed-upon evaluation set and provide a longitudinal quality trend report. Tracks whether retraining batches are producing measurable improvement and recalibrates the evaluation rubric if needed.

🔒

Dedicated Annotator Team

A fixed team of 3–8 domain-expert annotators assigned exclusively to your model. They accumulate deep familiarity with your model's specific failure modes, style, and domain over time — eliminating the cold-start quality dip every new annotator batch introduces.

Delivery Cadence

What happens every week

A structured, predictable rhythm so your team always knows what's coming and when.

Monday

Live output sample extraction

We pull the agreed sample of your model's live outputs from the previous week — via API, log export, or shared storage. Format agreed at onboarding. No manual work required from your team.

Automated ingestion Encrypted transfer

Mon–Thu

Expert evaluation & annotation

Your dedicated annotator team evaluates each output against your agreed rubric. Flagged outputs are escalated to senior review. Anomalies trigger a same-day alert. Gold standard tasks injected for calibration throughout.

Rubric-based scoring Senior escalation Gold injection

Thursday

QA review & anomaly resolution

Internal QA pass on the week's annotations. Any inter-annotator disagreements resolved. Drift alerts drafted if triggered. Data verified against previous week's baseline for trend analysis.

Kappa verification Drift comparison

Friday

Weekly quality report delivered

Structured report delivered to your designated contact: error rate by category, kappa scores, drift indicators, top failure examples (anonymised where needed), and a brief commentary from our ML lead.

PDF + JSON delivery Slack ping optional

Month-end

Retraining batch + monthly digest

Curated retraining batch assembled from the month's flagged outputs. Monthly digest written. Batch delivered in your preferred format (JSONL, CSV, or Hugging Face dataset card). Digest includes next month's recommended evaluation priority.

JSONL / CSV HuggingFace ready Data card included

Live quality dashboard — sample view

Live monitoring active

94.2%

Overall quality score

↑ +1.8% vs last week

0.78

Inter-annotator kappa

↑ Within SLA (≥0.72)

1,240

Outputs evaluated (week)

↑ On target

3

Active drift alerts

↓ Medical domain drifting

Quality by category this week

General helpfulness

92%

Factual accuracy

89%

Safety & tone

97%

Medical domain accuracy

74% ⚠

Code generation quality

85%

Service Level Agreements

Backed by contract, not promises

Every retainer tier comes with explicit, contractually binding SLAs. If we miss them, credits apply automatically — no need to chase us.

Tier	Weekly Volume	Report SLA	Drift Alert SLA	Retraining Batch	Annotator Team
Starter ₹4L/mo 1–3 person teams, focused product AI	500 outputs/week	Every Friday, 5pm IST	72 hours	200 pairs/month	3 dedicated annotators
Growth ₹8L/mo Scale-ups with multiple model endpoints	1,000 outputs/week	Every Friday, 5pm IST	48 hours	500 pairs/month	5 dedicated annotators
Enterprise ₹12–15L/mo Large AI teams, regulated industries	2,000 outputs/week	Friday + Tuesday	24 hours (P1: 4 hours)	1,000 pairs/month	8 annotators + ML lead

Pricing

Predictable monthly cost,
measurable model improvement

All retainer tiers are billed monthly, with a minimum 3-month commitment. No surprise usage charges — volume is fixed at the tier level.

Starter

Ideal for focused AI products with a single primary model endpoint and a small team.

₹4L / month

3-month minimum · ₹10.8L total minimum

500 live outputs evaluated per week

Weekly quality report every Friday

72-hour drift alert SLA

200 retraining pairs per month

3 dedicated domain-expert annotators

Quarterly benchmark review

Growth

For scale-ups with multiple model endpoints, or a single model with high traffic and diverse use cases.

₹8L / month

3-month minimum · ₹24L total minimum

1,000 live outputs evaluated per week

Weekly report + monthly digest

48-hour drift alert SLA

500 retraining pairs per month

5 dedicated domain-expert annotators

Quarterly benchmark + trend analysis

Slack integration for drift alerts

Enterprise

For large AI teams in regulated industries — healthcare, legal, finance — requiring full-coverage monitoring and fast SLAs.

₹12–15L / month

Custom contract · HIPAA/DPDP compliant

2,000 live outputs evaluated per week

Twice-weekly reports + monthly digest

24h drift SLA · P1 events: 4-hour response

1,000 retraining pairs per month

8 annotators + dedicated ML lead

Custom benchmark + regulatory reporting

On-site calibration session (quarterly)

FAQ

Common questions about retainers

How do you get access to our live model outputs?

At onboarding, we agree on an ingestion method. Common approaches: (1) we call your inference API on a fixed test prompt set each week, (2) you export a sample of production logs to a shared encrypted S3 bucket weekly, or (3) you push samples to us via webhook. We work around whatever is easiest for your engineering team — we have never needed direct production access.

What if our model's domain is highly specialised — say, cardiology or derivatives trading?

That's where the retainer model pays off most. We onboard your annotator team with a 2-week domain calibration before the engagement begins: structured reading, worked examples, supervised annotation rounds, kappa calibration. By week 3, your team understands your model's specific failure patterns. A new project-based annotator batch never gets that depth. If we cannot source sufficient domain expertise internally, we tell you before signing — not after delivery.

What format is the monthly retraining batch delivered in?

JSONL is default (compatible with most fine-tuning frameworks including OpenAI, Hugging Face, and Axolotl). We can deliver in CSV, Parquet, or as a Hugging Face dataset with a full data card at no extra cost. The data card documents: annotation rubric, annotator counts, kappa scores, collection period, domain coverage, and known limitations.

Can we increase volume mid-contract if our traffic spikes?

Yes. We build a 30% burst capacity buffer into every retainer team. If you need to evaluate more outputs than your tier allows for a given week, we can handle it — we will flag it and invoice any significant overage at the per-output rate. If the spike becomes consistent, we recommend upgrading tiers at the next billing cycle.

What's your offboarding process if we need to end the retainer?

30-day written notice after the minimum commitment period. We deliver a full off-boarding package: complete historical quality dataset, longitudinal trend report, annotator calibration notes (useful if you move to an internal team), and a summary of all drift events and resolutions. You own all data produced during the engagement — nothing is retained by Concave AI after offboarding.

How is this different from hiring an in-house annotation team?

Three core differences. First, speed: hiring, onboarding, and calibrating annotation staff takes 3–6 months. Our retainer team is ready in 2 weeks. Second, quality infrastructure: the QA pipeline, gold standard sets, kappa tracking, and delivery tooling are all built and maintained by us — your engineering team doesn't build or manage any of it. Third, flexibility: you can scale up or down each quarter; in-house teams are fixed costs. The retainer is typically 30–40% cheaper than an equivalent in-house function once you account for salaries, tooling, and management overhead.

Related Services

Often paired with the retainer

Most retainer clients use these services to establish a quality baseline before or alongside the ongoing engagement.

Your model stays aligned as the world changes

Weekly signals. Monthly retraining. Zero surprises.

Model Quality Monitoring Interface

Your model was great at launch.
Then user behaviour changed.

Without a retainer

With a Concave AI retainer

Six deliverables, every month

What happens every week

Live output sample extraction

Expert evaluation & annotation

QA review & anomaly resolution

Weekly quality report delivered

Retraining batch + monthly digest

Backed by contract, not promises

Predictable monthly cost,
measurable model improvement

Common questions about retainers

Your model stays aligned as the world changes

Weekly signals. Monthly retraining. Zero surprises.

Model Quality Monitoring Interface

Your model was great at launch.Then user behaviour changed.

Without a retainer

With a Concave AI retainer

Six deliverables, every month

What happens every week

Live output sample extraction

Expert evaluation & annotation

QA review & anomaly resolution

Weekly quality report delivered

Retraining batch + monthly digest

Backed by contract, not promises

Predictable monthly cost,measurable model improvement

Common questions about retainers

Often paired with the retainer

Your model was great at launch.
Then user behaviour changed.

Predictable monthly cost,
measurable model improvement