Contract Compliance Demo

Act 1 · The manual process

Automating Engineering Contract Compliance Review

Every consulting engagement is governed by a contract that defines scope, deliverables, methodology, and acceptance criteria. Verifying that work product conforms to those terms is currently manual, expensive, and error-prone. This demo identifies that manual process, presents an automated solution, and measures the impact.

Section B

The manual process today

A senior manager at any professional services firm performs the following sequence today, by hand, for every engagement that requires a compliance review.

  1. 0:00

    Receives executed contract from engagement team.

    Typically 20–50 pages of negotiated scope, deliverables, methodology, acceptance criteria, exclusions, and special terms.

  2. 0:15

    Reads through the contract, identifying clauses relevant to the work product under review.

    Notes specific methodology requirements, scope inclusions and exclusions, and deliverable specifications.

  3. 1:00

    Receives draft work product from delivery team.

    Typically a technical report, analysis, or recommendation document.

  4. 1:30

    Begins systematic cross-check.

    Reads each section of the work product against the relevant contract clauses, looking for scope drift, methodology divergence, missing deliverables, or undisclosed substitutions.

  5. 4:00

    Compiles findings into a compliance memo.

    Each finding cites the contract clause it relates to and the work-product item it concerns.

  6. 5:00

    Reviews findings with engagement partner; drafts revisions request to the delivery team.

    The end of one review; the beginning of the next.

Total time per review: 4–8 hours of senior manager time, depending on contract complexity and work-product size.

Why this is a problem

  • It is expensive. Senior managers cost $200–300 per hour fully loaded. A single review is a $1,000–2,000 line item.
  • It is inconsistent. Different reviewers catch different issues; the same reviewer on different days catches different issues.
  • It scales poorly. Engagement volume grows; senior reviewer capacity does not.
  • It is fatigue-sensitive. The kind of error that matters most — methodology substitutions, subtle scope drift — is the kind most easily missed by a tired reviewer on the third such review of the day.

A firm doing professional services work may run hundreds of these reviews per year. Aggregate cost runs into the hundreds of thousands of dollars annually, with material risk exposure from missed compliance issues.

Section C

Manual vs. automated, side by side

Manual

A senior manager, by hand

  • 0:00 — receives contract, reads it
  • 1:00 — receives work product
  • 1:30 — cross-checks section by section
  • 4:00 — compiles findings memo
  • 5:00 — partner review and revision request

Total: 4–8 hours.

Automated

The pipeline below

  • 0:00 — documents ingested
  • 0:04 — contract indexed for retrieval
  • 0:09 — deterministic checks pass
  • 0:22 — conformance findings produced with citations
  • 0:30 — compliance memo drafted; flagged items routed to a human

Total: 30–45 minutes, mostly in human review of flagged items.

Time per review: 4–8 hours → 30–45 minutes.

Section D

How the automation actually works

The system uses two architectural patterns worth understanding before watching it run.

Retrieval-augmented validation. When the system reviews a work product, it does not ask a language model “is this correct?” in the abstract. Instead, for every substantive item in the work product, it retrieves the most relevant passages from the governing contract and asks the model to evaluate the item against those specific passages. This is a deliberate inversion of how most people have encountered AI tools, where models generate content from prompts. Here, the model judges existing content against retrieved specifications. The shift matters because compliance is, by definition, a question of conformance to specification — and grounding judgment in the specification itself produces traceable, auditable verdicts rather than opinions.

Multi-step verification. No single check, deterministic or AI-based, is sufficient on its own. Schema checks catch structural problems but miss semantic ones. Arithmetic checks catch reconciliation failures but miss methodology divergences. Language-model judgment catches semantic issues but can be misled by ambiguous text. The system therefore layers six verification stages, each catching a different category of error, with deterministic checks running first and AI-based judgment running on the structurally-validated remainder. An issue must pass every layer to make it to the final output. The architecture's trustworthiness comes not from any individual stage but from the discipline of layering them.

The pipeline that runs next is an instance of these two patterns applied to contract compliance review. The patterns themselves generalize.

Run the demo →

Three sample engagements; pick one to walk through the pipeline.

Bridge to Act 2

Choose a sample engagement

Each sample is a real-shaped model-validation services engagement — contract plus draft validation report. The system runs the same six-stage pipeline against all three. The findings differ.

Illustrative engagements · client names use Microsoft sample-data placeholders (Northwind, Contoso, Fabrikam) and are not real institutions.

Replacing approximately 4–8 hours of senior manager review. Estimated time for this run: ~30 seconds.

0.0s

Engagement

 

 

Act 2 deliverable

Compliance Memo

 

Generated in   seconds. Manual equivalent: 4–8 hours of senior manager time.

Executive summary

 

Findings

Items flagged for human reviewer judgment

Verification coverage

Bridge to Act 3

Monitoring & governance

The dashboard below shows what production deployment of the automation looks like across a synthetic corpus of recent compliance reviews. Coverage and edge-case routing are first-class concerns: the system is trustworthy when it knows what to defer, not only when it knows what to decide.

Quality drift — weekly average confidence

Mean RAG-validate verdict confidence per ISO week.

Findings per review — weekly

Mean findings produced per review per ISO week.

Edge cases flagged for review

Each row is a past review whose output was routed for additional human judgment.

Model and prompt versioning

Recent runs

Act 3 · The measured impact

What changed

The architecture above is interesting on its own terms. Whether it is useful is a separate question, and one that requires real numbers. The sections below quantify the change against the manual process described in Act 1.

  Manual Automated
Time per review4–8 hours30–45 minutes
Cost per review (loaded)$1,000–2,000$100–180
Reviewer fatigue at high volumeSignificantNone
Audit trailVariable, narrativeComprehensive, structured
Catches subtle methodology divergencesInconsistentSystematic

Same review, both ways

Sample 3 — the methodology divergence case.

Manual track

  • 0:00Receives contract.
  • 0:30Reads contract.
  • 1:00Receives work product.
  • 1:30Begins cross-check.
  • 4:00Finds methodology divergence (if attentive).
  • 5:00Drafts memo.

Automated track

  • 0:00Indexing begins.
  • 0:04Indexing complete.
  • 0:09Schema validation passes.
  • 0:11Grounding passes.
  • 0:14Sanity passes.
  • 0:18RAG validation begins.
  • 0:22Methodology divergence flagged with citations to Sec. 4.3 and 4.4.
  • 0:26Quality judge complete.
  • 0:30Compliance memo drafted.

Same finding, ~960× faster, with a complete audit trail.

What gets caught

Categories of finding the automation reliably surfaces and that manual review tends to miss.

Methodology substitutions

The Sample 3 case study. Subtle replacements of contracted analytical methods with alternatives that look superficially compliant.

Manual review tends to miss this because it's the kind of error that only surfaces if the reviewer reads the methodology section of the contract immediately before reading the methodology section of the work product, every time, fully attentive — which the third such review of the day rarely receives.

Scope drift

Minor inclusions of out-of-scope analysis or recommendations not authorized by the contract.

Manual review tends to miss this because reviewers are looking for what's missing, not what's quietly extra.

Quiet exclusions

Contracted deliverables silently omitted from the work product.

Manual review tends to miss this when the contract's deliverable list isn't actively in front of the reviewer during the read.

Defensibility failures

Claims in the work product that don't trace to source data.

Manual review tends to miss this on numerical claims that look plausible on first read.

Cost model and ROI

Manual cost is one number multiplied by another: senior manager hours per review, times the loaded hourly rate. There is no API or compute component — the work is done by hand.

Automated cost has two components. The first is a per-review fee for the LLM and embedding API calls plus the compute the pipeline runs on. The second is a block of senior reviewer time — the same reviewer who would have done the entire manual cross-check, now adjudicating only the items the system flagged. The pipeline does not eliminate senior judgment; it compresses the senior reviewer's involvement from the full review to the flagged items. That compression is where the savings live.

Manual cost per review

 

No API or compute component.

 

Automated cost per review

 

 

 

Savings per review

 

Senior hours recovered / yr

 

Annual savings

 

How the API + compute figure is built

Each pipeline run does roughly:

  • One contract index pass — ~20 Voyage embedding calls at $0.12 / MTok ≈ $0.02.
  • ~15 RAG-validate LLM calls (one per substantive claim), each ~3K input + ~600 output tokens, at Sonnet 4.6 list pricing ($3 / MTok input, $15 / MTok output) ≈ $0.27.
  • One quality-judge LLM call, ~10K input + ~1.5K output tokens ≈ $0.05.
  • Compute infrastructure (FastAPI process, amortized) ≈ $0.05.

Subtotal: $0.39. The constant used here ($1.50) is rounded up generously to absorb retries, larger contracts, and the quality-judge running on the full work product text. Holds for typical model-validation engagements; very large contracts (100+ pages) would push it higher.

Time constants: 6.0 hours of senior reviewer time per manual review (midpoint of the 4–8 hour range); 0.5 hours of senior reviewer time per automated review, spent only on items the system flagged for human judgment. The pipeline itself runs in ~30 seconds and is not billed against human time.

Beyond hours saved

Closing

The three beats, in order

We identified a manual process: senior-reviewer-led contract compliance review, 4–8 hours per review, expensive, inconsistent, fatigue-sensitive. We designed an automated solution: a retrieval-augmented multi-step verification pipeline producing reviewer-ready compliance reports with full audit trails, where the senior reviewer remains in the loop on flagged items only. We measured the impact: ~960× faster, more than 90% lower loaded cost per review, more consistent, and demonstrably better at catching subtle classes of error.

The architecture generalizes. Any workflow at a professional services firm where AI-generated artifacts must conform to formal specifications — engagement scope monitoring, methodology adherence reviews, deliverable acceptance, compliance memo drafting — is a candidate for the same approach. The pattern is constant; the specifications and verification layers adapt to the workflow.