Methodology & Quality Assurance

Every commentary article passes through a multi-stage pipeline: generation by specialized AI agents, automatic quality review by an independent evaluator agent, and translation into four languages. This page documents the entire process and quality criteria.

Generation Pipeline

Three commentary layers are generated for each article of law:

Layer 1 — Overview
Accessible summary at B1 language level (150–300 words). Answers: What does the provision regulate? Who is affected? What are the legal consequences? Includes at least one concrete example. Maximum sentence length: 25 words.
Layer 2 — Doctrine
Academic analysis with marginal numbers (N. 1, N. 2, …). Required sections: legislative history, systematic context, elements of the provision, legal consequences, doctrinal debates, practical guidance. Minimum 3 secondary sources from different authors. The legislative message (Botschaft) must be cited.
Layer 3 — Case Law
Complete digest of Federal Supreme Court decisions from the opencaselaw.ch database (956,000+ decisions). Grouped by topic, ordered by importance. For each decision: BGE reference, date, core holding, relevance, block quote of the decisive passage.

The doctrine layer is enriched with structured reference data from leading commentary literature. Authors, marginal numbers, doctrinal positions, and controversies flow into the generation as context.

Quality Evaluation

After each generation, an independent evaluator agent reviews the draft using a two-stage process. Only drafts that meet all criteria are published.

Non-Negotiable Criteria (binary)

A single failure leads to rejection:

  1. No unsourced legal claims — Every statement about applicable law must be supported by a primary source. Cited authors and marginal numbers must match the reference data.
  2. No factual errors — Cited holdings must match the actual decisions.
  3. No missing leading cases — All BGE available on opencaselaw for the article must appear.
  4. Correct legal terminology — Legal terms must match the SR text.
  5. Structural completeness — All required sections must be present.

Scored Dimensions (0–1 scale)

DimensionThresholdCriteria
Precision≥ 0.95Citation accuracy, terminological exactness, temporal accuracy
Concision≥ 0.90No redundancy, word count 150–300 (overview), no filler phrases
Accessibility≥ 0.90No untranslated jargon, sentence length ≤ 25 words, concrete example (layer 1 only)
Relevance≥ 0.90Practical significance before theory, recent developments included
Academic Rigor≥ 0.95≥ 3 secondary sources, Botschaft cited, doctrinal debates with named authors

Retry Process

If a draft is rejected, the generation agent receives detailed feedback from the evaluator and produces an improved draft. After a maximum of 3 attempts, the article is flagged for manual review.

Generation Evaluation Passed? Translation & Publication

On rejection: retry with feedback (max 3×). After 3 failures: flagged for manual review.

Translation

Published commentary is automatically translated into French, Italian, and English. The German version is authoritative. Legal terminology uses official Fedlex termdat translations. Structure, formatting, and marginal number numbering are preserved.

For article titles and statute texts: where Fedlex provides official translations (e.g., the Federal Constitution is available in all four national languages plus English), these are used and marked as "Fedlex". Where no official translation exists, the platform produces its own AI translation.

Data Sources

  • Statute texts: Fedlex (Classified Compilation)
  • Case law: opencaselaw.ch (956,000+ decisions, 8.77M citation edges)
  • Doctrine references: Leading commentary literature, structurally processed
  • AI model: Claude (Anthropic)

Technical Architecture of the Quality Gate

The quality assurance system is implemented as a multi-agent pipeline where the generator and evaluator are strictly separated. The evaluator has no access to the generator's internal state, prompts, or reasoning. This architectural separation prevents self-confirming quality assessments.

Generator Agent

For each layer, a specialized law agent receives:

  • System prompt: Global authoring guidelines + law-specific guidelines (key commentaries, cross-reference patterns, special procedural notes for each law)
  • Article text: The full statutory text from Fedlex, injected directly into the prompt
  • Doctrinal reference data: Structured extracts from leading commentary literature identifying authors, Randziffern, doctrinal positions, controversies, cross-references, and key literature for the specific article
  • Tool access: find_leading_cases, search_decisions, get_case_brief, get_decision (full text), write_layer_content — all via the opencaselaw MCP server

The agent autonomously researches case law, reads full decision texts, and writes the commentary in a tool-use loop with up to 25 turns per layer.

Evaluator Agent

A separate evaluator agent receives the generated text and independently verifies it:

  • Read-only tools: Can verify content against primary sources but cannot modify it
  • Citation cross-check: Verifies that cited authors and Randziffern actually appear in the reference data. Fabricated citations are rejected.
  • Completeness check: Queries opencaselaw for all leading cases (BGE) on the article and verifies that every one appears in the case law layer

The evaluator returns structured JSON: binary non-negotiable results, scored dimensions (0.0–1.0 each), blocking issues, and improvement suggestions. On rejection, the full feedback is passed back to the generator for the next attempt.

Doctrinal Reference Integration

Commentary literature is processed into structured reference data that flows into the generation pipeline. This includes authors with edition years, Randziffern maps, named doctrinal positions, identified controversies, cross-references, and key literature. This grounds the AI commentary in actual scholarly sources rather than generating citations from training data.

Cost Tracking and Resumability

Every API call is tracked with token-level cost estimation. The pipeline supports budget limits and crash-safe resumability — state is persisted after each article completion.

Production Costs — BV (Federal Constitution)

Transparent cost data from the BV bootstrap run (232 articles):

MetricValue
Average cost per article$2.58
Median cost per article$2.40
Range$0.84 – $7.77
Cost per layer~$0.50 (generation + evaluation)
Cost per translation~$0.15 per language
Evaluation retry overhead+$0.80 per rejected attempt
Estimated total for BV (232 articles)~$600
Estimated total for all 8 laws (4,800+ articles)~$12,000

Higher-cost articles are those with extensive case law (more tool calls to opencaselaw) or repeated evaluation rejections (up to 3 attempts per layer).

Open Source & License

The complete source code is available on GitHub (MIT license). All commentary content is licensed under CC BY-SA 4.0. The full dataset is available on HuggingFace.