AI agent skills for the Claude Code CLI — a peer of ARC & ARS, with the audit-and-falsify layer added.

Audit‑and‑falsify for
quantum‑computing research.

QuantumNovelty is eighteen AI agent skills — running on the Claude Code CLI by default, with Codex available for cross-vendor falsifiability checks — that write and review quantum‑computing papers. The agents refuse to declare a result novel until the strict-domination comparator has run against current published baselines, every ratio has been recomputed from raw JSON, small-sample rates carry Wilson 95 % CIs, and the failure modes have their own section in the manuscript.

Tested on real arXiv papers. Every agent call ships its token + USD cost ledger in the PDF report. The framework refuses to silently fall back to a different LLM backend.

13agent skills
24modes
11journals
7quantum libs
151tests
Two ways to use it

Pick a workflow.

Both share the same skill catalog and audit-and-falsify framework. Pick by what you need today.

A

Write a new quantum paper

From a research question to a submission-ready manuscript with audit script attached.

  1. Surface the literature — CrossRef + arXiv + Semantic Scholar feed a Pareto-shaped baseline catalog of current published methods.
  2. Drive a discovery loop — the LLM proposes ansatz / circuit candidates; a strict-domination archive at calibrated ε keeps the non-dominated points.
  3. Falsifiability check — cross-LLM amplitude prediction across two distinct vendors; predictions persisted to disk before truth is computed.
  4. Novelty audit — augmented-baseline merge surfaces rediscoveries the LLM didn't catch; honest negatives become a required Failure Modes section.
  5. Draft the paper — venue-aware section ordering, abstract limits, and required statements per the 11-journal registry.
  6. Review your own draft — 5-voice panel + 11 quantum-CS-specific fallacy checks + Stage-6 CQE before submission.
$ chain/run.sh --pipeline full-pipeline \
    --topic   "LLM-driven UCCSD pruning on H2O" \
    --hamiltonian H2O_4e_4o_8q \
    --journal     npj-quantum-information \
    --quantum-lib mlxq
B

Review an existing paper

Point QuantumNovelty at any published or preprint paper. Get a real reviewer report.

  1. Deep research review — the 8-item audit-and-falsify checklist (Augmented baselines / Strict-domination / Recompute-from-raw / Wilson CIs / Cross-LLM / Honest negatives / Simulator precision / Auditable claims), with PASS / PARTIAL / FAIL per item.
  2. 5-voice reviewer panel — EIC + R1 Physics + R2 Novelty + R3 Evidence + Devil's Advocate. Each writes ≥4 paragraphs; the EIC produces a numbered must-fix list.
  3. Logical-fallacy scan — standard fallacies plus 11 quantum-CS-specific (cherry-picked-baseline, ad-hoc-precision-floor, simulator-laundering, pareto-cherry-picked-axes, cross-llm-theatre, …).
  4. Stage-6 CQE — 6-dimension Collaboration Quality Evaluation with geometric-mean composite. A 30/100 on one dimension cannot be averaged away by 95s on the other five.
$ chain/run.sh --pipeline chat \
    --prompt "Review this paper" \
    --paper draft.pdf \
    --journal npj-quantum-information
The catalog

Eighteen skills. One contract. Deterministic gates included.

Every skill is a directory with a SKILL.md, a run.sh, and a Python driver. Drop a new one in — the chain discovers it on the filesystem walk.

novelty_audit

marquee

The audit-and-falsify framework. Augmented-baseline merge → strict-domination at ε_abs=10⁻¹², ε_rel=10⁻⁹ → recompute-from-raw → Wilson 95 % CIs → honest-negatives enforcement → re-runnable audit_claims.py.

deep_research

7 modes

full · quick · systematic-review · socratic · fact-check · lit-review · review

quantum_paper

10 modes

full · plan · outline-only · revision · revision-coach · abstract-only · lit-review · format-convert · citation-check · disclosure

quantum_reviewer

6 modes

full (EIC + R1 + R2 + R3 + Devil's Advocate) · quick · guided · methodology-focus · re-review · calibration

logical_fallacies

+11 quantum-CS

cherry-picked-baseline · ad-hoc-precision-floor · conflated-regimes · active-space-handwave · hardware-irrelevant-comparison · asymptotic-only-claim · unit-inflation · simulator-laundering · mapping-by-convenience · pareto-cherry-picked-axes · cross-llm-theatre

cross_llm_prediction

Falsifiability rubric. Refuses single-vendor "consensus". Predictions persisted before truth.

pareto_explorer

LLM-in-loop discovery. Bring your own evaluator. Strict-domination archive at calibrated tolerances.

ablation_designer

4 standard axes: LLM-mutator on/off · commutation-hint · Pareto seeding · cross-vendor.

literature_surfacer

CrossRef + arXiv + Semantic Scholar live HTTP. Emits the Pareto-shaped baseline catalog.

book_acquirer

Anna's Archive client for citations not on arXiv or CrossRef.

process_summary

Stage-6 CQE. Geometric-mean composite of Novelty Rigour · Reproducibility · Methodological Rigour · Falsifiability · Domain Depth · Communication.

chat

Natural-language frontend. Pattern-first dispatch; LLM fallback. "Review this paper" → the right skill + mode + flags.

audit_falsify

Strict-domination primitives surfaced so you can compose custom audit pipelines.

journals + quantum_libs

Registries: 11 venues (npj-QI, PRX-Quantum, Quantum, PRL, PRA, Nature Comms…) + 7 libraries (Qiskit, PennyLane, QuTiP, mlxq, Cirq, OpenFermion, no-code).

The chain

Six stages. Resumable. Honest about what didn't run.

1

Literature surface

Multi-source pull. Books via Anna's Archive when needed. No RAG; queries hit live.

2

Discovery

LLM-in-loop Pareto archive with strict-domination at calibrated ε.

3

Novelty audit

Augmented-baseline merge. Ratio recompute. Wilson CIs. Honest-negatives gate.

4

Draft + cross-LLM

Venue-aware draft. Falsifiable amplitude rubric across two vendors.

5

Review panel

5 voices. Quantum-CS fallacy taxonomy applied. Verdict + must-fix list.

6

Process summary

6-dim Collaboration Quality Evaluation. Geometric mean. Honest about gaps.

Resumable

Re-run with the same --outdir: stages whose output exists are skipped. --force re-runs everything.

Token + USD ledger

Every LLM call records input_tokens + output_tokens + cost_usd + model_id via claude --output-format json. Estimated counts (codex) are flagged honestly.

Backend-honest

If claude fails the call surfaces as an error. No silent codex fallback. backend_actually_used in every marker file.

Worked examples — in the repo

Real papers. Real reviewers. Real cost ledger.

Every example here lives at examples/end_to_end/ and contains the input paper, every stage's LLM output, the backend markers (token counts + USD), and the compiled PDF.

⚠ Disclaimer. All reviews below are generated end-to-end by AI, for academic and demonstration purposes only. We make no claims about the papers under review — four passed real peer review at their journals and one is a public arXiv preprint; nothing here is criticism of the authors or a substitute for human peer review. The combined five-paper report is REVIEWS.pdf.

PRX Quantum

LCU-Trotter — Trotter error compensation with LCU

Zeng, Sun, Jiang, Zhao · arXiv:2212.04566 · PRX Quantum 6, 010359

Panel 6.67/10, minor-revisions; 5 fallacy findings. The physics voice traced boundary-condition gaps in the lattice analysis; the Devil's Advocate pressed the observable-estimation-vs-coherent-simulation framing. Full claude run, $1.94 real USD.

Read the report (PDF)
npj-QI

Flow-VQE — Generative flow-based VQE warm-start

Zou, Rahm, Kockum, Olsson · arXiv:2507.01726

Panel 6.0/10, major-revisions; 7 fallacy findings from the quantum-CS taxonomy (cherry-picked-baseline, pareto-cherry-picked-axes, …). Full claude run, $0.97 real USD.

Read the report (PDF)
Quantum · QML

HW-QML — Hamming-weight preserving circuits for ML

Monbroussou, Mamon, Landman, Grilo, Kukla, Kashefi · arXiv:2309.15547 · Quantum 9, 1745

The quantum-machine-learning entry: trainability, expressivity, barren-plateau analysis of subspace-preserving ansätze. Panel 5.5/10, major-revisions; 6 fallacy findings. Full claude run, $1.63 real USD.

Read the report (PDF)
PRX Quantum · QML

QCNN — Quantum CNNs are classically simulable

Bermejo, Braccia, Rudolph, Holmes, Cincio, Cerezo · arXiv:2408.12739 · PRX Quantum 7, 020304

The QML dequantization landmark: Pauli-path simulation of QCNNs. Panel 6.0/10, major-revisions; 7 fallacy findings. Full claude run, $1.36 real USD.

Read the report (PDF)
preprint · hardware

Majorana tetron — 20-second parity lifetime

Microsoft Quantum · arXiv:2606.03884 · preprint

The hardware entry — first non-gate-model paper through the pipeline, and the showcase for the full verification layer: argument-structure audit, deterministic claims registry, 16-point disclosure audit, and a paragraph-anchored revision plan. Panel 6.33/10, major-revisions; the Devil's Advocate flagged the single-wire scope vs the multi-tetron-array claims. Full claude run, $1.78 real USD across 8 stages.

Read the report (PDF)
scaffolded

Three quantum avenues from a blank page

Surface code · QML kernels · QAOA on MaxCut

Three different topics, three target journals (PRX Quantum / npj-QI / Quantum), three quantum libraries (Qiskit / PennyLane / QuTiP). One drafted with codex when claude hit a nested-CLI collision (since root-caused — nested calls now disable tool use). All three got reject / major-revisions — because they were scaffolded with no real experiments. The reviewer panel is brutal about that, by design.

Open the run
vs. ARC and ARS

What QuantumNovelty adds.

QN is a peer of AutoResearchClaw and academic-research-skills. We cherry-picked the good parts and added a quantum-specific spine. The bottom line first — all three frameworks were run head-to-head on the same PRX Quantum paper (LCU-Trotter, arXiv:2212.04566) with the same backend; the measured rows come from that run (full 40-page comparison PDF):

QN (QuantumNovelty) ARS (academic-research-skills) ARC (AutoResearchClaw)
Review steps 4 default stages (+3 opt-in: novelty-audit, cross-llm, synthesizer → up to 7) 7 agents in 3 phases 2 review stages (of a 23-stage generative pipeline)
Reviewer "agents" 5 voices in one LLM call (EIC + Physics + Novelty + Evidence + Devil's Advocate) 7 independent LLM calls (field-analyst, EIC, methodology, domain, perspective, Devil's Advocate, synthesizer) 1 call simulating Reviewer A + B
Logical fallacies ✓ dedicated stage — the only one with a domain taxonomy: 11 quantum-CS fallacies (simulator-laundering, cherry-picked-baseline, cross-llm-theatre, …) + standard partial — folded into Devil's Advocate + methodology reviewer (fabrication / evidence-consistency check instead)
Devil's Advocate Voice 4 of the panel dedicated agent
Numeric quality gate deterministic _quality_gate.json (ARC's shape, zero LLM cost) the original (score_1_to_10 + threshold)
Editorial decision package opt-in --with-synthesizer (+ CONSENSUS-0 tags for fallacy-only findings) Phase-2 synthesizer (consensus tags, roadmap, response template) partial (required_actions list)
Measured cost (same paper) 4 calls, $2.15 real USD 7 calls, $3.10 real USD 2 calls, $0.66 real USD
Quantum-specific? ✓ the only one general academic general (ML-leaning)

All numbers are from a single all-claude run (Claude Code CLI, exact per-call USD). On LCU-Trotter the three architectures landed in the same revisions band — QN gate 6.0/10 (major revisions), ARC gate 7/10 (accept with minor revisions), ARS editorial decision (Minor Revision); QN's quantum-CS-specific checks make it the strictest. The concrete differences in detail:

Quantum-specific pipeline

The audit-and-falsify framework targets the failure modes that appear in quantum-computing papers: simulator precision artefacts, complex64 vs float64 swaps, Pareto cherry-picking on convenient axes, JW/BK/parity mapping chosen for cosmetic gate counts. ARC's gate stack is venue-agnostic; QN's is quantum-shaped.

Tested on real published papers

Run end-to-end on Flow-VQE (arXiv:2507.01726, npj-QI) and LCU-Trotter (arXiv:2212.04566, PRX Quantum) with the PDF report shipped in examples/ — plus HW-QML, QCNN, and the Majorana-tetron preprint, and three scaffolded avenues across different libraries and venues.

Used to generate a new paper

Not just review. The quantum_paper skill with 10 modes (full / plan / outline-only / revision / revision-coach / abstract-only / lit-review / format-convert / citation-check / disclosure) takes a topic to a venue-formatted draft. cross_llm_prediction lets you bake falsifiability into the paper.

Quantum-CS logical fallacies

11 fallacies that appear in quantum-CS work specifically — not in the general taxonomy. Cherry-picked-baseline. Ad-hoc-precision-floor. Simulator-laundering. Mapping-by-convenience. Pareto-cherry-picked-axes. Cross-llm-theatre. The taxonomy was extracted from real reviewer concerns raised against the framework's motivating study (paper in development).

Venue + library selectors

11-journal registry (npj-QI · PRX Quantum · Quantum · PRL · PRA · PR Applied · Nature Comms · Comms Physics · QST · Physics Letters A · IEEE TQE) with abstract limits, section orderings, required statements. 7-library registry (Qiskit · PennyLane · QuTiP · mlxq · Cirq · OpenFermion · no-code) emitting library-specific code skeletons.

Token + USD ledger in the PDF

Every LLM call's input/output tokens, model snapshot ID, USD cost, and elapsed time get aggregated into the comparison PDF. Two-paper analysis cost $<X> on claude; the PDF shows you the receipt. Codex / estimated rows flagged honestly.

Nested-CLI isolation playbook

QN scrubs CLAUDE_CODE_* and ANTHROPIC_* env from every subprocess, runs from a neutral cwd, passes --no-session-persistence. Hardened against the "tool_use ids must be unique" 400 that bites nested claude calls. When isolation isn't enough, QN refuses to silently swap backends.

Stage-6 CQE with geometric mean

Six dimensions composed by geometric (not arithmetic) mean. A run that scored 95 on five dimensions and 30 on Falsifiability composites at 67, not 84. The publication-blocking weakness cannot be averaged away.

From ARC gate-stack pattern · citation-integrity · compilation-quality · paper-verification · quality-gate · research-decision · knowledge-archive
From ARS modular skill-as-folder pattern · multi-mode skill drivers · literature-surfacer triad (CrossRef + arXiv + Semantic Scholar)
New in QN novelty_audit · audit_falsify primitives · quantum-CS fallacy taxonomy · cross-LLM falsifiability rubric · venue + library registries · token + USD ledger · Stage-6 CQE
Get started

Three commands.

1

Clone

$ git clone <repo> QuantumNovelty
$ cd QuantumNovelty
$ pip install -e .
2

Verify a backend

$ which claude   # default: Claude Code CLI (subscription)
$ which codex    # optional: cross-LLM falsifiability
3

Run anything

$ chain/run.sh --list-skills

$ chain/run.sh --pipeline chat \
    --prompt "Quick assessment of this paper" \
    --paper paper.pdf --execute

Requirements

  • claude CLI (default) or codex CLI
  • Python ≥ 3.11
  • pdftotext (poppler) and pdflatex (any TeX distribution)
  • Optional: ANNAS_ARCHIVE_KEY for book downloads; SERPER_KEY for Google Scholar