QuantumNovelty is eighteen AI agent skills — running on the Claude Code CLI by default, with Codex available for cross-vendor falsifiability checks — that write and review quantum‑computing papers. The agents refuse to declare a result novel until the strict-domination comparator has run against current published baselines, every ratio has been recomputed from raw JSON, small-sample rates carry Wilson 95 % CIs, and the failure modes have their own section in the manuscript.
Tested on real arXiv papers. Every agent call ships its token + USD cost ledger in the PDF report. The framework refuses to silently fall back to a different LLM backend.
Both share the same skill catalog and audit-and-falsify framework. Pick by what you need today.
From a research question to a submission-ready manuscript with audit script attached.
Failure Modes section.$ chain/run.sh --pipeline full-pipeline \
--topic "LLM-driven UCCSD pruning on H2O" \
--hamiltonian H2O_4e_4o_8q \
--journal npj-quantum-information \
--quantum-lib mlxq
Point QuantumNovelty at any published or preprint paper. Get a real reviewer report.
$ chain/run.sh --pipeline chat \
--prompt "Review this paper" \
--paper draft.pdf \
--journal npj-quantum-information
Every skill is a directory with a SKILL.md, a run.sh, and a Python driver. Drop a new one in — the chain discovers it on the filesystem walk.
The audit-and-falsify framework. Augmented-baseline merge → strict-domination at ε_abs=10⁻¹², ε_rel=10⁻⁹ → recompute-from-raw → Wilson 95 % CIs → honest-negatives enforcement → re-runnable audit_claims.py.
full · quick · systematic-review · socratic · fact-check · lit-review · review
full · plan · outline-only · revision · revision-coach · abstract-only · lit-review · format-convert · citation-check · disclosure
full (EIC + R1 + R2 + R3 + Devil's Advocate) · quick · guided · methodology-focus · re-review · calibration
cherry-picked-baseline · ad-hoc-precision-floor · conflated-regimes · active-space-handwave · hardware-irrelevant-comparison · asymptotic-only-claim · unit-inflation · simulator-laundering · mapping-by-convenience · pareto-cherry-picked-axes · cross-llm-theatre
Falsifiability rubric. Refuses single-vendor "consensus". Predictions persisted before truth.
LLM-in-loop discovery. Bring your own evaluator. Strict-domination archive at calibrated tolerances.
4 standard axes: LLM-mutator on/off · commutation-hint · Pareto seeding · cross-vendor.
CrossRef + arXiv + Semantic Scholar live HTTP. Emits the Pareto-shaped baseline catalog.
Anna's Archive client for citations not on arXiv or CrossRef.
Stage-6 CQE. Geometric-mean composite of Novelty Rigour · Reproducibility · Methodological Rigour · Falsifiability · Domain Depth · Communication.
Natural-language frontend. Pattern-first dispatch; LLM fallback. "Review this paper" → the right skill + mode + flags.
Strict-domination primitives surfaced so you can compose custom audit pipelines.
Registries: 11 venues (npj-QI, PRX-Quantum, Quantum, PRL, PRA, Nature Comms…) + 7 libraries (Qiskit, PennyLane, QuTiP, mlxq, Cirq, OpenFermion, no-code).
Multi-source pull. Books via Anna's Archive when needed. No RAG; queries hit live.
LLM-in-loop Pareto archive with strict-domination at calibrated ε.
Augmented-baseline merge. Ratio recompute. Wilson CIs. Honest-negatives gate.
Venue-aware draft. Falsifiable amplitude rubric across two vendors.
5 voices. Quantum-CS fallacy taxonomy applied. Verdict + must-fix list.
6-dim Collaboration Quality Evaluation. Geometric mean. Honest about gaps.
Re-run with the same --outdir: stages whose output exists are skipped. --force re-runs everything.
Every LLM call records input_tokens + output_tokens + cost_usd + model_id via claude --output-format json. Estimated counts (codex) are flagged honestly.
If claude fails the call surfaces as an error. No silent codex fallback. backend_actually_used in every marker file.
Every example here lives at examples/end_to_end/ and contains the input paper, every stage's LLM output, the backend markers (token counts + USD), and the compiled PDF.
⚠ Disclaimer. All reviews below are generated end-to-end by AI, for academic and demonstration purposes only. We make no claims about the papers under review — four passed real peer review at their journals and one is a public arXiv preprint; nothing here is criticism of the authors or a substitute for human peer review. The combined five-paper report is REVIEWS.pdf.
Zeng, Sun, Jiang, Zhao · arXiv:2212.04566 · PRX Quantum 6, 010359
Panel 6.67/10, minor-revisions; 5 fallacy findings. The physics voice traced boundary-condition gaps in the lattice analysis; the Devil's Advocate pressed the observable-estimation-vs-coherent-simulation framing. Full claude run, $1.94 real USD.
Read the report (PDF)Zou, Rahm, Kockum, Olsson · arXiv:2507.01726
Panel 6.0/10, major-revisions; 7 fallacy findings from the quantum-CS taxonomy (cherry-picked-baseline, pareto-cherry-picked-axes, …). Full claude run, $0.97 real USD.
Read the report (PDF)Monbroussou, Mamon, Landman, Grilo, Kukla, Kashefi · arXiv:2309.15547 · Quantum 9, 1745
The quantum-machine-learning entry: trainability, expressivity, barren-plateau analysis of subspace-preserving ansätze. Panel 5.5/10, major-revisions; 6 fallacy findings. Full claude run, $1.63 real USD.
Read the report (PDF)Bermejo, Braccia, Rudolph, Holmes, Cincio, Cerezo · arXiv:2408.12739 · PRX Quantum 7, 020304
The QML dequantization landmark: Pauli-path simulation of QCNNs. Panel 6.0/10, major-revisions; 7 fallacy findings. Full claude run, $1.36 real USD.
Read the report (PDF)Microsoft Quantum · arXiv:2606.03884 · preprint
The hardware entry — first non-gate-model paper through the pipeline, and the showcase for the full verification layer: argument-structure audit, deterministic claims registry, 16-point disclosure audit, and a paragraph-anchored revision plan. Panel 6.33/10, major-revisions; the Devil's Advocate flagged the single-wire scope vs the multi-tetron-array claims. Full claude run, $1.78 real USD across 8 stages.
Read the report (PDF)Surface code · QML kernels · QAOA on MaxCut
Three different topics, three target journals (PRX Quantum / npj-QI / Quantum), three quantum libraries (Qiskit / PennyLane / QuTiP). One drafted with codex when claude hit a nested-CLI collision (since root-caused — nested calls now disable tool use). All three got reject / major-revisions — because they were scaffolded with no real experiments. The reviewer panel is brutal about that, by design.
Open the runQN is a peer of AutoResearchClaw and academic-research-skills. We cherry-picked the good parts and added a quantum-specific spine. The bottom line first — all three frameworks were run head-to-head on the same PRX Quantum paper (LCU-Trotter, arXiv:2212.04566) with the same backend; the measured rows come from that run (full 40-page comparison PDF):
| QN (QuantumNovelty) | ARS (academic-research-skills) | ARC (AutoResearchClaw) | |
|---|---|---|---|
| Review steps | 4 default stages (+3 opt-in: novelty-audit, cross-llm, synthesizer → up to 7) | 7 agents in 3 phases | 2 review stages (of a 23-stage generative pipeline) |
| Reviewer "agents" | 5 voices in one LLM call (EIC + Physics + Novelty + Evidence + Devil's Advocate) | 7 independent LLM calls (field-analyst, EIC, methodology, domain, perspective, Devil's Advocate, synthesizer) | 1 call simulating Reviewer A + B |
| Logical fallacies | ✓ dedicated stage — the only one with a domain taxonomy: 11 quantum-CS fallacies (simulator-laundering, cherry-picked-baseline, cross-llm-theatre, …) + standard | partial — folded into Devil's Advocate + methodology reviewer | ✗ (fabrication / evidence-consistency check instead) |
| Devil's Advocate | ✓ Voice 4 of the panel | ✓ dedicated agent | ✗ |
| Numeric quality gate | ✓ deterministic _quality_gate.json (ARC's shape, zero LLM cost) |
✗ | ✓ the original (score_1_to_10 + threshold) |
| Editorial decision package | ✓ opt-in --with-synthesizer (+ CONSENSUS-0 tags for fallacy-only findings) |
✓ Phase-2 synthesizer (consensus tags, roadmap, response template) | partial (required_actions list) |
| Measured cost (same paper) | 4 calls, $2.15 real USD | 7 calls, $3.10 real USD | 2 calls, $0.66 real USD |
| Quantum-specific? | ✓ the only one | ✗ general academic | ✗ general (ML-leaning) |
All numbers are from a single all-claude run (Claude Code CLI, exact per-call USD). On LCU-Trotter the three architectures landed in the same revisions band — QN gate 6.0/10 (major revisions), ARC gate 7/10 (accept with minor revisions), ARS editorial decision (Minor Revision); QN's quantum-CS-specific checks make it the strictest. The concrete differences in detail:
The audit-and-falsify framework targets the failure modes that appear in quantum-computing papers: simulator precision artefacts, complex64 vs float64 swaps, Pareto cherry-picking on convenient axes, JW/BK/parity mapping chosen for cosmetic gate counts. ARC's gate stack is venue-agnostic; QN's is quantum-shaped.
Run end-to-end on Flow-VQE (arXiv:2507.01726, npj-QI) and LCU-Trotter (arXiv:2212.04566, PRX Quantum) with the PDF report shipped in examples/ — plus HW-QML, QCNN, and the Majorana-tetron preprint, and three scaffolded avenues across different libraries and venues.
Not just review. The quantum_paper skill with 10 modes (full / plan / outline-only / revision / revision-coach / abstract-only / lit-review / format-convert / citation-check / disclosure) takes a topic to a venue-formatted draft. cross_llm_prediction lets you bake falsifiability into the paper.
11 fallacies that appear in quantum-CS work specifically — not in the general taxonomy. Cherry-picked-baseline. Ad-hoc-precision-floor. Simulator-laundering. Mapping-by-convenience. Pareto-cherry-picked-axes. Cross-llm-theatre. The taxonomy was extracted from real reviewer concerns raised against the framework's motivating study (paper in development).
11-journal registry (npj-QI · PRX Quantum · Quantum · PRL · PRA · PR Applied · Nature Comms · Comms Physics · QST · Physics Letters A · IEEE TQE) with abstract limits, section orderings, required statements. 7-library registry (Qiskit · PennyLane · QuTiP · mlxq · Cirq · OpenFermion · no-code) emitting library-specific code skeletons.
Every LLM call's input/output tokens, model snapshot ID, USD cost, and elapsed time get aggregated into the comparison PDF. Two-paper analysis cost $<X> on claude; the PDF shows you the receipt. Codex / estimated rows flagged honestly.
QN scrubs CLAUDE_CODE_* and ANTHROPIC_* env from every subprocess, runs from a neutral cwd, passes --no-session-persistence. Hardened against the "tool_use ids must be unique" 400 that bites nested claude calls. When isolation isn't enough, QN refuses to silently swap backends.
Six dimensions composed by geometric (not arithmetic) mean. A run that scored 95 on five dimensions and 30 on Falsifiability composites at 67, not 84. The publication-blocking weakness cannot be averaged away.
$ git clone <repo> QuantumNovelty
$ cd QuantumNovelty
$ pip install -e .
$ which claude # default: Claude Code CLI (subscription)
$ which codex # optional: cross-LLM falsifiability
$ chain/run.sh --list-skills
$ chain/run.sh --pipeline chat \
--prompt "Quick assessment of this paper" \
--paper paper.pdf --execute
claude CLI (default) or codex CLIpdftotext (poppler) and pdflatex (any TeX distribution)ANNAS_ARCHIVE_KEY for book downloads; SERPER_KEY for Google Scholar