Skip to main content

Current Evidence

The current promoted run is runs/self-improve-v0.42/.

Older milestone evidence that is no longer part of the current-state table is preserved in Historical evidence archive.

SignalCurrent value
Product nameQuarkLM
Package / repository slugquark-lm
Docs hostRead the Docs
Marketing hostGitHub Pages
RC specRC_SPEC.md
RC gap auditRC_GAP_AUDIT.md
RC checklistRC_CHECKLIST.md
Recommended RC trackResearch Prototype RC
Research Prototype RC statusnear: the closed-world self-improvement system is reproducible, auditable, documented, and honest about unpromoted transformer evidence.
Language Model RC statusnot ready: the from-scratch transformer still fails branch_diversity_target after v0.115.
Next model bundleProfile-Balanced Routing Repair with representation-separation acceptance checks.
Research groundingsites/docs/docs/learn/research-grounding.md
Research reviewed2026-06-15
Research passv0.71-v0.115 implementation evidence; v0.115 adds a bias-frozen hidden-projection margin candidate with exact retrieval memory and rejected neural promotion.
Research decisionQuarkLM should model self-improvement as a closed-loop lifecycle with ledgered admission, verified selection, auditable weight optimization, separate inference rails, and promotion gates that can reject regressions.
Research next stepUse the v0.115 hidden-projection candidate evidence to design a broader guarded routing repair that can lift coverage without collapsing profiles.
Open-source mechanics auditMECHANICS_AUDIT.md and sites/docs/docs/learn/open-source-mechanics-audit.md added as the v0.66 deeper comparison of open-source LLM, tokenizer, continual-learning, transparency, and self-improvement mechanics.
Mechanics audit decisionThe next bottleneck is trainer mechanics, not another global branch-loss term: direct-answer replay should be profile-aware, artifacted, coverage-constrained, and tested for profile isolation before the next full-stack repair run.
Mechanics audit next stepImprove target-token diversity for remaining memory-backed failures using v0.115 hidden-projection candidate evidence; do not count retrieval as weight learning.
Forward research planFORWARD_RESEARCH_PLAN.md and sites/docs/docs/learn/forward-research-plan.md added as the v0.69 cross-referenced implementation strategy.
Forward plan decisionPause direct-answer objective churn until QuarkLM has the self-improvement operating system needed to decide which training changes are legitimate: experiment registry, replay extraction, corpus governance, candidate quarantine, closed-world verifier checks, recipe boundaries, and constraint-first promotion gates.
Forward plan next stepAfter v0.115, expand guarded hidden-projection repair only if it improves target coverage under the branch-diversity gate.
Deep research reviewDEEP_RESEARCH_REVIEW.md and sites/docs/docs/learn/deep-research-review.md added as the v0.70 deeper cross-referenced research and implementation-gap review.
Deep review decisionNo larger transformer repair screen should run until experiment intent, corpus plans, replay plans, verifier checks, recipes, and constraint-first promotion are explicit artifacts.
Deep review next stepAfter v0.115, tie the target-routing gap to a guarded hidden-projection and prompt-representation repair surface that can survive promotion constraints.
Research implementation mapRESEARCH_IMPLEMENTATION_MAP.md and sites/docs/docs/learn/research-implementation-map.md added as the v0.74 source-to-gap-to-version implementation map.
Implementation map decisionDeep cross-referenced research and open-source mechanics review are now a required implementation control: each next mechanic should cite its research pattern, name the closed-world boundary it protects, and produce acceptance evidence.
Implementation map next stepCandidate quarantine through hidden-projection margin repair are implemented and screened through v0.115.0; broader guarded routing repair next.
Experiment registrysrc/closed_world_lm/experiment_registry.py and sites/docs/docs/operate/experiment-registry.md added as the v0.71 run-intent implementation.
Experiment registry decisionSelf-improvement and transformer answer-training runs now declare hypothesis, allowed data, planned artifacts, recipe id, acceptance gates, failure criteria, notes, and final decision before their outputs are trusted as evidence.
Experiment registry next stepUse the registry as the required evidence wrapper for replay, corpus, verifier, recipe, and promotion-gate mechanics.
Replay planningsrc/closed_world_lm/replay_plan.py added as the v0.72 standalone replay-planning module.
Replay planning decisionTransformer training still uses the existing profile-aware replay behavior, but replay record normalization, profile grouping, coverage floors, missing-target summaries, and JSON-safe plan shape now live outside the transformer monolith.
Replay planning next stepUse standalone replay planning as input to corpus hygiene, candidate quarantine, verifier, recipe, and promotion-gate reports.
Corpus hygienesrc/closed_world_lm/corpus_hygiene.py and sites/docs/docs/operate/corpus-hygiene.md added as the v0.73 corpus hygiene and training-plan artifact implementation.
Corpus hygiene decisionSelf-improvement and transformer answer-training runs now write corpus_hygiene.json and training_plan.json with source mixtures, duplicate checks, train/eval prompt overlap, candidate ratios, rare-profile coverage, allowed data sources, planned artifacts, and replay-plan summaries when available.
Corpus hygiene next stepUse candidate ratios, quarantine summaries, overlap evidence, verifier summaries, recipe summaries, transformer responsibility surfaces, checkpoint metadata surfaces, and eval surfaces as inputs to objective-repair work.
Candidate quarantinesrc/closed_world_lm/candidate_quarantine.py and sites/docs/docs/operate/candidate-quarantine.md added as the v0.75 candidate lifecycle implementation.
Candidate quarantine decisionSelf-improvement and transformer answer-training runs now write candidate_quarantine.json, and training_plan.json records that candidate records are not training data until admitted into the ledgered corpus and converted into curriculum lessons.
Candidate quarantine next stepUse the candidate quarantine manifest as input to deterministic verifier checks, recipe artifacts, and future promotion gates.
Closed-world verifiersrc/closed_world_lm/closed_world_verifier.py and sites/docs/docs/operate/closed-world-verifier.md added as the v0.76 deterministic verifier implementation.
Verifier decisionSelf-improvement and transformer answer-training runs now write closed_world_verifier.json, embed verifier summaries in training_plan.json, and require verifier approval as a run-intent gate without using an external model.
Verifier next stepUse verifier evidence as an input to recipe objects, constraint-first promotion gates, transformer responsibility surfaces, model/checkpoint metadata, eval surfaces, and objective-repair work.
Training recipesrc/closed_world_lm/training_recipe.py and sites/docs/docs/operate/training-recipes.md added as the v0.77 recipe and constraint-first promotion implementation.
Training recipe decisionSelf-improvement and transformer answer-training runs now write training_recipe.json and constraint_first_promotion.json. Transformer decisions cannot promote from loss, NLL, rank, top-k, or exact quality evidence unless closed-world constraints pass first.
Training recipe next stepUse recipe and constraint-first artifacts as the surfaces for transformer objective-repair work.
Transformer responsibilitysrc/closed_world_lm/transformer_experiment.py, src/closed_world_lm/transformer_training.py, src/closed_world_lm/transformer_objectives.py, and sites/docs/docs/build/transformer-responsibilities.md added as the v0.78 transformer responsibility implementation.
Transformer responsibility decisionTransformer answer-training now keeps artifact contracts, experiment intent, recipe creation, promotion decisions, JSONL snapshot writing, shuffled training cursors, loss averaging, and the direct-answer objective catalog behind narrow tested surfaces while preserving the public CLI.
Transformer responsibility next stepUse the v0.78 responsibility surfaces through the v0.115.0 hidden-projection candidate evidence before broader routing repair.
Transformer model surfacesrc/closed_world_lm/transformer_model.py and tests/test_transformer_model.py added as the v0.79 transformer model/config and checkpoint metadata implementation.
Transformer model decisionTransformer config, optimizer config, generation config, validation, checkpoint architecture, checkpoint format, tokenizer identity, closed-world dataset metadata, arg-to-config adapters, and run metadata now live outside transformer_char_model.py while remaining re-exported for compatibility.
Transformer model next stepUse model/checkpoint metadata surfaces with the v0.115.0 hidden-projection candidate evidence before broader routing repair.
Transformer eval surfacesrc/closed_world_lm/transformer_checkpoint.py, src/closed_world_lm/transformer_eval.py, tests/test_transformer_checkpoint.py, and tests/test_transformer_eval.py added as the v0.80 transformer eval/checkpoint-load implementation.
Transformer eval decisionCheckpoint payload loading and identity validation, checkpoint summaries, probe loading, candidate collection, generic transformer scoring, eval report assembly, samples JSONL writing, and eval JSON writing now live outside transformer_char_model.py while preserving CLI behavior and artifact shapes.
Transformer eval next stepv0.115.0 uses eval and promotion surfaces to screen a bias-frozen hidden-projection margin candidate; branch_diversity_target still blocks promotion.
Latest repository versionv0.115.0
Latest version summarybias-frozen hidden-projection margin candidate evidence
Current versionv0.42
Admitted facts12
Direct admission probes48/48
Admission paraphrase probes84/84
Glossary probes38/38
QA exact8/8
Admissions exact48/48
Admission paraphrases exact84/84
Glossary exact38/38
Self exact7/7
Learning exact4/4
Forgetting auditpassed
Prompt leakage auditpassed
Exact eval auditpassed
Promotion gatepassed
Self-diagnosispassed
Self-diagnosis external modelfalse
Self-diagnosis recommendationpromote_or_expand_corpus
Attempt archiveenabled
Transformer runruns/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/
Transformer validation NLLanswer target NLL 3.5850 -> 2.4129
Transformer exact0/219 -> 0/219 direct greedy
Transformer candidate accuracy15/219 -> 37/219 eval-scoped
Direct transformer exact0/219 -> 0/219 direct greedy
Direct transformer loss3.4278 -> 2.2708
Direct transformer modeperiodic-branch-repair-contrast-unlikelihood
Direct transformer failure patternshort wrong ' te.' greedy completion after wider sparse branch contrast
Latest transformer screenruns/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/
Latest screen direct loss4.9050 one-step hidden-projection margin candidate screen
Latest screen direct exactbranch-only screen; direct greedy eval skipped
Latest screen post-direct candidate snapshot skippedtrue
Latest retrieval memory reportruns/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/retrieval_memory_report.json
Retrieval memory artifactretrieval_memory_report.json is now a transformer answer-training artifact declared in experiment intent and training plans.
Retrieval memory summary497 corpus-only memory cards; 219/219 exact retrieval evals; no external model, embeddings, pretrained retriever, or weight updates.
Retrieval memory statusmemory-first evidence remains exact in v0.115.0 and is consumed only as source-plan evidence, not neural promotion.
Latest memory consolidation planruns/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/memory_consolidation_plan.json
Memory consolidation summaryv0.115 keeps retrieval exact, screens hidden-projection margin repair with output bias frozen, and still rejects neural promotion on branch_diversity_target.
Memory consolidation statuslogit-prior representation evidence: 8 missing-token candidates, 24 attempts, 0 direct missing-token acceptances, 24 rejections, 8 fallbacks, 1 accepted profile-specific update shape, no external model, embeddings, or pretrained retriever; branch_diversity_target still blocks promotion with critical target_routing_gap, high output-bias escape risk, low representation separation across 9/9 profiles, and hidden-projection pressure across 9/9 multi-target profiles.
Latest transformer diagnostic runruns/transformer-answer-v0.43-branch-profile-smoke-dim4-context16/
Latest transformer diagnosticdirect-answer branch profiles from model logits
Latest diagnostic QA branch accuracy1/8 -> 1/8
Latest diagnostic dominant predictionall 'o' -> all 'y'
Latest transformer repair runruns/transformer-answer-v0.43-periodic-branch-batch-smoke-dim4-context16/
Latest transformer repair modeperiodic-branch-batch-contrast-unlikelihood
Latest transformer repair statusrejected: loss improved but prompt-independent branch collapse worsened
Latest representation screenruns/transformer-answer-v0.43-prompt-attention-branch-repair-smoke-dim4-context16/
Latest representation option--use-prompt-attention-summary
Latest representation statusrejected: prompt-attention summary projection moved and lowered loss, but QA branch collapse still worsened
Latest branch-context diagnosticruns/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context80/
Branch context 16 QA0/8 semantic covered; 4 ambiguous QA branch contexts
Branch context 32 QA0/8 semantic covered; 0 ambiguous QA branch contexts
Branch context 80 all evals219/219 semantic covered; 0 ambiguous branch contexts
Latest branch-context gate runruns/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context80/
Branch context gate at 16required gate failed; requested 5 direct steps, ran 0
Branch context gate at 80required gate passed; requested 1 direct step, ran 1
Latest branch-only screenruns/transformer-answer-v0.43-branch-context-gated-branchonly-smoke-dim4-context80/
Branch-only gatepassed; requested 5 direct steps, ran 5
Branch-only eval skippingdirect greedy evals skipped in JSONL snapshots; branch profiles and branch-context gate retained
Latest branch-only repair screenruns/transformer-answer-v0.43-branchonly-periodic-repair-contrast50-dim8-context80/
Branch-only repair statusrejected screen: gate passed and 100/100 direct steps ran, but QA branch prediction collapsed from all space to all 'a'
Latest branch-only batch screenruns/transformer-answer-v0.43-branchonly-branch-batch-dim8-context80/
Branch-only batch statusrejected screen: gate passed and 50/50 direct steps ran, but QA branch prediction still collapsed to all 'a'
Latest branch-diversity target runruns/transformer-answer-v0.43-branch-diversity-target-smoke-dim4-context80/
Branch-diversity targetdirect-answer snapshots include branch_diversity_target over multi-target eval profiles
Branch-diversity smokecontext gate passed; diversity target failed 0/9 multi-target profiles; QA target_unique 8, predicted_unique 1, dominant 'r' rate 1.0
Latest branch-diversity training runruns/transformer-answer-v0.43-branch-diversity-train-smoke-dim4-context80/
Branch-diversity training modebranch-diversity-unlikelihood
Branch-diversity training statusrejected smoke: gate passed and 10/10 direct steps ran, but diversity target still failed 0/9 multi-target profiles
Latest branch-diversity freeze-bias runruns/transformer-answer-v0.43-branch-diversity-freezebias-smoke-dim4-context80/
Branch-diversity freeze-bias modebranch-diversity-unlikelihood with --direct-answer-freeze-output-bias
Branch-diversity freeze-bias statusrejected stabilizer: gate passed and 50/50 direct steps ran with output bias frozen, but diversity target still failed 0/9 multi-target profiles
Latest branch-target softmax runruns/transformer-answer-v0.43-branch-target-softmax-freezebias-smoke-dim4-context80/
Branch-target softmax modebranch-target-softmax-unlikelihood with --direct-answer-freeze-output-bias
Branch-target softmax statusrejected target-set screen: gate passed and 50/50 direct steps ran, composite train loss moved 5.6671 -> 5.5820, but diversity target still failed 0/9 multi-target profiles
Latest branch restore runruns/transformer-answer-v0.43-branch-target-softmax-restorebest-smoke-dim4-context80/
Branch restore modebranch-target-softmax-unlikelihood with --direct-answer-restore-best-branch-snapshot
Branch restore statusrejected guardrail: restored best aggregate branch snapshot from step 40 after 50/50 direct steps, but diversity target still failed 0/9 multi-target profiles
Latest prompt-prefix projection runruns/transformer-answer-v0.43-prompt-prefix-target-softmax-restorebest-smoke-dim4-context80/
Prompt-prefix projection option--use-prompt-prefix-projection
Prompt-prefix projection statusrejected representation screen: all 20 prompt-prefix projection parameters moved and loss improved 5.6649 -> 5.5679, but diversity target still failed 0/9 multi-target profiles
Latest prompt-position projection runruns/transformer-answer-v0.43-prompt-position-target-softmax-restorebest-smoke-dim4-context80/
Prompt-position projection option--use-prompt-position-projection
Prompt-position projection statusrejected representation screen: 1108/1284 prompt-position projection parameters moved and loss improved 5.6649 -> 5.5679, but diversity target still failed 0/9 multi-target profiles
Latest branch-target margin runruns/transformer-answer-v0.43-branch-target-margin-prompt-position-smoke-dim4-context80/
Branch-target margin modebranch-target-margin-unlikelihood with --use-prompt-position-projection
Branch-target margin statusrejected target-margin screen: gate passed and 50/50 direct steps ran, train loss moved 4.8973 -> 4.7784, but diversity target still failed 0/9 multi-target profiles
Latest branch-representation contrast runruns/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim4-context80/
Branch-representation contrast modebranch-representation-contrast-unlikelihood with --direct-answer-contrast-weight 50.0
Branch-representation contrast statusrejected representation-contrast screen: direct snapshots now record hidden-distance profiles, but high-weight contrast still failed diversity target 0/9 multi-target profiles
Latest branch-representation capacity runruns/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim8-context80-steps40/
Branch-representation capacity modedim8 branch-representation-contrast-unlikelihood with --direct-answer-contrast-weight 50.0
Branch-representation capacity statusrejected capacity screen: 40/40 direct steps ran after the 50-step dim8 screen proved too slow, hidden distance increased but diversity target still failed 0/9 multi-target profiles
Latest prompt-position scale runruns/transformer-answer-v0.43-prompt-position-scale32-repcontrast50-smoke-dim4-context80/
Prompt-position scale modebranch-representation-contrast-unlikelihood with --prompt-position-projection-scale 32.0
Prompt-position scale statusrejected prompt-signal scale screen: 50/50 direct steps ran, 1108/1284 prompt-position projection parameters moved, hidden distance increased, but diversity target still failed 0/9 multi-target profiles
Transformer structure auditSTRUCTURE_AUDIT.md now gates the next transformer repair: study open-source model/trainer/tokenizer/checkpoint structure without importing external weights, tokenizers, embeddings, datasets, or training text
Transformer structure decisionimplemented and screened an opt-in pre-layer-norm transformer block path with final normalization; target-balanced branch sampling was rejected, so the next target is prompt-to-answer binding for QA and heldout
Latest pre-layer-norm runruns/transformer-answer-v0.44-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/
Pre-layer-norm modebranch-representation-contrast-unlikelihood with --use-pre-layer-norm and --use-prompt-position-projection
Pre-layer-norm statuspartial structural evidence: 50/50 direct steps ran, 1108/1284 prompt-position parameters and all 8 final-norm parameters moved, but diversity target still failed 0/9 multi-target profiles
Latest target-balanced runruns/transformer-answer-v0.44-target-balanced-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/
Target-balanced modebranch-balanced-representation-contrast-unlikelihood with --use-pre-layer-norm and target-bucket branch batches
Target-balanced statusrejected sampler evidence: 50/50 direct steps ran, but best-snapshot restoration returned to step 0 and all 9/9 multi-target profiles collapsed to global 'n'
Latest branch-rank diagnostic runruns/transformer-answer-v0.45-branch-rank-diagnostic-smoke-dim4-context80/
Branch-rank diagnosticdirect-answer branch profiles include average target rank, top-3/top-5 target coverage, and failed-record top predictions
Branch-rank QAfinal QA collapsed to all 'n' with average target rank 14.25 and top-3/top-5 target coverage 0.125
Branch-rank heldoutfinal heldout collapsed to all 'n' with average target rank 14.25 and top-3/top-5 target coverage 0.125
Branch-rank statusdiagnostic evidence: correct branch targets are usually buried, so the next repair should improve prompt-to-answer output binding
Latest output-binding runruns/transformer-answer-v0.46-output-binding-rankscore-smoke-dim4-context80/
Output-binding modebranch-output-binding-unlikelihood with rank-aware best-snapshot scoring and frozen output bias
Output-binding QAQA average target rank improved 17.375 -> 14.125 and top-5 coverage reached 0.25, but target-token coverage stayed 0.0 and top-3 coverage ended 0.0
Output-binding heldoutheldout average target rank improved 17.25 -> 14.375 and top-5 coverage reached 0.25, but target-token coverage stayed 0.0 and top-3 coverage ended 0.0
Output-binding statusrejected repair evidence: output binding cracked wrong-token diversity but still collapsed QA and heldout to wrong branch tokens
Latest rank-margin runruns/transformer-answer-v0.47-rank-margin-steps50-smoke-dim4-context80/
Rank-margin modebranch-rank-margin-unlikelihood against top wrong branch tokens with frozen output bias
Rank-margin QAQA average target rank improved 17.375 -> 9.0, target-token coverage reached 0.125, top-3 coverage reached 0.25, and top-5 coverage reached 0.5
Rank-margin heldoutheldout average target rank improved 17.25 -> 9.0, target-token coverage reached 0.125, top-3 coverage reached 0.25, and top-5 coverage reached 0.375
Rank-margin statusstrongest rank-lift evidence so far, but rejected for promotion because predicted diversity stayed 1/8 and branches still collapsed to wrong 'n'
Latest balanced rank-margin runruns/transformer-answer-v0.48-balanced-rank-margin-smoke-dim4-context80/
Balanced rank-margin modebranch-balanced-rank-margin-unlikelihood with target-balanced branch batches and top wrong-token margins
Balanced rank-margin QAQA predicted diversity reached 2/8, target-token coverage stayed 0.125, average target rank reached 9.375, top-3 reached 0.375, and top-5 reached 0.5
Balanced rank-margin heldoutheldout predicted diversity reached 2/8, target-token coverage stayed 0.125, average target rank reached 9.625, top-3 reached 0.25, and top-5 reached 0.5
Balanced rank-margin statusrejected evidence: target-balanced rank margin improves wrong-token diversity and top-3/top-5 coverage, but top-1 branch choices are still wrong
Latest top-one rank-margin runruns/transformer-answer-v0.49-balanced-rank-margin-top1-smoke-dim4-context80/
Top-one rank-margin modebranch-balanced-rank-margin-unlikelihood with one top wrong token
Top-one rank-margin QAQA target-token coverage stayed 0.125, but average target rank regressed to 12.5, top-3 fell to 0.125, and top-5 fell to 0.25
Top-one rank-margin heldoutheldout target-token coverage stayed 0.125, but average target rank regressed to 12.375, top-3 fell to 0.125, and top-5 fell to 0.25
Top-one rank-margin statusrejected evidence: concentrating on one current top wrong token regressed rank/top-k evidence instead of converting targets into top-1 choices
Latest top-k softmax runruns/transformer-answer-v0.50-balanced-topk-softmax-w5-smoke-dim4-context80/
Top-k softmax modebranch-balanced-topk-softmax-unlikelihood with target-balanced branch batches and restricted target-vs-top-wrong-token softmax
Top-k softmax QAQA target-token coverage stayed 0.125, average target rank improved to 8.75, top-3 reached 0.375, and top-5 reached 0.5
Top-k softmax heldoutheldout target-token coverage stayed 0.125, average target rank improved to 8.75, top-3 reached 0.375, and top-5 reached 0.5
Top-k softmax statusrejected evidence: top-k softmax recovers rank/top-k evidence after v0.49 but still leaves QA and heldout collapsed to wrong 'u' top-1 branch choices
Latest foundation-stack runruns/transformer-v0.51-foundation-stack-smoke/
Foundation-stack modefull mechanics stack: AdamW/SGD state, scheduling, accumulation, resume validation, multi-head/RMSNorm/gated/tied/rotary architecture options, generation traces, and replayable eval samples
Foundation-stack smoke2/2 language-model steps completed with AdamW, attention_heads 2, RMSNorm, gated MLP, tied output embeddings, rotary positions, and cache-aware generation metadata
Foundation-stack artifactsquarklm-transformer-v2 checkpoint, optimizer_state.json, eval.json, and eval_samples.jsonl
Foundation-stack statusmechanics-readiness evidence only; not a promoted responder or direct-answer repair run
Latest full-stack top-k runruns/transformer-answer-v0.52-fullstack-topk-softmax-smoke-dim4-context80/
Full-stack top-k modebranch-balanced-topk-softmax-unlikelihood under the full v0.51 stack
Full-stack top-k QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Full-stack top-k heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Full-stack top-k statusrejected evidence: full-stack baseline improves diversity, but unchanged top-k pressure collapses training to one wrong token; next repair should bind prompt contexts to target tokens
Latest bidirectional binding runruns/transformer-answer-v0.53-fullstack-bidir-binding-smoke-dim4-context80/
Bidirectional binding modebranch-balanced-bidirectional-binding-unlikelihood under the full v0.51 stack
Bidirectional binding unit testfocused transformer tests pass; context-ownership regression verifies target tokens gain probability mass on their own prompt contexts
Bidirectional binding QArestored step 40; QA predicted diversity 2/8, dominant wrong 'a', target-token coverage 0.125, average target rank 7.875, top-3 0.25, top-5 0.5
Bidirectional binding heldoutrestored step 40; heldout predicted diversity 2/8, dominant wrong 'a', target-token coverage 0.125, average target rank 9.0, top-3 0.25, top-5 0.375
Bidirectional binding historytraining step 50 briefly reached QA target-token coverage 0.25 with average target rank 8.375 before best-snapshot restore selected the rank-focused step 40 checkpoint
Bidirectional binding statuspartial progress, rejected for promotion: bidirectional binding improves rank pressure under the full stack, but target coverage is not preserved and diversity target still fails 0/9 multi-target profiles
Latest coverage binding runruns/transformer-answer-v0.54-fullstack-coverage-binding-smoke-dim4-context80/
Coverage binding modebranch-balanced-coverage-binding-unlikelihood under the full v0.51 stack
Coverage binding unit testfocused transformer tests pass; hard-wrong-token coverage regression verifies target-set mass and exact target probability improve in the restricted candidate set
Coverage binding QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Coverage binding heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Coverage binding historytraining step 50 improved QA average target rank to 8.125, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token
Coverage binding statusrejected evidence: best-snapshot scoring restored the baseline because bundled hard-negative coverage binding traded away target coverage for rank; next repair should preserve target-set coverage before exact-target sharpening
Latest target-set coverage runruns/transformer-answer-v0.55-fullstack-target-set-coverage-smoke-dim4-context80/
Target-set coverage modebranch-balanced-target-set-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled
Target-set coverage unit testfocused transformer tests pass; target-set-only coverage regression verifies target-set mass improves against hard wrong tokens without requiring exact-target sharpening
Target-set coverage QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Target-set coverage heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Target-set coverage historytraining step 50 improved QA average target rank to 10.0, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token
Target-set coverage statusrejected evidence: batch-local target-set mass still trades away eval target-token coverage; next repair should add explicit anti-collapse pressure over predicted target tokens
Latest target-diversity runruns/transformer-answer-v0.57-fullstack-target-diversity-smoke-dim4-context80/
Target-diversity modebranch-balanced-target-diversity-unlikelihood under the full v0.51 stack with positive target CE disabled
Target-diversity unit testfocused transformer tests pass; target-diversity regression verifies restricted target-set mass and weakest target-share balance improve in a small branch batch
Target-diversity QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Target-diversity heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Target-diversity historytraining step 50 improved QA average target rank to 10.0, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token
Target-diversity statusrejected evidence: batch-local target-share diversity still trades away eval target-token coverage; next repair should preserve eval-wide target coverage directly
Latest target-replay coverage runruns/transformer-answer-v0.58-fullstack-target-replay-coverage-smoke-dim4-context80/
Target-replay coverage modebranch-balanced-target-replay-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled
Target-replay coverage unit testfocused transformer tests pass; target-replay regression verifies replay target-set mass and weakest missing-target share improve when the sampled branch batch omits admitted pool targets
Target-replay coverage QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Target-replay coverage heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Target-replay coverage historytraining step 40 improved QA average target rank to 6.875 and top-5 coverage to 0.5; by step 50, QA/heldout top-1 collapsed to wrong 'n' and target-token coverage had hit 0.0 during training
Target-replay coverage statusrejected evidence: pool-owned replay target coverage still trades away context-specific target ownership; next repair should bind replay pressure to branch contexts
Latest context-replay coverage runruns/transformer-answer-v0.59-fullstack-context-replay-coverage-smoke-dim4-context80/
Context-replay coverage modebranch-balanced-context-replay-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled
Context-replay coverage unit testfocused transformer tests pass; context-replay regression verifies replay target-set mass and weakest owned-target share improve on fixed replay contexts
Context-replay coverage QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Context-replay coverage heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Context-replay coverage historytraining step 40 improved QA average target rank to 7.375, top-3 to 0.375, and top-5 to 0.5; by step 50, QA predicted diversity was only 2/8 and target-token coverage had hit 0.0 during training
Context-replay coverage statusrejected evidence: context-owned replay improves rank/top-k snapshots but still does not preserve target-token coverage; next repair should strengthen target-preserving ownership or scoring gates
Latest coverage-floor runruns/transformer-answer-v0.60-fullstack-context-replay-coverage-floor-metadata-smoke-dim4-context80/
Coverage-floor modeprofile-wise target-token coverage floor before branch snapshot rank/top-k scoring
Coverage-floor unit testfocused transformer tests pass; coverage-floor regression rejects a rank-lifted candidate when QA target-token coverage falls below baseline
Coverage-floor QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Coverage-floor heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Coverage-floor historyclean v0.60 JSONL wrote 7 direct-answer rows with branch_target_coverage_by_profile; step 40 improved QA rank/top-k but was ineligible because profile coverage regressed
Coverage-floor statusgate repair accepted, model behavior rejected: coverage floor prevents rank/top-k gains from promoting snapshots that regress target-token coverage
Latest coverage-anchor runruns/transformer-answer-v0.61-fullstack-context-coverage-anchor-smoke-dim4-context80/
Coverage-anchor modebranch-balanced-context-coverage-anchor-unlikelihood under the full v0.51 stack with the v0.60 coverage floor
Coverage-anchor unit testfocused transformer tests pass; anchor regression verifies covered-target probability is protected better than the same replay training without anchors
Coverage-anchor QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Coverage-anchor heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Coverage-anchor historytraining snapshots over-anchored covered wrong 'i'; QA/heldout predicted diversity fell to 1/8, target-token coverage to 0.125, and average target rank above 21
Coverage-anchor statusrejected evidence: global covered-target anchors protect one covered token but do not preserve coverage diversity; next repair should be target-balanced or profile-aware
Latest target-balanced anchor runruns/transformer-answer-v0.62-fullstack-target-balanced-anchor-smoke-dim4-context80/
Target-balanced anchor modebranch-balanced-context-target-balanced-anchor-unlikelihood under the full v0.51 stack with the v0.60 coverage floor
Target-balanced anchor unit testfocused transformer tests pass; singleton covered-target regression verifies target-balanced anchors skip the v0.61 one-token over-anchor
Target-balanced anchor QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Target-balanced anchor heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Target-balanced anchor historytraining avoided the v0.61 hard 'i' attractor, but QA/heldout target-token coverage still collapsed to 0.0 and trained snapshots remained ineligible
Target-balanced anchor statusrejected evidence: target-balanced anchors prevent singleton over-anchoring but do not preserve profile coverage; next repair should train from profile-level coverage deficits
Latest coverage-deficit runruns/transformer-answer-v0.64-fullstack-coverage-deficit-smoke-dim4-context80/
Coverage-deficit modebranch-balanced-context-coverage-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor
Coverage-deficit unit testfocused transformer tests pass; deficit regression verifies missing replay targets gain restricted probability over the old context replay objective
Coverage-deficit QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Coverage-deficit heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Coverage-deficit historytraining step 50 reached QA accuracy 1/8 and predicted diversity 4/8 with average target rank 10.0, but QA/heldout target-token coverage regressed to 0.125 and trained snapshots remained ineligible
Coverage-deficit statusrejected evidence: deficit pressure can crack the top-1 branch in training but still trades away coverage, so the next repair should combine deficit pressure with an explicit coverage-preserving constraint
Latest coverage-preserving deficit runruns/transformer-answer-v0.65-fullstack-coverage-preserving-deficit-smoke-dim4-context80/
Coverage-preserving deficit modebranch-balanced-context-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor
Coverage-preserving deficit unit testfocused transformer tests pass; preserving-deficit regression verifies missing targets still lift while represented target tokens are protected better than deficit-only training
Coverage-preserving deficit QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Coverage-preserving deficit heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Coverage-preserving deficit historytraining step 50 reached QA/heldout branch accuracy 1/8, QA average target rank 7.75, heldout average target rank 7.125, and top-5 coverage 0.5, but both profiles collapsed to predicted_unique 1/8 with target-token coverage 0.125
Coverage-preserving deficit statusrejected evidence: predicted-target preservation over-preserved the represented 'i' token and improved rank while regressing coverage diversity; next repair should make the coverage constraint profile-aware instead of anchoring current predicted target tokens
Latest profile-aware replay runruns/transformer-answer-v0.67-profile-aware-replay-plan-smoke-dim4-context80/
Profile-aware replay modebranch-balanced-context-profile-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor and v0.67 per-profile replay plan
Profile-aware replay unit testfocused transformer tests pass; profile replay plan verifies profile deficits are not hidden by global target coverage, and profiled replay records preserve source keys for shared branch targets
Profile-aware replay plandirect_answer_replay_plan.json records 9144 branch/replay records across 21 profiles; example floors include qa:place 0.5 and qa:color 0.0
Profile-aware replay gatebranch-context gate passed 219/219 semantic records with 0 ambiguous contexts, 0 context collisions, and 0 skipped records
Profile-aware replay smokeone gated branch-only direct step ran, post-direct candidate snapshot was skipped by configuration, and the best branch snapshot restored from step 0
Profile-aware replay statusmechanics-readiness evidence: replay plan and profile-aware objective surface are implemented, but branch-diversity target still failed 0/9 multi-target profiles so no model-quality promotion
Latest profile-aware full-stack runruns/transformer-answer-v0.68-fullstack-profile-aware-preserving-deficit-smoke-dim4-context80/
Profile-aware full-stack modebranch-balanced-context-profile-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor and v0.67 replay-plan artifact
Profile-aware full-stack plandirect_answer_replay_plan.json records 9144 branch/replay records across 21 profiles; branch-context gate passed 219/219 semantic records
Profile-aware full-stack QArestored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375
Profile-aware full-stack heldoutrestored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375
Profile-aware full-stack historystep 40 improved QA average target rank to 6.5 and top-5 to 0.625, with heldout rank 6.875 and top-5 0.5, but QA/heldout target-token coverage regressed to 0.125 and predicted diversity collapsed to 1/8
Profile-aware full-stack statusrejected evidence: profile-aware preservation can improve rank under training, but best-snapshot scoring restored step 0 because trained snapshots still erase coverage and diversity
Profile target-share objectivesrc/closed_world_lm/transformer_char_model.py and src/closed_world_lm/transformer_objectives.py add branch-balanced-context-profile-target-share-preserving-deficit-unlikelihood as the v0.81 profile target-share objective implementation.
Profile target-share decisionProfile-aware replay can now add balanced owned target-share pressure across each profile's replay targets while retaining deficit focus, represented-target preservation, replay-plan artifacts, and recipe/promotion surfaces.
Profile target-share unit testfocused transformer tests pass; the minority replay target gains more share with balanced profile target-share pressure than under the previous profile-aware replay loss.
Latest profile target-share runruns/transformer-answer-v0.82-fullstack-profile-target-share-smoke-dim4-context80/
Profile target-share modebranch-balanced-context-profile-target-share-preserving-deficit-unlikelihood
Profile target-share artifactsexperiment_intent.json, corpus_hygiene.json, training_plan.json, candidate_quarantine.json, closed_world_verifier.json, training_recipe.json, direct_answer_replay_plan.json, constraint_first_promotion.json, metrics JSON/JSONL, tokenizer, optimizer, lessons, and checkpoint are written.
Profile target-share gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; deterministic verifier passed; purity gates include external_embeddings false.
Profile target-share history50/50 direct steps completed with 7 clean JSONL rows. Step 40 lowered train loss to 19.7378 and improved QA average rank to 9.125, but QA and heldout collapsed to one 'c' prediction with target-token coverage 0.0.
Profile target-share statusrejected evidence: best-snapshot scoring restored step 0, preserving QA/heldout target-token coverage at 0.25, but branch_diversity_target still failed across all 9 multi-target profiles.
Latest prompt-ownership runruns/transformer-answer-v0.83-fullstack-prompt-ownership-smoke-dim4-context80/
Prompt-ownership modebranch-balanced-context-profile-prompt-ownership-target-share-preserving-deficit-unlikelihood
Prompt-ownership unit testfocused transformer tests pass; prompt-specific ownership margins lift a context's own target above a sibling profile target more than the v0.82 profile target-share pressure.
Prompt-ownership gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; deterministic verifier passed; purity gates include external_embeddings false.
Prompt-ownership history50/50 direct steps completed with 7 clean JSONL rows. Step 50 improved QA average target rank to 8.625 and heldout average target rank to 8.5, but QA and heldout collapsed to one 'c' prediction with target-token coverage 0.0 during training.
Prompt-ownership statusrejected evidence: best-snapshot scoring restored step 0, preserving QA/heldout target-token coverage at 0.25, but branch_diversity_target still failed across all 9 multi-target profiles.
Latest baseline-anchor runruns/transformer-answer-v0.84-fullstack-baseline-anchored-prompt-ownership-smoke-dim4-context80/
Baseline-anchor modebranch-balanced-context-profile-baseline-anchored-prompt-ownership-target-share-preserving-deficit-unlikelihood
Baseline-anchor unit testfocused transformer tests pass; profiled replay batches can use baseline prediction overrides, and anchored replay preservation protects a covered target better than following current prediction drift.
Baseline-anchor gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; deterministic verifier passed; purity gates include external_embeddings false.
Baseline-anchor history50/50 direct steps completed with 7 clean JSONL rows. Step 40 improved QA average target rank to 8.0 and heldout rank to 8.375, but QA and heldout collapsed to one 'i' prediction with target-token coverage 0.125 during training.
Baseline-anchor statusrejected evidence: anchoring improves over the v0.83 zero-coverage collapse, but best-snapshot scoring restored step 0 because trained snapshots still fell below the 0.25 QA/heldout coverage floor and branch_diversity_target failed across all 9 multi-target profiles.
Latest baseline-floor gate runruns/transformer-answer-v0.85-fullstack-baseline-floor-gated-prompt-ownership-smoke-dim4-context80/
Baseline-floor gate modebranch-balanced-context-profile-baseline-floor-gated-prompt-ownership-target-share-preserving-deficit-unlikelihood
Baseline-floor gate unit testfocused transformer tests pass; the new mode records baseline replay anchors, a baseline-floor update guard, and one-step accepted/rejected guard accounting.
Baseline-floor gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; the baseline-floor update guard checked 50 attempted steps and rejected 50 unsafe updates; deterministic verifier passed; purity gates include external_embeddings false.
Baseline-floor gate history50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/50 attempted updates.
Baseline-floor gate statusrejected evidence: v0.85 prevents unsafe forgetting by refusing every update below the profile-wise baseline coverage floor, but branch_diversity_target still fails across all 9 multi-target profiles and no weight update is accepted.
Latest baseline-floor adaptive runruns/transformer-answer-v0.86-fullstack-baseline-floor-adaptive-prompt-ownership-smoke-dim4-context80/
Baseline-floor adaptive modebranch-balanced-context-profile-baseline-floor-adaptive-prompt-ownership-target-share-preserving-deficit-unlikelihood
Baseline-floor adaptive unit testfocused transformer tests pass; the adaptive mode records baseline replay anchors, adaptive learning-rate scales, checked steps, attempted updates, accepted attempts, and rejected attempts.
Baseline-floor adaptive gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false.
Baseline-floor adaptive history50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 scaled attempted updates.
Baseline-floor adaptive statusrejected evidence: v0.86 proves the unsafe-update problem is not fixed by four learning-rate scales; every scaled retry still falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles.
Latest baseline-floor repaired runruns/transformer-answer-v0.87-fullstack-baseline-floor-repaired-prompt-ownership-clean-smoke-dim4-context80/
Baseline-floor repaired modebranch-balanced-context-profile-baseline-floor-repaired-prompt-ownership-target-share-preserving-deficit-unlikelihood
Baseline-floor repaired unit testfocused transformer tests pass; the repaired mode records baseline replay anchors, adaptive learning-rate scales, repair-anchor counts, repair attempts, repaired attempts, accepted update-shape counts, and rejected samples.
Baseline-floor repaired gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors and 227 baseline-covered repair anchors were recorded; adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, ran 200 one-step repairs, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false.
Baseline-floor repaired history50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 repaired attempted updates.
Baseline-floor repaired statusrejected evidence: v0.87 proves one bounded baseline-covered anchor repair after an unsafe update is still not enough; every repaired retry falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles.
Latest baseline-floor objective runruns/transformer-answer-v0.88-fullstack-baseline-floor-objective-prompt-ownership-smoke-dim4-context80/
Baseline-floor objective modebranch-balanced-context-profile-baseline-floor-objective-prompt-ownership-target-share-preserving-deficit-unlikelihood
Baseline-floor objective unit testfocused transformer tests pass; the objective mode records baseline replay anchors, objective-side floor-anchor counts, anchor batch size, anchor weight, objective anchor batches, accepted attempts, and rejected attempts.
Baseline-floor objective gatebranch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors and 227 objective-side floor anchors were recorded; anchor batch size was 32, anchor weight was 10.0, adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, ran 200 objective anchor batches covering 2400 anchor records, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false.
Baseline-floor objective history50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 objective-shaped attempted updates.
Baseline-floor objective statusrejected evidence: v0.88 proves a balanced objective-side floor-anchor term is still not enough when coupled to branch-diversity pressure; every retry falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles.
Profile target-share nextUse the v0.115.0 hidden-projection candidate evidence with profile target-share and branch-diversity gates before promotion.
Transformer selector exact18/219 -> 219/219 selector-emitted
Transformer selector candidate accuracy18/219 -> 219/219 eval-scoped
Transformer-guided generator exact0/219 -> 219/219 no-candidate
Tokenizercorpus-trained character tokenizer

v0.42 Summary

QuarkLM v0.42 keeps the admitted corpus unchanged from v0.41 and widens the from-scratch transformer used by the sparse prompt-contrast branch repair path. The stable self-improvement run is runs/self-improve-v0.42/; the current transformer answer-lesson run is runs/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/.

The current corpus remains at 12 admitted facts. Direct admission probes pass 48/48, admission paraphrase probes pass 84/84, and glossary probes pass 38/38.

The transformer is a tiny decoder-only language model built in the Python standard library. It uses learned token and position embeddings, one causal self-attention block, a feed-forward block, and QuarkLM's corpus-trained character tokenizer. It starts from random weights and imports no pretrained model, vocabulary, or embeddings.

Transformer direct-answer evidence:

  • transformer checkpoint: runs/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/transformer_answer.json
  • v0.31 generator checkpoint retained for comparison: runs/transformer-answer-v0.31-generator-weighted-lr035-80k/answer_generator.json
  • selector checkpoint: runs/transformer-answer-v0.31-generator-weighted-lr035-80k/answer_selector.json
  • training steps: 80
  • context size: 32
  • embedding dimension: 8
  • feed-forward dimension: 16
  • direct answer steps: 1000
  • direct answer mode: periodic-branch-repair-contrast-unlikelihood
  • direct answer negative weight: 1.0
  • direct answer positive weight: 1.0
  • direct answer contrast weight: 1.0
  • branch position: 1
  • contrast interval: 50
  • direct answer training examples: 9144
  • answer target NLL: 3.5850 -> 2.4129
  • direct answer target loss: 3.4278 -> 2.2708
  • raw direct greedy exact answers: 0/219 -> 0/219
  • transformer-only eval-scoped candidate accuracy: 15/219 -> 37/219
  • selector-emitted exact answers: 18/219 -> 219/219
  • selector eval-scoped candidate accuracy: 18/219 -> 219/219
  • v0.31 generator exact answers without candidates: 0/219 -> 219/219
  • pretrained weights: false
  • pretrained tokenizer: false
  • external embeddings: false
  • direct path uses answer candidates: false
  • direct path uses auxiliary weights: false
  • generator uses answer candidates: false

That is a real movement toward raw transformer answering with a clear boundary. v0.42 preserves v0.33's transformer-only candidate discrimination while testing whether a wider random transformer gives sparse branch contrast more room to represent prompt differences. The direct path improves answer-target NLL versus v0.41 and reduces runaway greedy looping, but raw greedy completions still fail exact answer generation: the dominant failure is now the short wrong completion " te.". The next structured repair should make the prompt representation more target-specific without losing these scored gains. v0.31's auxiliary no-candidate generator remains the best exact no-candidate answer evidence. The current reliable response gate still belongs to the responder, learned answer classifier, and generative answer decoder.

Unpromoted v0.43 Findings

v0.43 development added transformer-loop improvements, but did not replace the v0.42 promoted checkpoint.

  • The transformer forward pass now computes only the final position consumed by the language-model head, preserving the next-character objective while making longer-context experiments practical.
  • Transformer answer artifacts now include prompt context-coverage metrics. A context size of 80 covers all current semantic eval templates (219/219), while context size 32 drops complete template coverage for many prompts.
  • runs/transformer-answer-v0.43-hard-branch-contrast4-dim8-context32/ preserved candidate accuracy at 37/219, but regressed direct loss to 2.4225, answer NLL to 2.5402, and collapsed greedy output to a repeated " a" loop.
  • runs/transformer-answer-v0.43-branch-repair-contrast50-dim8-context80/ achieved full context coverage and the shorter failure " t.", but still trailed v0.42 with direct loss 2.3122 and answer NLL 2.4546.
  • runs/transformer-answer-v0.43-branch-repair-contrast50-dim8-context80-1500/ reached 38/219 candidates, but regressed direct loss, answer NLL, and greedy output. It remains archived evidence rather than a promoted release.
  • runs/transformer-answer-v0.43-layernorm-screen-dim8-context80/ tested optional layer normalization with full context coverage. It preserved 37/219 candidates, but answer NLL regressed to 2.5881 and greedy output collapsed into repeated " y"/"e" loops, so it was not promoted.
  • runs/transformer-answer-v0.43-branch-span3-screen-dim8-context32/ tested branch repair over answer positions 1..3. It preserved 37/219 candidates, but answer NLL regressed to 2.7426 and greedy output became a long "neeee" loop, so it was not promoted.
  • runs/transformer-answer-v0.43-two-layer-screen-dim8-context32/ tested the new multi-layer transformer path. It was interrupted before final direct-answer metrics because two-layer full-block scalar autograd was too slow for the regular loop. The partial JSONL history is runtime evidence, not promotion evidence.
  • runs/transformer-answer-v0.43-two-layer-finalopt-screen-dim8-context32/ tested the optimized stacked path where the final layer computes only the last state. The optimization is covered by logit-equivalence tests, but the run was still interrupted before final metrics because the intermediate full-state layer remains too expensive for direct-answer repair updates.
  • runs/transformer-answer-v0.43-two-layer-toponly-skip-screen-dim8-context32/ tested top-layer-only direct-answer updates for a two-layer transformer and the explicit post-direct snapshot skip used for bounded screens. It completed and saved a checkpoint after 40 target-loss steps and 80 direct-answer steps, recorded the skipped post-direct candidate snapshot, improved direct-answer target loss 3.5186 -> 3.2436, but kept direct greedy exact at 0/219 -> 0/219 with repeated "a" output. It is training-loop completion evidence, not promotion evidence.
  • runs/transformer-answer-v0.43-branch-profile-smoke-dim4-context16/ verified direct-answer branch-profile metrics. The QA branch-position-1 profile stayed at 1/8 accuracy, moved from all "o" predictions to all "y" predictions after five tiny direct updates, and kept a negative average target margin. This is model-native self-diagnosis evidence for prompt-independent branch collapse, not promotion evidence.
  • runs/transformer-answer-v0.43-branch-collapse-smoke-dim4-context16/ tested full-dose dominant-branch-token suppression. It regressed direct loss and moved QA branch collapse from all "o" predictions to all "a" predictions.
  • runs/transformer-answer-v0.43-periodic-branch-collapse-smoke-dim4-context16/ tested sparse dominant-token suppression every five direct steps. It improved direct loss 3.5800 -> 3.5157, but QA branch accuracy stayed 1/8 -> 1/8 and the dominant prediction moved from all "o" to all "n". It remains rejected repair evidence because the branch stayed prompt-independent.
  • runs/transformer-answer-v0.43-branch-batch-smoke-dim4-context16/ tested full-dose distinct-target branch batching. It improved direct loss only slightly and moved QA branch collapse from all "o" predictions to all "y" predictions.
  • runs/transformer-answer-v0.43-periodic-branch-batch-smoke-dim4-context16/ tested sparse branch-batch contrast every five direct steps. It improved direct loss 3.5800 -> 3.5248, but QA branch accuracy regressed 1/8 -> 0/8 and the dominant prediction moved from all "o" to all "a". It is rejected evidence that distinct-target batching still does not force prompt-conditioned branch separation in the current representation.
  • runs/transformer-answer-v0.43-context-mean-branch-batch-smoke-dim4-context16/ added --use-context-mean, a representation-side option that adds the mean-pooled prompt context to the final transformer hidden state. With sparse branch-batch contrast it improved direct loss 3.5805 -> 3.5252, but QA branch accuracy regressed 1/8 -> 0/8 and the dominant prediction moved from all "o" to all "a".
  • runs/transformer-answer-v0.43-context-mean-branch-repair-smoke-dim4-context16/ tested the same context-mean representation with sparse branch repair. It improved direct loss 3.5805 -> 3.5310, but again regressed QA branch accuracy 1/8 -> 0/8 and collapsed to all "a" predictions. This is rejected representation evidence: prompt averaging alone is not enough to produce prompt-specific branch choices.
  • runs/transformer-answer-v0.43-context-projection-branch-repair-smoke-dim4-context16/ added --use-context-projection, a zero-initialized trainable projection of the mean-pooled context. It starts baseline-equivalent, moved all 20 projection parameters during training, and improved direct loss 3.5802 -> 3.5217, but QA branch accuracy regressed 1/8 -> 0/8 and the dominant prediction moved from all "o" to all "a".
  • runs/transformer-answer-v0.43-context-projection-branch-batch-smoke-dim4-context16/ tested the same learned projection with sparse branch-batch contrast. It moved all 20 projection parameters and improved direct loss 3.5802 -> 3.5252, but also regressed QA branch accuracy 1/8 -> 0/8 and collapsed to all "a" predictions. This keeps learned context projection in rejected representation evidence.
  • runs/transformer-answer-v0.43-prompt-attention-branch-repair-smoke-dim4-context16/ added --use-prompt-attention-summary, a trainable attention-pooled context summary with a zero-initialized output projection. It moved all 20 output projection parameters and improved direct loss 3.5802 -> 3.5217, but QA branch accuracy regressed 1/8 -> 0/8 and the dominant prediction moved from all "o" to all "a".
  • runs/transformer-answer-v0.43-prompt-attention-branch-batch-smoke-dim4-context16/ tested the same prompt-attention summary with sparse branch-batch contrast. It moved all 20 output projection parameters and improved direct loss 3.5802 -> 3.5252, but again regressed QA branch accuracy 1/8 -> 0/8 and collapsed to all "a" predictions. This keeps trainable prompt attention in rejected representation evidence.
  • runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context16/ added branch_context_coverage diagnostics to direct-answer snapshots. At context size 16, QA had 0/8 semantic coverage and 4 ambiguous branch contexts; for example "s ball?\nanswer: " mapped both place and color first target tokens.
  • runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context32/ removed QA branch ambiguity (0 ambiguous contexts), but still had 0/8 semantic coverage at the branch point because the prompt prefix was truncated.
  • runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context80/ reached complete branch-context coverage across all eval sets (219/219) with zero ambiguous branch contexts. This is diagnostic evidence for efficient longer-context branch repair.
  • runs/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context16/ made that diagnostic actionable with --direct-answer-require-branch-context-gate. The required gate failed at context size 16, so the run recorded actual_steps: 0 for 5 requested direct-answer steps.
  • runs/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context80/ passed the same required gate at context size 80 and recorded actual_steps: 1 for 1 requested direct-answer step.
  • runs/transformer-answer-v0.43-branch-context-gated-branchonly-smoke-dim4-context80/ added --direct-answer-snapshot-mode branch-only to keep longer-context branch screens bounded. The required context-80 gate passed across all 219/219 semantic records, all 5 requested direct-answer steps ran, and JSONL snapshots recorded evals_skipped: true while retaining branch profiles and branch-context gate evidence.
  • runs/transformer-answer-v0.43-branchonly-periodic-repair-contrast50-dim8-context80/ used branch-only snapshots for a dim8 context-80 version of the best prior sparse repair/contrast policy. The required gate passed and all 100 direct steps ran, but QA branch prediction collapsed to all "a" with final QA branch accuracy 0/8.
  • runs/transformer-answer-v0.43-branchonly-branch-batch-dim8-context80/ tested branch-batch contrast under the same complete context. It lowered interval train loss 3.4614 -> 3.1976, but final QA branch prediction still collapsed to all "a" with final QA branch accuracy 0/8.
  • runs/transformer-answer-v0.43-branch-diversity-target-smoke-dim4-context80/ added a first-class branch_diversity_target to direct-answer snapshots. The required branch-context gate passed and all 5 direct steps ran, but the diversity target failed across all 9 multi-target eval profiles. Final QA had target_unique: 8, predicted_unique: 1, dominant predicted token "r" at rate 1.0, and target-token coverage 0.125.
  • runs/transformer-answer-v0.43-branch-diversity-train-smoke-dim4-context80/ added branch-diversity-unlikelihood, which trains distinct branch targets while penalizing each branch context's current wrong prediction. The required branch-context gate passed and 10/10 direct steps ran, but the diversity target still failed across all 9 multi-target profiles. QA moved from all "x" to all "b" predictions, with target-token coverage 0.0 -> 0.125 and predicted_unique still 1/8.
  • runs/transformer-answer-v0.43-branch-diversity-freezebias-smoke-dim4-context80/ added --direct-answer-freeze-output-bias, which excludes the transformer output bias from direct-answer updates. The required branch-context gate passed and 50/50 direct steps ran with the output bias frozen. Loss moved 3.6149 -> 3.5016, but the diversity target still failed across all 9 multi-target profiles. QA moved from all "x" to all "w" predictions, final target-token coverage was 0.0, and predicted_unique stayed 1/8.
  • runs/transformer-answer-v0.43-branch-target-softmax-freezebias-smoke-dim4-context80/ added branch-target-softmax-unlikelihood, which applies a restricted softmax over the distinct branch targets in each batch. The required branch-context gate passed, output bias was frozen, and 50/50 direct steps ran. Composite train loss moved 5.6671 -> 5.5820, but the diversity target still failed across all 9 multi-target profiles. QA briefly reached predicted_unique: 2 at step 20, then collapsed back to all "w" by step 50.
  • runs/transformer-answer-v0.43-branch-target-softmax-restorebest-smoke-dim4-context80/ added --direct-answer-restore-best-branch-snapshot. The required branch-context gate passed, output bias was frozen, and 50/50 direct steps ran. The run restored the final checkpoint from step 40; final QA moved from the prior all-"w" endpoint to all "u" with target-token coverage 0.125, but predicted_unique stayed 1/8 and all 9 multi-target profiles still failed the diversity target.
  • runs/transformer-answer-v0.43-prompt-prefix-target-softmax-restorebest-smoke-dim4-context80/ added --use-prompt-prefix-projection, a zero-initialized trainable projection over non-padding prompt-prefix positions before the final answer token. All 20 projection parameters moved and composite train loss improved 5.6649 -> 5.5679, but the final checkpoint restored from step 40 to the same all-"u" QA collapse with target-token coverage 0.125.
  • runs/transformer-answer-v0.43-prompt-position-target-softmax-restorebest-smoke-dim4-context80/ added --use-prompt-position-projection, a position-specific trainable projection over non-padding prompt-prefix positions before the final answer token. 1108/1284 projection parameters moved and composite train loss improved 5.6649 -> 5.5679, but the final checkpoint restored from step 40 to the same all-"u" QA collapse with target-token coverage 0.125.
  • runs/transformer-answer-v0.43-branch-target-margin-prompt-position-smoke-dim4-context80/ added branch-target-margin-unlikelihood, a smooth pairwise target-margin loss over each batch's distinct branch targets. The prompt-position context-80 screen moved train loss 4.8973 -> 4.7784 and moved 1108/1284 prompt-position projection parameters, but the final checkpoint restored from step 40 to the same all-"u" QA collapse with target-token coverage 0.125.
  • runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim4-context80/ added branch_representation_profiles and branch-representation-contrast-unlikelihood. The high-weight prompt-position context-80 screen used --direct-answer-contrast-weight 50.0 and moved QA different-target hidden distance only about 0.00097 -> 0.00107 at the restored checkpoint; the final branch profile still restored to the same all-"u" QA collapse.
  • runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim8-context80-steps40/ tested the same high-weight representation-contrast path at embedding/feed- forward dimensions 8/16. The completed 40/40 step screen restored from step 10, moved QA different-target hidden distance to about 0.00209, and still restored to the same all-"u" QA collapse with target-token coverage 0.125.
  • runs/transformer-answer-v0.43-prompt-position-scale32-repcontrast50-smoke-dim4-context80/ added --prompt-position-projection-scale 32.0 to test whether the prompt- position residual was simply too quiet. The completed 50/50 step screen moved 1108/1284 prompt-position projection parameters and restored from step 40; restored QA different-target hidden distance rose to about 0.01235, but QA still collapsed to all "u" with target-token coverage 0.125.
  • STRUCTURE_AUDIT.md now records the next transformer checkpoint: study open-source model, trainer, tokenizer, checkpoint, and transparency patterns before adding another repair objective, while keeping all external weights, tokenizers, embeddings, datasets, and training text outside QuarkLM's closed-world boundary. The completed comparison table chooses an opt-in pre-layer-norm transformer block path with final normalization as the next structural implementation target.
  • runs/transformer-answer-v0.44-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/ implemented that path with --use-pre-layer-norm. The bounded context-80 screen ran 50/50 direct steps, moved 1108/1284 prompt-position parameters and all 8 final-norm parameters, and cracked full collapse in 7/9 multi-target profiles. The formal diversity target still failed 0/9, and QA stayed collapsed to all "y" with target-token coverage 0.125.
  • runs/transformer-answer-v0.44-target-balanced-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/ added target-bucket branch batch sampling through branch-balanced-representation-contrast-unlikelihood. The screen ran 50/50 direct steps, but best-snapshot restoration returned to step 0 because every trained snapshot scored worse than baseline. All 9/9 multi-target profiles collapsed to "n", so target balancing is rejected as a standalone repair.
  • runs/transformer-answer-v0.45-branch-rank-diagnostic-smoke-dim4-context80/ adds target-rank diagnostics to branch profiles. The smoke used the pre-layer-norm prompt-position path and recorded QA and heldout both collapsed to "n" with average target rank 14.25 and top-3/top-5 target coverage 0.125. The correct branch target is usually buried behind several global alternatives, so this is output-binding evidence rather than a near- miss rank problem.
  • runs/transformer-answer-v0.46-output-binding-rankscore-smoke-dim4-context80/ adds branch-output-binding-unlikelihood, combining branch target softmax with representation contrast, and makes best-snapshot scoring rank-aware. It ran 20/20 direct steps with output bias frozen. QA average target rank improved 17.375 -> 14.125, and QA/heldout top-5 coverage reached 0.25. Target-token coverage stayed 0.0, top-3 coverage ended 0.0, and the branch prediction still collapsed to wrong tokens, so the repair is rejected for promotion.
  • runs/transformer-answer-v0.47-rank-margin-steps50-smoke-dim4-context80/ adds branch-rank-margin-unlikelihood, which pushes each branch target above the model's own top wrong tokens. The screen ran 50/50 direct steps, restored the rank-aware best snapshot from step 40, and improved QA average target rank 17.375 -> 9.0. QA target-token coverage rose to 0.125, top-3 coverage rose to 0.25, and top-5 coverage rose to 0.5. It is still rejected because predicted diversity stayed 1/8 and QA/heldout remained collapsed to wrong "n".
  • runs/transformer-answer-v0.48-balanced-rank-margin-smoke-dim4-context80/ combines target-balanced branch batches with the same rank-margin repair. It ran 50/50 direct steps and reached QA predicted diversity 2/8, target- token coverage 0.125, average target rank 9.375, top-3 coverage 0.375, and top-5 coverage 0.5. It is still rejected because QA and heldout remain wrong top-1 branch choices.
  • runs/transformer-answer-v0.49-balanced-rank-margin-top1-smoke-dim4-context80/ tests the same balanced rank-margin path with --direct-answer-hard-negatives 1, concentrating margin pressure on only the current top wrong token. It restored from step 10; QA target-token coverage stayed 0.125, but average target rank regressed to 12.5, top-3 coverage fell to 0.125, and top-5 coverage fell to 0.25. This is rejected evidence.
  • runs/transformer-answer-v0.50-balanced-topk-softmax-w5-smoke-dim4-context80/ adds branch-balanced-topk-softmax-unlikelihood, where each correct branch target competes in a restricted softmax against the model's current top wrong tokens. It restored from step 40; QA target-token coverage stayed 0.125, average target rank improved to 8.75, top-3 coverage reached 0.375, and top-5 coverage reached 0.5. This recovers rank/top-k evidence after v0.49, but prediction diversity stayed 1/8 and top-1 branch choices remained wrong, so it is rejected repair evidence.
  • runs/transformer-v0.51-foundation-stack-smoke/ verifies the full transformer foundation stack before the next direct-answer repair run. It ran 2/2 language-model steps with AdamW, gradient accumulation, two attention heads, RMSNorm, gated MLPs, tied output embeddings, rotary positions, and cache-aware generation metadata. The run wrote a quarklm-transformer-v2 checkpoint, optimizer_state.json, eval.json, and replayable eval_samples.jsonl traces. This is mechanics-readiness evidence, not model-quality promotion evidence.
  • runs/transformer-answer-v0.52-fullstack-topk-softmax-smoke-dim4-context80/ reruns the v0.50 top-k branch objective under the full v0.51 stack. It completed 50/50 direct steps and restored to step 0: the full-stack baseline had QA and heldout predicted diversity 3/8 and target-token coverage 0.25, but training collapsed to one wrong token at later snapshots. This rejects unchanged top-k pressure under the full stack and points the next repair toward prompt-context-to-target-token binding.
  • runs/transformer-answer-v0.53-fullstack-bidir-binding-smoke-dim4-context80/ adds branch-balanced-bidirectional-binding-unlikelihood. The objective trains each prompt context to choose its own branch target and each target token to assign cross-context probability mass back to its own prompt contexts. The focused transformer unit test verifies that context-ownership signal on a small branch batch. The full-stack screen completed 50/50 direct steps and restored from step 40: QA average target rank improved to 7.875 with top-5 coverage 0.5, but target-token coverage ended at 0.125 and the diversity target still failed 0/9 multi-target profiles. This is partial rank-pressure evidence, not promotion evidence.
  • runs/transformer-answer-v0.54-fullstack-coverage-binding-smoke-dim4-context80/ adds branch-balanced-coverage-binding-unlikelihood, which makes every branch target compete against sibling branch targets and hard wrong tokens while adding a target-set mass coverage guard. The focused transformer test verifies that this pressure lifts target-set mass against hard wrong tokens. The full-stack screen completed 50/50 direct steps but restored from step 0: training snapshots improved QA average target rank to 8.125, but target-token coverage collapsed to 0.0 and top-1 predictions collapsed to wrong "a". This rejects the bundled coverage-binding loss under the full stack.
  • runs/transformer-answer-v0.55-fullstack-target-set-coverage-smoke-dim4-context80/ isolates target-set coverage with branch-balanced-target-set-coverage-unlikelihood, positive target CE disabled, and no exact-target row or cross-context ownership losses. The focused transformer test verifies that target-set mass can increase against hard wrong tokens without asserting exact-target sharpening. The full-stack screen completed 50/50 direct steps and restored from step 0: training snapshots improved QA average target rank to 10.0, but target-token coverage still collapsed to 0.0 with wrong "a" top-1 predictions. This rejects batch-local target-set mass as a sufficient coverage repair.
  • runs/transformer-answer-v0.57-fullstack-target-diversity-smoke-dim4-context80/ adds target-share anti-collapse pressure with branch-balanced-target-diversity-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that restricted target-set mass and weakest target-share balance can both improve in a small branch batch. The full-stack screen completed 50/50 direct steps and restored from step 0: training snapshots improved QA average target rank to 10.0, but target-token coverage again collapsed to 0.0 with wrong "a" top-1 predictions. This rejects batch-local target sharing as a sufficient eval-wide anti-collapse repair.
  • runs/transformer-answer-v0.58-fullstack-target-replay-coverage-smoke-dim4-context80/ extends the repair from batch-local target sharing to closed-world replay targets with branch-balanced-target-replay-coverage-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that replay target-set mass and weakest missing-target share can both improve when the sampled branch batch omits some admitted pool targets. The full-stack screen completed 50/50 direct steps and restored from step 0: training snapshots improved QA average target rank as far as 6.875 and top-5 coverage to 0.5, but target-token coverage still hit 0.0 during training and QA/heldout top-1 predictions collapsed to wrong "n" by step 50. This rejects pool-owned replay coverage as a sufficient context-specific target-ownership repair.
  • runs/transformer-answer-v0.59-fullstack-context-replay-coverage-smoke-dim4-context80/ makes replay context-owned with branch-balanced-context-replay-coverage-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that replay target-set mass and weakest owned-target share can both improve on fixed replay contexts. The full-stack screen completed 50/50 direct steps and restored from step 0: training snapshots improved QA average target rank as far as 7.375, QA top-3 to 0.375, QA top-5 to 0.5, and admissions top-5 to 0.5208 by step 50, but target-token coverage still hit 0.0 during training and the diversity target failed 0/9. This rejects context-owned replay coverage as implemented.
  • runs/transformer-answer-v0.60-fullstack-context-replay-coverage-floor-metadata-smoke-dim4-context80/ adds a profile-wise target-token coverage floor to branch snapshot selection: rank/top-k gains are eligible only when every multi-target profile preserves its baseline coverage. Direct-answer JSONL snapshots now write branch_target_coverage_by_profile, and the focused transformer test rejects a rank-lifted candidate that regresses QA coverage. The clean full-stack screen completed 50/50 direct steps, wrote 7 JSONL rows, and restored from step 0: the baseline coverage floor remained visible in the final row (qa 0.25, heldout 0.25, admissions 0.1429, minimum profile 0.0714). This accepts the self-improvement gate repair while still rejecting the trained model behavior.
  • runs/transformer-answer-v0.61-fullstack-context-coverage-anchor-smoke-dim4-context80/ adds a covered-target anchor to context replay: replay branches whose own target is already top-1 receive extra target-vs-replay-target/hard-wrong pressure. The focused transformer test verifies that the anchor protects a covered branch better than identical replay training without the anchor. The full-stack screen completed 50/50 direct steps and restored from step 0 under the v0.60 coverage floor, but trained snapshots over-anchored the already-covered wrong "i" token: QA/heldout predicted diversity fell to 1/8, target-token coverage to 0.125, and average target rank above 21. This rejects global covered-target anchoring as implemented.
  • runs/transformer-answer-v0.62-fullstack-target-balanced-anchor-smoke-dim4-context80/ makes covered-target anchoring target-balanced: anchor losses are averaged by covered target and skipped when only one covered target is present. The focused transformer test verifies that this singleton guard skips the v0.61 one-token over-anchor while the old global anchor still raises that token. The full-stack screen completed 50/50 direct steps and restored from step 0 under the v0.60 coverage floor. It avoided the hard "i" attractor, but QA/heldout target-token coverage still collapsed to 0.0 during training. This rejects target-balanced anchoring as sufficient.
  • runs/transformer-answer-v0.64-fullstack-coverage-deficit-smoke-dim4-context80/ adds branch-balanced-context-coverage-deficit-unlikelihood, which computes replay target tokens that are absent from the current replay predictions and adds target pressure only for those missing targets. The focused transformer test verifies that the deficit term lifts a missing replay target above the old context replay objective. The full-stack screen completed 50/50 direct steps and restored from step 0 under the v0.60 coverage floor. Step 50 cracked QA top-1 behavior enough to reach 1/8 branch accuracy and predicted diversity 4/8, but QA/heldout target-token coverage regressed to 0.125, so the trained snapshots remained ineligible. This rejects deficit pressure by itself.
  • runs/transformer-answer-v0.65-fullstack-coverage-preserving-deficit-smoke-dim4-context80/ adds branch-balanced-context-coverage-preserving-deficit-unlikelihood, which balances missing-target deficit pressure with preservation anchors for target tokens currently represented in replay predictions. Focused tests pass and verify both effects in isolation. The full-stack screen completed 50/50 direct steps and restored from step 0. Step 50 improved QA average target rank to 7.75, heldout average target rank to 7.125, and top-5 coverage to 0.5, but both profiles collapsed to one predicted target token with target-token coverage 0.125. This rejects current-prediction preservation as implemented.
  • runs/transformer-answer-v0.67-profile-aware-replay-plan-smoke-dim4-context80/ adds profile-aware replay records and direct_answer_replay_plan.json for the preserving-deficit path. Focused tests verify that global target coverage cannot hide a profile-local missing target and that profiled replay records keep their admitted source keys even when branch target tokens are shared. The bounded smoke wrote a plan for 9144 branch/replay records across 21 profiles, passed the branch-context gate across 219/219 semantic records, and showed profile-specific coverage floors such as qa:place at 0.5 and qa:color at 0.0. It ran one branch-only direct step and restored from step 0; branch diversity still failed 0/9 multi-target profiles. This is mechanics-readiness evidence, not model-quality promotion evidence.
  • runs/transformer-answer-v0.68-fullstack-profile-aware-preserving-deficit-smoke-dim4-context80/ spends that replay plan on the comparable full-stack repair screen. The run completed 50/50 direct steps, wrote 7 direct-answer JSONL rows, passed the branch-context gate, and used a replay plan for 9144 branch records across 21 profiles. Training step 40 improved QA average target rank to 6.5 and top-5 coverage to 0.625; heldout average rank improved to 6.875 with top-5 coverage 0.5. Those rank gains came with QA/heldout target-token coverage regressing to 0.125 and predicted diversity collapsing to 1/8, so best-snapshot scoring restored step 0. This is rejected evidence.

v0.81 keeps the context-coverage audit, profile-wise coverage floor, and replay-plan artifact, then adds balanced profile target-share pressure inside the profile-local direct-answer objective. Focused tests verify the minority replay target gains more share than under the previous profile-aware replay loss. v0.82 then screens that objective in runs/transformer-answer-v0.82-fullstack-profile-target-share-smoke-dim4-context80/. The run records the modern artifact stack and passes the verifier, branch-context, purity, and coverage-preservation gates, but it still fails branch diversity. Step 40 improves QA average target rank to 9.125, yet does so by collapsing QA and heldout to one "c" prediction with 0.0 target-token coverage. Best-snapshot scoring restores step 0, so this is rejected evidence.

v0.83 adds prompt-specific sibling-target ownership margins on top of that profile target-share objective. Focused tests show the new term lifts a context-specific target more than v0.82 target-share pressure. The full screen in runs/transformer-answer-v0.83-fullstack-prompt-ownership-smoke-dim4-context80/ writes the modern artifacts and passes the verifier, branch-context, and purity gates. It still fails branch diversity: step 50 improves QA average target rank to 8.625, but QA and heldout collapse to one "c" prediction with 0.0 target-token coverage during training. Best-snapshot scoring restores step 0.

v0.84 anchors replay preservation to the baseline profile-aware replay predictions captured before direct-answer training. Focused tests show replay batches can use those baseline prediction overrides and that anchored preservation protects a covered target better than following current prediction drift. The full screen in runs/transformer-answer-v0.84-fullstack-baseline-anchored-prompt-ownership-smoke-dim4-context80/ records 562 active baseline prediction anchors, passes the verifier, branch-context, and purity gates, and avoids the v0.83 zero-coverage collapse. Step 40 improves QA average target rank to 8.0, but QA and heldout still collapse to one "i" prediction with target-token coverage 0.125, below the baseline 0.25 floor. Best-snapshot scoring restores step 0, so the next repair must preserve the full baseline target-token floor.

v0.85 adds a baseline-floor update guard around the baseline-anchored prompt-ownership mode. The full screen in runs/transformer-answer-v0.85-fullstack-baseline-floor-gated-prompt-ownership-smoke-dim4-context80/ records 562 active baseline prediction anchors and checks 50/50 attempted direct-answer updates. The guard rejects all 50 unsafe updates, preserving QA and heldout target-token coverage at the baseline 0.25 floor in every recorded snapshot. It is still rejected evidence: no weight update is accepted and branch diversity still fails across all 9 multi-target profiles.

v0.86 adds adaptive retries around that update guard. The full screen in runs/transformer-answer-v0.86-fullstack-baseline-floor-adaptive-prompt-ownership-smoke-dim4-context80/ records 562 active baseline prediction anchors and attempts 200 scaled updates across 50 checked direct-answer steps. Scales 1.0, 0.25, 0.05, and 0.01 all still violate at least one profile-wise baseline coverage floor, so the guard rejects 200/200 attempts. It is still rejected evidence: step-size retry alone does not produce accepted safe updates.

v0.87 adds one bounded baseline-covered anchor repair after each unsafe adaptive retry. The clean full screen in runs/transformer-answer-v0.87-fullstack-baseline-floor-repaired-prompt-ownership-clean-smoke-dim4-context80/ records 562 active baseline prediction anchors, 227 repair anchors, 200 repair attempts, and 200/200 rejected update attempts. QA and heldout coverage remain at 0.25, but no repaired update is accepted, so post-update repair is also rejected as the missing mechanic.

v0.88 moves balanced baseline-floor anchors into the direct-answer objective itself. The full screen in runs/transformer-answer-v0.88-fullstack-baseline-floor-objective-prompt-ownership-smoke-dim4-context80/ records 562 active baseline prediction anchors, 227 objective-side floor anchors, 200 objective anchor batches, 2400 anchor records, and 200/200 rejected update attempts. QA and heldout coverage remain at 0.25, but no objective-shaped update is accepted, so branch-pressure coupling is rejected as the missing mechanic.

v0.89 isolates baseline-floor stabilization updates. The full screen in runs/transformer-answer-v0.89-fullstack-baseline-floor-stabilization-smoke-dim4-context80/ records 562 active baseline prediction anchors, 227 stabilization anchors, 200 stabilization anchor batches, 2400 anchor records, and 200/200 rejected update attempts. QA and heldout coverage remain at 0.25, but no stabilization-only update is accepted, so the next repair should diagnose why floor-only updates still violate the baseline floor.

v0.90 adds the missing rejection diagnosis. The full screen in runs/transformer-answer-v0.90-fullstack-baseline-floor-stabilization-diagnostics-smoke-dim4-context80/ records stabilization: 200 rejected update-shape counts, 50 rejected attempts at each adaptive scale, heldout: 200 violation counts, and a worst floor deficit of 0.25 on learning. Promotion still rejects the transformer, but the next repair now has measured profile-level floor evidence.

v0.91 applies that evidence by covering the full baseline-covered profile-target floor surface. The full screen in runs/transformer-answer-v0.91-fullstack-baseline-floor-profile-targeted-stabilization-smoke-dim4-context80/ records 227 floor anchors, 12 profile-target groups, profile_targeted_stabilization: 200 rejected attempts, and the same violation profile counts as v0.90. Promotion still rejects the transformer, and the next repair must change the floor repair shape rather than only broaden anchor coverage.

v0.92 changes that shape to sequential source-profile floor repair. The full screen in runs/transformer-answer-v0.92-fullstack-baseline-floor-sequential-profile-stabilization-smoke-dim4-context80/ records 10 source-profile groups, 2000 profile-local repair attempts, 2000 profile-local rejections, and 200 no-effective-update outer attempts. Promotion still rejects the transformer, and the next repair must isolate floor-preserving weight movement rather than only broaden coverage or reorder profiles.

v0.93 calibrates that movement below 0.01. The diagnostic screen in runs/transformer-answer-v0.93-baseline-floor-calibrated-sequential-profile-stabilization-step1-dim4-context80/ records calibrated scales down to 0.0001, 50 profile-local repair attempts, 49 profile-local rejections, and one accepted nonzero bridge:owner update at scale 0.0025. Promotion still rejects the transformer on branch_diversity_target, but the baseline floor guard has now accepted real weight movement.

v0.94 adds profile-scale memory to that calibrated path. The diagnostic screen in runs/transformer-answer-v0.94-baseline-floor-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer profile-scale update, 60 profile-scale attempts, 8 accepted source-profile updates, and 52 rejected profile-scale attempts. Promotion still rejects the transformer on branch_diversity_target, but safe floor-preserving movement now spans multiple source profiles.

v0.95 adds diversity-aware profile-scale memory to that path. The diagnostic screen in runs/transformer-answer-v0.95-baseline-floor-diversity-profile-scale-calibrated-sequential-stabilization-configured-step1-dim4-context80/ records one accepted outer diversity-aware profile-scale update, 58 profile-scale attempts, 5 score-improving accepted source-profile updates, 42 floor regressions, and 11 floor-preserving diversity-score regressions. Promotion still rejects the transformer on branch_diversity_target, but the training loop now records which safe movements are non-regressive for branch diversity.

v0.96 adds missing-target frontier anchors to that path. The diagnostic screen in runs/transformer-answer-v0.96-baseline-floor-diversity-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records 52 frontier anchors, one accepted outer frontier profile-scale update, 43 profile-scale attempts, 9 score-improving accepted source-profile updates, 28 floor regressions, and 6 floor-preserving diversity-score regressions. Promotion still rejects the transformer on branch_diversity_target, but max dominant predicted rate improves to 0.9 and minimum target-token coverage improves to 0.1667.

v0.97 adds coverage-frontier acceptance to that path. The diagnostic screen in runs/transformer-answer-v0.97-baseline-floor-diversity-coverage-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer coverage-frontier profile-scale update, 68 profile-scale attempts, 1 coverage-gaining accepted source-profile update, 50 floor regressions, 15 coverage ties, and 2 coverage regressions. Promotion still rejects the transformer on branch_diversity_target, but the update guard now records accepted coverage deltas and proves the strict monotonic screen is currently too conservative for full missing-target repair.

v0.98 adds coverage-prep frontier acceptance to that path. The diagnostic screen in runs/transformer-answer-v0.98-baseline-floor-diversity-coverage-prep-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer coverage-prep profile-scale update, 43 profile-scale attempts, 9 accepted source-profile updates, 3 coverage gains, 6 coverage-preparation moves, 28 floor regressions, 4 coverage ties without score gain, and 2 coverage regressions. Promotion still rejects the transformer on branch_diversity_target, but the update guard now separates direct coverage gains from safe preparation moves.

v0.99 adds coverage-recovery frontier retry to that path. The diagnostic screen in runs/transformer-answer-v0.99-baseline-floor-diversity-coverage-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer coverage-recovery profile-scale update, 54 profile-scale attempts, 6 accepted source-profile updates, 6 prepared recovery candidates, 15 recovery retries, 2 direct coverage recoveries, 4 preparation fallbacks, 38 floor regressions, 7 coverage ties without score gain, and 3 coverage regressions. Promotion still rejects the transformer on branch_diversity_target, but the guard now proves preparation can be tested as direct missing-target recovery before it is admitted as self-improvement evidence.

v0.100.0 adds branch-stable coverage-recovery acceptance to that path. The diagnostic screen in runs/transformer-answer-v0.100.0-baseline-floor-diversity-branch-stable-coverage-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer branch-stable recovery update, 54 profile-scale attempts, 6 accepted source-profile updates, 6 prepared recovery candidates, 15 branch-stability checks, 2 branch-stable coverage recoveries, 4 preparation fallbacks, 7 floor-regressed recovery retries, 5 coverage-tied retries, and 1 branch-score regression rejection. Promotion still rejects the transformer on branch_diversity_target, but the guard now proves recovery can be checked against the prepared branch-diversity score instead of coverage alone.

v0.101.0 adds branch-diversity recovery after already-safe profile updates. The diagnostic screen in runs/transformer-answer-v0.101.0-baseline-floor-diversity-branch-diversity-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer branch-diversity recovery update, 52 profile-scale attempts, 6 accepted source-profile updates, 6 branch-diversity recovery candidates, 9 branch-diversity recovery attempts, 5 branch-score-improving refinements, 1 fallback, 1 floor-regression rejection, 1 score-regression rejection, and 2 score-tie rejections. Promotion still rejects the transformer on branch_diversity_target, but the guard now proves local branch-diversity score can improve without weakening the coverage floor.

v0.102.0 adds collapsed-profile binding after branch-diversity recovery. The diagnostic screen in runs/transformer-answer-v0.102.0-baseline-floor-diversity-collapsed-profile-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer collapsed-profile binding update, 54 profile-scale attempts, 11 accepted source-profile updates, 11 branch-diversity recovery candidates, 26 branch-diversity recovery attempts, 4 branch-score refinements, 31 collapsed-profile binding attempts, 1 accepted binding update, 10 binding fallbacks, 27 collapsed-profile ties, 1 floor-regression rejection, and 2 score-regression rejections. Promotion still rejects the transformer on branch_diversity_target, but the guard now proves a targeted binding update can survive while final collapse narrows from 9/9 eval profiles at baseline to 3/9 remaining collapsed profiles: learning, owner, and paraphrases.

v0.103.0 adds remaining-profile binding after collapsed-profile binding. The diagnostic screen in runs/transformer-answer-v0.103.0-baseline-floor-diversity-remaining-profile-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer remaining-profile binding update, 56 profile-scale attempts, 11 accepted source-profile updates, 21 prioritized remaining-profile attempts, 6 prioritized acceptances, 15 prioritized rejections, 3 branch-diversity refinements, and 2 collapsed-profile binding updates. Promotion still rejects the transformer on branch_diversity_target, but the guard proves the remaining-profile curriculum can improve learning coverage from 0.0 to 0.25 without target coverage regression.

v0.104.0 adds owner/paraphrase residual binding after remaining-profile binding. The diagnostic screen in runs/transformer-answer-v0.104.0-baseline-floor-diversity-owner-paraphrase-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/ records one accepted outer owner/paraphrase binding update, 16 owner/paraphrase-prioritized attempts, 6 prioritized acceptances, 10 prioritized rejections, 75 learning-preservation checks, 24 preservation failures, and 33 narrowed collapsed-profile binding rejections. Promotion still rejects the transformer on branch_diversity_target, but learning finishes non-collapsed with coverage 0.25 and predicted diversity 2.

v0.105.0 adds corpus-only retrieval memory as a separate evidence rail before weight consolidation. The diagnostic screen in runs/transformer-answer-v0.105.0-retrieval-memory-owner-paraphrase-frontier-profile-scale-step1-dim4-context80/ writes retrieval_memory_report.json, builds 497 memory cards from the closed-world corpus, answers 219/219 eval probes exactly, and records uses_external_model: false, external_embeddings: false, pretrained_retriever: false, and updates_weights: false. The neural transformer still rejects promotion on branch_diversity_target; v0.105.0 therefore proves immediate memory serving, not completed weight learning.

v0.106.0 adds memory-guided consolidation planning. The diagnostic screen in runs/transformer-answer-v0.106.0-memory-guided-consolidation-owner-paraphrase-frontier-profile-scale-step1-dim4-context80/ writes memory_consolidation_plan.json, keeps retrieval at 219/219, records 9 memory-backed neural failed profiles, and ranks owner, paraphrases, glossary, admission_paraphrases, and admissions as the top consolidation priorities. The collapsed memory-backed profiles are owner, paraphrases, and glossary; neural promotion still rejects on branch_diversity_target.

v0.107.0 adds gated memory-consolidation training. The diagnostic screen in runs/transformer-answer-v0.107.0-gated-memory-consolidation-owner-paraphrase-glossary-frontier-profile-scale-step1-dim4-context80/ consumes the v0.106.0 plan, targets owner, paraphrases, and glossary, records 26 memory-consolidation prioritized attempts with 8 acceptances and 18 rejections, and keeps retrieval at 219/219. The transformer still rejects neural promotion on branch_diversity_target, so this is plan-guided weight-consolidation evidence rather than promoted model evidence.

v0.108.0 expands that consolidation window. The diagnostic screen in runs/transformer-answer-v0.108.0-expanded-memory-consolidation-owner-paraphrase-heldout-qa-glossary-frontier-profile-scale-step1-dim4-context80/ consumes the v0.107.0 plan, targets owner, paraphrases, heldout, qa, and glossary, maps target-only profiles back to admitted source labels, and keeps retrieval at 219/219. Branch-diversity still blocks promotion, which means the next repair needs direct missing first-token diversity pressure.

v0.109.0 adds that direct missing first-token pressure. The diagnostic screen in runs/transformer-answer-v0.109.0-missing-first-token-memory-consolidation-owner-paraphrase-heldout-qa-glossary-frontier-profile-scale-step1-dim4-context80/ consumes the v0.108.0 plan, extracts missing first-token target maps for owner, paraphrases, heldout, qa, and glossary, and records 8 missing-token candidates, 22 attempts, 1 accepted guarded coverage-gain update, 21 rejections, and 7 fallback acceptances. Retrieval remains exact at 219/219; branch-diversity still blocks promotion, and the next plan narrows the collapsed memory-backed profiles to owner, paraphrases, and learning.

v0.110.0 makes that narrowed plan the explicit training contract. The diagnostic screen in runs/transformer-answer-v0.110.0-remaining-collapsed-missing-first-token-memory-consolidation-owner-paraphrase-learning-frontier-profile-scale-step1-dim4-context80/ consumes the v0.109.0 plan, requires source-plan collapsed_memory_backed_profiles, targets only owner, paraphrases, and learning, and records no unconsumed collapsed targets. Retrieval remains exact at 219/219; the missing-token phase records 6 candidates, 16 attempts, 1 accepted guarded coverage-gain update, 15 rejections, and 5 fallback acceptances. Branch-diversity still blocks promotion.

v0.111.0 makes that pressure profile-specific. The diagnostic screen in runs/transformer-answer-v0.111.0-profile-specific-missing-first-token-memory-consolidation-owner-paraphrase-learning-frontier-profile-scale-step1-dim4-context80/ consumes the v0.110.0 plan, keeps targets owner, paraphrases, and learning, and records the target map learning -> learning, owner -> owner/paraphrases, and color/place/training_data -> paraphrases. Retrieval remains exact at 219/219; memory-prioritized consolidation records 16 attempts with 6 acceptances and 10 rejections, and the missing-token phase records 6 candidates, 18 attempts, 0 direct missing-token acceptances, 18 rejections, and 6 fallbacks. The guard records 1 accepted profile-specific update shape, but branch-diversity still blocks promotion.

v0.112.0 adds branch-diversity root-cause diagnostics before another repair objective. The diagnostic screen in runs/transformer-answer-v0.112.0-branch-diversity-root-cause-profile-specific-memory-consolidation-step1-dim4-context80/ consumes the v0.111.0 plan, targets owner, paraphrases, and glossary, keeps retrieval exact at 219/219, records 24 profile-specific missing-token attempts with 0 direct acceptances and 8 fallbacks, and classifies the final branch-diversity failure as a critical target_routing_gap. The root-cause report records 9/9 failed profiles, 3 collapsed profiles, 1 zero-coverage profile, 6 buried-target profiles, and reused dominant tokens "n" and "a". Branch-diversity still blocks promotion.

v0.113.0 adds branch routing audit diagnostics to the same branch-only screen surface. The diagnostic screen in runs/transformer-answer-v0.113.0-branch-routing-audit-profile-specific-memory-consolidation-step1-dim4-context80/ consumes the v0.112.0 plan, targets owner, paraphrases, and learning, keeps retrieval exact at 219/219, records 18 profile-specific missing-token attempts with 0 direct acceptances and 6 fallbacks, and keeps branch-diversity as the blocker. The routing audit reports high output-bias escape risk ("n" bias rank 2), low representation separation across 9/9 multi-target profiles, and a glossary target-imbalance hotspot.

v0.114.0 adds logit-prior and centroid-separation instrumentation to the same screen surface. The diagnostic screen in runs/transformer-answer-v0.114.0-logit-prior-representation-instrumentation-profile-specific-memory-consolidation-step1-dim4-context80/ consumes the v0.113.0 plan, targets owner, paraphrases, and glossary, keeps retrieval exact at 219/219, records 24 profile-specific missing-token attempts with 0 direct acceptances and 8 fallbacks, and keeps branch-diversity as the blocker. The new logit-prior profiles report hidden-projection pressure across 9/9 multi-target profiles, while centroid separation remains poor.

v0.115.0 adds a bias-frozen hidden-projection margin candidate. The candidate screen in runs/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/ introduces branch-hidden-projection-margin-unlikelihood and tests one direct-answer step that compares target-token hidden * output_weight contributions directly. It lowers average collapsed-token hidden advantage from about 0.0842 to 0.0736, but promotion remains blocked before quality metrics: 10/11 constraints pass, branch_diversity_target fails, all 9/9 multi-target profiles still collapse to "n", and 2 profiles still have zero target-token coverage.

The v0.42 self-improvement run passed:

  • direct admission-probe audit
  • admission-paraphrase audit
  • glossary-probe audit
  • exact eval audit
  • promotion gate
  • forgetting audit against v0.41
  • protected prompt leakage audit
  • responder exact evals
  • learned answer classifier exact evals
  • generative answer decoder exact evals
  • rule-based self-diagnosis with no external model

Admission probes now pass 48/48 direct records and 84/84 paraphrase records. Glossary probes now pass 38/38 records. The passing attempt is archived at runs/self-improve-v0.42/attempts/attempt-001/ before the top-level latest report is updated.

v0.23 added attempt archives. A deliberately undertrained attempt failed at runs/self-improve-v0.23/attempts/attempt-001/, and the repaired passing attempt remains at runs/self-improve-v0.23/attempts/attempt-002/.