Current Evidence
The current promoted run is runs/self-improve-v0.42/.
Older milestone evidence that is no longer part of the current-state table is preserved in Historical evidence archive.
| Signal | Current value |
|---|---|
| Product name | QuarkLM |
| Package / repository slug | quark-lm |
| Docs host | Read the Docs |
| Marketing host | GitHub Pages |
| RC spec | RC_SPEC.md |
| RC gap audit | RC_GAP_AUDIT.md |
| RC checklist | RC_CHECKLIST.md |
| Recommended RC track | Research Prototype RC |
| Research Prototype RC status | near: the closed-world self-improvement system is reproducible, auditable, documented, and honest about unpromoted transformer evidence. |
| Language Model RC status | not ready: the from-scratch transformer still fails branch_diversity_target after v0.115. |
| Next model bundle | Profile-Balanced Routing Repair with representation-separation acceptance checks. |
| Research grounding | sites/docs/docs/learn/research-grounding.md |
| Research reviewed | 2026-06-15 |
| Research pass | v0.71-v0.115 implementation evidence; v0.115 adds a bias-frozen hidden-projection margin candidate with exact retrieval memory and rejected neural promotion. |
| Research decision | QuarkLM should model self-improvement as a closed-loop lifecycle with ledgered admission, verified selection, auditable weight optimization, separate inference rails, and promotion gates that can reject regressions. |
| Research next step | Use the v0.115 hidden-projection candidate evidence to design a broader guarded routing repair that can lift coverage without collapsing profiles. |
| Open-source mechanics audit | MECHANICS_AUDIT.md and sites/docs/docs/learn/open-source-mechanics-audit.md added as the v0.66 deeper comparison of open-source LLM, tokenizer, continual-learning, transparency, and self-improvement mechanics. |
| Mechanics audit decision | The next bottleneck is trainer mechanics, not another global branch-loss term: direct-answer replay should be profile-aware, artifacted, coverage-constrained, and tested for profile isolation before the next full-stack repair run. |
| Mechanics audit next step | Improve target-token diversity for remaining memory-backed failures using v0.115 hidden-projection candidate evidence; do not count retrieval as weight learning. |
| Forward research plan | FORWARD_RESEARCH_PLAN.md and sites/docs/docs/learn/forward-research-plan.md added as the v0.69 cross-referenced implementation strategy. |
| Forward plan decision | Pause direct-answer objective churn until QuarkLM has the self-improvement operating system needed to decide which training changes are legitimate: experiment registry, replay extraction, corpus governance, candidate quarantine, closed-world verifier checks, recipe boundaries, and constraint-first promotion gates. |
| Forward plan next step | After v0.115, expand guarded hidden-projection repair only if it improves target coverage under the branch-diversity gate. |
| Deep research review | DEEP_RESEARCH_REVIEW.md and sites/docs/docs/learn/deep-research-review.md added as the v0.70 deeper cross-referenced research and implementation-gap review. |
| Deep review decision | No larger transformer repair screen should run until experiment intent, corpus plans, replay plans, verifier checks, recipes, and constraint-first promotion are explicit artifacts. |
| Deep review next step | After v0.115, tie the target-routing gap to a guarded hidden-projection and prompt-representation repair surface that can survive promotion constraints. |
| Research implementation map | RESEARCH_IMPLEMENTATION_MAP.md and sites/docs/docs/learn/research-implementation-map.md added as the v0.74 source-to-gap-to-version implementation map. |
| Implementation map decision | Deep cross-referenced research and open-source mechanics review are now a required implementation control: each next mechanic should cite its research pattern, name the closed-world boundary it protects, and produce acceptance evidence. |
| Implementation map next step | Candidate quarantine through hidden-projection margin repair are implemented and screened through v0.115.0; broader guarded routing repair next. |
| Experiment registry | src/closed_world_lm/experiment_registry.py and sites/docs/docs/operate/experiment-registry.md added as the v0.71 run-intent implementation. |
| Experiment registry decision | Self-improvement and transformer answer-training runs now declare hypothesis, allowed data, planned artifacts, recipe id, acceptance gates, failure criteria, notes, and final decision before their outputs are trusted as evidence. |
| Experiment registry next step | Use the registry as the required evidence wrapper for replay, corpus, verifier, recipe, and promotion-gate mechanics. |
| Replay planning | src/closed_world_lm/replay_plan.py added as the v0.72 standalone replay-planning module. |
| Replay planning decision | Transformer training still uses the existing profile-aware replay behavior, but replay record normalization, profile grouping, coverage floors, missing-target summaries, and JSON-safe plan shape now live outside the transformer monolith. |
| Replay planning next step | Use standalone replay planning as input to corpus hygiene, candidate quarantine, verifier, recipe, and promotion-gate reports. |
| Corpus hygiene | src/closed_world_lm/corpus_hygiene.py and sites/docs/docs/operate/corpus-hygiene.md added as the v0.73 corpus hygiene and training-plan artifact implementation. |
| Corpus hygiene decision | Self-improvement and transformer answer-training runs now write corpus_hygiene.json and training_plan.json with source mixtures, duplicate checks, train/eval prompt overlap, candidate ratios, rare-profile coverage, allowed data sources, planned artifacts, and replay-plan summaries when available. |
| Corpus hygiene next step | Use candidate ratios, quarantine summaries, overlap evidence, verifier summaries, recipe summaries, transformer responsibility surfaces, checkpoint metadata surfaces, and eval surfaces as inputs to objective-repair work. |
| Candidate quarantine | src/closed_world_lm/candidate_quarantine.py and sites/docs/docs/operate/candidate-quarantine.md added as the v0.75 candidate lifecycle implementation. |
| Candidate quarantine decision | Self-improvement and transformer answer-training runs now write candidate_quarantine.json, and training_plan.json records that candidate records are not training data until admitted into the ledgered corpus and converted into curriculum lessons. |
| Candidate quarantine next step | Use the candidate quarantine manifest as input to deterministic verifier checks, recipe artifacts, and future promotion gates. |
| Closed-world verifier | src/closed_world_lm/closed_world_verifier.py and sites/docs/docs/operate/closed-world-verifier.md added as the v0.76 deterministic verifier implementation. |
| Verifier decision | Self-improvement and transformer answer-training runs now write closed_world_verifier.json, embed verifier summaries in training_plan.json, and require verifier approval as a run-intent gate without using an external model. |
| Verifier next step | Use verifier evidence as an input to recipe objects, constraint-first promotion gates, transformer responsibility surfaces, model/checkpoint metadata, eval surfaces, and objective-repair work. |
| Training recipe | src/closed_world_lm/training_recipe.py and sites/docs/docs/operate/training-recipes.md added as the v0.77 recipe and constraint-first promotion implementation. |
| Training recipe decision | Self-improvement and transformer answer-training runs now write training_recipe.json and constraint_first_promotion.json. Transformer decisions cannot promote from loss, NLL, rank, top-k, or exact quality evidence unless closed-world constraints pass first. |
| Training recipe next step | Use recipe and constraint-first artifacts as the surfaces for transformer objective-repair work. |
| Transformer responsibility | src/closed_world_lm/transformer_experiment.py, src/closed_world_lm/transformer_training.py, src/closed_world_lm/transformer_objectives.py, and sites/docs/docs/build/transformer-responsibilities.md added as the v0.78 transformer responsibility implementation. |
| Transformer responsibility decision | Transformer answer-training now keeps artifact contracts, experiment intent, recipe creation, promotion decisions, JSONL snapshot writing, shuffled training cursors, loss averaging, and the direct-answer objective catalog behind narrow tested surfaces while preserving the public CLI. |
| Transformer responsibility next step | Use the v0.78 responsibility surfaces through the v0.115.0 hidden-projection candidate evidence before broader routing repair. |
| Transformer model surface | src/closed_world_lm/transformer_model.py and tests/test_transformer_model.py added as the v0.79 transformer model/config and checkpoint metadata implementation. |
| Transformer model decision | Transformer config, optimizer config, generation config, validation, checkpoint architecture, checkpoint format, tokenizer identity, closed-world dataset metadata, arg-to-config adapters, and run metadata now live outside transformer_char_model.py while remaining re-exported for compatibility. |
| Transformer model next step | Use model/checkpoint metadata surfaces with the v0.115.0 hidden-projection candidate evidence before broader routing repair. |
| Transformer eval surface | src/closed_world_lm/transformer_checkpoint.py, src/closed_world_lm/transformer_eval.py, tests/test_transformer_checkpoint.py, and tests/test_transformer_eval.py added as the v0.80 transformer eval/checkpoint-load implementation. |
| Transformer eval decision | Checkpoint payload loading and identity validation, checkpoint summaries, probe loading, candidate collection, generic transformer scoring, eval report assembly, samples JSONL writing, and eval JSON writing now live outside transformer_char_model.py while preserving CLI behavior and artifact shapes. |
| Transformer eval next step | v0.115.0 uses eval and promotion surfaces to screen a bias-frozen hidden-projection margin candidate; branch_diversity_target still blocks promotion. |
| Latest repository version | v0.115.0 |
| Latest version summary | bias-frozen hidden-projection margin candidate evidence |
| Current version | v0.42 |
| Admitted facts | 12 |
| Direct admission probes | 48/48 |
| Admission paraphrase probes | 84/84 |
| Glossary probes | 38/38 |
| QA exact | 8/8 |
| Admissions exact | 48/48 |
| Admission paraphrases exact | 84/84 |
| Glossary exact | 38/38 |
| Self exact | 7/7 |
| Learning exact | 4/4 |
| Forgetting audit | passed |
| Prompt leakage audit | passed |
| Exact eval audit | passed |
| Promotion gate | passed |
| Self-diagnosis | passed |
| Self-diagnosis external model | false |
| Self-diagnosis recommendation | promote_or_expand_corpus |
| Attempt archive | enabled |
| Transformer run | runs/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/ |
| Transformer validation NLL | answer target NLL 3.5850 -> 2.4129 |
| Transformer exact | 0/219 -> 0/219 direct greedy |
| Transformer candidate accuracy | 15/219 -> 37/219 eval-scoped |
| Direct transformer exact | 0/219 -> 0/219 direct greedy |
| Direct transformer loss | 3.4278 -> 2.2708 |
| Direct transformer mode | periodic-branch-repair-contrast-unlikelihood |
| Direct transformer failure pattern | short wrong ' te.' greedy completion after wider sparse branch contrast |
| Latest transformer screen | runs/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/ |
| Latest screen direct loss | 4.9050 one-step hidden-projection margin candidate screen |
| Latest screen direct exact | branch-only screen; direct greedy eval skipped |
| Latest screen post-direct candidate snapshot skipped | true |
| Latest retrieval memory report | runs/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/retrieval_memory_report.json |
| Retrieval memory artifact | retrieval_memory_report.json is now a transformer answer-training artifact declared in experiment intent and training plans. |
| Retrieval memory summary | 497 corpus-only memory cards; 219/219 exact retrieval evals; no external model, embeddings, pretrained retriever, or weight updates. |
| Retrieval memory status | memory-first evidence remains exact in v0.115.0 and is consumed only as source-plan evidence, not neural promotion. |
| Latest memory consolidation plan | runs/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/memory_consolidation_plan.json |
| Memory consolidation summary | v0.115 keeps retrieval exact, screens hidden-projection margin repair with output bias frozen, and still rejects neural promotion on branch_diversity_target. |
| Memory consolidation status | logit-prior representation evidence: 8 missing-token candidates, 24 attempts, 0 direct missing-token acceptances, 24 rejections, 8 fallbacks, 1 accepted profile-specific update shape, no external model, embeddings, or pretrained retriever; branch_diversity_target still blocks promotion with critical target_routing_gap, high output-bias escape risk, low representation separation across 9/9 profiles, and hidden-projection pressure across 9/9 multi-target profiles. |
| Latest transformer diagnostic run | runs/transformer-answer-v0.43-branch-profile-smoke-dim4-context16/ |
| Latest transformer diagnostic | direct-answer branch profiles from model logits |
| Latest diagnostic QA branch accuracy | 1/8 -> 1/8 |
| Latest diagnostic dominant prediction | all 'o' -> all 'y' |
| Latest transformer repair run | runs/transformer-answer-v0.43-periodic-branch-batch-smoke-dim4-context16/ |
| Latest transformer repair mode | periodic-branch-batch-contrast-unlikelihood |
| Latest transformer repair status | rejected: loss improved but prompt-independent branch collapse worsened |
| Latest representation screen | runs/transformer-answer-v0.43-prompt-attention-branch-repair-smoke-dim4-context16/ |
| Latest representation option | --use-prompt-attention-summary |
| Latest representation status | rejected: prompt-attention summary projection moved and lowered loss, but QA branch collapse still worsened |
| Latest branch-context diagnostic | runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context80/ |
| Branch context 16 QA | 0/8 semantic covered; 4 ambiguous QA branch contexts |
| Branch context 32 QA | 0/8 semantic covered; 0 ambiguous QA branch contexts |
| Branch context 80 all evals | 219/219 semantic covered; 0 ambiguous branch contexts |
| Latest branch-context gate run | runs/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context80/ |
| Branch context gate at 16 | required gate failed; requested 5 direct steps, ran 0 |
| Branch context gate at 80 | required gate passed; requested 1 direct step, ran 1 |
| Latest branch-only screen | runs/transformer-answer-v0.43-branch-context-gated-branchonly-smoke-dim4-context80/ |
| Branch-only gate | passed; requested 5 direct steps, ran 5 |
| Branch-only eval skipping | direct greedy evals skipped in JSONL snapshots; branch profiles and branch-context gate retained |
| Latest branch-only repair screen | runs/transformer-answer-v0.43-branchonly-periodic-repair-contrast50-dim8-context80/ |
| Branch-only repair status | rejected screen: gate passed and 100/100 direct steps ran, but QA branch prediction collapsed from all space to all 'a' |
| Latest branch-only batch screen | runs/transformer-answer-v0.43-branchonly-branch-batch-dim8-context80/ |
| Branch-only batch status | rejected screen: gate passed and 50/50 direct steps ran, but QA branch prediction still collapsed to all 'a' |
| Latest branch-diversity target run | runs/transformer-answer-v0.43-branch-diversity-target-smoke-dim4-context80/ |
| Branch-diversity target | direct-answer snapshots include branch_diversity_target over multi-target eval profiles |
| Branch-diversity smoke | context gate passed; diversity target failed 0/9 multi-target profiles; QA target_unique 8, predicted_unique 1, dominant 'r' rate 1.0 |
| Latest branch-diversity training run | runs/transformer-answer-v0.43-branch-diversity-train-smoke-dim4-context80/ |
| Branch-diversity training mode | branch-diversity-unlikelihood |
| Branch-diversity training status | rejected smoke: gate passed and 10/10 direct steps ran, but diversity target still failed 0/9 multi-target profiles |
| Latest branch-diversity freeze-bias run | runs/transformer-answer-v0.43-branch-diversity-freezebias-smoke-dim4-context80/ |
| Branch-diversity freeze-bias mode | branch-diversity-unlikelihood with --direct-answer-freeze-output-bias |
| Branch-diversity freeze-bias status | rejected stabilizer: gate passed and 50/50 direct steps ran with output bias frozen, but diversity target still failed 0/9 multi-target profiles |
| Latest branch-target softmax run | runs/transformer-answer-v0.43-branch-target-softmax-freezebias-smoke-dim4-context80/ |
| Branch-target softmax mode | branch-target-softmax-unlikelihood with --direct-answer-freeze-output-bias |
| Branch-target softmax status | rejected target-set screen: gate passed and 50/50 direct steps ran, composite train loss moved 5.6671 -> 5.5820, but diversity target still failed 0/9 multi-target profiles |
| Latest branch restore run | runs/transformer-answer-v0.43-branch-target-softmax-restorebest-smoke-dim4-context80/ |
| Branch restore mode | branch-target-softmax-unlikelihood with --direct-answer-restore-best-branch-snapshot |
| Branch restore status | rejected guardrail: restored best aggregate branch snapshot from step 40 after 50/50 direct steps, but diversity target still failed 0/9 multi-target profiles |
| Latest prompt-prefix projection run | runs/transformer-answer-v0.43-prompt-prefix-target-softmax-restorebest-smoke-dim4-context80/ |
| Prompt-prefix projection option | --use-prompt-prefix-projection |
| Prompt-prefix projection status | rejected representation screen: all 20 prompt-prefix projection parameters moved and loss improved 5.6649 -> 5.5679, but diversity target still failed 0/9 multi-target profiles |
| Latest prompt-position projection run | runs/transformer-answer-v0.43-prompt-position-target-softmax-restorebest-smoke-dim4-context80/ |
| Prompt-position projection option | --use-prompt-position-projection |
| Prompt-position projection status | rejected representation screen: 1108/1284 prompt-position projection parameters moved and loss improved 5.6649 -> 5.5679, but diversity target still failed 0/9 multi-target profiles |
| Latest branch-target margin run | runs/transformer-answer-v0.43-branch-target-margin-prompt-position-smoke-dim4-context80/ |
| Branch-target margin mode | branch-target-margin-unlikelihood with --use-prompt-position-projection |
| Branch-target margin status | rejected target-margin screen: gate passed and 50/50 direct steps ran, train loss moved 4.8973 -> 4.7784, but diversity target still failed 0/9 multi-target profiles |
| Latest branch-representation contrast run | runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim4-context80/ |
| Branch-representation contrast mode | branch-representation-contrast-unlikelihood with --direct-answer-contrast-weight 50.0 |
| Branch-representation contrast status | rejected representation-contrast screen: direct snapshots now record hidden-distance profiles, but high-weight contrast still failed diversity target 0/9 multi-target profiles |
| Latest branch-representation capacity run | runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim8-context80-steps40/ |
| Branch-representation capacity mode | dim8 branch-representation-contrast-unlikelihood with --direct-answer-contrast-weight 50.0 |
| Branch-representation capacity status | rejected capacity screen: 40/40 direct steps ran after the 50-step dim8 screen proved too slow, hidden distance increased but diversity target still failed 0/9 multi-target profiles |
| Latest prompt-position scale run | runs/transformer-answer-v0.43-prompt-position-scale32-repcontrast50-smoke-dim4-context80/ |
| Prompt-position scale mode | branch-representation-contrast-unlikelihood with --prompt-position-projection-scale 32.0 |
| Prompt-position scale status | rejected prompt-signal scale screen: 50/50 direct steps ran, 1108/1284 prompt-position projection parameters moved, hidden distance increased, but diversity target still failed 0/9 multi-target profiles |
| Transformer structure audit | STRUCTURE_AUDIT.md now gates the next transformer repair: study open-source model/trainer/tokenizer/checkpoint structure without importing external weights, tokenizers, embeddings, datasets, or training text |
| Transformer structure decision | implemented and screened an opt-in pre-layer-norm transformer block path with final normalization; target-balanced branch sampling was rejected, so the next target is prompt-to-answer binding for QA and heldout |
| Latest pre-layer-norm run | runs/transformer-answer-v0.44-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/ |
| Pre-layer-norm mode | branch-representation-contrast-unlikelihood with --use-pre-layer-norm and --use-prompt-position-projection |
| Pre-layer-norm status | partial structural evidence: 50/50 direct steps ran, 1108/1284 prompt-position parameters and all 8 final-norm parameters moved, but diversity target still failed 0/9 multi-target profiles |
| Latest target-balanced run | runs/transformer-answer-v0.44-target-balanced-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/ |
| Target-balanced mode | branch-balanced-representation-contrast-unlikelihood with --use-pre-layer-norm and target-bucket branch batches |
| Target-balanced status | rejected sampler evidence: 50/50 direct steps ran, but best-snapshot restoration returned to step 0 and all 9/9 multi-target profiles collapsed to global 'n' |
| Latest branch-rank diagnostic run | runs/transformer-answer-v0.45-branch-rank-diagnostic-smoke-dim4-context80/ |
| Branch-rank diagnostic | direct-answer branch profiles include average target rank, top-3/top-5 target coverage, and failed-record top predictions |
| Branch-rank QA | final QA collapsed to all 'n' with average target rank 14.25 and top-3/top-5 target coverage 0.125 |
| Branch-rank heldout | final heldout collapsed to all 'n' with average target rank 14.25 and top-3/top-5 target coverage 0.125 |
| Branch-rank status | diagnostic evidence: correct branch targets are usually buried, so the next repair should improve prompt-to-answer output binding |
| Latest output-binding run | runs/transformer-answer-v0.46-output-binding-rankscore-smoke-dim4-context80/ |
| Output-binding mode | branch-output-binding-unlikelihood with rank-aware best-snapshot scoring and frozen output bias |
| Output-binding QA | QA average target rank improved 17.375 -> 14.125 and top-5 coverage reached 0.25, but target-token coverage stayed 0.0 and top-3 coverage ended 0.0 |
| Output-binding heldout | heldout average target rank improved 17.25 -> 14.375 and top-5 coverage reached 0.25, but target-token coverage stayed 0.0 and top-3 coverage ended 0.0 |
| Output-binding status | rejected repair evidence: output binding cracked wrong-token diversity but still collapsed QA and heldout to wrong branch tokens |
| Latest rank-margin run | runs/transformer-answer-v0.47-rank-margin-steps50-smoke-dim4-context80/ |
| Rank-margin mode | branch-rank-margin-unlikelihood against top wrong branch tokens with frozen output bias |
| Rank-margin QA | QA average target rank improved 17.375 -> 9.0, target-token coverage reached 0.125, top-3 coverage reached 0.25, and top-5 coverage reached 0.5 |
| Rank-margin heldout | heldout average target rank improved 17.25 -> 9.0, target-token coverage reached 0.125, top-3 coverage reached 0.25, and top-5 coverage reached 0.375 |
| Rank-margin status | strongest rank-lift evidence so far, but rejected for promotion because predicted diversity stayed 1/8 and branches still collapsed to wrong 'n' |
| Latest balanced rank-margin run | runs/transformer-answer-v0.48-balanced-rank-margin-smoke-dim4-context80/ |
| Balanced rank-margin mode | branch-balanced-rank-margin-unlikelihood with target-balanced branch batches and top wrong-token margins |
| Balanced rank-margin QA | QA predicted diversity reached 2/8, target-token coverage stayed 0.125, average target rank reached 9.375, top-3 reached 0.375, and top-5 reached 0.5 |
| Balanced rank-margin heldout | heldout predicted diversity reached 2/8, target-token coverage stayed 0.125, average target rank reached 9.625, top-3 reached 0.25, and top-5 reached 0.5 |
| Balanced rank-margin status | rejected evidence: target-balanced rank margin improves wrong-token diversity and top-3/top-5 coverage, but top-1 branch choices are still wrong |
| Latest top-one rank-margin run | runs/transformer-answer-v0.49-balanced-rank-margin-top1-smoke-dim4-context80/ |
| Top-one rank-margin mode | branch-balanced-rank-margin-unlikelihood with one top wrong token |
| Top-one rank-margin QA | QA target-token coverage stayed 0.125, but average target rank regressed to 12.5, top-3 fell to 0.125, and top-5 fell to 0.25 |
| Top-one rank-margin heldout | heldout target-token coverage stayed 0.125, but average target rank regressed to 12.375, top-3 fell to 0.125, and top-5 fell to 0.25 |
| Top-one rank-margin status | rejected evidence: concentrating on one current top wrong token regressed rank/top-k evidence instead of converting targets into top-1 choices |
| Latest top-k softmax run | runs/transformer-answer-v0.50-balanced-topk-softmax-w5-smoke-dim4-context80/ |
| Top-k softmax mode | branch-balanced-topk-softmax-unlikelihood with target-balanced branch batches and restricted target-vs-top-wrong-token softmax |
| Top-k softmax QA | QA target-token coverage stayed 0.125, average target rank improved to 8.75, top-3 reached 0.375, and top-5 reached 0.5 |
| Top-k softmax heldout | heldout target-token coverage stayed 0.125, average target rank improved to 8.75, top-3 reached 0.375, and top-5 reached 0.5 |
| Top-k softmax status | rejected evidence: top-k softmax recovers rank/top-k evidence after v0.49 but still leaves QA and heldout collapsed to wrong 'u' top-1 branch choices |
| Latest foundation-stack run | runs/transformer-v0.51-foundation-stack-smoke/ |
| Foundation-stack mode | full mechanics stack: AdamW/SGD state, scheduling, accumulation, resume validation, multi-head/RMSNorm/gated/tied/rotary architecture options, generation traces, and replayable eval samples |
| Foundation-stack smoke | 2/2 language-model steps completed with AdamW, attention_heads 2, RMSNorm, gated MLP, tied output embeddings, rotary positions, and cache-aware generation metadata |
| Foundation-stack artifacts | quarklm-transformer-v2 checkpoint, optimizer_state.json, eval.json, and eval_samples.jsonl |
| Foundation-stack status | mechanics-readiness evidence only; not a promoted responder or direct-answer repair run |
| Latest full-stack top-k run | runs/transformer-answer-v0.52-fullstack-topk-softmax-smoke-dim4-context80/ |
| Full-stack top-k mode | branch-balanced-topk-softmax-unlikelihood under the full v0.51 stack |
| Full-stack top-k QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Full-stack top-k heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Full-stack top-k status | rejected evidence: full-stack baseline improves diversity, but unchanged top-k pressure collapses training to one wrong token; next repair should bind prompt contexts to target tokens |
| Latest bidirectional binding run | runs/transformer-answer-v0.53-fullstack-bidir-binding-smoke-dim4-context80/ |
| Bidirectional binding mode | branch-balanced-bidirectional-binding-unlikelihood under the full v0.51 stack |
| Bidirectional binding unit test | focused transformer tests pass; context-ownership regression verifies target tokens gain probability mass on their own prompt contexts |
| Bidirectional binding QA | restored step 40; QA predicted diversity 2/8, dominant wrong 'a', target-token coverage 0.125, average target rank 7.875, top-3 0.25, top-5 0.5 |
| Bidirectional binding heldout | restored step 40; heldout predicted diversity 2/8, dominant wrong 'a', target-token coverage 0.125, average target rank 9.0, top-3 0.25, top-5 0.375 |
| Bidirectional binding history | training step 50 briefly reached QA target-token coverage 0.25 with average target rank 8.375 before best-snapshot restore selected the rank-focused step 40 checkpoint |
| Bidirectional binding status | partial progress, rejected for promotion: bidirectional binding improves rank pressure under the full stack, but target coverage is not preserved and diversity target still fails 0/9 multi-target profiles |
| Latest coverage binding run | runs/transformer-answer-v0.54-fullstack-coverage-binding-smoke-dim4-context80/ |
| Coverage binding mode | branch-balanced-coverage-binding-unlikelihood under the full v0.51 stack |
| Coverage binding unit test | focused transformer tests pass; hard-wrong-token coverage regression verifies target-set mass and exact target probability improve in the restricted candidate set |
| Coverage binding QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Coverage binding heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Coverage binding history | training step 50 improved QA average target rank to 8.125, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token |
| Coverage binding status | rejected evidence: best-snapshot scoring restored the baseline because bundled hard-negative coverage binding traded away target coverage for rank; next repair should preserve target-set coverage before exact-target sharpening |
| Latest target-set coverage run | runs/transformer-answer-v0.55-fullstack-target-set-coverage-smoke-dim4-context80/ |
| Target-set coverage mode | branch-balanced-target-set-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled |
| Target-set coverage unit test | focused transformer tests pass; target-set-only coverage regression verifies target-set mass improves against hard wrong tokens without requiring exact-target sharpening |
| Target-set coverage QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Target-set coverage heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Target-set coverage history | training step 50 improved QA average target rank to 10.0, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token |
| Target-set coverage status | rejected evidence: batch-local target-set mass still trades away eval target-token coverage; next repair should add explicit anti-collapse pressure over predicted target tokens |
| Latest target-diversity run | runs/transformer-answer-v0.57-fullstack-target-diversity-smoke-dim4-context80/ |
| Target-diversity mode | branch-balanced-target-diversity-unlikelihood under the full v0.51 stack with positive target CE disabled |
| Target-diversity unit test | focused transformer tests pass; target-diversity regression verifies restricted target-set mass and weakest target-share balance improve in a small branch batch |
| Target-diversity QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Target-diversity heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Target-diversity history | training step 50 improved QA average target rank to 10.0, but target-token coverage collapsed to 0.0 with one wrong 'a' top-1 branch token |
| Target-diversity status | rejected evidence: batch-local target-share diversity still trades away eval target-token coverage; next repair should preserve eval-wide target coverage directly |
| Latest target-replay coverage run | runs/transformer-answer-v0.58-fullstack-target-replay-coverage-smoke-dim4-context80/ |
| Target-replay coverage mode | branch-balanced-target-replay-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled |
| Target-replay coverage unit test | focused transformer tests pass; target-replay regression verifies replay target-set mass and weakest missing-target share improve when the sampled branch batch omits admitted pool targets |
| Target-replay coverage QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Target-replay coverage heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Target-replay coverage history | training step 40 improved QA average target rank to 6.875 and top-5 coverage to 0.5; by step 50, QA/heldout top-1 collapsed to wrong 'n' and target-token coverage had hit 0.0 during training |
| Target-replay coverage status | rejected evidence: pool-owned replay target coverage still trades away context-specific target ownership; next repair should bind replay pressure to branch contexts |
| Latest context-replay coverage run | runs/transformer-answer-v0.59-fullstack-context-replay-coverage-smoke-dim4-context80/ |
| Context-replay coverage mode | branch-balanced-context-replay-coverage-unlikelihood under the full v0.51 stack with positive target CE disabled |
| Context-replay coverage unit test | focused transformer tests pass; context-replay regression verifies replay target-set mass and weakest owned-target share improve on fixed replay contexts |
| Context-replay coverage QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Context-replay coverage heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Context-replay coverage history | training step 40 improved QA average target rank to 7.375, top-3 to 0.375, and top-5 to 0.5; by step 50, QA predicted diversity was only 2/8 and target-token coverage had hit 0.0 during training |
| Context-replay coverage status | rejected evidence: context-owned replay improves rank/top-k snapshots but still does not preserve target-token coverage; next repair should strengthen target-preserving ownership or scoring gates |
| Latest coverage-floor run | runs/transformer-answer-v0.60-fullstack-context-replay-coverage-floor-metadata-smoke-dim4-context80/ |
| Coverage-floor mode | profile-wise target-token coverage floor before branch snapshot rank/top-k scoring |
| Coverage-floor unit test | focused transformer tests pass; coverage-floor regression rejects a rank-lifted candidate when QA target-token coverage falls below baseline |
| Coverage-floor QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Coverage-floor heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Coverage-floor history | clean v0.60 JSONL wrote 7 direct-answer rows with branch_target_coverage_by_profile; step 40 improved QA rank/top-k but was ineligible because profile coverage regressed |
| Coverage-floor status | gate repair accepted, model behavior rejected: coverage floor prevents rank/top-k gains from promoting snapshots that regress target-token coverage |
| Latest coverage-anchor run | runs/transformer-answer-v0.61-fullstack-context-coverage-anchor-smoke-dim4-context80/ |
| Coverage-anchor mode | branch-balanced-context-coverage-anchor-unlikelihood under the full v0.51 stack with the v0.60 coverage floor |
| Coverage-anchor unit test | focused transformer tests pass; anchor regression verifies covered-target probability is protected better than the same replay training without anchors |
| Coverage-anchor QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Coverage-anchor heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Coverage-anchor history | training snapshots over-anchored covered wrong 'i'; QA/heldout predicted diversity fell to 1/8, target-token coverage to 0.125, and average target rank above 21 |
| Coverage-anchor status | rejected evidence: global covered-target anchors protect one covered token but do not preserve coverage diversity; next repair should be target-balanced or profile-aware |
| Latest target-balanced anchor run | runs/transformer-answer-v0.62-fullstack-target-balanced-anchor-smoke-dim4-context80/ |
| Target-balanced anchor mode | branch-balanced-context-target-balanced-anchor-unlikelihood under the full v0.51 stack with the v0.60 coverage floor |
| Target-balanced anchor unit test | focused transformer tests pass; singleton covered-target regression verifies target-balanced anchors skip the v0.61 one-token over-anchor |
| Target-balanced anchor QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Target-balanced anchor heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Target-balanced anchor history | training avoided the v0.61 hard 'i' attractor, but QA/heldout target-token coverage still collapsed to 0.0 and trained snapshots remained ineligible |
| Target-balanced anchor status | rejected evidence: target-balanced anchors prevent singleton over-anchoring but do not preserve profile coverage; next repair should train from profile-level coverage deficits |
| Latest coverage-deficit run | runs/transformer-answer-v0.64-fullstack-coverage-deficit-smoke-dim4-context80/ |
| Coverage-deficit mode | branch-balanced-context-coverage-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor |
| Coverage-deficit unit test | focused transformer tests pass; deficit regression verifies missing replay targets gain restricted probability over the old context replay objective |
| Coverage-deficit QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Coverage-deficit heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Coverage-deficit history | training step 50 reached QA accuracy 1/8 and predicted diversity 4/8 with average target rank 10.0, but QA/heldout target-token coverage regressed to 0.125 and trained snapshots remained ineligible |
| Coverage-deficit status | rejected evidence: deficit pressure can crack the top-1 branch in training but still trades away coverage, so the next repair should combine deficit pressure with an explicit coverage-preserving constraint |
| Latest coverage-preserving deficit run | runs/transformer-answer-v0.65-fullstack-coverage-preserving-deficit-smoke-dim4-context80/ |
| Coverage-preserving deficit mode | branch-balanced-context-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor |
| Coverage-preserving deficit unit test | focused transformer tests pass; preserving-deficit regression verifies missing targets still lift while represented target tokens are protected better than deficit-only training |
| Coverage-preserving deficit QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Coverage-preserving deficit heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Coverage-preserving deficit history | training step 50 reached QA/heldout branch accuracy 1/8, QA average target rank 7.75, heldout average target rank 7.125, and top-5 coverage 0.5, but both profiles collapsed to predicted_unique 1/8 with target-token coverage 0.125 |
| Coverage-preserving deficit status | rejected evidence: predicted-target preservation over-preserved the represented 'i' token and improved rank while regressing coverage diversity; next repair should make the coverage constraint profile-aware instead of anchoring current predicted target tokens |
| Latest profile-aware replay run | runs/transformer-answer-v0.67-profile-aware-replay-plan-smoke-dim4-context80/ |
| Profile-aware replay mode | branch-balanced-context-profile-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor and v0.67 per-profile replay plan |
| Profile-aware replay unit test | focused transformer tests pass; profile replay plan verifies profile deficits are not hidden by global target coverage, and profiled replay records preserve source keys for shared branch targets |
| Profile-aware replay plan | direct_answer_replay_plan.json records 9144 branch/replay records across 21 profiles; example floors include qa:place 0.5 and qa:color 0.0 |
| Profile-aware replay gate | branch-context gate passed 219/219 semantic records with 0 ambiguous contexts, 0 context collisions, and 0 skipped records |
| Profile-aware replay smoke | one gated branch-only direct step ran, post-direct candidate snapshot was skipped by configuration, and the best branch snapshot restored from step 0 |
| Profile-aware replay status | mechanics-readiness evidence: replay plan and profile-aware objective surface are implemented, but branch-diversity target still failed 0/9 multi-target profiles so no model-quality promotion |
| Latest profile-aware full-stack run | runs/transformer-answer-v0.68-fullstack-profile-aware-preserving-deficit-smoke-dim4-context80/ |
| Profile-aware full-stack mode | branch-balanced-context-profile-coverage-preserving-deficit-unlikelihood under the full v0.51 stack with the v0.60 coverage floor and v0.67 replay-plan artifact |
| Profile-aware full-stack plan | direct_answer_replay_plan.json records 9144 branch/replay records across 21 profiles; branch-context gate passed 219/219 semantic records |
| Profile-aware full-stack QA | restored step 0; QA predicted diversity 3/8, target-token coverage 0.25, average target rank 13.25, top-3 0.25, top-5 0.375 |
| Profile-aware full-stack heldout | restored step 0; heldout predicted diversity 3/8, target-token coverage 0.25, average target rank 13.375, top-3 0.25, top-5 0.375 |
| Profile-aware full-stack history | step 40 improved QA average target rank to 6.5 and top-5 to 0.625, with heldout rank 6.875 and top-5 0.5, but QA/heldout target-token coverage regressed to 0.125 and predicted diversity collapsed to 1/8 |
| Profile-aware full-stack status | rejected evidence: profile-aware preservation can improve rank under training, but best-snapshot scoring restored step 0 because trained snapshots still erase coverage and diversity |
| Profile target-share objective | src/closed_world_lm/transformer_char_model.py and src/closed_world_lm/transformer_objectives.py add branch-balanced-context-profile-target-share-preserving-deficit-unlikelihood as the v0.81 profile target-share objective implementation. |
| Profile target-share decision | Profile-aware replay can now add balanced owned target-share pressure across each profile's replay targets while retaining deficit focus, represented-target preservation, replay-plan artifacts, and recipe/promotion surfaces. |
| Profile target-share unit test | focused transformer tests pass; the minority replay target gains more share with balanced profile target-share pressure than under the previous profile-aware replay loss. |
| Latest profile target-share run | runs/transformer-answer-v0.82-fullstack-profile-target-share-smoke-dim4-context80/ |
| Profile target-share mode | branch-balanced-context-profile-target-share-preserving-deficit-unlikelihood |
| Profile target-share artifacts | experiment_intent.json, corpus_hygiene.json, training_plan.json, candidate_quarantine.json, closed_world_verifier.json, training_recipe.json, direct_answer_replay_plan.json, constraint_first_promotion.json, metrics JSON/JSONL, tokenizer, optimizer, lessons, and checkpoint are written. |
| Profile target-share gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; deterministic verifier passed; purity gates include external_embeddings false. |
| Profile target-share history | 50/50 direct steps completed with 7 clean JSONL rows. Step 40 lowered train loss to 19.7378 and improved QA average rank to 9.125, but QA and heldout collapsed to one 'c' prediction with target-token coverage 0.0. |
| Profile target-share status | rejected evidence: best-snapshot scoring restored step 0, preserving QA/heldout target-token coverage at 0.25, but branch_diversity_target still failed across all 9 multi-target profiles. |
| Latest prompt-ownership run | runs/transformer-answer-v0.83-fullstack-prompt-ownership-smoke-dim4-context80/ |
| Prompt-ownership mode | branch-balanced-context-profile-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Prompt-ownership unit test | focused transformer tests pass; prompt-specific ownership margins lift a context's own target above a sibling profile target more than the v0.82 profile target-share pressure. |
| Prompt-ownership gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; deterministic verifier passed; purity gates include external_embeddings false. |
| Prompt-ownership history | 50/50 direct steps completed with 7 clean JSONL rows. Step 50 improved QA average target rank to 8.625 and heldout average target rank to 8.5, but QA and heldout collapsed to one 'c' prediction with target-token coverage 0.0 during training. |
| Prompt-ownership status | rejected evidence: best-snapshot scoring restored step 0, preserving QA/heldout target-token coverage at 0.25, but branch_diversity_target still failed across all 9 multi-target profiles. |
| Latest baseline-anchor run | runs/transformer-answer-v0.84-fullstack-baseline-anchored-prompt-ownership-smoke-dim4-context80/ |
| Baseline-anchor mode | branch-balanced-context-profile-baseline-anchored-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Baseline-anchor unit test | focused transformer tests pass; profiled replay batches can use baseline prediction overrides, and anchored replay preservation protects a covered target better than following current prediction drift. |
| Baseline-anchor gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; deterministic verifier passed; purity gates include external_embeddings false. |
| Baseline-anchor history | 50/50 direct steps completed with 7 clean JSONL rows. Step 40 improved QA average target rank to 8.0 and heldout rank to 8.375, but QA and heldout collapsed to one 'i' prediction with target-token coverage 0.125 during training. |
| Baseline-anchor status | rejected evidence: anchoring improves over the v0.83 zero-coverage collapse, but best-snapshot scoring restored step 0 because trained snapshots still fell below the 0.25 QA/heldout coverage floor and branch_diversity_target failed across all 9 multi-target profiles. |
| Latest baseline-floor gate run | runs/transformer-answer-v0.85-fullstack-baseline-floor-gated-prompt-ownership-smoke-dim4-context80/ |
| Baseline-floor gate mode | branch-balanced-context-profile-baseline-floor-gated-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Baseline-floor gate unit test | focused transformer tests pass; the new mode records baseline replay anchors, a baseline-floor update guard, and one-step accepted/rejected guard accounting. |
| Baseline-floor gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; the baseline-floor update guard checked 50 attempted steps and rejected 50 unsafe updates; deterministic verifier passed; purity gates include external_embeddings false. |
| Baseline-floor gate history | 50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/50 attempted updates. |
| Baseline-floor gate status | rejected evidence: v0.85 prevents unsafe forgetting by refusing every update below the profile-wise baseline coverage floor, but branch_diversity_target still fails across all 9 multi-target profiles and no weight update is accepted. |
| Latest baseline-floor adaptive run | runs/transformer-answer-v0.86-fullstack-baseline-floor-adaptive-prompt-ownership-smoke-dim4-context80/ |
| Baseline-floor adaptive mode | branch-balanced-context-profile-baseline-floor-adaptive-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Baseline-floor adaptive unit test | focused transformer tests pass; the adaptive mode records baseline replay anchors, adaptive learning-rate scales, checked steps, attempted updates, accepted attempts, and rejected attempts. |
| Baseline-floor adaptive gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors were recorded and active; adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false. |
| Baseline-floor adaptive history | 50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 scaled attempted updates. |
| Baseline-floor adaptive status | rejected evidence: v0.86 proves the unsafe-update problem is not fixed by four learning-rate scales; every scaled retry still falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles. |
| Latest baseline-floor repaired run | runs/transformer-answer-v0.87-fullstack-baseline-floor-repaired-prompt-ownership-clean-smoke-dim4-context80/ |
| Baseline-floor repaired mode | branch-balanced-context-profile-baseline-floor-repaired-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Baseline-floor repaired unit test | focused transformer tests pass; the repaired mode records baseline replay anchors, adaptive learning-rate scales, repair-anchor counts, repair attempts, repaired attempts, accepted update-shape counts, and rejected samples. |
| Baseline-floor repaired gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors and 227 baseline-covered repair anchors were recorded; adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, ran 200 one-step repairs, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false. |
| Baseline-floor repaired history | 50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 repaired attempted updates. |
| Baseline-floor repaired status | rejected evidence: v0.87 proves one bounded baseline-covered anchor repair after an unsafe update is still not enough; every repaired retry falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles. |
| Latest baseline-floor objective run | runs/transformer-answer-v0.88-fullstack-baseline-floor-objective-prompt-ownership-smoke-dim4-context80/ |
| Baseline-floor objective mode | branch-balanced-context-profile-baseline-floor-objective-prompt-ownership-target-share-preserving-deficit-unlikelihood |
| Baseline-floor objective unit test | focused transformer tests pass; the objective mode records baseline replay anchors, objective-side floor-anchor counts, anchor batch size, anchor weight, objective anchor batches, accepted attempts, and rejected attempts. |
| Baseline-floor objective gate | branch-context gate passed 219/219 semantic records; replay plan records 9144 branch/replay records across 21 profiles; 562 baseline prediction anchors and 227 objective-side floor anchors were recorded; anchor batch size was 32, anchor weight was 10.0, adaptive scales were 1.0, 0.25, 0.05, and 0.01; the guard checked 50 steps, attempted 200 updates, ran 200 objective anchor batches covering 2400 anchor records, and rejected 200 unsafe attempts; deterministic verifier passed; purity gates include external_embeddings false. |
| Baseline-floor objective history | 50/50 attempted direct steps completed with 7 clean JSONL rows. The guard preserved baseline/final QA and heldout target-token coverage at 0.25, predicted diversity at 3/8, QA average target rank at 13.25, and heldout average rank at 13.375, but accepted 0/200 objective-shaped attempted updates. |
| Baseline-floor objective status | rejected evidence: v0.88 proves a balanced objective-side floor-anchor term is still not enough when coupled to branch-diversity pressure; every retry falls below at least one profile-wise baseline coverage floor and branch_diversity_target still fails across all 9 multi-target profiles. |
| Profile target-share next | Use the v0.115.0 hidden-projection candidate evidence with profile target-share and branch-diversity gates before promotion. |
| Transformer selector exact | 18/219 -> 219/219 selector-emitted |
| Transformer selector candidate accuracy | 18/219 -> 219/219 eval-scoped |
| Transformer-guided generator exact | 0/219 -> 219/219 no-candidate |
| Tokenizer | corpus-trained character tokenizer |
v0.42 Summary
QuarkLM v0.42 keeps the admitted corpus unchanged from v0.41 and widens the
from-scratch transformer used by the sparse prompt-contrast branch repair path.
The stable self-improvement run is runs/self-improve-v0.42/; the current
transformer answer-lesson run is
runs/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/.
The current corpus remains at 12 admitted facts. Direct admission probes pass
48/48, admission paraphrase probes pass 84/84, and glossary probes pass
38/38.
The transformer is a tiny decoder-only language model built in the Python standard library. It uses learned token and position embeddings, one causal self-attention block, a feed-forward block, and QuarkLM's corpus-trained character tokenizer. It starts from random weights and imports no pretrained model, vocabulary, or embeddings.
Transformer direct-answer evidence:
- transformer checkpoint:
runs/transformer-answer-v0.42-branch-repair-contrast50-dim8-context32/transformer_answer.json - v0.31 generator checkpoint retained for comparison:
runs/transformer-answer-v0.31-generator-weighted-lr035-80k/answer_generator.json - selector checkpoint:
runs/transformer-answer-v0.31-generator-weighted-lr035-80k/answer_selector.json - training steps:
80 - context size:
32 - embedding dimension:
8 - feed-forward dimension:
16 - direct answer steps:
1000 - direct answer mode:
periodic-branch-repair-contrast-unlikelihood - direct answer negative weight:
1.0 - direct answer positive weight:
1.0 - direct answer contrast weight:
1.0 - branch position:
1 - contrast interval:
50 - direct answer training examples:
9144 - answer target NLL:
3.5850 -> 2.4129 - direct answer target loss:
3.4278 -> 2.2708 - raw direct greedy exact answers:
0/219 -> 0/219 - transformer-only eval-scoped candidate accuracy:
15/219 -> 37/219 - selector-emitted exact answers:
18/219 -> 219/219 - selector eval-scoped candidate accuracy:
18/219 -> 219/219 - v0.31 generator exact answers without candidates:
0/219 -> 219/219 - pretrained weights:
false - pretrained tokenizer:
false - external embeddings:
false - direct path uses answer candidates:
false - direct path uses auxiliary weights:
false - generator uses answer candidates:
false
That is a real movement toward raw transformer answering with a clear boundary.
v0.42 preserves v0.33's transformer-only candidate discrimination while testing
whether a wider random transformer gives sparse branch contrast more room to
represent prompt differences. The direct path improves answer-target NLL versus
v0.41 and reduces runaway greedy looping, but raw greedy completions still fail
exact answer generation: the dominant failure is now the short wrong completion
" te.". The next structured repair should make the prompt representation more
target-specific without losing these scored gains.
v0.31's auxiliary no-candidate generator remains the best exact no-candidate
answer evidence. The current reliable response gate still belongs to the
responder, learned answer classifier, and generative answer decoder.
Unpromoted v0.43 Findings
v0.43 development added transformer-loop improvements, but did not replace the v0.42 promoted checkpoint.
- The transformer forward pass now computes only the final position consumed by the language-model head, preserving the next-character objective while making longer-context experiments practical.
- Transformer answer artifacts now include prompt context-coverage metrics. A
context size of
80covers all current semantic eval templates (219/219), while context size32drops complete template coverage for many prompts. runs/transformer-answer-v0.43-hard-branch-contrast4-dim8-context32/preserved candidate accuracy at37/219, but regressed direct loss to2.4225, answer NLL to2.5402, and collapsed greedy output to a repeated" a"loop.runs/transformer-answer-v0.43-branch-repair-contrast50-dim8-context80/achieved full context coverage and the shorter failure" t.", but still trailed v0.42 with direct loss2.3122and answer NLL2.4546.runs/transformer-answer-v0.43-branch-repair-contrast50-dim8-context80-1500/reached38/219candidates, but regressed direct loss, answer NLL, and greedy output. It remains archived evidence rather than a promoted release.runs/transformer-answer-v0.43-layernorm-screen-dim8-context80/tested optional layer normalization with full context coverage. It preserved37/219candidates, but answer NLL regressed to2.5881and greedy output collapsed into repeated" y"/"e"loops, so it was not promoted.runs/transformer-answer-v0.43-branch-span3-screen-dim8-context32/tested branch repair over answer positions1..3. It preserved37/219candidates, but answer NLL regressed to2.7426and greedy output became a long"neeee"loop, so it was not promoted.runs/transformer-answer-v0.43-two-layer-screen-dim8-context32/tested the new multi-layer transformer path. It was interrupted before final direct-answer metrics because two-layer full-block scalar autograd was too slow for the regular loop. The partial JSONL history is runtime evidence, not promotion evidence.runs/transformer-answer-v0.43-two-layer-finalopt-screen-dim8-context32/tested the optimized stacked path where the final layer computes only the last state. The optimization is covered by logit-equivalence tests, but the run was still interrupted before final metrics because the intermediate full-state layer remains too expensive for direct-answer repair updates.runs/transformer-answer-v0.43-two-layer-toponly-skip-screen-dim8-context32/tested top-layer-only direct-answer updates for a two-layer transformer and the explicit post-direct snapshot skip used for bounded screens. It completed and saved a checkpoint after40target-loss steps and80direct-answer steps, recorded the skipped post-direct candidate snapshot, improved direct-answer target loss3.5186 -> 3.2436, but kept direct greedy exact at0/219 -> 0/219with repeated"a"output. It is training-loop completion evidence, not promotion evidence.runs/transformer-answer-v0.43-branch-profile-smoke-dim4-context16/verified direct-answer branch-profile metrics. The QA branch-position-1 profile stayed at1/8accuracy, moved from all"o"predictions to all"y"predictions after five tiny direct updates, and kept a negative average target margin. This is model-native self-diagnosis evidence for prompt-independent branch collapse, not promotion evidence.runs/transformer-answer-v0.43-branch-collapse-smoke-dim4-context16/tested full-dose dominant-branch-token suppression. It regressed direct loss and moved QA branch collapse from all"o"predictions to all"a"predictions.runs/transformer-answer-v0.43-periodic-branch-collapse-smoke-dim4-context16/tested sparse dominant-token suppression every five direct steps. It improved direct loss3.5800 -> 3.5157, but QA branch accuracy stayed1/8 -> 1/8and the dominant prediction moved from all"o"to all"n". It remains rejected repair evidence because the branch stayed prompt-independent.runs/transformer-answer-v0.43-branch-batch-smoke-dim4-context16/tested full-dose distinct-target branch batching. It improved direct loss only slightly and moved QA branch collapse from all"o"predictions to all"y"predictions.runs/transformer-answer-v0.43-periodic-branch-batch-smoke-dim4-context16/tested sparse branch-batch contrast every five direct steps. It improved direct loss3.5800 -> 3.5248, but QA branch accuracy regressed1/8 -> 0/8and the dominant prediction moved from all"o"to all"a". It is rejected evidence that distinct-target batching still does not force prompt-conditioned branch separation in the current representation.runs/transformer-answer-v0.43-context-mean-branch-batch-smoke-dim4-context16/added--use-context-mean, a representation-side option that adds the mean-pooled prompt context to the final transformer hidden state. With sparse branch-batch contrast it improved direct loss3.5805 -> 3.5252, but QA branch accuracy regressed1/8 -> 0/8and the dominant prediction moved from all"o"to all"a".runs/transformer-answer-v0.43-context-mean-branch-repair-smoke-dim4-context16/tested the same context-mean representation with sparse branch repair. It improved direct loss3.5805 -> 3.5310, but again regressed QA branch accuracy1/8 -> 0/8and collapsed to all"a"predictions. This is rejected representation evidence: prompt averaging alone is not enough to produce prompt-specific branch choices.runs/transformer-answer-v0.43-context-projection-branch-repair-smoke-dim4-context16/added--use-context-projection, a zero-initialized trainable projection of the mean-pooled context. It starts baseline-equivalent, moved all20projection parameters during training, and improved direct loss3.5802 -> 3.5217, but QA branch accuracy regressed1/8 -> 0/8and the dominant prediction moved from all"o"to all"a".runs/transformer-answer-v0.43-context-projection-branch-batch-smoke-dim4-context16/tested the same learned projection with sparse branch-batch contrast. It moved all20projection parameters and improved direct loss3.5802 -> 3.5252, but also regressed QA branch accuracy1/8 -> 0/8and collapsed to all"a"predictions. This keeps learned context projection in rejected representation evidence.runs/transformer-answer-v0.43-prompt-attention-branch-repair-smoke-dim4-context16/added--use-prompt-attention-summary, a trainable attention-pooled context summary with a zero-initialized output projection. It moved all20output projection parameters and improved direct loss3.5802 -> 3.5217, but QA branch accuracy regressed1/8 -> 0/8and the dominant prediction moved from all"o"to all"a".runs/transformer-answer-v0.43-prompt-attention-branch-batch-smoke-dim4-context16/tested the same prompt-attention summary with sparse branch-batch contrast. It moved all20output projection parameters and improved direct loss3.5802 -> 3.5252, but again regressed QA branch accuracy1/8 -> 0/8and collapsed to all"a"predictions. This keeps trainable prompt attention in rejected representation evidence.runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context16/addedbranch_context_coveragediagnostics to direct-answer snapshots. At context size16, QA had0/8semantic coverage and4ambiguous branch contexts; for example"s ball?\nanswer: "mapped both place and color first target tokens.runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context32/removed QA branch ambiguity (0ambiguous contexts), but still had0/8semantic coverage at the branch point because the prompt prefix was truncated.runs/transformer-answer-v0.43-branch-context-coverage-smoke-dim4-context80/reached complete branch-context coverage across all eval sets (219/219) with zero ambiguous branch contexts. This is diagnostic evidence for efficient longer-context branch repair.runs/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context16/made that diagnostic actionable with--direct-answer-require-branch-context-gate. The required gate failed at context size16, so the run recordedactual_steps: 0for5requested direct-answer steps.runs/transformer-answer-v0.43-branch-context-gate-smoke-dim4-context80/passed the same required gate at context size80and recordedactual_steps: 1for1requested direct-answer step.runs/transformer-answer-v0.43-branch-context-gated-branchonly-smoke-dim4-context80/added--direct-answer-snapshot-mode branch-onlyto keep longer-context branch screens bounded. The required context-80 gate passed across all219/219semantic records, all5requested direct-answer steps ran, and JSONL snapshots recordedevals_skipped: truewhile retaining branch profiles and branch-context gate evidence.runs/transformer-answer-v0.43-branchonly-periodic-repair-contrast50-dim8-context80/used branch-only snapshots for a dim8 context-80 version of the best prior sparse repair/contrast policy. The required gate passed and all100direct steps ran, but QA branch prediction collapsed to all"a"with final QA branch accuracy0/8.runs/transformer-answer-v0.43-branchonly-branch-batch-dim8-context80/tested branch-batch contrast under the same complete context. It lowered interval train loss3.4614 -> 3.1976, but final QA branch prediction still collapsed to all"a"with final QA branch accuracy0/8.runs/transformer-answer-v0.43-branch-diversity-target-smoke-dim4-context80/added a first-classbranch_diversity_targetto direct-answer snapshots. The required branch-context gate passed and all5direct steps ran, but the diversity target failed across all9multi-target eval profiles. Final QA hadtarget_unique: 8,predicted_unique: 1, dominant predicted token"r"at rate1.0, and target-token coverage0.125.runs/transformer-answer-v0.43-branch-diversity-train-smoke-dim4-context80/addedbranch-diversity-unlikelihood, which trains distinct branch targets while penalizing each branch context's current wrong prediction. The required branch-context gate passed and10/10direct steps ran, but the diversity target still failed across all9multi-target profiles. QA moved from all"x"to all"b"predictions, with target-token coverage0.0 -> 0.125andpredicted_uniquestill1/8.runs/transformer-answer-v0.43-branch-diversity-freezebias-smoke-dim4-context80/added--direct-answer-freeze-output-bias, which excludes the transformer output bias from direct-answer updates. The required branch-context gate passed and50/50direct steps ran with the output bias frozen. Loss moved3.6149 -> 3.5016, but the diversity target still failed across all9multi-target profiles. QA moved from all"x"to all"w"predictions, final target-token coverage was0.0, andpredicted_uniquestayed1/8.runs/transformer-answer-v0.43-branch-target-softmax-freezebias-smoke-dim4-context80/addedbranch-target-softmax-unlikelihood, which applies a restricted softmax over the distinct branch targets in each batch. The required branch-context gate passed, output bias was frozen, and50/50direct steps ran. Composite train loss moved5.6671 -> 5.5820, but the diversity target still failed across all9multi-target profiles. QA briefly reachedpredicted_unique: 2at step20, then collapsed back to all"w"by step50.runs/transformer-answer-v0.43-branch-target-softmax-restorebest-smoke-dim4-context80/added--direct-answer-restore-best-branch-snapshot. The required branch-context gate passed, output bias was frozen, and50/50direct steps ran. The run restored the final checkpoint from step40; final QA moved from the prior all-"w"endpoint to all"u"with target-token coverage0.125, butpredicted_uniquestayed1/8and all9multi-target profiles still failed the diversity target.runs/transformer-answer-v0.43-prompt-prefix-target-softmax-restorebest-smoke-dim4-context80/added--use-prompt-prefix-projection, a zero-initialized trainable projection over non-padding prompt-prefix positions before the final answer token. All20projection parameters moved and composite train loss improved5.6649 -> 5.5679, but the final checkpoint restored from step40to the same all-"u"QA collapse with target-token coverage0.125.runs/transformer-answer-v0.43-prompt-position-target-softmax-restorebest-smoke-dim4-context80/added--use-prompt-position-projection, a position-specific trainable projection over non-padding prompt-prefix positions before the final answer token.1108/1284projection parameters moved and composite train loss improved5.6649 -> 5.5679, but the final checkpoint restored from step40to the same all-"u"QA collapse with target-token coverage0.125.runs/transformer-answer-v0.43-branch-target-margin-prompt-position-smoke-dim4-context80/addedbranch-target-margin-unlikelihood, a smooth pairwise target-margin loss over each batch's distinct branch targets. The prompt-position context-80 screen moved train loss4.8973 -> 4.7784and moved1108/1284prompt-position projection parameters, but the final checkpoint restored from step40to the same all-"u"QA collapse with target-token coverage0.125.runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim4-context80/addedbranch_representation_profilesandbranch-representation-contrast-unlikelihood. The high-weight prompt-position context-80 screen used--direct-answer-contrast-weight 50.0and moved QA different-target hidden distance only about0.00097 -> 0.00107at the restored checkpoint; the final branch profile still restored to the same all-"u"QA collapse.runs/transformer-answer-v0.43-branch-representation-contrast50-prompt-position-smoke-dim8-context80-steps40/tested the same high-weight representation-contrast path at embedding/feed- forward dimensions8/16. The completed40/40step screen restored from step10, moved QA different-target hidden distance to about0.00209, and still restored to the same all-"u"QA collapse with target-token coverage0.125.runs/transformer-answer-v0.43-prompt-position-scale32-repcontrast50-smoke-dim4-context80/added--prompt-position-projection-scale 32.0to test whether the prompt- position residual was simply too quiet. The completed50/50step screen moved1108/1284prompt-position projection parameters and restored from step40; restored QA different-target hidden distance rose to about0.01235, but QA still collapsed to all"u"with target-token coverage0.125.STRUCTURE_AUDIT.mdnow records the next transformer checkpoint: study open-source model, trainer, tokenizer, checkpoint, and transparency patterns before adding another repair objective, while keeping all external weights, tokenizers, embeddings, datasets, and training text outside QuarkLM's closed-world boundary. The completed comparison table chooses an opt-in pre-layer-norm transformer block path with final normalization as the next structural implementation target.runs/transformer-answer-v0.44-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/implemented that path with--use-pre-layer-norm. The bounded context-80 screen ran50/50direct steps, moved1108/1284prompt-position parameters and all8final-norm parameters, and cracked full collapse in7/9multi-target profiles. The formal diversity target still failed0/9, and QA stayed collapsed to all"y"with target-token coverage0.125.runs/transformer-answer-v0.44-target-balanced-prelayernorm-repcontrast50-prompt-position-smoke-dim4-context80/added target-bucket branch batch sampling throughbranch-balanced-representation-contrast-unlikelihood. The screen ran50/50direct steps, but best-snapshot restoration returned to step0because every trained snapshot scored worse than baseline. All9/9multi-target profiles collapsed to"n", so target balancing is rejected as a standalone repair.runs/transformer-answer-v0.45-branch-rank-diagnostic-smoke-dim4-context80/adds target-rank diagnostics to branch profiles. The smoke used the pre-layer-norm prompt-position path and recorded QA and heldout both collapsed to"n"with average target rank14.25and top-3/top-5 target coverage0.125. The correct branch target is usually buried behind several global alternatives, so this is output-binding evidence rather than a near- miss rank problem.runs/transformer-answer-v0.46-output-binding-rankscore-smoke-dim4-context80/addsbranch-output-binding-unlikelihood, combining branch target softmax with representation contrast, and makes best-snapshot scoring rank-aware. It ran20/20direct steps with output bias frozen. QA average target rank improved17.375 -> 14.125, and QA/heldout top-5 coverage reached0.25. Target-token coverage stayed0.0, top-3 coverage ended0.0, and the branch prediction still collapsed to wrong tokens, so the repair is rejected for promotion.runs/transformer-answer-v0.47-rank-margin-steps50-smoke-dim4-context80/addsbranch-rank-margin-unlikelihood, which pushes each branch target above the model's own top wrong tokens. The screen ran50/50direct steps, restored the rank-aware best snapshot from step40, and improved QA average target rank17.375 -> 9.0. QA target-token coverage rose to0.125, top-3 coverage rose to0.25, and top-5 coverage rose to0.5. It is still rejected because predicted diversity stayed1/8and QA/heldout remained collapsed to wrong"n".runs/transformer-answer-v0.48-balanced-rank-margin-smoke-dim4-context80/combines target-balanced branch batches with the same rank-margin repair. It ran50/50direct steps and reached QA predicted diversity2/8, target- token coverage0.125, average target rank9.375, top-3 coverage0.375, and top-5 coverage0.5. It is still rejected because QA and heldout remain wrong top-1 branch choices.runs/transformer-answer-v0.49-balanced-rank-margin-top1-smoke-dim4-context80/tests the same balanced rank-margin path with--direct-answer-hard-negatives 1, concentrating margin pressure on only the current top wrong token. It restored from step10; QA target-token coverage stayed0.125, but average target rank regressed to12.5, top-3 coverage fell to0.125, and top-5 coverage fell to0.25. This is rejected evidence.runs/transformer-answer-v0.50-balanced-topk-softmax-w5-smoke-dim4-context80/addsbranch-balanced-topk-softmax-unlikelihood, where each correct branch target competes in a restricted softmax against the model's current top wrong tokens. It restored from step40; QA target-token coverage stayed0.125, average target rank improved to8.75, top-3 coverage reached0.375, and top-5 coverage reached0.5. This recovers rank/top-k evidence after v0.49, but prediction diversity stayed1/8and top-1 branch choices remained wrong, so it is rejected repair evidence.runs/transformer-v0.51-foundation-stack-smoke/verifies the full transformer foundation stack before the next direct-answer repair run. It ran2/2language-model steps with AdamW, gradient accumulation, two attention heads, RMSNorm, gated MLPs, tied output embeddings, rotary positions, and cache-aware generation metadata. The run wrote aquarklm-transformer-v2checkpoint,optimizer_state.json,eval.json, and replayableeval_samples.jsonltraces. This is mechanics-readiness evidence, not model-quality promotion evidence.runs/transformer-answer-v0.52-fullstack-topk-softmax-smoke-dim4-context80/reruns the v0.50 top-k branch objective under the full v0.51 stack. It completed50/50direct steps and restored to step0: the full-stack baseline had QA and heldout predicted diversity3/8and target-token coverage0.25, but training collapsed to one wrong token at later snapshots. This rejects unchanged top-k pressure under the full stack and points the next repair toward prompt-context-to-target-token binding.runs/transformer-answer-v0.53-fullstack-bidir-binding-smoke-dim4-context80/addsbranch-balanced-bidirectional-binding-unlikelihood. The objective trains each prompt context to choose its own branch target and each target token to assign cross-context probability mass back to its own prompt contexts. The focused transformer unit test verifies that context-ownership signal on a small branch batch. The full-stack screen completed50/50direct steps and restored from step40: QA average target rank improved to7.875with top-5 coverage0.5, but target-token coverage ended at0.125and the diversity target still failed0/9multi-target profiles. This is partial rank-pressure evidence, not promotion evidence.runs/transformer-answer-v0.54-fullstack-coverage-binding-smoke-dim4-context80/addsbranch-balanced-coverage-binding-unlikelihood, which makes every branch target compete against sibling branch targets and hard wrong tokens while adding a target-set mass coverage guard. The focused transformer test verifies that this pressure lifts target-set mass against hard wrong tokens. The full-stack screen completed50/50direct steps but restored from step0: training snapshots improved QA average target rank to8.125, but target-token coverage collapsed to0.0and top-1 predictions collapsed to wrong"a". This rejects the bundled coverage-binding loss under the full stack.runs/transformer-answer-v0.55-fullstack-target-set-coverage-smoke-dim4-context80/isolates target-set coverage withbranch-balanced-target-set-coverage-unlikelihood, positive target CE disabled, and no exact-target row or cross-context ownership losses. The focused transformer test verifies that target-set mass can increase against hard wrong tokens without asserting exact-target sharpening. The full-stack screen completed50/50direct steps and restored from step0: training snapshots improved QA average target rank to10.0, but target-token coverage still collapsed to0.0with wrong"a"top-1 predictions. This rejects batch-local target-set mass as a sufficient coverage repair.runs/transformer-answer-v0.57-fullstack-target-diversity-smoke-dim4-context80/adds target-share anti-collapse pressure withbranch-balanced-target-diversity-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that restricted target-set mass and weakest target-share balance can both improve in a small branch batch. The full-stack screen completed50/50direct steps and restored from step0: training snapshots improved QA average target rank to10.0, but target-token coverage again collapsed to0.0with wrong"a"top-1 predictions. This rejects batch-local target sharing as a sufficient eval-wide anti-collapse repair.runs/transformer-answer-v0.58-fullstack-target-replay-coverage-smoke-dim4-context80/extends the repair from batch-local target sharing to closed-world replay targets withbranch-balanced-target-replay-coverage-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that replay target-set mass and weakest missing-target share can both improve when the sampled branch batch omits some admitted pool targets. The full-stack screen completed50/50direct steps and restored from step0: training snapshots improved QA average target rank as far as6.875and top-5 coverage to0.5, but target-token coverage still hit0.0during training and QA/heldout top-1 predictions collapsed to wrong"n"by step50. This rejects pool-owned replay coverage as a sufficient context-specific target-ownership repair.runs/transformer-answer-v0.59-fullstack-context-replay-coverage-smoke-dim4-context80/makes replay context-owned withbranch-balanced-context-replay-coverage-unlikelihood, positive target CE disabled, and hard wrong-token competition. The focused transformer test verifies that replay target-set mass and weakest owned-target share can both improve on fixed replay contexts. The full-stack screen completed50/50direct steps and restored from step0: training snapshots improved QA average target rank as far as7.375, QA top-3 to0.375, QA top-5 to0.5, and admissions top-5 to0.5208by step50, but target-token coverage still hit0.0during training and the diversity target failed0/9. This rejects context-owned replay coverage as implemented.runs/transformer-answer-v0.60-fullstack-context-replay-coverage-floor-metadata-smoke-dim4-context80/adds a profile-wise target-token coverage floor to branch snapshot selection: rank/top-k gains are eligible only when every multi-target profile preserves its baseline coverage. Direct-answer JSONL snapshots now writebranch_target_coverage_by_profile, and the focused transformer test rejects a rank-lifted candidate that regresses QA coverage. The clean full-stack screen completed50/50direct steps, wrote7JSONL rows, and restored from step0: the baseline coverage floor remained visible in the final row (qa0.25,heldout0.25,admissions0.1429, minimum profile0.0714). This accepts the self-improvement gate repair while still rejecting the trained model behavior.runs/transformer-answer-v0.61-fullstack-context-coverage-anchor-smoke-dim4-context80/adds a covered-target anchor to context replay: replay branches whose own target is already top-1 receive extra target-vs-replay-target/hard-wrong pressure. The focused transformer test verifies that the anchor protects a covered branch better than identical replay training without the anchor. The full-stack screen completed50/50direct steps and restored from step0under the v0.60 coverage floor, but trained snapshots over-anchored the already-covered wrong"i"token: QA/heldout predicted diversity fell to1/8, target-token coverage to0.125, and average target rank above21. This rejects global covered-target anchoring as implemented.runs/transformer-answer-v0.62-fullstack-target-balanced-anchor-smoke-dim4-context80/makes covered-target anchoring target-balanced: anchor losses are averaged by covered target and skipped when only one covered target is present. The focused transformer test verifies that this singleton guard skips the v0.61 one-token over-anchor while the old global anchor still raises that token. The full-stack screen completed50/50direct steps and restored from step0under the v0.60 coverage floor. It avoided the hard"i"attractor, but QA/heldout target-token coverage still collapsed to0.0during training. This rejects target-balanced anchoring as sufficient.runs/transformer-answer-v0.64-fullstack-coverage-deficit-smoke-dim4-context80/addsbranch-balanced-context-coverage-deficit-unlikelihood, which computes replay target tokens that are absent from the current replay predictions and adds target pressure only for those missing targets. The focused transformer test verifies that the deficit term lifts a missing replay target above the old context replay objective. The full-stack screen completed50/50direct steps and restored from step0under the v0.60 coverage floor. Step50cracked QA top-1 behavior enough to reach1/8branch accuracy and predicted diversity4/8, but QA/heldout target-token coverage regressed to0.125, so the trained snapshots remained ineligible. This rejects deficit pressure by itself.runs/transformer-answer-v0.65-fullstack-coverage-preserving-deficit-smoke-dim4-context80/addsbranch-balanced-context-coverage-preserving-deficit-unlikelihood, which balances missing-target deficit pressure with preservation anchors for target tokens currently represented in replay predictions. Focused tests pass and verify both effects in isolation. The full-stack screen completed50/50direct steps and restored from step0. Step50improved QA average target rank to7.75, heldout average target rank to7.125, and top-5 coverage to0.5, but both profiles collapsed to one predicted target token with target-token coverage0.125. This rejects current-prediction preservation as implemented.runs/transformer-answer-v0.67-profile-aware-replay-plan-smoke-dim4-context80/adds profile-aware replay records anddirect_answer_replay_plan.jsonfor the preserving-deficit path. Focused tests verify that global target coverage cannot hide a profile-local missing target and that profiled replay records keep their admitted source keys even when branch target tokens are shared. The bounded smoke wrote a plan for9144branch/replay records across21profiles, passed the branch-context gate across219/219semantic records, and showed profile-specific coverage floors such asqa:placeat0.5andqa:colorat0.0. It ran one branch-only direct step and restored from step0; branch diversity still failed0/9multi-target profiles. This is mechanics-readiness evidence, not model-quality promotion evidence.runs/transformer-answer-v0.68-fullstack-profile-aware-preserving-deficit-smoke-dim4-context80/spends that replay plan on the comparable full-stack repair screen. The run completed50/50direct steps, wrote7direct-answer JSONL rows, passed the branch-context gate, and used a replay plan for9144branch records across21profiles. Training step40improved QA average target rank to6.5and top-5 coverage to0.625; heldout average rank improved to6.875with top-5 coverage0.5. Those rank gains came with QA/heldout target-token coverage regressing to0.125and predicted diversity collapsing to1/8, so best-snapshot scoring restored step0. This is rejected evidence.
v0.81 keeps the context-coverage audit, profile-wise coverage floor, and
replay-plan artifact, then adds balanced profile target-share pressure inside
the profile-local direct-answer objective. Focused tests verify the minority
replay target gains more share than under the previous profile-aware replay
loss. v0.82 then screens that objective in
runs/transformer-answer-v0.82-fullstack-profile-target-share-smoke-dim4-context80/.
The run records the modern artifact stack and passes the verifier,
branch-context, purity, and coverage-preservation gates, but it still fails
branch diversity. Step 40 improves QA average target rank to 9.125, yet
does so by collapsing QA and heldout to one "c" prediction with 0.0
target-token coverage. Best-snapshot scoring restores step 0, so this is
rejected evidence.
v0.83 adds prompt-specific sibling-target ownership margins on top of that
profile target-share objective. Focused tests show the new term lifts a
context-specific target more than v0.82 target-share pressure. The full screen
in
runs/transformer-answer-v0.83-fullstack-prompt-ownership-smoke-dim4-context80/
writes the modern artifacts and passes the verifier, branch-context, and purity
gates. It still fails branch diversity: step 50 improves QA average target
rank to 8.625, but QA and heldout collapse to one "c" prediction with
0.0 target-token coverage during training. Best-snapshot scoring restores
step 0.
v0.84 anchors replay preservation to the baseline profile-aware replay
predictions captured before direct-answer training. Focused tests show replay
batches can use those baseline prediction overrides and that anchored
preservation protects a covered target better than following current prediction
drift. The full screen in
runs/transformer-answer-v0.84-fullstack-baseline-anchored-prompt-ownership-smoke-dim4-context80/
records 562 active baseline prediction anchors, passes the verifier,
branch-context, and purity gates, and avoids the v0.83 zero-coverage collapse.
Step 40 improves QA average target rank to 8.0, but QA and heldout still
collapse to one "i" prediction with target-token coverage 0.125, below the
baseline 0.25 floor. Best-snapshot scoring restores step 0, so the next
repair must preserve the full baseline target-token floor.
v0.85 adds a baseline-floor update guard around the baseline-anchored
prompt-ownership mode. The full screen in
runs/transformer-answer-v0.85-fullstack-baseline-floor-gated-prompt-ownership-smoke-dim4-context80/
records 562 active baseline prediction anchors and checks 50/50 attempted
direct-answer updates. The guard rejects all 50 unsafe updates, preserving QA
and heldout target-token coverage at the baseline 0.25 floor in every recorded
snapshot. It is still rejected evidence: no weight update is accepted and branch
diversity still fails across all 9 multi-target profiles.
v0.86 adds adaptive retries around that update guard. The full screen in
runs/transformer-answer-v0.86-fullstack-baseline-floor-adaptive-prompt-ownership-smoke-dim4-context80/
records 562 active baseline prediction anchors and attempts 200 scaled
updates across 50 checked direct-answer steps. Scales 1.0, 0.25, 0.05,
and 0.01 all still violate at least one profile-wise baseline coverage floor,
so the guard rejects 200/200 attempts. It is still rejected evidence:
step-size retry alone does not produce accepted safe updates.
v0.87 adds one bounded baseline-covered anchor repair after each unsafe adaptive
retry. The clean full screen in
runs/transformer-answer-v0.87-fullstack-baseline-floor-repaired-prompt-ownership-clean-smoke-dim4-context80/
records 562 active baseline prediction anchors, 227 repair anchors, 200
repair attempts, and 200/200 rejected update attempts. QA and heldout coverage
remain at 0.25, but no repaired update is accepted, so post-update repair is
also rejected as the missing mechanic.
v0.88 moves balanced baseline-floor anchors into the direct-answer objective
itself. The full screen in
runs/transformer-answer-v0.88-fullstack-baseline-floor-objective-prompt-ownership-smoke-dim4-context80/
records 562 active baseline prediction anchors, 227 objective-side floor
anchors, 200 objective anchor batches, 2400 anchor records, and 200/200
rejected update attempts. QA and heldout coverage remain at 0.25, but no
objective-shaped update is accepted, so branch-pressure coupling is rejected as
the missing mechanic.
v0.89 isolates baseline-floor stabilization updates. The full screen in
runs/transformer-answer-v0.89-fullstack-baseline-floor-stabilization-smoke-dim4-context80/
records 562 active baseline prediction anchors, 227 stabilization anchors,
200 stabilization anchor batches, 2400 anchor records, and 200/200
rejected update attempts. QA and heldout coverage remain at 0.25, but no
stabilization-only update is accepted, so the next repair should diagnose why
floor-only updates still violate the baseline floor.
v0.90 adds the missing rejection diagnosis. The full screen in
runs/transformer-answer-v0.90-fullstack-baseline-floor-stabilization-diagnostics-smoke-dim4-context80/
records stabilization: 200 rejected update-shape counts, 50 rejected
attempts at each adaptive scale, heldout: 200 violation counts, and a worst
floor deficit of 0.25 on learning. Promotion still rejects the transformer,
but the next repair now has measured profile-level floor evidence.
v0.91 applies that evidence by covering the full baseline-covered profile-target
floor surface. The full screen in
runs/transformer-answer-v0.91-fullstack-baseline-floor-profile-targeted-stabilization-smoke-dim4-context80/
records 227 floor anchors, 12 profile-target groups,
profile_targeted_stabilization: 200 rejected attempts, and the same violation
profile counts as v0.90. Promotion still rejects the transformer, and the next
repair must change the floor repair shape rather than only broaden anchor
coverage.
v0.92 changes that shape to sequential source-profile floor repair. The full
screen in
runs/transformer-answer-v0.92-fullstack-baseline-floor-sequential-profile-stabilization-smoke-dim4-context80/
records 10 source-profile groups, 2000 profile-local repair attempts,
2000 profile-local rejections, and 200 no-effective-update outer attempts.
Promotion still rejects the transformer, and the next repair must isolate
floor-preserving weight movement rather than only broaden coverage or reorder
profiles.
v0.93 calibrates that movement below 0.01. The diagnostic screen in
runs/transformer-answer-v0.93-baseline-floor-calibrated-sequential-profile-stabilization-step1-dim4-context80/
records calibrated scales down to 0.0001, 50 profile-local repair attempts,
49 profile-local rejections, and one accepted nonzero bridge:owner update at
scale 0.0025. Promotion still rejects the transformer on
branch_diversity_target, but the baseline floor guard has now accepted real
weight movement.
v0.94 adds profile-scale memory to that calibrated path. The diagnostic screen
in
runs/transformer-answer-v0.94-baseline-floor-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer profile-scale update, 60 profile-scale attempts,
8 accepted source-profile updates, and 52 rejected profile-scale attempts.
Promotion still rejects the transformer on branch_diversity_target, but safe
floor-preserving movement now spans multiple source profiles.
v0.95 adds diversity-aware profile-scale memory to that path. The diagnostic
screen in
runs/transformer-answer-v0.95-baseline-floor-diversity-profile-scale-calibrated-sequential-stabilization-configured-step1-dim4-context80/
records one accepted outer diversity-aware profile-scale update, 58
profile-scale attempts, 5 score-improving accepted source-profile updates,
42 floor regressions, and 11 floor-preserving diversity-score regressions.
Promotion still rejects the transformer on branch_diversity_target, but the
training loop now records which safe movements are non-regressive for branch
diversity.
v0.96 adds missing-target frontier anchors to that path. The diagnostic screen
in
runs/transformer-answer-v0.96-baseline-floor-diversity-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records 52 frontier anchors, one accepted outer frontier profile-scale
update, 43 profile-scale attempts, 9 score-improving accepted source-profile
updates, 28 floor regressions, and 6 floor-preserving diversity-score
regressions. Promotion still rejects the transformer on
branch_diversity_target, but max dominant predicted rate improves to 0.9
and minimum target-token coverage improves to 0.1667.
v0.97 adds coverage-frontier acceptance to that path. The diagnostic screen in
runs/transformer-answer-v0.97-baseline-floor-diversity-coverage-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer coverage-frontier profile-scale update, 68
profile-scale attempts, 1 coverage-gaining accepted source-profile update,
50 floor regressions, 15 coverage ties, and 2 coverage regressions.
Promotion still rejects the transformer on branch_diversity_target, but the
update guard now records accepted coverage deltas and proves the strict
monotonic screen is currently too conservative for full missing-target repair.
v0.98 adds coverage-prep frontier acceptance to that path. The diagnostic
screen in
runs/transformer-answer-v0.98-baseline-floor-diversity-coverage-prep-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer coverage-prep profile-scale update, 43
profile-scale attempts, 9 accepted source-profile updates, 3 coverage
gains, 6 coverage-preparation moves, 28 floor regressions, 4 coverage
ties without score gain, and 2 coverage regressions. Promotion still rejects
the transformer on branch_diversity_target, but the update guard now
separates direct coverage gains from safe preparation moves.
v0.99 adds coverage-recovery frontier retry to that path. The diagnostic screen
in
runs/transformer-answer-v0.99-baseline-floor-diversity-coverage-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer coverage-recovery profile-scale update, 54
profile-scale attempts, 6 accepted source-profile updates, 6 prepared
recovery candidates, 15 recovery retries, 2 direct coverage recoveries,
4 preparation fallbacks, 38 floor regressions, 7 coverage ties without
score gain, and 3 coverage regressions. Promotion still rejects the
transformer on branch_diversity_target, but the guard now proves preparation
can be tested as direct missing-target recovery before it is admitted as
self-improvement evidence.
v0.100.0 adds branch-stable coverage-recovery acceptance to that path. The
diagnostic screen in
runs/transformer-answer-v0.100.0-baseline-floor-diversity-branch-stable-coverage-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer branch-stable recovery update, 54 profile-scale
attempts, 6 accepted source-profile updates, 6 prepared recovery
candidates, 15 branch-stability checks, 2 branch-stable coverage
recoveries, 4 preparation fallbacks, 7 floor-regressed recovery retries,
5 coverage-tied retries, and 1 branch-score regression rejection. Promotion
still rejects the transformer on branch_diversity_target, but the guard now
proves recovery can be checked against the prepared branch-diversity score
instead of coverage alone.
v0.101.0 adds branch-diversity recovery after already-safe profile updates. The
diagnostic screen in
runs/transformer-answer-v0.101.0-baseline-floor-diversity-branch-diversity-recovery-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer branch-diversity recovery update, 52 profile-scale
attempts, 6 accepted source-profile updates, 6 branch-diversity recovery
candidates, 9 branch-diversity recovery attempts, 5 branch-score-improving
refinements, 1 fallback, 1 floor-regression rejection, 1
score-regression rejection, and 2 score-tie rejections. Promotion still
rejects the transformer on branch_diversity_target, but the guard now proves
local branch-diversity score can improve without weakening the coverage floor.
v0.102.0 adds collapsed-profile binding after branch-diversity recovery. The
diagnostic screen in
runs/transformer-answer-v0.102.0-baseline-floor-diversity-collapsed-profile-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer collapsed-profile binding update, 54
profile-scale attempts, 11 accepted source-profile updates, 11
branch-diversity recovery candidates, 26 branch-diversity recovery attempts,
4 branch-score refinements, 31 collapsed-profile binding attempts, 1
accepted binding update, 10 binding fallbacks, 27 collapsed-profile ties,
1 floor-regression rejection, and 2 score-regression rejections. Promotion
still rejects the transformer on branch_diversity_target, but the guard now
proves a targeted binding update can survive while final collapse narrows from
9/9 eval profiles at baseline to 3/9 remaining collapsed profiles:
learning, owner, and paraphrases.
v0.103.0 adds remaining-profile binding after collapsed-profile binding. The
diagnostic screen in
runs/transformer-answer-v0.103.0-baseline-floor-diversity-remaining-profile-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer remaining-profile binding update, 56
profile-scale attempts, 11 accepted source-profile updates, 21
prioritized remaining-profile attempts, 6 prioritized acceptances, 15
prioritized rejections, 3 branch-diversity refinements, and 2
collapsed-profile binding updates. Promotion still rejects the transformer on
branch_diversity_target, but the guard proves the remaining-profile
curriculum can improve learning coverage from 0.0 to 0.25 without target
coverage regression.
v0.104.0 adds owner/paraphrase residual binding after remaining-profile
binding. The diagnostic screen in
runs/transformer-answer-v0.104.0-baseline-floor-diversity-owner-paraphrase-binding-frontier-profile-scale-calibrated-sequential-stabilization-step1-dim4-context80/
records one accepted outer owner/paraphrase binding update, 16
owner/paraphrase-prioritized attempts, 6 prioritized acceptances, 10
prioritized rejections, 75 learning-preservation checks, 24 preservation
failures, and 33 narrowed collapsed-profile binding rejections. Promotion
still rejects the transformer on branch_diversity_target, but learning
finishes non-collapsed with coverage 0.25 and predicted diversity 2.
v0.105.0 adds corpus-only retrieval memory as a separate evidence rail before
weight consolidation. The diagnostic screen in
runs/transformer-answer-v0.105.0-retrieval-memory-owner-paraphrase-frontier-profile-scale-step1-dim4-context80/
writes retrieval_memory_report.json, builds 497 memory cards from the
closed-world corpus, answers 219/219 eval probes exactly, and records
uses_external_model: false, external_embeddings: false,
pretrained_retriever: false, and updates_weights: false. The neural
transformer still rejects promotion on branch_diversity_target; v0.105.0
therefore proves immediate memory serving, not completed weight learning.
v0.106.0 adds memory-guided consolidation planning. The diagnostic screen in
runs/transformer-answer-v0.106.0-memory-guided-consolidation-owner-paraphrase-frontier-profile-scale-step1-dim4-context80/
writes memory_consolidation_plan.json, keeps retrieval at 219/219, records
9 memory-backed neural failed profiles, and ranks owner, paraphrases,
glossary, admission_paraphrases, and admissions as the top consolidation
priorities. The collapsed memory-backed profiles are owner, paraphrases,
and glossary; neural promotion still rejects on branch_diversity_target.
v0.107.0 adds gated memory-consolidation training. The diagnostic screen in
runs/transformer-answer-v0.107.0-gated-memory-consolidation-owner-paraphrase-glossary-frontier-profile-scale-step1-dim4-context80/
consumes the v0.106.0 plan, targets owner, paraphrases, and glossary,
records 26 memory-consolidation prioritized attempts with 8 acceptances and
18 rejections, and keeps retrieval at 219/219. The transformer still
rejects neural promotion on branch_diversity_target, so this is plan-guided
weight-consolidation evidence rather than promoted model evidence.
v0.108.0 expands that consolidation window. The diagnostic screen in
runs/transformer-answer-v0.108.0-expanded-memory-consolidation-owner-paraphrase-heldout-qa-glossary-frontier-profile-scale-step1-dim4-context80/
consumes the v0.107.0 plan, targets owner, paraphrases, heldout, qa,
and glossary, maps target-only profiles back to admitted source labels, and
keeps retrieval at 219/219. Branch-diversity still blocks promotion, which
means the next repair needs direct missing first-token diversity pressure.
v0.109.0 adds that direct missing first-token pressure. The diagnostic screen in
runs/transformer-answer-v0.109.0-missing-first-token-memory-consolidation-owner-paraphrase-heldout-qa-glossary-frontier-profile-scale-step1-dim4-context80/
consumes the v0.108.0 plan, extracts missing first-token target maps for
owner, paraphrases, heldout, qa, and glossary, and records 8
missing-token candidates, 22 attempts, 1 accepted guarded coverage-gain
update, 21 rejections, and 7 fallback acceptances. Retrieval remains exact
at 219/219; branch-diversity still blocks promotion, and the next plan
narrows the collapsed memory-backed profiles to owner, paraphrases, and
learning.
v0.110.0 makes that narrowed plan the explicit training contract. The diagnostic
screen in
runs/transformer-answer-v0.110.0-remaining-collapsed-missing-first-token-memory-consolidation-owner-paraphrase-learning-frontier-profile-scale-step1-dim4-context80/
consumes the v0.109.0 plan, requires source-plan
collapsed_memory_backed_profiles, targets only owner, paraphrases, and
learning, and records no unconsumed collapsed targets. Retrieval remains exact
at 219/219; the missing-token phase records 6 candidates, 16 attempts,
1 accepted guarded coverage-gain update, 15 rejections, and 5 fallback
acceptances. Branch-diversity still blocks promotion.
v0.111.0 makes that pressure profile-specific. The diagnostic screen in
runs/transformer-answer-v0.111.0-profile-specific-missing-first-token-memory-consolidation-owner-paraphrase-learning-frontier-profile-scale-step1-dim4-context80/
consumes the v0.110.0 plan, keeps targets owner, paraphrases, and
learning, and records the target map learning -> learning,
owner -> owner/paraphrases, and color/place/training_data -> paraphrases.
Retrieval remains exact at 219/219; memory-prioritized consolidation records
16 attempts with 6 acceptances and 10 rejections, and the missing-token
phase records 6 candidates, 18 attempts, 0 direct missing-token
acceptances, 18 rejections, and 6 fallbacks. The guard records 1
accepted profile-specific update shape, but branch-diversity still blocks
promotion.
v0.112.0 adds branch-diversity root-cause diagnostics before another repair
objective. The diagnostic screen in
runs/transformer-answer-v0.112.0-branch-diversity-root-cause-profile-specific-memory-consolidation-step1-dim4-context80/
consumes the v0.111.0 plan, targets owner, paraphrases, and glossary,
keeps retrieval exact at 219/219, records 24 profile-specific
missing-token attempts with 0 direct acceptances and 8 fallbacks, and
classifies the final branch-diversity failure as a critical
target_routing_gap. The root-cause report records 9/9 failed profiles,
3 collapsed profiles, 1 zero-coverage profile, 6 buried-target profiles,
and reused dominant tokens "n" and "a". Branch-diversity still blocks
promotion.
v0.113.0 adds branch routing audit diagnostics to the same branch-only screen
surface. The diagnostic screen in
runs/transformer-answer-v0.113.0-branch-routing-audit-profile-specific-memory-consolidation-step1-dim4-context80/
consumes the v0.112.0 plan, targets owner, paraphrases, and learning,
keeps retrieval exact at 219/219, records 18 profile-specific
missing-token attempts with 0 direct acceptances and 6 fallbacks, and
keeps branch-diversity as the blocker. The routing audit reports high
output-bias escape risk ("n" bias rank 2), low representation separation
across 9/9 multi-target profiles, and a glossary target-imbalance hotspot.
v0.114.0 adds logit-prior and centroid-separation instrumentation to the same
screen surface. The diagnostic screen in
runs/transformer-answer-v0.114.0-logit-prior-representation-instrumentation-profile-specific-memory-consolidation-step1-dim4-context80/
consumes the v0.113.0 plan, targets owner, paraphrases, and glossary,
keeps retrieval exact at 219/219, records 24 profile-specific
missing-token attempts with 0 direct acceptances and 8 fallbacks, and keeps
branch-diversity as the blocker. The new logit-prior profiles report
hidden-projection pressure across 9/9 multi-target profiles, while centroid
separation remains poor.
v0.115.0 adds a bias-frozen hidden-projection margin candidate. The candidate
screen in
runs/transformer-answer-v0.115.0-hidden-projection-margin-candidate-step1-dim4-context80/
introduces branch-hidden-projection-margin-unlikelihood and tests one
direct-answer step that compares target-token hidden * output_weight
contributions directly. It lowers average collapsed-token hidden advantage from
about 0.0842 to 0.0736, but promotion remains blocked before quality
metrics: 10/11 constraints pass, branch_diversity_target fails, all 9/9
multi-target profiles still collapse to "n", and 2 profiles still have
zero target-token coverage.
The v0.42 self-improvement run passed:
- direct admission-probe audit
- admission-paraphrase audit
- glossary-probe audit
- exact eval audit
- promotion gate
- forgetting audit against v0.41
- protected prompt leakage audit
- responder exact evals
- learned answer classifier exact evals
- generative answer decoder exact evals
- rule-based self-diagnosis with no external model
Admission probes now pass 48/48 direct records and 84/84 paraphrase records.
Glossary probes now pass 38/38 records. The passing attempt is archived at
runs/self-improve-v0.42/attempts/attempt-001/ before the top-level latest
report is updated.
v0.23 added attempt archives. A deliberately undertrained attempt failed at
runs/self-improve-v0.23/attempts/attempt-001/, and the repaired passing
attempt remains at runs/self-improve-v0.23/attempts/attempt-002/.