Corpus
The current launch-family corpus is the v16 dgen1-r5-synth-300k corpus. It is
generated from structured BOTCOIN challenge worlds, entities, relations,
temporal updates, traps, and hard negatives. The generation path creates
retrieval-evaluation records from structured challenge ingredients.
Each record carries:
| Field | Purpose |
|---|---|
| Query text | The retrieval task |
| Truth documents | Answer-bearing memory documents |
| Hard negatives | Plausible documents that should rank below truth |
| Graded qrels | Relevance labels for nDCG@10, MRR, recall, and audit metrics |
| Split | train_visible, calibration, eval_hidden, or canary |
| Public intent metadata | Temporal, relation, evidence, conflict, scope, entity, and abstention hints available to all miners |
| Embeddings | Bundle-layout-compatible BGE-M3 query and document vectors |
| Provenance | Source domain, seed, generator path, and deterministic roots |
The production qrel path uses synthesizer-category labels. The generator knows why a negative exists, such as stale fact, entity swap, relation neighbor, attribute swap, lexical distractor, or unrelated filler. The bundle maps those categories into graded relevance. Larger audit rerankers remain useful for calibration checks. Production corpus growth avoids a heavier relabeling pass over every pair.
Corpus growth is published through signed deltas. Validators retain the launch
base corpus and can reconstruct historical corpus roots by walking the signed
delta chain forward. A manual --corpus-for-root 0x...=path shortcut exists
for operators, while the normal validator path auto-resolves and verifies
historical roots before post-reveal rescoring.
Miner-facing corpus access is coordinator-proxied. Start from:
| Method | Path | Purpose |
|---|---|---|
GET |
/coretex/public-corpus/manifest |
Public corpus manifest, model IDs, corpus root, paging limits, split policy, and endpoint templates |
GET |
/coretex/public-corpus/events?offset=N&limit=M |
Paged public visible events and public qrels |
GET |
/coretex/public-corpus/events?offset=N&limit=M&includeEmbeddings=true |
Same event page with canonical public embedding hex; lower page limit |
GET |
/coretex/public-corpus/event/:eventId |
One public visible event by id |
GET |
/coretex/public-corpus/entities?offset=N&limit=M |
Paged public entity table |
GET |
/coretex/public-corpus/family-summary |
Query-family counts and bounded representative public examples |
GET |
/coretex/public-corpus/relation-summary |
Public relation edge-type counts and bounded representative public examples |
GET |
/coretex/public-corpus/query-examples?surface=...&family=...&relation=... |
Bounded public examples filtered by intended surface, family, or relation |
The public corpus proxy serves unprotected train_visible rows. Calibration,
hidden eval, canary, protected, or nonexistent event IDs are not served through
these endpoints. Use the coordinator proxy for miner research even when S3
artifact links are also advertised, because bucket ACLs and publication timing
can differ from the miner-facing corpus API.
/coretex/schema advertises the current public artifact base URL, S3 URL
templates, public corpus links, and eval-report URL template under
publicArtifacts. Validators use those artifact URLs to hydrate launch files,
historical corpus material, signing keys, and post-reveal evaluation reports.
Useful publicArtifacts fields:
| Field | Purpose |
|---|---|
artifactBaseUrl |
Base URL for launch-family public artifacts |
epochSigningPublicKeyUrl |
Public key material for verifying epoch-signed artifacts |
evalReportUrlTemplate |
Template for post-reveal eval reports by artifact hash |
publicCorpus |
Coordinator-proxied manifest, event, entity, family, relation, and query-example endpoints |
s3Hints |
Operator notes about which artifacts are best fetched through S3 versus coordinator proxy |
Corpus evolution is part of the memory model. New information enters, older information becomes stale, conflicts appear, retired hidden tasks leave the scoring pool, and new hidden tasks are added. Each evolve event is calibrated against its own corpus root, query pack, baseline, and pinned scorer context.
The calibration path also checks whether useful substrate changes generalize across corpus generations. A representative test starts with substrate design A on corpus/query pack A, evolves into corpus/query pack B, accepts a miner patch that beats the newly calibrated baseline, then backtests the resulting substrate design B against the pre-evolve corpus/query pack A. When design B preserves pre-evolve performance while improving the evolved context, the result shows a retrieval-routing improvement rather than corpus-specific indexing churn.