How Evaluation Works
Evaluation asks a simple question:
Does this patch make the current CoreTexState better than the live state it is based on?
The exact flow is:
- CoreTex loads the current 32 KB parent state.
- CoreTex checks that the patch's
parentStateRootequals the current live state root. - CoreTex decodes the patch and confirms it changes only 1-4 allowed words.
- CoreTex scores the current state against CoreTexBench V0.
- CoreTex applies the patch to produce a candidate state.
- CoreTex checks that the candidate state is structurally valid.
- CoreTex scores the candidate state against the same corpus.
- CoreTex computes
scoreDelta = candidateScore - baselineScore. - If
scoreDelta > threshold, the patch can advance the live state and earn credits. - If not, the patch is rejected with a stable error code and earns nothing.
Score Formula
The V0 score is:
S = 0.30 * exact_retrieval
+ 0.15 * stale_memory_rejection
+ 0.15 * temporal_update_correctness
+ 0.30 * compression_survival
+ 0.05 * routing_accuracy
- latency_penalty
Latency penalty starts after 10 ms and reaches its cap by 50 ms.
What Each Component Checks
| Component | Weight | Concrete Check |
|---|---|---|
| Exact retrieval | 0.30 | Does a near-collision event's key ID appear in an active retrieval-key slot? |
| Stale memory rejection | 0.15 | Does a stale temporal event appear as a revoked memory object? |
| Temporal update correctness | 0.15 | Does a current temporal event appear as active and not revoked? |
| Compression survival | 0.30 | Does a long-horizon event survive in an active memory-index slot? |
| Routing accuracy | 0.05 | Are relation/routing entries populated with non-zero routing weights? |
In concrete terms, the scorer derives deterministic 128-bit IDs from corpus event IDs. It then checks whether those IDs appear in the right typed regions of the 1024-word state.
For example:
near-collision event id -> retrieval key id -> active RetrievalKeys slot
long-horizon event id -> memory object id -> active MemoryIndex slot
stale temporal event id -> memory object id -> active MemoryIndex slot with REVOKED set
current temporal id -> memory object id -> active MemoryIndex slot with REVOKED clear
A patch is an improvement when it makes more of these checks true, without breaking structural validity or protected regression checks.
Example Improvement
Suppose the current state has no slot for long-horizon event lh-0042.
A miner proposes a patch that fills one MemoryIndex slot with:
EVENT_ID = id(lh-0042)OBJ_TYPE = MEMORY_EVENTVALIDITY_FLAGS = activeCHECKSUM = checksum(lh-0042)- Corpus metadata and payload words
CoreTex applies the patch and rescans the corpus. If lh-0042 is now counted as a compression-survival hit, the candidate state's score increases. If the increase is positive and the patch is otherwise valid, the miner can receive a screener-pass work receipt. If the patch also advances the live root and passes the model no-regression gate, it receives heavier state-advance work credit.
Example Rejection
A patch is rejected if:
- It points at an old parent root
- It changes more than 4 words
- It writes reserved bits
- It is a no-op
- It improves a public-looking shard but fails protected regression checks
- It does not improve the current live state
The important rule is that a miner does not earn credits for raw "suggestions." They earn 1x for qualified, auditable screener passes and materially more for state advances that CoreTex measures as real improvements.
Local Model-Assisted Eval For Elevated Proposals
The structural scorer proves that the state contains the right compact memory handles. That is necessary for deterministic settlement, but it is not the whole story. To test whether the memory would actually help a model, production CoreTex also runs a local model-assisted evaluator unless explicitly disabled for non-reward drills.
When CORETEX_LOCAL_MODEL_EVAL is not 0, an elevated patch must pass two checks:
- Deterministic structural check — Did the 1024-word state encode better memory handles?
- Local model retrieval check — Given the actual benchmark query text, does a small open-weight local model rank the right memory text at least as well after the patch, with no model-facing regression?
The default model path is:
Xenova/multi-qa-MiniLM-L6-cos-v1
This is intentionally an embedding/retrieval model rather than a generative chatbot. It is much smaller and faster than even a small LLM, and it tests the exact thing CoreTex is responsible for: whether the memory index causes the right memory content to be retrievable for a query. The model runs locally through @huggingface/transformers; no closed API is called.
Example:
Query:
What is the complete final shopping list across all sessions?
Candidate memory texts available from CoreTexState:
- Milk, eggs, bread, apples
- Cheese and yogurt
- Milk, eggs, bread, apples, cheese, yogurt
Local model check:
Embed the query and candidate memory texts, rank candidates by similarity,
and count the task as a hit only if the correct cumulative memory ranks first.
For temporal memory, stale facts must not be surfaced as active candidates. If a stale memory is correctly revoked and the current memory is retrieved instead, the temporal score improves.
This model-assisted evaluator is still local and open-weight. It does not call an external API and does not depend on a closed frontier model. The deterministic structural scorer remains the consensus-safe base layer; the local model scorer is the empirical layer that shows signal for downstream agent usefulness.
Operator flags:
CORETEX_LOCAL_MODEL_EVAL=1
CORETEX_LOCAL_MODEL=Xenova/multi-qa-MiniLM-L6-cos-v1
CORETEX_LOCAL_MODEL_MIN_DELTA=0
CORETEX_LOCAL_MODEL_PREWARM=1
CORETEX_LOCAL_MODEL_PREWARM=1 embeds the corpus query and memory texts once so warm proposal checks only embed new or changed candidates.
In production this gate is on by default. A patch can receive credits only if
the deterministic structural score improves and the local model gate shows no
regression. Equality is acceptable for the model gate by default because the
structural score already has to improve; operators can raise
CORETEX_LOCAL_MODEL_MIN_DELTA if testnet data shows the model gate should
require positive model delta too.
Patch Budget
V0 keeps the patch budget at 1-4 words. This is conservative, but it protects the system:
- Small patches make credit attribution clear.
- Small patches keep calldata and replay cheap.
- Small patches reduce hidden regressions inside large bundles.
- Small patches force miners to expose improvements incrementally.
- Mid-epoch live state advances mean miners can still land many improvements during the same 24-hour epoch.
Opening the budget too early would increase the search space for gaming and make it harder to tell which part of a patch actually improved memory. Larger macro-patches should be introduced only after testnet data shows the local model gate remains reliable at larger patch sizes.