On this page
Event Clustering Layers
Artery's universal event registry is built bottom-up across three clustering layers. Each layer is strictly more precise than the next; we run them in order and stop at the first that produces a parse.
┌──────────────────────────────────────────────────────────────────┐
│ Layer 0 — Structured ticker / description │
│ • HL HIP-4 description KV ("class:priceBinary|...") │
│ • Kalshi crypto ticker ("KXBTCD-26MAY0917-T68000") │
│ • Polymarket question regex ("Will Bitcoin be above $X on Y?") │
│ Precision: 100% — wins for all crypto price-binary │
│ Volume: ~80% of Kalshi+HL today, ~10% of Polymarket │
└──────────────────────────────────────────────────────────────────┘
↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 1 — Entity NER (roadmap) │
│ Built-in dictionary: politicians, NFL/NBA teams, awards │
│ Match policy: longest alias wins; word-boundary aware │
│ Outcome polarity: "wins"|"loses" via verb regex │
│ Precision: ~95% on dictionary hits │
│ Volume: ~30% of Polymarket non-crypto, ~10% of Kalshi events │
└──────────────────────────────────────────────────────────────────┘
↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 2 — BM25 long-tail (roadmap — opt-in) │
│ Tokenize + stop-word strip + plural stem │
│ Cross-provider top-N candidate by BM25 score │
│ Manual confirmation queue (roadmap) │
│ Precision: ~70-80% before human review │
│ Volume: catches the long tail │
└──────────────────────────────────────────────────────────────────┘
↓ no match
(drop)
Layer 0: structured
Three providers, three transports, all 100% precision:
| Provider | Source | Example |
|---|---|---|
| HL HIP-4 | description KV | class:priceBinary|underlying:BTC|expiry:20260509-0600|targetPrice:79583 |
| Kalshi crypto | ticker | KXBTCD-26MAY0917-T68000 |
| Polymarket crypto | question regex | "Will Bitcoin be above $68,000 on May 9?" + endDateIso |
Same canonical key → automatic cluster.
Layer 1: entity NER
For non-crypto markets we use a curated entity dictionary (in
apps/api/src/events/ner/entity-dictionary.ts).
Each entity has:
ts{
slug: 'DONALD-TRUMP',
aliases: ['donald trump', 'trump', 'djt', 'donald j. trump'],
category: 'election',
}Matching:
- Lowercase the question text
- For each dictionary entity, run a
\bAlias\bregex - Pick the longest matching alias (so "donald trump" beats "trump")
- Extract polarity ("wins" / "loses" / default "wins")
- Combine with
endDateIso→ canonical key:
entity|DONALD-TRUMP|wins|2024-11-05
Same canonical key across providers → cluster.
What's in the dictionary today
| Category | Coverage |
|---|---|
election | Trump, Biden, Harris, Vance, Newsom, DeSantis |
sports (NFL) | Chiefs, 49ers, Eagles, Bills, Cowboys |
sports (NBA) | Lakers, Celtics, Warriors, Nuggets |
award | Oscar Best Picture, Nobel Peace |
tech | Elon Musk, OpenAI |
finance | Tesla |
The dictionary is deliberately narrow in the current release — adding entities is a single PR. Future versions will pull in LLM-driven entity discovery for the long tail.
Layer 2: BM25 long-tail
For markets that hit no entity in Layer 1 (custom markets, niche events), we tokenize the question + description and rank cross-provider neighbours by BM25 similarity. Pure code, no LLM API.
tsimport { BM25Index, tokenize } from './ner/bm25';
const idx = new BM25Index<{ provider: string; eventDate: string }>();
for (const market of allParsedMarkets) {
idx.add({ id: market.id, text: market.question, payload: { ... } });
}
const candidates = idx.search('Will GTA VI release before EOY 2026?', {
predicate: (id, p) => p.provider !== 'polymarket' && sameWeek(p.eventDate, target),
minScore: 3.0,
limit: 5,
});Tokenization
- Lowercase
- Strip non-alphanumeric
- Drop stop words (
the,is,will,by,for, …) - Plural /
-ies/-ingstem (lite)
Score formula
Standard Okapi BM25 with k1=1.5, b=0.75:
score(q, d) = Σ idf(t) × (tf(t,d) × (k1+1)) / (tf(t,d) + k1×(1−b+b×|d|/avgdl))
Coverage
Layer 2 is deliberately not auto-clustered in the current release — running it unattended on 10K+ markets produces too many false positives. The BM25 index ships as a primitive; A future release adds a candidate-review queue (human or LLM verifier).
Precision policy
We stop at the first layer that produces a parse. This means:
| Market type | Layer used | Source label |
|---|---|---|
| HL HIP-4 outcome | Layer 0 | structured_description |
Kalshi KXBTCD | Layer 0 | structured_ticker |
| Polymarket "above $X on Y" | Layer 0 | regex_title |
| Polymarket "Will Trump win?" | Layer 1 | regex_title (entity match) |
| Polymarket "Will Lakers win Finals?" | Layer 1 | regex_title |
| Polymarket "Will GTA VI release before…" | Layer 2 (roadmap) | embedding (later) |
| Anything unmatched | (dropped) | — |
A market only enters the registry when it has a parse. We'd rather miss markets than register noise — clusters with false members destroy arbitrage signals downstream.
See also
- Event correlation — how the registry surfaces clusters
- Settlement tracking — where the three-source ingest pattern repeats