Event Clustering Layers

Artery's universal event registry is built bottom-up across three clustering layers. Each layer is strictly more precise than the next; we run them in order and stop at the first that produces a parse.

┌──────────────────────────────────────────────────────────────────┐
│ Layer 0 — Structured ticker / description                         │
│   • HL HIP-4 description KV     ("class:priceBinary|...")          │
│   • Kalshi crypto ticker        ("KXBTCD-26MAY0917-T68000")        │
│   • Polymarket question regex   ("Will Bitcoin be above $X on Y?") │
│   Precision: 100% — wins for all crypto price-binary               │
│   Volume: ~80% of Kalshi+HL today, ~10% of Polymarket              │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 1 — Entity NER (roadmap)                                   │
│   Built-in dictionary: politicians, NFL/NBA teams, awards          │
│   Match policy: longest alias wins; word-boundary aware            │
│   Outcome polarity: "wins"|"loses" via verb regex                  │
│   Precision: ~95% on dictionary hits                               │
│   Volume: ~30% of Polymarket non-crypto, ~10% of Kalshi events     │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 2 — BM25 long-tail (roadmap — opt-in)                      │
│   Tokenize + stop-word strip + plural stem                         │
│   Cross-provider top-N candidate by BM25 score                     │
│   Manual confirmation queue (roadmap)                            │
│   Precision: ~70-80% before human review                           │
│   Volume: catches the long tail                                    │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
                            (drop)

Layer 0: structured

Three providers, three transports, all 100% precision:

Provider	Source	Example
HL HIP-4	`description` KV	`class:priceBinary\|underlying:BTC\|expiry:20260509-0600\|targetPrice:79583`
Kalshi crypto	ticker	`KXBTCD-26MAY0917-T68000`
Polymarket crypto	`question` regex	`"Will Bitcoin be above $68,000 on May 9?"` + `endDateIso`

Same canonical key → automatic cluster.

Layer 1: entity NER

For non-crypto markets we use a curated entity dictionary (in apps/api/src/events/ner/entity-dictionary.ts). Each entity has:

ts{
  slug:     'DONALD-TRUMP',
  aliases:  ['donald trump', 'trump', 'djt', 'donald j. trump'],
  category: 'election',
}

Matching:

Lowercase the question text
For each dictionary entity, run a \bAlias\b regex
Pick the longest matching alias (so "donald trump" beats "trump")
Extract polarity ("wins" / "loses" / default "wins")
Combine with endDateIso → canonical key:

entity|DONALD-TRUMP|wins|2024-11-05

Same canonical key across providers → cluster.

What's in the dictionary today

Category	Coverage
`election`	Trump, Biden, Harris, Vance, Newsom, DeSantis
`sports` (NFL)	Chiefs, 49ers, Eagles, Bills, Cowboys
`sports` (NBA)	Lakers, Celtics, Warriors, Nuggets
`award`	Oscar Best Picture, Nobel Peace
`tech`	Elon Musk, OpenAI
`finance`	Tesla

The dictionary is deliberately narrow in the current release — adding entities is a single PR. Future versions will pull in LLM-driven entity discovery for the long tail.

Layer 2: BM25 long-tail

For markets that hit no entity in Layer 1 (custom markets, niche events), we tokenize the question + description and rank cross-provider neighbours by BM25 similarity. Pure code, no LLM API.

tsimport { BM25Index, tokenize } from './ner/bm25';
 
const idx = new BM25Index<{ provider: string; eventDate: string }>();
for (const market of allParsedMarkets) {
  idx.add({ id: market.id, text: market.question, payload: { ... } });
}
 
const candidates = idx.search('Will GTA VI release before EOY 2026?', {
  predicate: (id, p) => p.provider !== 'polymarket' && sameWeek(p.eventDate, target),
  minScore: 3.0,
  limit: 5,
});

Tokenization

Lowercase
Strip non-alphanumeric
Drop stop words (the, is, will, by, for, …)
Plural / -ies / -ing stem (lite)

Score formula

Standard Okapi BM25 with k1=1.5, b=0.75:

score(q, d) = Σ idf(t) × (tf(t,d) × (k1+1)) / (tf(t,d) + k1×(1−b+b×|d|/avgdl))

Coverage

Layer 2 is deliberately not auto-clustered in the current release — running it unattended on 10K+ markets produces too many false positives. The BM25 index ships as a primitive; A future release adds a candidate-review queue (human or LLM verifier).

Precision policy

We stop at the first layer that produces a parse. This means:

Market type	Layer used	Source label
HL HIP-4 outcome	Layer 0	`structured_description`
Kalshi `KXBTCD`	Layer 0	`structured_ticker`
Polymarket "above $X on Y"	Layer 0	`regex_title`
Polymarket "Will Trump win?"	Layer 1	`regex_title` (entity match)
Polymarket "Will Lakers win Finals?"	Layer 1	`regex_title`
Polymarket "Will GTA VI release before…"	Layer 2 (roadmap)	`embedding` (later)
Anything unmatched	(dropped)	—

A market only enters the registry when it has a parse. We'd rather miss markets than register noise — clusters with false members destroy arbitrage signals downstream.

Event Clustering Layers#

Layer 0: structured#

Layer 1: entity NER#

What's in the dictionary today#

Layer 2: BM25 long-tail#

Tokenization#

Score formula#

Coverage#

Precision policy#

See also#