AArtery
On this page

Event Clustering Layers

Artery's universal event registry is built bottom-up across three clustering layers. Each layer is strictly more precise than the next; we run them in order and stop at the first that produces a parse.

┌──────────────────────────────────────────────────────────────────┐
│ Layer 0 — Structured ticker / description                         │
│   • HL HIP-4 description KV     ("class:priceBinary|...")          │
│   • Kalshi crypto ticker        ("KXBTCD-26MAY0917-T68000")        │
│   • Polymarket question regex   ("Will Bitcoin be above $X on Y?") │
│   Precision: 100% — wins for all crypto price-binary               │
│   Volume: ~80% of Kalshi+HL today, ~10% of Polymarket              │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 1 — Entity NER (roadmap)                                   │
│   Built-in dictionary: politicians, NFL/NBA teams, awards          │
│   Match policy: longest alias wins; word-boundary aware            │
│   Outcome polarity: "wins"|"loses" via verb regex                  │
│   Precision: ~95% on dictionary hits                               │
│   Volume: ~30% of Polymarket non-crypto, ~10% of Kalshi events     │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
┌──────────────────────────────────────────────────────────────────┐
│ Layer 2 — BM25 long-tail (roadmap — opt-in)                      │
│   Tokenize + stop-word strip + plural stem                         │
│   Cross-provider top-N candidate by BM25 score                     │
│   Manual confirmation queue (roadmap)                            │
│   Precision: ~70-80% before human review                           │
│   Volume: catches the long tail                                    │
└──────────────────────────────────────────────────────────────────┘
                                ↓ no match
                            (drop)

Layer 0: structured

Three providers, three transports, all 100% precision:

ProviderSourceExample
HL HIP-4description KVclass:priceBinary|underlying:BTC|expiry:20260509-0600|targetPrice:79583
Kalshi cryptotickerKXBTCD-26MAY0917-T68000
Polymarket cryptoquestion regex"Will Bitcoin be above $68,000 on May 9?" + endDateIso

Same canonical key → automatic cluster.

Layer 1: entity NER

For non-crypto markets we use a curated entity dictionary (in apps/api/src/events/ner/entity-dictionary.ts). Each entity has:

ts{
  slug:     'DONALD-TRUMP',
  aliases:  ['donald trump', 'trump', 'djt', 'donald j. trump'],
  category: 'election',
}

Matching:

  1. Lowercase the question text
  2. For each dictionary entity, run a \bAlias\b regex
  3. Pick the longest matching alias (so "donald trump" beats "trump")
  4. Extract polarity ("wins" / "loses" / default "wins")
  5. Combine with endDateIso → canonical key:
entity|DONALD-TRUMP|wins|2024-11-05

Same canonical key across providers → cluster.

What's in the dictionary today

CategoryCoverage
electionTrump, Biden, Harris, Vance, Newsom, DeSantis
sports (NFL)Chiefs, 49ers, Eagles, Bills, Cowboys
sports (NBA)Lakers, Celtics, Warriors, Nuggets
awardOscar Best Picture, Nobel Peace
techElon Musk, OpenAI
financeTesla

The dictionary is deliberately narrow in the current release — adding entities is a single PR. Future versions will pull in LLM-driven entity discovery for the long tail.

Layer 2: BM25 long-tail

For markets that hit no entity in Layer 1 (custom markets, niche events), we tokenize the question + description and rank cross-provider neighbours by BM25 similarity. Pure code, no LLM API.

tsimport { BM25Index, tokenize } from './ner/bm25';
 
const idx = new BM25Index<{ provider: string; eventDate: string }>();
for (const market of allParsedMarkets) {
  idx.add({ id: market.id, text: market.question, payload: { ... } });
}
 
const candidates = idx.search('Will GTA VI release before EOY 2026?', {
  predicate: (id, p) => p.provider !== 'polymarket' && sameWeek(p.eventDate, target),
  minScore: 3.0,
  limit: 5,
});

Tokenization

  • Lowercase
  • Strip non-alphanumeric
  • Drop stop words (the, is, will, by, for, …)
  • Plural / -ies / -ing stem (lite)

Score formula

Standard Okapi BM25 with k1=1.5, b=0.75:

score(q, d) = Σ idf(t) × (tf(t,d) × (k1+1)) / (tf(t,d) + k1×(1−b+b×|d|/avgdl))

Coverage

Layer 2 is deliberately not auto-clustered in the current release — running it unattended on 10K+ markets produces too many false positives. The BM25 index ships as a primitive; A future release adds a candidate-review queue (human or LLM verifier).

Precision policy

We stop at the first layer that produces a parse. This means:

Market typeLayer usedSource label
HL HIP-4 outcomeLayer 0structured_description
Kalshi KXBTCDLayer 0structured_ticker
Polymarket "above $X on Y"Layer 0regex_title
Polymarket "Will Trump win?"Layer 1regex_title (entity match)
Polymarket "Will Lakers win Finals?"Layer 1regex_title
Polymarket "Will GTA VI release before…"Layer 2 (roadmap)embedding (later)
Anything unmatched(dropped)

A market only enters the registry when it has a parse. We'd rather miss markets than register noise — clusters with false members destroy arbitrage signals downstream.

See also

Edit this page on GitHubLast updated
Event Clustering Layers · Artery API Docs