Cross-Source Event Linking

A real intrusion almost never confines itself to one telemetry boundary. An attacker who phishes a credential touches the identity provider, the endpoint, the VPN concentrator, and the cloud control plane within minutes — yet each of those systems emits its own log stream, its own field names, and its own clock. Cross-source event linking is the technique that fuses those fragments back into a single incident narrative so a downstream rule can reason about the campaign rather than six unrelated alerts. This page is part of the broader Alert Correlation & Rule Engines discipline, and it sits directly upstream of scoring and routing: nothing can be prioritized correctly until the events that belong together have actually been joined. Get linking wrong and you either fracture one breach into noise or fuse unrelated activity into a phantom incident.

Problem Framing

Consider a concrete failure scenario. A mid-size SOC ingests roughly 40,000 events per second across four sources: Windows Security logs (EVTX forwarded as JSON, ~12k EPS), Okta system logs (JSON via the API, ~800 EPS), Zscaler web proxy (CEF over syslog, ~18k EPS), and AWS CloudTrail (JSON via Kinesis, ~9k EPS). An analyst investigating a CloudTrail AssumeRole flagged as anomalous has no automated way to see that the same human, eleven minutes earlier, completed an Okta MFA push from a never-before-seen ASN and, two minutes after that, triggered Windows Event 4624 (logon type 10) on a jump host. The three records share no common field verbatim: Okta calls the user actor.alternateId, Windows calls it TargetUserName, and CloudTrail calls it userIdentity.arn. Their timestamps live in three timezones with up to 90 seconds of skew between sources.

The job of a linking pipeline is to make those three records resolve to one entity and land in one time-bounded session, deterministically, at 40k EPS, without unbounded memory growth. That requires three things working together: a canonical schema so fields are comparable (handled upstream by JSON event normalization), an entity-resolution function that maps source-specific identifiers to a stable principal key, and a stateful join that tolerates out-of-order arrival and clock skew. The rest of this page builds exactly that.

Prerequisites & Environment

The reference implementation targets Python 3.11+ (it relies on tomllib-era typing ergonomics and asyncio.TaskGroup). The only third-party dependency is redis for shared window state; everything else is standard library.

python3 -m venv .venv
source .venv/bin/activate
pip install "redis>=5.0,<6.0" "pydantic>=2.6,<3.0"

Infrastructure assumptions:

A Redis 7 instance (or cluster) reachable from every linker worker, used as the partitioned window state store. Single-node Redis is fine up to ~50k EPS; shard by entity-key hash beyond that.
An upstream normalized event stream. Linking assumes events already conform to the site’s Elastic Common Schema (ECS) model — if yours do not, run them through the schema validation pipelines first. Linking on un-normalized data is the single most common cause of silent join misses.
A dead-letter queue (a Redis stream, Kafka topic, or S3 prefix) for events that cannot be linked, so failures are replayable rather than dropped.

All timestamps entering the linker must already be timezone-aware UTC epoch floats. Skew correction happens inside the window, but timezone normalization is an ingestion-layer responsibility.

Architecture Overview

The linker is a single logical stage with three internal phases — resolve, window, emit — fed by the normalization layer and feeding the rule engine. Each phase fails closed into the DLQ rather than passing malformed state downstream.

The pivotal design decision is that windows are keyed by resolved principal, not by raw identifier. Two records only meet if entity resolution maps them to the same key, which is why resolution quality bounds linking quality. The emit condition requires cross-source diversity (events from two or more distinct source_type values) so that ten repeated logons from one endpoint do not masquerade as a multi-stage incident.

Step-by-Step Implementation

Step 1 — Define a typed, validated event contract

Linking on raw dictionaries invites KeyError cascades under load. A Pydantic model gives every event a validated, comparable shape and a single rejection point.

from __future__ import annotations

from pydantic import BaseModel, Field, field_validator


class LinkableEvent(BaseModel):
    """Canonical, ECS-aligned event accepted by the linker."""

    event_id: str = Field(min_length=1)
    timestamp_utc: float = Field(gt=0)          # timezone-aware epoch seconds
    source_type: str = Field(min_length=1)      # endpoint | identity | proxy | cloud
    src_ip: str | None = None
    user_principal: str | None = None
    host_id: str | None = None
    technique_id: str | None = None             # e.g. "T1078" Valid Accounts

    @field_validator("source_type")
    @classmethod
    def known_source(cls, v: str) -> str:
        allowed = {"endpoint", "identity", "proxy", "cloud"}
        if v not in allowed:
            raise ValueError(f"unknown source_type: {v}")
        return v

Step 2 — Resolve a deterministic principal key

Entity resolution collapses source-specific identifiers into one stable key. The order of identifiers is fixed so the same entity always hashes identically regardless of which fields a given source populates. Resolution is deterministic (SHA-256 over a canonical tuple) — never a probabilistic match at this layer, because non-determinism makes joins irreproducible during incident review.

import hashlib


def resolve_entity_key(event: LinkableEvent) -> str | None:
    """Map heterogeneous identifiers to one canonical principal key."""
    identifiers = [
        ("user", (event.user_principal or "").strip().lower()),
        ("host", (event.host_id or "").strip().lower()),
        ("ip", (event.src_ip or "").strip()),
    ]
    canonical = "|".join(f"{k}={v}" for k, v in identifiers if v)
    if not canonical:
        return None
    return hashlib.sha256(canonical.encode("utf-8")).hexdigest()[:32]

A production resolver also consults an identity-graph lookup (CMDB, IdP directory) so that user_principal=jdoe, [email protected], and a known host-to-owner mapping all collapse to one key. Keep that lookup table authoritative and versioned — a stale alias table silently fractures sessions.

Step 3 — Build the skew-tolerant sliding window

The window store evicts expired events, tolerates a bounded lateness grace period, and caps per-key cardinality so one noisy entity cannot exhaust memory.

import time
from collections import defaultdict


class SlidingWindowLinker:
    def __init__(
        self,
        window_seconds: int = 900,
        grace_seconds: int = 120,
        max_events_per_key: int = 500,
    ) -> None:
        self.window_seconds = window_seconds
        self.grace_seconds = grace_seconds
        self.max_events_per_key = max_events_per_key
        self._store: dict[str, list[LinkableEvent]] = defaultdict(list)

    def admit(self, event: LinkableEvent, now: float) -> str:
        """Insert an event; return a status code for DLQ routing."""
        key = resolve_entity_key(event)
        if key is None:
            return "ERR_LINK_001"  # no resolvable identifier

        # Reject events later than the grace period rather than reopening
        # a closed window — this blocks delay-based sequence-rule evasion.
        if event.timestamp_utc < now - self.window_seconds - self.grace_seconds:
            return "ERR_LINK_011"  # arrived beyond lateness grace

        bucket = self._store[key]
        cutoff = event.timestamp_utc - self.window_seconds
        bucket[:] = [e for e in bucket if e.timestamp_utc > cutoff]
        bucket.append(event)

        if len(bucket) > self.max_events_per_key:
            # Keep the most recent N; shed oldest to bound memory.
            bucket[:] = sorted(bucket, key=lambda e: e.timestamp_utc)[
                -self.max_events_per_key :
            ]
            return "ERR_LINK_021"  # cardinality cap hit (warn, not fatal)
        return "OK"

    def linked_session(self, key: str) -> list[LinkableEvent] | None:
        """Return events if >= 2 distinct sources share the window."""
        events = self._store.get(key, [])
        sources = {e.source_type for e in events}
        if len(sources) >= 2:
            return sorted(events, key=lambda e: e.timestamp_utc)
        return None

Step 4 — Drive it from an async ingestion loop

A bounded asyncio consumer applies backpressure so an upstream surge cannot outrun the window store. Each event is admitted, evaluated, and either emitted to the rule engine or routed to the DLQ by status code.

import asyncio
import json
import logging

logger = logging.getLogger("soc.linker")


async def run_linker(
    incoming: asyncio.Queue[dict],
    emit: asyncio.Queue[list[dict]],
    dlq: asyncio.Queue[tuple[str, dict]],
    linker: SlidingWindowLinker,
    concurrency: int = 8,
) -> None:
    sem = asyncio.Semaphore(concurrency)

    async def handle(raw: dict) -> None:
        async with sem:
            try:
                event = LinkableEvent.model_validate(raw)
            except Exception as exc:  # validation failure -> DLQ
                await dlq.put(("ERR_LINK_031", {"raw": raw, "error": str(exc)}))
                return

            status = linker.admit(event, now=time.time())
            if status.startswith("ERR_LINK_0") and status != "ERR_LINK_021":
                await dlq.put((status, raw))
                return

            key = resolve_entity_key(event)
            session = linker.linked_session(key) if key else None
            if session:
                await emit.put([e.model_dump() for e in session])
                logger.info(
                    json.dumps(
                        {
                            "msg": "linked_session_emitted",
                            "correlation_key": key,
                            "sources": sorted({e.source_type for e in session}),
                            "event_count": len(session),
                        }
                    )
                )

    while True:
        raw = await incoming.get()
        await handle(raw)
        incoming.task_done()

This loop is intentionally decoupled from transport: swap the asyncio.Queue for a Kafka or Kinesis consumer without touching linking logic, which keeps the engine unit-testable against recorded fixtures.

Schema & Validation Integration

Cross-source linking is only as reliable as the field consistency it inherits. Every comparison in resolve_entity_key — user_principal, host_id, src_ip — assumes those fields already carry canonical ECS names and types (user.name, host.id, source.ip). That mapping is owned upstream by JSON event normalization, which collapses source dialects (Okta actor.alternateId, Windows TargetUserName, CloudTrail userIdentity.arn) onto one principal field before the event ever reaches the linker.

The LinkableEvent model is the contract enforcement point. It rejects any record that lost its identifiers or timestamp during normalization, and because rejection is typed (ERR_LINK_031), a rising rejection rate is a direct signal of upstream schema drift rather than a silent linking gap. Wire that signal into the error categorization frameworks so drift is triaged at the source instead of being absorbed as missed correlations. The technique_id field is optional at this layer but, when present, lets the linked session carry MITRE ATT&CK context straight through to scoring — for example tagging a session with T1078 (Valid Accounts) and T1021 (Remote Services) to mark probable lateral movement.

Error Handling & DLQ Routing

Every failure mode produces a stable code so the DLQ is queryable and the failure is replayable. Codes follow the established ERR_CATEGORY_NNN convention.

Code	Meaning	Recovery action
`ERR_LINK_001`	No resolvable entity identifier on the event	Replay after enrichment adds an identifier; if structurally absent, the source needs a normalization rule
`ERR_LINK_011`	Event arrived beyond the window + grace period	Drop from live join; retain for forensic replay. Persistent volume signals upstream buffering or clock skew
`ERR_LINK_021`	Per-key cardinality cap hit (non-fatal warning)	Keep newest N events; investigate the entity for a scan/flood that inflates one key
`ERR_LINK_031`	Event failed `LinkableEvent` validation	Route raw payload to DLQ with the validation error; fix the normalization mapping
`ERR_LINK_041`	Redis window store unreachable	Trip circuit breaker; buffer locally with bounded backpressure and replay on reconnect

The cardinal rule is fail closed, never silently. A record that cannot be linked is preserved in the DLQ keyed by its code, not discarded — discarding it would create an invisible blind spot exactly where an attacker wants one. Late events (ERR_LINK_011) are deliberately not allowed to reopen a closed window, because a sequence rule that reopens on late data can be evaded by an adversary who intentionally delays one stage of the attack.

Performance Tuning

At the 40k EPS target the dominant cost is window-store I/O, not Python compute. Tune in this order:

Batch the Redis round-trips. Admit events in micro-batches of 256–512 per pipeline call rather than one round-trip per event; this is the single largest throughput win and pairs naturally with async log batching on the ingest side.
Size the window to the threat, not the default. A 900-second window holding ~500 events per key at 50 bytes each is ~25 KB per active entity; 200k concurrent entities is ~5 GB. Multi-stage lateral-movement rules may need 1–6 hour windows — when they do, shard Redis by key[:2] and provision memory against peak active-entity count, not average EPS.
Cap per-key cardinality (max_events_per_key) so a single scanning host cannot balloon one window. 500 is a safe default; raise it only for keys you have profiled.
Bound concurrency with the semaphore (8–16 in-flight handlers per worker is typical) so an enrichment or Redis latency spike applies backpressure instead of unbounded task growth.
Target sub-second linking latency end to end. Emit-path p99 above ~1s usually means the window store is the bottleneck — shard before adding workers.

Verification & Observability

Confirm correct operation with three signals: a structured log line per emitted session, counters per error code, and replayable test assertions.

def test_links_across_sources() -> None:
    linker = SlidingWindowLinker(window_seconds=900)
    base = time.time()
    okta = LinkableEvent(event_id="e1", timestamp_utc=base,
                         source_type="identity", user_principal="[email protected]")
    win = LinkableEvent(event_id="e2", timestamp_utc=base + 120,
                        source_type="endpoint", user_principal="[email protected]")
    aws = LinkableEvent(event_id="e3", timestamp_utc=base + 300,
                        source_type="cloud", user_principal="[email protected]")

    for ev in (okta, win, aws):
        assert linker.admit(ev, now=base + 300) == "OK"

    key = resolve_entity_key(okta)
    session = linker.linked_session(key)
    assert session is not None
    assert {e.source_type for e in session} == {"identity", "endpoint", "cloud"}


def test_single_source_does_not_link() -> None:
    linker = SlidingWindowLinker()
    base = time.time()
    for i in range(5):
        ev = LinkableEvent(event_id=f"e{i}", timestamp_utc=base + i,
                           source_type="endpoint", host_id="HOST-1")
        linker.admit(ev, now=base + 5)
    key = resolve_entity_key(
        LinkableEvent(event_id="x", timestamp_utc=base, source_type="endpoint", host_id="HOST-1")
    )
    assert linker.linked_session(key) is None  # diversity not met

Operationally, emit metrics for: linked_sessions_emitted_total, dlq_events_total{code=...}, active_entity_keys (a proxy for memory), and link_latency_seconds (histogram). A sudden drop in linked_sessions_emitted_total with steady ingest is the canary for upstream normalization breakage. A climbing dlq_events_total{code="ERR_LINK_011"} points at clock skew or buffering, not at the linker itself.

Troubleshooting

Sessions never link even though events clearly relate. Root cause: entity resolution maps the records to different keys (e.g. jdoe vs [email protected]). Fix: normalize the principal field upstream and add the alias to the identity-graph lookup before hashing.
Phantom incidents from unrelated activity. Root cause: the resolver keys on src_ip alone, so NAT or a shared egress IP fuses many users. Fix: prefer user/host identifiers over IP, and treat IP as a tiebreaker only.
Memory climbs without bound. Root cause: a scanning or service account inflates one key past the cap, or the window is far larger than the threat dwell time. Fix: lower max_events_per_key, shorten the window, or shard Redis by key prefix.
High ERR_LINK_011 rate. Root cause: clock skew between sources exceeds the grace period, or an upstream buffer is releasing batches late. Fix: widen grace_seconds modestly and fix source NTP; do not paper over skew by reopening windows.
Linked sessions arrive but score wrong. Root cause: linking succeeds but technique_id and asset context never propagate. Fix: carry ATT&CK and criticality fields through emit and verify them in dynamic severity scoring.

FAQ

How is cross-source event linking different from ordinary correlation?

Ordinary correlation often means matching events against a rule within a single source or a single field. Cross-source linking specifically joins events that share no common field verbatim across heterogeneous systems — identity, endpoint, proxy, cloud — by first resolving their disparate identifiers to one canonical entity key, then windowing on that key. Linking is the join step that produces the multi-source session; rule evaluation and scoring happen afterward on that session.

Should entity resolution be deterministic or probabilistic?

At the linking layer, keep it deterministic: a fixed-order hash over canonical identifiers plus an authoritative alias lookup. Determinism makes every join reproducible during incident review, which matters for forensics and audit. Probabilistic or ML-based identity stitching has a place, but run it upstream to populate the canonical identifier, not inside the join — non-deterministic joins are nearly impossible to explain to an investigator after the fact.

How do I size the correlation window for multi-stage attacks?

Start from the dwell time of the behavior you are detecting. Fast credential-replay chains link inside 5–15 minutes; slow lateral movement may need 1–6 hours. Size the window to that dwell time, then cap per-key cardinality and memory against peak active-entity count. Validate the chosen size by replaying historical incidents, and feed the results into your threshold tuning strategies so the window is grounded in real attack timing rather than a guess.

What happens to events that arrive after their window closes?

They are routed to the DLQ as ERR_LINK_011 and retained for forensic replay, but they do not reopen the closed window. Allowing late events to reopen a window lets an attacker evade a sequence rule simply by delaying one stage of the attack. Track the late-arrival rate as a metric: a steady climb signals clock skew or upstream buffering that should be fixed at the source.

Cross-Source Event Linking: Stitching Multi-Telemetry Incidents in the SOC

Related guides