Threat Intel Feed Mapping

A threat feed is worthless until an indicator inside it lines up, byte for byte, with a field in your live telemetry. The gap between “we subscribe to twelve feeds” and “an analyst was alerted the instant a known C2 domain appeared in proxy logs” is an engineering problem: heterogeneous feed formats must be parsed, every indicator must be coerced into one canonical, typed shape, and that shape must be indexed so it can be matched against millions of events per second without melting the pipeline. This page is part of the broader SOC Log Architecture & Taxonomy discipline, and it sits at the seam between external context and internal detection — feed mapping is the deterministic transform that turns a vendor’s CSV row or STIX object into a correlation signal your rule engine can actually fire on. Get the mapping wrong and you either drown analysts in stale matches or, worse, miss the one indicator that mattered because a domain arrived with a trailing dot.

Problem Framing

Consider a concrete failure scenario. A SOC subscribes to four representative sources: an OSINT CSV blocklist (~90k IPv4/CIDR rows, refreshed hourly), a commercial STIX/TAXII feed of domains and file hashes (~15k objects/day), a MISP instance exporting JSON events with per-attribute confidence, and an internal abuse-team spreadsheet of phishing URLs. Each source names the same primitive differently: the CSV header is indicator, MISP calls it value with a sibling type, and STIX wraps it in a pattern like [domain-name:value = 'evil.example.com']. Confidence is a 0–100 integer in MISP, a confidence STIX property, and entirely absent from the CSV. Timestamps span three formats and two timezones.

When that mess lands in the matcher unnormalized, three things break. First, EVIL.EXAMPLE.COM. from one feed never matches evil.example.com observed in a proxy log, because nobody lowercased or stripped the trailing dot. Second, a hash arrives as SHA256:ABCD... from one feed and a bare hex string from another, so the same malware sample registers as two distinct indicators. Third, an IP blocked two years ago still fires today because no feed carried an expiry and nothing decayed its confidence. The job of a feed mapping pipeline is to make every indicator resolve to one canonical, typed, time-bounded record that can be looked up in O(1) against normalized telemetry. The rest of this page builds exactly that.

Prerequisites & Environment

The reference implementation targets Python 3.11+ (it uses match statements and X | None typing). The third-party dependencies are pydantic for validated models and redis for the shared matchable index; stix2 is optional and only needed for the STIX adapter.

python3 -m venv .venv
source .venv/bin/activate
pip install "pydantic>=2.6,<3.0" "redis>=5.0,<6.0" "stix2>=3.0,<4.0"

Infrastructure assumptions:

A Redis 7 instance reachable from every matcher worker, holding the indicator index as hashes keyed by ioc:{type}:{value}. Single-node Redis comfortably indexes a few million live indicators; shard by indicator-type prefix beyond that.
An upstream normalized event stream. Matching assumes telemetry already conforms to the site’s Elastic Common Schema (ECS) model so that source.ip, dns.question.name, and file.hash.sha256 are populated consistently. If yours are not, run events through JSON event normalization first — matching against un-normalized fields is the single most common cause of silent misses.
A dead-letter queue (a Redis stream, Kafka topic, or object-store prefix) for indicators that cannot be parsed or validated, so feed-format drift is replayable rather than dropped.

Feed credentials (TAXII basic auth, MISP API keys) must come from a secrets manager or short-lived tokens, never static config — a leaked feed key is an attacker’s map of your detections.

Architecture Overview

The mapper is one logical stage with four internal phases — adapt, normalize, validate, index — fed by raw feeds and feeding the matcher that joins indicators against live telemetry. Each phase fails closed into the DLQ rather than passing a malformed indicator downstream.

The pivotal design decision is that the index is keyed by the canonical indicator, not by the feed it came from. Two feeds reporting the same domain collapse to one record whose confidence is the max (or a policy-defined merge) of the contributors, so a hit carries the strongest available context rather than firing twice. Adapters are the only feed-aware code in the system; everything after normalization operates on one shape, which is what keeps the matcher both fast and feed-agnostic.

Step-by-Step Implementation

Step 1 — Define a typed, validated indicator contract

Mapping onto raw dictionaries invites KeyError cascades and silent type drift. A Pydantic model gives every indicator a validated, comparable shape, canonicalizes its value, and provides a single rejection point.

from __future__ import annotations

import hashlib
import re
from datetime import datetime, timezone
from enum import Enum

from pydantic import BaseModel, Field, field_validator, model_validator


class IOCType(str, Enum):
    IPV4 = "ipv4-addr"
    DOMAIN = "domain-name"
    URL = "url"
    SHA256 = "file-sha256"


_SHA256_RE = re.compile(r"^[a-f0-9]{64}$")


class Indicator(BaseModel):
    """Canonical, ECS-mappable threat indicator."""

    ioc_type: IOCType
    value: str = Field(min_length=1)
    source_feed: str = Field(min_length=1)
    confidence: int = Field(ge=0, le=100)
    first_seen: datetime
    expires_at: datetime

    @field_validator("first_seen", "expires_at")
    @classmethod
    def must_be_utc(cls, v: datetime) -> datetime:
        if v.tzinfo is None:
            raise ValueError("timestamps must be timezone-aware UTC")
        return v.astimezone(timezone.utc)

    @model_validator(mode="after")
    def canonicalize(self) -> "Indicator":
        v = self.value.strip()
        match self.ioc_type:
            case IOCType.DOMAIN:
                v = v.lower().rstrip(".")
            case IOCType.SHA256:
                v = v.split(":")[-1].lower()
                if not _SHA256_RE.match(v):
                    raise ValueError("invalid sha256 hex")
            case IOCType.URL:
                v = v.strip()
            case IOCType.IPV4:
                v = v.strip()
        object.__setattr__(self, "value", v)
        return self

    @property
    def index_key(self) -> str:
        return f"ioc:{self.ioc_type.value}:{self.value}"

    @property
    def fingerprint(self) -> str:
        return hashlib.sha256(self.index_key.encode()).hexdigest()[:16]

Canonicalization happens inside the model, so it is impossible to construct an Indicator whose value is not already normalized — the trailing-dot and SHA256: prefix bugs from the problem statement cannot occur downstream.

Step 2 — Map each feed dialect with a dedicated adapter

Adapters are the only feed-aware code. Each one converts a source-specific record into the canonical model, applying that feed’s own confidence convention and defaulting an expiry when the feed omits one.

from datetime import timedelta
from typing import Any, Iterable


class AdapterError(ValueError):
    """Raised when a source record cannot be mapped to an Indicator."""


def adapt_misp(attr: dict[str, Any], feed: str) -> Indicator:
    type_map = {
        "ip-dst": IOCType.IPV4, "ip-src": IOCType.IPV4,
        "domain": IOCType.DOMAIN, "url": IOCType.URL, "sha256": IOCType.SHA256,
    }
    ioc_type = type_map.get(attr.get("type", ""))
    if ioc_type is None:
        raise AdapterError(f"unmapped MISP type: {attr.get('type')}")
    now = datetime.now(timezone.utc)
    return Indicator(
        ioc_type=ioc_type,
        value=attr["value"],
        source_feed=feed,
        confidence=int(attr.get("confidence", 50)),
        first_seen=now,
        expires_at=now + timedelta(days=int(attr.get("ttl_days", 30))),
    )


def adapt_osint_csv(row: dict[str, str], feed: str) -> Indicator:
    now = datetime.now(timezone.utc)
    # CSV blocklist carries no confidence or expiry -> apply feed policy.
    return Indicator(
        ioc_type=IOCType.IPV4,
        value=row["indicator"],
        source_feed=feed,
        confidence=60,
        first_seen=now,
        expires_at=now + timedelta(days=7),
    )


ADAPTERS = {"misp": adapt_misp, "osint_csv": adapt_osint_csv}


def map_records(feed: str, kind: str, records: Iterable[dict]) -> Iterable[Indicator]:
    adapt = ADAPTERS[kind]
    for record in records:
        yield adapt(record, feed)

A new feed format means writing one adapter, not touching the matcher. STIX pattern extraction lives in its own adapter detailed under integrating STIX/TAXII feeds into SIEM, and tabular sources reuse the column-handling rules from CSV ingestion patterns.

Step 3 — Apply age-based confidence decay

A static confidence score makes a two-year-old IP as trustworthy as one seen this morning. Decay reduces an indicator’s effective confidence as it ages toward expiry, so the matcher can suppress weak, stale hits.

def effective_confidence(ind: Indicator, now: datetime) -> int:
    """Linearly decay confidence across the indicator's lifetime."""
    if now >= ind.expires_at:
        return 0
    lifetime = (ind.expires_at - ind.first_seen).total_seconds()
    if lifetime <= 0:
        return ind.confidence
    elapsed = (now - ind.first_seen).total_seconds()
    remaining = max(0.0, 1.0 - elapsed / lifetime)
    return round(ind.confidence * remaining)

The decay curve is deliberately simple and explainable; an exponential half-life is a drop-in replacement when a feed warrants it, but linear decay is auditable, which matters when an analyst asks why a hit scored 18 instead of 60.

Step 4 — Index and match from a bounded async loop

The indexer writes canonical records to Redis; the matcher looks up normalized telemetry fields against that index. A concurrency-limited asyncio loop applies backpressure so a feed-refresh burst cannot outrun Redis.

import asyncio
import json
import logging

import redis.asyncio as redis

logger = logging.getLogger("soc.tifm")


async def index_indicator(r: redis.Redis, ind: Indicator) -> None:
    """Upsert, keeping the highest confidence when feeds overlap."""
    key = ind.index_key
    existing = await r.hget(key, "confidence")
    conf = ind.confidence
    if existing is not None:
        conf = max(conf, int(existing))
    await r.hset(key, mapping={
        "confidence": conf,
        "source_feed": ind.source_feed,
        "first_seen": ind.first_seen.isoformat(),
        "expires_at": ind.expires_at.isoformat(),
    })
    ttl = int((ind.expires_at - datetime.now(timezone.utc)).total_seconds())
    if ttl > 0:
        await r.expire(key, ttl)


async def match_event(r: redis.Redis, event: dict, now: datetime) -> dict | None:
    """Probe an ECS event's IOC-bearing fields against the index."""
    probes = {
        IOCType.IPV4: event.get("source.ip"),
        IOCType.DOMAIN: (event.get("dns.question.name") or "").lower().rstrip("."),
        IOCType.SHA256: (event.get("file.hash.sha256") or "").lower(),
    }
    for ioc_type, value in probes.items():
        if not value:
            continue
        record = await r.hgetall(f"ioc:{ioc_type.value}:{value}")
        if not record:
            continue
        ind = Indicator(
            ioc_type=ioc_type, value=value, source_feed=record["source_feed"],
            confidence=int(record["confidence"]),
            first_seen=datetime.fromisoformat(record["first_seen"]),
            expires_at=datetime.fromisoformat(record["expires_at"]),
        )
        score = effective_confidence(ind, now)
        if score <= 0:
            continue
        return {**event, "threat.indicator.matched": value,
                "threat.indicator.confidence": score,
                "threat.feed": ind.source_feed}
    return None


async def run_matcher(
    incoming: asyncio.Queue[dict],
    emit: asyncio.Queue[dict],
    dlq: asyncio.Queue[tuple[str, dict]],
    r: redis.Redis,
    concurrency: int = 16,
) -> None:
    sem = asyncio.Semaphore(concurrency)

    async def handle(event: dict) -> None:
        async with sem:
            try:
                enriched = await match_event(r, event, datetime.now(timezone.utc))
            except (KeyError, ValueError) as exc:
                await dlq.put(("ERR_TIFM_031", {"event": event, "error": str(exc)}))
                return
            if enriched is not None:
                await emit.put(enriched)
                logger.info(json.dumps({
                    "msg": "ioc_match",
                    "indicator": enriched["threat.indicator.matched"],
                    "confidence": enriched["threat.indicator.confidence"],
                    "feed": enriched["threat.feed"],
                }))

    while True:
        event = await incoming.get()
        await handle(event)
        incoming.task_done()

The loop is transport-agnostic: swap the asyncio.Queue for a Kafka or Kinesis consumer without touching mapping or matching logic, which keeps both unit-testable against recorded fixtures.

Schema & Validation Integration

Feed mapping is only as reliable as the field consistency it inherits on both sides. The matcher probes source.ip, dns.question.name, and file.hash.sha256 — canonical ECS names that only exist because telemetry was normalized upstream by JSON event normalization. Indicators arriving from network appliances embedded in Syslog RFC standards messages must have their PRI/HEADER stripped and the IOC extracted before adaptation, or the matcher compares an indicator against a raw syslog line and silently misses.

The Indicator model is the contract enforcement point on the feed side. Because rejection is typed, a rising validation-failure rate is a direct signal of upstream feed-format drift rather than a quiet mapping gap — wire expires_at parse failures and unmapped-type errors into the error categorization frameworks so a vendor changing their schema is triaged at the source. For high-assurance pipelines, run adapter output through the same schema validation pipelines used elsewhere on the site, asserting every indexed indicator carries a valid type, a 0–100 confidence, and a future expires_at. The optional ATT&CK linkage is also resolved here: tagging a C2 domain with T1071 (Application Layer Protocol) lets a match carry technique context straight into MITRE ATT&CK integration and downstream scoring.

Error Handling & DLQ Routing

Every failure mode produces a stable code so the DLQ is queryable and the failure is replayable. Codes follow the established ERR_CATEGORY_NNN convention.

Code	Meaning	Recovery action
`ERR_TIFM_001`	Source record has no mappable indicator type	Add or correct the type mapping in the feed’s adapter; replay the record
`ERR_TIFM_011`	Indicator value failed canonicalization (e.g. malformed sha256)	Route raw record to DLQ; if systemic, the feed is emitting corrupt values — escalate to the provider
`ERR_TIFM_021`	Missing or non-UTC timestamp; expiry could not be derived	Apply the feed’s default-TTL policy in the adapter and re-map
`ERR_TIFM_031`	Telemetry event failed probe-field extraction	Fix upstream normalization mapping; the event reached the matcher un-normalized
`ERR_TIFM_041`	Redis index unreachable	Trip circuit breaker; buffer indexing locally with bounded backpressure and replay on reconnect

The cardinal rule is fail closed, never silently. An indicator that cannot be mapped is preserved in the DLQ keyed by its code, not discarded — discarding it creates an invisible blind spot exactly where an attacker benefits. A spike in ERR_TIFM_001 almost always means a feed introduced a new attribute type, and the fix is one adapter line plus a replay of the quarantined batch.

Performance Tuning

At high event rates the dominant cost is index I/O, not Python compute. Tune in this order:

Pipeline the Redis probes. Batch the three per-event hgetall lookups into a single MGET-style pipeline call, and pipeline indexing in micro-batches of 256–512 indicators per refresh rather than one round-trip each — this is the largest throughput win and pairs naturally with async log batching on the ingest side.
Throttle feed refreshes so an hourly 90k-row blocklist reload does not stall live matching; apply rate limiting strategies to the indexer’s write path and stagger feed pulls.
Let Redis TTL do expiry work. Setting per-key TTL from expires_at means stale indicators evict themselves; the decay sweeper then only handles confidence, not deletion, keeping the hot path lean.
Bound concurrency with the semaphore (16–32 in-flight handlers per worker is typical) so a Redis latency spike applies backpressure instead of unbounded task growth.
Budget enrichment latency. For real-time correlation, target a sub-millisecond index lookup and a p99 match-path under ~5 ms; if matching can’t keep pace, shard the index by indicator-type prefix before adding workers.

A 3-million-indicator index at ~120 bytes per record is roughly 360 MB in Redis — provision memory against peak live-indicator count, not daily feed volume, since TTL caps what stays resident.

Verification & Observability

Confirm correct operation with three signals: a structured log line per match, counters per error code, and replayable test assertions.

def test_domain_canonicalization_matches() -> None:
    now = datetime.now(timezone.utc)
    ind = Indicator(
        ioc_type=IOCType.DOMAIN, value="EVIL.EXAMPLE.COM.",
        source_feed="misp", confidence=80,
        first_seen=now, expires_at=now + timedelta(days=30),
    )
    assert ind.value == "evil.example.com"
    assert ind.index_key == "ioc:domain-name:evil.example.com"


def test_confidence_decays_to_zero_at_expiry() -> None:
    start = datetime(2025, 1, 1, tzinfo=timezone.utc)
    ind = Indicator(
        ioc_type=IOCType.IPV4, value="203.0.113.10",
        source_feed="osint_csv", confidence=60,
        first_seen=start, expires_at=start + timedelta(days=10),
    )
    midpoint = start + timedelta(days=5)
    assert effective_confidence(ind, midpoint) == 30
    assert effective_confidence(ind, ind.expires_at) == 0

Operationally, emit metrics for: ioc_matches_total{feed=...}, indicators_indexed_total, dlq_events_total{code=...}, live_indicator_count (a proxy for memory), and match_latency_seconds (histogram). A sudden drop in ioc_matches_total with steady ingest is the canary for upstream telemetry normalization breakage. A climbing dlq_events_total{code="ERR_TIFM_001"} points at a feed schema change, not at the matcher.

Troubleshooting

A known-bad domain in a feed never matches proxy logs. Root cause: case or trailing-dot mismatch, or the telemetry field is un-normalized. Fix: confirm canonicalization ran on both sides — the indicator via the Indicator model and the event via upstream event normalization.
The same indicator fires two alerts. Root cause: two feeds indexed under different keys (e.g. a SHA256: prefix on one). Fix: ensure all sha256 values flow through the model’s canonicalizer so they collapse to one index_key.
Old indicators keep alerting. Root cause: no expires_at was set, or decay is not applied at match time. Fix: enforce a default TTL in the adapter (ERR_TIFM_021 guards this) and verify effective_confidence gates the emit.
Redis memory climbs without bound. Root cause: per-key TTL is not being set, so expired indicators never evict. Fix: confirm r.expire runs on every upsert and that expires_at is in the future.
Match latency spikes during feed refresh. Root cause: a bulk reload contends with live matching on the same connection. Fix: throttle the indexer write path and stagger feed pulls off-peak.

FAQ

Should I normalize indicators on ingest or match against raw feed values?

Always normalize on ingest. Matching raw feed values forces the matcher to reconcile every feed’s quirks at query time — lowercasing, trailing dots, hash prefixes — for every event, which is both slow and error-prone. Canonicalizing once inside the indicator model means the hot match path is a single O(1) lookup against a value that is already in its comparable form. The same canonical rule must also be applied to the telemetry side, which is why upstream event normalization is a hard prerequisite.

How do I stop stale indicators from generating false positives?

Give every indicator an explicit expires_at and decay its confidence as it ages. Set a Redis TTL from the expiry so the record self-evicts, and gate the emit on an effective (decayed) confidence rather than the raw feed score. An IP that scored 60 when fresh might score 18 near expiry, below your alerting threshold, so it stops firing without anyone manually pruning the list. Feeds that omit an expiry get a default TTL applied in their adapter.

How should overlapping indicators from multiple feeds be merged?

Key the index by the canonical indicator, not the feed, and merge on upsert. The simplest defensible policy is to keep the maximum confidence so a hit carries the strongest available context, while recording which feed contributed it for provenance. More sophisticated policies weight feeds by historical precision, but start with max-confidence merge — it is explainable, and explainability matters when an analyst asks why an indicator scored the way it did.

Where does feed mapping sit relative to alert correlation?

Feed mapping runs after normalization and before correlation: it enriches normalized events with threat.indicator.* fields, and the rule engine then reasons over those fields. Keeping enrichment separate means a correlation rule can treat “matched a high-confidence C2 domain” as just another condition alongside behavioral signals, which is exactly how it feeds into alert correlation and rule engines and severity scoring.

Threat Intel Feed Mapping: Normalizing IOCs into a Matchable SOC Index

Explore deeper

Related guides