JSON Event Normalization for SOC Pipelines

JSON event normalization is the deterministic transformation layer that converts heterogeneous, vendor-specific telemetry into a single query-ready schema. It is the stage that decides whether your detection rules fire reliably or fragment across a dozen vendor field names. This page is part of the broader SOC Log Architecture & Taxonomy pipeline, and it focuses on one technique: taking arbitrary JSON payloads from EDR agents, cloud control planes, and network sensors and emitting Elastic Common Schema (ECS) records that the correlation engine can trust. Without strict normalization, raw JSON introduces parsing drift, false-positive inflation, and silent detection gaps the moment a vendor renames a key or nests a field one level deeper.

Problem Framing

Consider a SOC ingesting roughly 40,000 events per second across three sources: a CrowdStrike-style EDR emitting deeply nested JSON, AWS CloudTrail records wrapped in an outer Records[] array, and an identity provider streaming flat sign-in events. Each source uses a different key for the same concept — LocalIP, sourceIPAddress, and ipAddress all mean source.ip. Three concrete failures follow:

Field-name fragmentation. A correlation rule that joins endpoint and identity events on source.ip matches nothing, because each source spells the field differently. The rule reports zero hits and the gap is invisible until an incident is missed in review.
Nested-structure mismatch. CloudTrail nests the actor under userIdentity.arn; the EDR nests it under actor.user.name. A flat detection query referencing user.name silently returns null, dropping the event out of behavioral baselines.
Type drift. One source emits source.port as the string "443", another as the integer 443. Range comparisons and aggregations break, and the SIEM either rejects the document or coerces it inconsistently across shards.

The fix is a normalization stage that flattens nested objects, unwinds arrays, maps every vendor key onto a canonical ECS field, and enforces types at a validation gate before anything reaches the correlation layer. Everything below builds that stage.

Prerequisites & Environment

Python 3.11+ — pattern matching and faster json parsing help on hot paths.
pydantic v2 (pip install "pydantic>=2.6") — typed validation and coercion at the parser boundary.
orjson (pip install orjson) — optional, 2–5x faster JSON decode for high-volume streams.
A canonical schema definition aligned to Elastic Common Schema. This is the same field contract enforced by the site’s schema validation pipelines, so normalization output and validation gate agree by construction.
A dead-letter sink (Kafka topic, SQS queue, or an append-only file) for rejected payloads.

Treat the field-mapping table itself as infrastructure-as-code: version it, review it, and ship it through CI so a schema change can never be applied out of band.

Architecture Overview

The flow is a chain of gates: decode, unwind arrays into discrete events, flatten nested objects into dotted keys, apply the vendor-to-ECS map, then validate and coerce. An event either clears the validation gate and advances to enrichment and correlation, or it is diverted to the dead-letter path with a typed error code. The state of any event is always knowable from the last gate it cleared.

Step-by-Step Implementation

Step 1 — Decode and unwind array envelopes

Cloud audit feeds wrap many events in an outer array (Records[] for CloudTrail). Unwind it deterministically so each element becomes one normalized event, and decode defensively.

import orjson
from typing import Any, Iterator


def decode_and_unwind(raw: bytes) -> Iterator[dict[str, Any]]:
    """Decode a raw payload and yield one event dict per logical event.

    Handles both single-object payloads and array envelopes such as
    CloudTrail's {"Records": [...]} without loading the stream twice.
    """
    payload = orjson.loads(raw)

    if isinstance(payload, dict) and isinstance(payload.get("Records"), list):
        yield from (rec for rec in payload["Records"] if isinstance(rec, dict))
    elif isinstance(payload, list):
        yield from (rec for rec in payload if isinstance(rec, dict))
    elif isinstance(payload, dict):
        yield payload
    else:
        raise ValueError(f"unsupported top-level JSON type: {type(payload).__name__}")

Step 2 — Flatten nested objects to dotted keys

Recursive flattening turns {"userIdentity": {"arn": "..."}} into {"userIdentity.arn": "..."}, with deterministic prefixing so deeply nested keys never collide. Arrays of scalars are preserved; arrays of objects are left for explicit handling in the map step.

from typing import Any


def flatten(obj: dict[str, Any], parent: str = "", sep: str = ".") -> dict[str, Any]:
    """Flatten a nested JSON object into single-level dotted keys.

    Scalar arrays are kept intact; nested dicts are recursed with a
    stable prefix so identical leaf names under different parents do
    not overwrite one another.
    """
    flat: dict[str, Any] = {}
    for key, value in obj.items():
        compound = f"{parent}{sep}{key}" if parent else key
        if isinstance(value, dict):
            flat.update(flatten(value, compound, sep))
        else:
            flat[compound] = value
    return flat

Step 3 — Map vendor keys onto canonical ECS fields

The mapping table is the heart of normalization. Each source declares an ordered list of candidate source keys per ECS target; the first present, non-null value wins. Keeping the table data-driven means adding a vendor is a config change, not a code change.

from typing import Any

# Ordered candidates: first present value wins. Drives all sources.
ECS_FIELD_MAP: dict[str, tuple[str, ...]] = {
    "event.id":        ("eventID", "id", "event_id"),
    "event.action":    ("eventName", "action", "event_type"),
    "@timestamp":      ("eventTime", "timestamp", "@timestamp"),
    "source.ip":       ("sourceIPAddress", "LocalIP", "src_ip", "source.address"),
    "source.port":     ("sourcePort", "src_port", "source.port"),
    "user.name":       ("userIdentity.userName", "actor.user.name", "user", "identity.username"),
    "cloud.provider":  ("eventSource", "cloud.provider"),
    "process.name":    ("process.name", "executable.name", "process"),
}


def map_to_ecs(flat: dict[str, Any]) -> dict[str, Any]:
    """Project a flattened event onto canonical ECS field names."""
    out: dict[str, Any] = {}
    for ecs_field, candidates in ECS_FIELD_MAP.items():
        for candidate in candidates:
            value = flat.get(candidate)
            if value not in (None, ""):
                out[ecs_field] = value
                break
    return out

Step 4 — Validate, coerce, and emit at the gate

A Pydantic model is the final gate. It coerces types (string ports to integers, raw strings to IPvAnyAddress), rejects malformed records, and stamps a deterministic correlation_id so replays are idempotent.

import hashlib
import logging
import sys
from typing import Any, Iterator

from pydantic import BaseModel, ConfigDict, Field, IPvAnyAddress, ValidationError

logging.basicConfig(
    level=logging.INFO,
    format="%(asctime)s | %(levelname)s | %(name)s | %(message)s",
    handlers=[logging.StreamHandler(sys.stdout)],
)
logger = logging.getLogger("soc_normalizer")


class NormalizedEvent(BaseModel):
    """Canonical ECS-aligned event emitted to the correlation engine."""

    model_config = ConfigDict(populate_by_name=True, extra="ignore")

    event_id: str = Field(alias="event.id")
    event_action: str = Field(alias="event.action")
    timestamp_utc: str = Field(alias="@timestamp")
    source_ip: IPvAnyAddress | None = Field(default=None, alias="source.ip")
    source_port: int | None = Field(default=None, alias="source.port")
    user_name: str | None = Field(default=None, alias="user.name")
    cloud_provider: str | None = Field(default=None, alias="cloud.provider")
    process_name: str | None = Field(default=None, alias="process.name")
    correlation_id: str | None = None


def normalize(raw: bytes) -> Iterator[dict[str, Any]]:
    """End-to-end: decode → unwind → flatten → map → validate → emit."""
    for event in decode_and_unwind(raw):
        mapped = map_to_ecs(flatten(event))
        try:
            validated = NormalizedEvent.model_validate(mapped)
        except ValidationError as exc:
            logger.warning(
                "ERR_SCHEMA_002 normalization rejected event",
                extra={"errors": exc.errors(include_url=False)},
            )
            yield {"status": "rejected", "code": "ERR_SCHEMA_002", "raw": mapped}
            continue

        seed = f"{validated.event_id}|{validated.timestamp_utc}".encode()
        validated.correlation_id = hashlib.sha256(seed).hexdigest()[:32]
        logger.info("event normalized", extra={"event_id": validated.event_id})
        yield validated.model_dump(by_alias=True, mode="json")

The pipeline rejects malformed payloads before they contaminate correlation, and because correlation_id is derived only from stable input fields, reprocessing the same raw payload yields byte-identical output.

Schema & Validation Integration

Normalization does not own the schema — it conforms to it. The NormalizedEvent model here is the same ECS contract enforced downstream by schema validation pipelines, so a field that normalizes cleanly here will not be rejected later. Two integration rules keep the two stages aligned:

Single source of truth for field names. ECS_FIELD_MAP keys and the model aliases must reference identical ECS targets. Drift between them is the most common cause of “normalized but rejected” events.
Coercion belongs at this gate, not downstream. Convert "443" to 443 and a raw string to a validated IP here. Once an event passes, every consumer can assume types are correct, which is what makes cross-source joins on source.ip and user.name reliable.

When a vendor introduces an array of objects — for example process.children[] — decide explicitly whether to unwind it into discrete events or aggregate it into a canonical list field. Leaving array handling implicit is how correlation rules break silently under a new integration.

Error Handling & DLQ Routing

Every rejection carries a typed error code so the dead-letter queue is queryable and the failure mode is unambiguous. These codes feed the same error categorization framework used across the ingestion stack.

Error code	Trigger	Recovery path
`ERR_PARSE_001`	`orjson.loads` fails — truncated or non-JSON payload	DLQ with raw bytes; replay after transport fix
`ERR_SCHEMA_002`	Pydantic validation fails — missing required field or uncoercible type	DLQ with mapped dict; patch `ECS_FIELD_MAP` and replay
`ERR_MAP_003`	Required ECS target had no candidate present	DLQ; add the vendor’s key to the map’s candidate list
`ERR_SIZE_004`	Payload exceeds the configured byte ceiling	Reject without parsing to protect memory

Route rejects to a forensic dead-letter sink that retains the raw payload alongside the code — never auto-correct critical-feed events in place. Because normalization is idempotent, replaying a DLQ batch after a map fix is safe and produces no duplicates downstream.

MAX_PAYLOAD_BYTES = 1_048_576  # 1 MiB ceiling per event


def guard_size(raw: bytes) -> None:
    """Reject oversized payloads before any parse work is done."""
    if len(raw) > MAX_PAYLOAD_BYTES:
        raise ValueError("ERR_SIZE_004 payload exceeds 1 MiB ceiling")

Performance Tuning

Normalization sits on the hot path, so its cost is paid on every event. Practical levers:

Decode with orjson. On nested cloud audit records, orjson.loads typically runs 2–5x faster than the standard library and is the single biggest win on JSON-heavy feeds.
Batch, don’t trickle. Normalize in batches of 500–1,000 events per worker invocation to amortize logging and bus-publish overhead. Larger batches raise tail latency without improving throughput once the CPU is saturated.
Cap concurrency to cores. A worker pool of min(32, os.cpu_count() * 2) saturates CPU-bound normalization without thrashing the scheduler. JSON parsing and Pydantic validation are CPU-bound, so oversubscribing yields no gain.
Hold a memory ceiling. With the 1 MiB per-event guard and streaming (generator-based) normalization, resident memory stays bounded regardless of file size — the pipeline never materializes a whole stream.
Target sub-millisecond per event. A flat IdP event should normalize in well under 1 ms; a deeply nested CloudTrail record in 1–3 ms. Sustained times above that point to an oversized map or redundant flattening.

Verification & Observability

Confirm correct operation with concrete signals, not vibes:

Round-trip assertion. For a known fixture, assert that normalize() output contains the expected source.ip as a validated IP and that correlation_id is stable across two runs.
Rejection-rate metric. Emit a counter per error code to your TSDB. A sudden rise in ERR_MAP_003 means a vendor changed a key — an early warning of schema drift.
Field-coverage gauge. Track the percentage of events populating each optional ECS field. A drop to zero on user.name for one source flags a broken nested path before detections degrade.
Structured log lines. Every accept logs event normalized with event_id; every reject logs the code and Pydantic error list, so the DLQ is fully explainable from logs alone.

def assert_normalizes(raw: bytes, expected_ip: str) -> None:
    """Minimal verification harness for a known-good fixture."""
    results = list(normalize(raw))
    assert results and results[0].get("status") != "rejected", "event was rejected"
    first = results[0]
    assert first["source.ip"] == expected_ip, first["source.ip"]
    assert first["correlation_id"] == list(normalize(raw))[0]["correlation_id"]

Troubleshooting

All events land in the DLQ with ERR_SCHEMA_002. A required field alias does not match any ECS_FIELD_MAP target. Confirm event.id, event.action, and @timestamp resolve a candidate for this source; required fields with no mapping reject every record.
source.ip is always null. The vendor nests the address (source.address, userIdentity.ipAddress) and the flat candidate list misses it. Inspect flatten() output and add the dotted key to the candidate tuple.
Type errors on source.port. The source emits the port as a string the validator cannot coerce (e.g. "N/A"). Add a sentinel-stripping step in the map, or widen the candidate list to the numeric field.
Duplicate events downstream after a replay. correlation_id is being seeded from a volatile field (ingest time) instead of stable input. Seed only from event.id and the source timestamp so replays stay idempotent.
CloudTrail records arrive as one giant event. The Records[] envelope was not unwound. Confirm decode_and_unwind runs before flatten, or the array is flattened into indexed keys instead of fanned out.

FAQ

Should I normalize to ECS or OCSF?

Pick one canonical schema and enforce it everywhere; mixing them reintroduces the field fragmentation you are trying to remove. This site standardizes on Elastic Common Schema because most SIEM correlation content assumes ECS field names, but the same ECS_FIELD_MAP pattern works unchanged for OCSF — only the target keys change. What matters is that normalization output and the downstream validation contract reference the identical names.

How do I handle a vendor that nests the same field at different depths across event types?

List every known path as an ordered candidate for that ECS target. Because map_to_ecs takes the first present, non-null value, ("userIdentity.userName", "actor.user.name", "user") resolves correctly whether the source put the username two levels deep or at the root. When a genuinely new path appears it surfaces as ERR_MAP_003, which tells you exactly which candidate list to extend.

Where should enrichment happen — before or after normalization?

Always after. Threat-intel lookups, geo-IP, and asset tagging must operate on canonical field names, not vendor aliases, or you maintain a separate enrichment rule per source. Normalize first so source.ip is consistent, then feed the clean event into threat intel feed mapping and append matches as nested objects such as threat.indicator.matched.

Is it safe to log the raw payload when an event is rejected?

Only into the forensic DLQ, never into general application logs. Raw payloads routinely contain credentials, tokens, and PII. The rejection logger here emits the Pydantic error list and the mapped (not raw) dict; the raw bytes go only to the access-controlled dead-letter sink, where redaction and retention policy apply.

SOC Log Architecture & Taxonomy — parent overview
CSV Ingestion Patterns
Syslog RFC Standards
Threat Intel Feed Mapping
How to Map CEF to ECS Schema
Normalizing JSON Logs from Cloud Providers

JSON Event Normalization for SOC Pipelines

Explore deeper

Related guides