How do I keep up when a cloud provider changes its JSON schema without notice?

Treat extraction failure as a monitored signal, not a silent null. ERR_SCHEMA_001 and ERR_TIME_002 surface an upstream format change as a measurable spike in the dead-letter rate. Page on a sustained DLQ rate above baseline, sample the quarantined payloads, extend the additive ordered JMESPath fallbacks, and replay the quarantine — because fallbacks are 'a || b || c', adding a path never regresses existing providers.

Normalizing JSON Logs from Cloud Providers

Three clouds describe the same login three different ways, so a correlation rule written against one provider’s JSON silently misses the other two. This page builds a provider-agnostic normalizer that collapses AWS CloudTrail, Azure Activity, and GCP Audit events into one canonical model, as a single technique within JSON event normalization and the broader SOC log architecture and taxonomy it sits under.

Root-Cause Context

Cloud audit streams are structurally inconsistent by design, because each provider optimizes its payload for service-specific debugging and billing rather than cross-source security correlation. The same logical fact — which principal did what, from where, when — is encoded under entirely different paths. The actor lives at userIdentity.arn in CloudTrail, at identity.claims.name (or caller) in Azure Activity, and at protoPayload.authenticationInfo.principalEmail in a GCP Audit record. The action is eventName in AWS, operationName in Azure, and protoPayload.methodName in GCP. The source IP is sourceIPAddress, callerIpAddress, and protoPayload.requestMetadata.callerIp respectively. None of these align, and the nesting depth differs by provider.

The danger is not the verbosity but the failure mode it creates downstream. When a correlation rule keys off userIdentity.arn, every Azure and GCP event is invisible to it — a false negative that looks like a quiet tenant rather than a coverage gap. Worse, the timestamp formats disagree at the millisecond: CloudTrail emits ISO 8601 with a trailing Z, Azure carries an explicit offset like +00:00, and some services append microsecond precision that a strict SIEM date mapper rejects outright. A single unhandled format drops the event into the wrong correlation window or quarantines it entirely. And because cloud providers ship schema changes without notice, a brittle JMESPath query that worked yesterday can start returning null after an upstream update, severing detection coverage with no error raised. The remedy is to treat the raw cloud payload as a transport envelope — not a final format — and normalize every event to a canonical model keyed on actor.id, action.category, source.ip, and a single timestamp standard before it reaches the correlation engine.

Prerequisites

This technique needs only pydantic for the validated output contract and jmespath for resilient path extraction; everything else is standard library. Keeping the dependency surface small matters because this normalizer runs at the ingestion edge, often inside a serverless function or sidecar.

Python 3.11+ for dataclasses(slots=True), datetime.fromisoformat with offset parsing, and modern asyncio semantics.
pip install "pydantic>=2.5" jmespath — Pydantic v2 for fast typed validation, JMESPath for declarative a || b || c fallback extraction across providers.
A bounded dead-letter sink (Kafka topic, object-store prefix, or on-disk spool) to receive quarantined events with their error codes for triage and replay.
An agreement that the canonical model is the contract: downstream rules, threat intel feed mapping, and enrichment all read actor.id / action.category / source.ip, never the provider-native paths.

Production-Ready Implementation

The module below detects the provider from a stable structural marker, extracts each canonical field through an ordered JMESPath fallback so no provider needs its own rule set, coerces every timestamp to UTC ISO 8601 with seconds precision, and maps provider-specific verbs to a unified action.category. Pydantic validates the result; anything that cannot be normalized is dead-lettered with an ERR_CATEGORY_NNN code rather than dropped or allowed to fire a false alert. The pipeline is async so a single corrupt feed cannot block healthy ones.

from __future__ import annotations

import asyncio
import logging
import re
from datetime import datetime, timezone
from enum import StrEnum
from typing import Optional

import jmespath
from pydantic import BaseModel, ValidationError, field_validator

logger = logging.getLogger("soc.cloud_normalize")

IPV4 = re.compile(r"^(?:(?:25[0-5]|2[0-4]\d|1?\d?\d)\.){3}(?:25[0-5]|2[0-4]\d|1?\d?\d)$")
MICROS = re.compile(r"\.\d+")


class Provider(StrEnum):
    AWS = "aws"
    AZURE = "azure"
    GCP = "gcp"
    UNKNOWN = "unknown"


# Ordered, provider-agnostic extraction paths — first non-null wins.
ACTOR_PATHS = "userIdentity.arn || identity.claims.name || caller || protoPayload.authenticationInfo.principalEmail"
ACTION_PATHS = "eventName || operationName || protoPayload.methodName"
TIME_PATHS = "eventTime || time || timestamp || protoPayload.requestMetadata.requestAttributes.time"
IP_PATHS = "sourceIPAddress || callerIpAddress || protoPayload.requestMetadata.callerIp"
ID_PATHS = "eventID || correlationId || insertId || id"

# Verb fragments -> canonical action.category (extend per detection taxonomy).
CATEGORY_RULES: tuple[tuple[frozenset[str], str], ...] = (
    (frozenset({"runinstances", "create", "deploy", "start", "write", "insert"}), "compute.create"),
    (frozenset({"delete", "terminate", "stop", "remove", "destroy"}), "compute.delete"),
    (frozenset({"login", "consolelogin", "authenticate", "assumerole", "getcredentials"}), "identity.access"),
    (frozenset({"getobject", "list", "describe", "get", "read"}), "data.read"),
    (frozenset({"putbucketpolicy", "setiampolicy", "updateacl", "addpermission"}), "identity.policy_change"),
)


class NormalizedEvent(BaseModel):
    event_id: str
    timestamp_utc: datetime
    provider: Provider
    actor_id: str
    actor_type: str
    action: str
    action_category: str
    source_ip: Optional[str] = None
    raw_severity: str = "informational"

    @field_validator("timestamp_utc", mode="before")
    @classmethod
    def coerce_timestamp(cls, v: object) -> datetime:
        if isinstance(v, datetime):
            return v.astimezone(timezone.utc)
        if isinstance(v, str) and v:
            cleaned = MICROS.sub("", v.replace("Z", "+00:00"))
            return datetime.fromisoformat(cleaned).astimezone(timezone.utc)
        raise ValueError("ERR_TIME_002")

    @field_validator("source_ip", mode="before")
    @classmethod
    def validate_ip(cls, v: object) -> Optional[str]:
        return v if isinstance(v, str) and IPV4.match(v) else None


def detect_provider(payload: dict) -> Provider:
    if "userIdentity" in payload or "eventSource" in payload:
        return Provider.AWS
    if "operationName" in payload and "resourceId" in payload:
        return Provider.AZURE
    if "protoPayload" in payload:
        return Provider.GCP
    return Provider.UNKNOWN


def classify_action(action: str) -> str:
    needle = action.lower()
    for fragments, category in CATEGORY_RULES:
        if any(frag in needle for frag in fragments):
            return category
    return "unknown"


def normalize_event(payload: dict) -> NormalizedEvent:
    """Collapse one cloud audit record into the canonical model or raise ValueError(code)."""
    provider = detect_provider(payload)
    if provider is Provider.UNKNOWN:
        raise ValueError("ERR_SCHEMA_001")

    actor = jmespath.search(ACTOR_PATHS, payload)
    if not actor:
        raise ValueError("ERR_ACTOR_004")
    action = jmespath.search(ACTION_PATHS, payload)
    if not action:
        raise ValueError("ERR_SCHEMA_001")

    actor_str = str(actor)
    return NormalizedEvent(
        event_id=str(jmespath.search(ID_PATHS, payload) or "unknown"),
        timestamp_utc=jmespath.search(TIME_PATHS, payload),  # validator coerces / raises ERR_TIME_002
        provider=provider,
        actor_id=actor_str,
        actor_type="iam_user" if ":" in actor_str or "@" not in actor_str else "service_account",
        action=str(action),
        action_category=classify_action(str(action)),
        source_ip=jmespath.search(IP_PATHS, payload),
        raw_severity=str(payload.get("severity", "informational")),
    )


async def run_pipeline(
    raw: list[dict],
    clean: asyncio.Queue[NormalizedEvent],
    dlq: asyncio.Queue[dict],
) -> dict[str, int]:
    """Normalize a batch, routing every record to the clean queue or the error-coded DLQ."""
    stats: dict[str, int] = {"processed": 0, "normalized": 0, "quarantined": 0}
    for payload in raw:
        stats["processed"] += 1
        try:
            await clean.put(normalize_event(payload))
            stats["normalized"] += 1
        except (ValueError, ValidationError) as exc:
            code = _error_code(exc)
            stats["quarantined"] += 1
            stats[code] = stats.get(code, 0) + 1
            await dlq.put({"error_code": code, "raw": payload})
        if stats["processed"] % 5_000 == 0:
            await asyncio.sleep(0)  # cooperative yield so consumers drain
    return stats


def _error_code(exc: Exception) -> str:
    text = str(exc)
    for code in ("ERR_SCHEMA_001", "ERR_TIME_002", "ERR_ACTOR_004"):
        if code in text:
            return code
    return "ERR_TYPE_003"


if __name__ == "__main__":
    logging.basicConfig(level=logging.INFO)
    sample = [
        {"userIdentity": {"arn": "arn:aws:iam::1:user/ana"}, "eventName": "ConsoleLogin",
         "eventTime": "2026-06-27T09:14:02Z", "sourceIPAddress": "203.0.113.7", "eventID": "a1"},
        {"operationName": "Microsoft.Compute/virtualMachines/write", "resourceId": "/vm/01",
         "caller": "[email protected]", "time": "2026-06-27T09:14:05.118+00:00", "callerIpAddress": "203.0.113.7"},
        {"protoPayload": {"methodName": "storage.objects.get",
                          "authenticationInfo": {"principalEmail": "[email protected]"},
                          "requestMetadata": {"callerIp": "198.51.100.4"}},
         "timestamp": "2026-06-27T09:14:09Z", "insertId": "g7"},
    ]

    async def _demo() -> None:
        clean: asyncio.Queue[NormalizedEvent] = asyncio.Queue()
        dlq: asyncio.Queue[dict] = asyncio.Queue()
        stats = await run_pipeline(sample, clean, dlq)
        logger.info("stats=%s", stats)
        while not clean.empty():
            logger.info("normalized=%s", (await clean.get()).model_dump(mode="json"))

    asyncio.run(_demo())

The output is exactly the type-enforced shape that the schema validation pipelines model expects, and once action_category is canonical a single rule can detect a compromised credential reused across clouds — identity.access from an anomalous source.ip followed by data.read on storage, the cross-provider signature of T1078 (Valid Accounts) escalating into T1530 (Data from Cloud Storage Object). Mapping that canonical action surface to detection logic is the job of the MITRE ATT&CK integration layer downstream.

Error-Code Reference

Codes follow the ERR_CATEGORY_NNN convention so the dead-letter sink stays triageable by category, and each maps to a disposition handled by the downstream error categorization frameworks.

Code	Meaning	Action
`ERR_SCHEMA_001`	Provider could not be detected, or no action verb resolved from any known path	Quarantine and inspect — a new service or an upstream schema change; extend `detect_provider` / `ACTION_PATHS` and replay
`ERR_TIME_002`	Timestamp missing or unparseable after `Z`/offset/microsecond normalization	Route to a timestamp-repair worker; if a whole feed spikes, the provider changed its date format
`ERR_TYPE_003`	Pydantic validation failure on a coerced field (non-string actor, bad enum)	Auto-coerce in a normalization worker; persistent failures mean a structural shift in the payload
`ERR_ACTOR_004`	No principal resolved at any actor path	Quarantine — usually an unauthenticated/anonymous or service-internal event; decide whether to synthesize `anonymous` or drop per policy
`ERR_DEPTH_005`	Nested object exceeds the SIEM index depth ceiling after flattening	Truncate or pipe-serialize the deep branch before re-emitting; alert on a sustained rate

Categorizing this way enables disposition routing rather than blanket dropping: ERR_TIME_002 and ERR_TYPE_003 are candidates for auto-repair, while a spike in ERR_SCHEMA_001 is the signature of an upstream provider format change worth paging on.

Operational Notes

Memory profile. Each event is normalized independently with no cross-record state, so resident memory is bounded by the two queues — queue_max × avg_event_size — and stays flat whether the batch is 5,000 or 5,000,000 records. Stream rather than json.load an entire multi-GB CloudTrail digest file.
Throughput. JMESPath compilation dominates per-event cost; pre-compile each expression once (jmespath.compile(...)) at module load when running hot paths rather than re-parsing the string per record, which roughly doubles throughput on a single core.
Vendor quirks. CloudTrail wraps records in a top-level {"Records": [...]} array and emits eventTime to whole seconds, while GCP nests everything under protoPayload and can ship microsecond timestamp values; Azure sometimes delivers callerIpAddress as an empty string rather than omitting it — the IP validator returns None for both, which is the correct canonical absence.
Actor typing is heuristic. The iam_user vs service_account split here keys on ARN/email shape; for high-fidelity detection, resolve it against the identity provider rather than trusting the string, since a service principal acting like a human is itself a signal worth scoring in dynamic severity scoring.
Legacy egress. When a normalized event must reach an on-prem SIEM over syslog, serialize the canonical fields into RFC 5424 STRUCTURED-DATA rather than re-flattening ad hoc — see the syslog RFC standards guidance so fields survive transport without truncation.

Verification Checklist

One AWS, one Azure, and one GCP record describing the same login all produce identical action_category and matching source_ip.
processed equals normalized plus quarantined — conservation holds and no record is silently lost.
A timestamp with microsecond precision and a +00:00 offset normalizes to seconds-precision UTC, not ERR_TIME_002.
A payload with no resolvable principal yields ERR_ACTOR_004, and an unrecognized envelope yields ERR_SCHEMA_001 — neither crashes the batch.
Every DLQ entry carries a populated error_code that resolves to a row in the reference table.
Resident memory stays flat between a 40 MB and a 40 GB CloudTrail digest under the same queue maxsize.

FAQ

Why normalize at ingestion instead of writing per-provider correlation rules in the SIEM?

Because per-provider rules multiply: every detection you author has to be written, tested, and maintained three times, and a fourth cloud or a renamed field silently breaks one copy without touching the others. Normalizing to a canonical actor.id / action.category / source.ip model at the edge means each detection is written once against the canonical surface, and onboarding a new provider is a change to one extraction map — ACTOR_PATHS, ACTION_PATHS, detect_provider — rather than a rewrite of the rule set. It also closes the false-negative gap where an Azure event is simply invisible to a rule keyed on a CloudTrail-only path.

How do I keep up when a cloud provider changes its JSON without notice?

Treat extraction failure as a monitored signal, not a silent null. The ERR_SCHEMA_001 and ERR_TIME_002 codes exist so that an upstream format change surfaces as a measurable spike in the dead-letter rate rather than as a quiet drop in detection coverage. Page on a sustained DLQ rate above your baseline, sample the quarantined raw payloads to see the new shape, extend the ordered JMESPath fallbacks (which are additive — old paths keep working), and replay the quarantine. Because the fallbacks are a || b || c, adding a new path never regresses the providers you already handle.

JSON Event Normalization — parent topic
SOC Log Architecture & Taxonomy — parent architecture
How to Map CEF to ECS Schema
Schema Validation Pipelines
Error Categorization Frameworks

Normalizing JSON Logs from Cloud Providers

Related guides