Quick Definition (30–60 words)
Complete mediation is the security and access-control principle that every access request must be checked against authorization policy every time, not just once. Analogy: like a tollbooth that checks every car at every entry, not just once per day. Formal: ensure authorization enforcement occurs at every access decision point.
What is Complete Mediation?
Complete mediation is a principle from access control and security engineering: every access to a resource must be checked for permission. It is NOT a one-time check, implicit trust, or purely network-layer routing rule. It applies across identity, sessions, caching, tokens, and service-to-service calls.
Key properties and constraints:
- Checks at every access point, including internal calls.
- Fresh authorization decision or safely validated cache entry.
- Scalable in cloud-native environments via policy caches and PDP/PAP patterns.
- Tolerant to latency constraints with bounded cache TTLs and revocation signals.
- Requires observability for enforcement effectiveness.
Where it fits in modern cloud/SRE workflows:
- Identity-aware proxies at the edge.
- Service mesh and sidecar-level enforcement.
- API gateways and function-level checks in serverless.
- CI/CD policy gates and runtime enforcement for zero-trust architectures.
- Part of SRE reliability responsibilities: prevents incidents caused by unauthorized actions and reduces blast radius.
Diagram description (text-only):
- Requester (user or service) sends request -> Identity provider validates identity -> Request passes through ingress policy enforcer (edge) -> If allowed, forward to service sidecar policy evaluator -> Sidecar checks attributes and policy -> Service receives request and re-checks for sensitive actions -> Logging and telemetry emitted to observability backend -> PDP updates policy changes and revokes caches via push/pull.
Complete Mediation in one sentence
Every access attempt to a resource must be authorized at the time of access by an enforced policy, not assumed based on previous checks.
Complete Mediation vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from Complete Mediation | Common confusion |
|---|---|---|---|
| T1 | Authentication | Confirms identity only | Often conflated with authorization |
| T2 | Authorization | Broader category that includes mediation | Mediation is enforcement practice |
| T3 | Least Privilege | Principle on permission scope | Not about checking frequency |
| T4 | Role-Based Access Control | Policy model not enforcement timing | RBAC can be used without mediation |
| T5 | Attribute-Based Access Control | Policy model using attributes | ABAC requires enforcement too |
| T6 | Caching | Performance optimization | Caching can break mediation if stale |
| T7 | Session Tokens | Mechanism for identity claims | Tokens may be revoked yet still valid |
| T8 | Service Mesh | Transport-level controls | Mesh can enforce mediation but not required |
| T9 | Network ACLs | Coarse network filtering | Not sufficient for resource-level checks |
| T10 | Zero Trust | Security model aligned with mediation | Zero Trust includes more than mediation |
Row Details (only if any cell says “See details below”)
None
Why does Complete Mediation matter?
Business impact:
- Revenue: Prevents fraud, data exfiltration, and uptime loss due to unauthorized actions.
- Trust: Preserves customer and partner trust by enforcing access policies reliably.
- Risk: Limits regulatory exposure and breach impact by ensuring access decisions are enforced.
Engineering impact:
- Incident reduction: Eliminates classes of incidents where stale permissions allowed bad actions.
- Velocity: Clear, enforced policies reduce ad hoc fixes and developer uncertainty.
- Trade-offs: Needs tooling to avoid latency and operational burdens.
SRE framing:
- SLIs/SLOs: Authorization success rate, policy evaluation latency, enforcement coverage.
- Error budgets: Allow limited policy sync failures but not silent bypasses.
- Toil: Automation of policy distribution and revocation reduces manual toil.
- On-call: Include authorization failures as actionable alerts.
What breaks in production — realistic examples:
- Stale token bug allows deprovisioned employee to modify billing for hours.
- Cache invalidation failure prevents revocation of third-party API keys.
- Sidecar policy mismatch allows elevated-read operations on a data service.
- CI/CD pipeline lacks policy gate, pushes configuration that disables checks.
- Temporary network partition causes PDP unreachable and services operate in permissive mode.
Where is Complete Mediation used? (TABLE REQUIRED)
| ID | Layer/Area | How Complete Mediation appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge ingress | Policy check per request at gateway | Request authz latency and decision logs | API gateway sidecars |
| L2 | Service mesh | Sidecar enforces per-call policies | mTLS, authz decision traces | Service mesh control planes |
| L3 | Application | Inline checks before sensitive ops | Audit logs and deny counters | Middleware libraries |
| L4 | Database | Row/column access enforcement | DB audit and slow denies | DB proxy or RLS |
| L5 | IAM | User and service permission checks | Token issuance and revocation metrics | IAM systems |
| L6 | Serverless | Function-level authz per invocation | Invocation authz metrics | Serverless gateways |
| L7 | CI/CD | Policy gates on deploy and config | Pipeline policy pass/fail counts | Policy-as-code tools |
| L8 | Observability | Enforcement telemetry and traces | Events, alerts, traces | Logging and APM tools |
| L9 | Network | Microsegmentation and ACLs per flow | Flow logs and deny counts | Network policy managers |
| L10 | Data plane | Storage and stream enforcement | Access patterns and deny rates | Data access proxies |
Row Details (only if needed)
None
When should you use Complete Mediation?
When it’s necessary:
- Systems handling sensitive data, financial transactions, or PII.
- Multi-tenant platforms where owner boundaries must be enforced.
- Environments requiring regulatory compliance and auditability.
- Zero-trust or high-assurance architectures.
When it’s optional:
- Public read-only datasets with low risk.
- Internal tooling where developer velocity outweighs strict controls (short term).
- Prototyping phases where strict checks are intentionally relaxed with mitigation.
When NOT to use / overuse it:
- Over-enforcing non-sensitive operations causing latency and complexity.
- Applying verbose policy checks to high-throughput internal telemetry without benefit.
- Using complete mediation as an excuse for poor API design and coupling.
Decision checklist:
- If handling sensitive data AND external access -> enforce complete mediation.
- If internal-only low-risk service AND performance critical -> consider sampled checks.
- If you need rapid deprovisioning -> use enforcement with immediate revocation signals.
- If subject to compliance audits -> implement strict per-access logs.
Maturity ladder:
- Beginner: API gateway checks + RBAC, logging allow/deny.
- Intermediate: Service mesh sidecar enforcement + short TTL caches + policy-as-code.
- Advanced: Distributed PDP with streaming revocation, ABAC policies, observability-driven alerts, automated remediation.
How does Complete Mediation work?
Step-by-step components and workflow:
- Caller identity established (authentication) via tokens or mTLS.
- Request arrives at first enforcement point (edge/API gateway).
- Enforcer performs policy check against a Policy Decision Point (PDP) or local cache.
- PDP evaluates policy using attributes and returns permit/deny/conditional.
- Enforcer enforces the decision, logs outcome, and forwards or rejects.
- Downstream services re-check as needed for sensitive operations.
- Policy updates flow from Policy Administration Point (PAP) to PDP and enforcers.
- Revocation signals and cache invalidations ensure freshness.
Data flow and lifecycle:
- Identity creation -> token issuance -> request -> evaluation -> enforcement -> audit log -> metrics -> policy change -> revocation -> cache invalidation.
Edge cases and failure modes:
- PDP unavailable -> enforcers must have fail-safe mode: deny or allow with risk.
- Token replay -> short TTLs and nonce checks.
- Latency-sensitive flows -> local cache with bounded TTL and revocation push.
- Intermittent network partitions -> ensure deterministic fail mode and monitoring.
Typical architecture patterns for Complete Mediation
- Edge-first enforcement: API gateway as first check; useful for public APIs.
- Sidecar enforcement: service mesh enforces per-call checks; good for microservices.
- Library/middleware enforcement: application enforces inside code for domain-specific checks.
- Hybrid PDP + caches: centralized PDP with local caches and push invalidations for scale.
- Policy-as-code in CI/CD gates: static checks prevent policy-violating deployments.
- Database row-level policy enforcement (RLS) coexisting with service-level checks for defense in depth.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Stale cache | Access granted after revocation | Cache TTL too long | Reduce TTL and push revocations | Stale cache hits metric |
| F2 | PDP outage | High deny-or-allow fallback events | PDP unreachable | Circuit breaker and fail-safe deny | PDP latency and error rate |
| F3 | Policy mismatch | Some services allow, others deny | Out-of-sync policies | Policy distribution verification | Policy version drift metric |
| F4 | Token replay | Duplicate actions from same token | Missing nonce checks | Use nonce and short TTLs | Duplicate request pattern |
| F5 | Performance regression | Increased request latency | Excessive sync calls to PDP | Cache and async evaluation | Authz latency SLI spike |
| F6 | Missing coverage | Unauthorized access by design gap | Unchecked code paths | Audit and add enforcers | Access control coverage metric |
| F7 | False positives | Legitimate requests denied | Overly strict policy rules | Tweak policy or exceptions | Deny rate and user reports |
| F8 | Audit log loss | Missing history for decisions | Logging pipeline failure | Durable logging and retries | Log ingestion drop count |
Row Details (only if needed)
None
Key Concepts, Keywords & Terminology for Complete Mediation
Below are 40+ concise glossary entries. Each entry uses the format: Term — definition — why it matters — common pitfall.
- Access Control — Mechanism to permit or deny resource access — core of mediation — assumes enforcement exists
- Access Token — Credential proving identity — used to make authz decisions — stale tokens can be abused
- Active Revocation — Immediate invalidation of rights — reduces window of risk — requires signaling to caches
- Attribute-Based Access Control — Policies based on attributes — flexible for cloud contexts — complex policy authoring
- Authorization — Decision process allowing actions — the intent of mediation — mistaken for authentication
- Audit Log — Immutable record of access events — required for forensics — can be incomplete if pipeline fails
- Backup PDP — Redundant policy decision point — resilience — adds complexity to sync
- Baseline Policy — Minimal permitted actions — safety net — can block legitimate workflows
- Bindings — Link between principal and role — simplifies rules — stale bindings cause issues
- Cache TTL — Time cache entries live — performance tactic — too-long TTL violates mediation
- Central Policy Store — Single source of truth for rules — consistency benefit — single point of failure if mismanaged
- Challenge-Response — Mechanism to verify freshness — mitigates replay — extra round-trip latency
- Conditional Access — Policies based on context — reduces risk — complexity in evaluation logic
- Deny by Default — Default posture of refusal — secure baseline — may block users initially
- Delegation — Allowing actors to act for others — needed for workflows — mis-scoped delegation is risky
- Fine-Grained Authorization — Resource-level checks — limits blast radius — can be heavy to maintain
- Identity Provider — Issues credentials — starting point for authz — trust boundary to validate
- Immutable Audit — Tamper-proof logs — essential for compliance — hard to retroactively add
- Implicit Trust — Trust without re-verification — anti-pattern for mediation — leads to breaches
- JWT — Token format with claims — common in distributed systems — long TTLs problematic
- Least Privilege — Give minimum rights needed — reduces exposure — can slow feature delivery
- Legal Hold — Prevent revocation for compliance — affects mediation windows — needs exceptions handling
- Multi-Cloud Policy — Policies that span providers — necessary in 2026 cloud stacks — increased integration effort
- Nonce — One-time value to prevent replay — improves security — requires state management
- Observability — Metrics, logs, traces for authz — proves enforcement works — often incomplete coverage
- PDP — Policy Decision Point evaluates policies — core runtime evaluator — scaling needs care
- PAP — Policy Administration Point manages policies — governance function — can be bottleneck
- Policy-as-Code — Policies defined and tested in code — CI/CD integration — requires testing discipline
- Policy Cache — Local copy of decisions or rules — reduces latency — invalidation complexity
- RBAC — Role-based access control model — simple to reason about — coarse for modern needs
- Revocation List — Records revoked tokens or grants — needed for rapid deprovisioning — must be checked frequently
- Service Mesh — Network layer with sidecars — convenient enforcement point — can be bypassed if misconfigured
- Shadow Mode — Simulate enforcement without blocking — safe rollout method — must monitor outcomes
- Single Sign-On — Unified identity across apps — simplifies auth — reliance centralizes risk
- Session — Authenticated context for a user — often assumed safe — session hijack risk
- Sidecar — Proxy co-located with service — enforces per-call checks — deployment and observability needed
- Token Exchange — Swap token types for scopes — supports delegation — increases complexity
- Tracing — Distributed traces of authz paths — helps debug enforcement — sampling may hide issues
- Two-Phase Enforcement — Initial gate, then operation-level check — balance safety and latency — more implementation work
- Zero Trust — Security posture of no implicit trust — natural home for complete mediation — requires orchestration
How to Measure Complete Mediation (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Authorization success rate | Fraction of allowed decisions vs requests | allow_count / total_requests per minute | 99.9% for non-sensitive ops | False positives mask real issues |
| M2 | Authorization deny rate | Fraction of denies indicating policy blocks | deny_count / total_requests per minute | Varies by app; alert on spikes | Spikes may be expected after deploys |
| M3 | Authz decision latency | Time to evaluate and enforce a decision | p95 latency from request start to decision | p95 < 50ms for APIs | Network to PDP adds variance |
| M4 | Policy distribution lag | Time from PAP change to enforcer update | time policy_updated -> enforcer version | <30s for high-sensitivity | Large fleets need push infra |
| M5 | Cache stale window | Time between revocation and last enforcement of old permit | max TTL observed after revocation | <60s for sensitive systems | Complex to measure accurately |
| M6 | PDP error rate | PDP internal failures rate | errors / total_requests to PDP | <0.1% | Transient errors must be tracked |
| M7 | Enforcement coverage | Fraction of access paths checked | checked_paths / total_paths | 100% for sensitive resources | Discovery of paths is hard |
| M8 | Unauthorized access events | Incidents where unauthorized actions occurred | count of confirmed unauthorized ops | 0 | Detection depends on logging |
| M9 | Audit log completeness | Fraction of decisions logged and ingested | logged_decisions / total_decisions | 100% | Logging pipeline drops can hide gaps |
| M10 | Revocation propagation time | Time for revocation to be enforced globally | time from revoke -> no further access | <5s for critical systems | Dependent on network and caches |
Row Details (only if needed)
None
Best tools to measure Complete Mediation
Use the following tool sections for 5–10 tools.
Tool — OpenTelemetry
- What it measures for Complete Mediation: Distributed traces and spans including authz decision timings.
- Best-fit environment: Cloud-native microservices and service mesh.
- Setup outline:
- Instrument services and sidecars.
- Capture authz decision spans.
- Propagate trace context across calls.
- Export to chosen backend.
- Strengths:
- Standardized telemetry.
- Rich context for root cause.
- Limitations:
- Requires instrumentation effort.
- Sampling may miss authz anomalies.
Tool — Policy Decision Point (PDP) solutions
- What it measures for Complete Mediation: Decision counts, latency, error rates.
- Best-fit environment: Centralized policy evaluation with distributed enforcers.
- Setup outline:
- Deploy redundant PDPs.
- Expose metrics endpoint.
- Integrate with policy store.
- Strengths:
- Centralized visibility.
- Consistent decisions.
- Limitations:
- Scaling needs careful design.
- Network latency concerns.
Tool — Service Mesh control planes
- What it measures for Complete Mediation: Per-call enforcement, deny/allow metrics at sidecar.
- Best-fit environment: Kubernetes microservices.
- Setup outline:
- Enable authz policies in mesh.
- Collect mesh metrics and logs.
- Configure policy sync.
- Strengths:
- Transparent enforcement.
- Fine-grained telemetry.
- Limitations:
- Mesh complexity.
- Bypass risk if sidecars removed.
Tool — API Gateways
- What it measures for Complete Mediation: Edge-level authz rates and latencies.
- Best-fit environment: Public APIs and ingress control.
- Setup outline:
- Configure authz plugins.
- Enable decision and latency metrics.
- Integrate with PDP or local policies.
- Strengths:
- First-line defense.
- Easy to observe externally.
- Limitations:
- Not sufficient for intra-service checks.
Tool — SIEM / Logging pipelines
- What it measures for Complete Mediation: Audit log ingestion, correlation of authz events.
- Best-fit environment: Organizations with compliance needs.
- Setup outline:
- Forward authz logs with structured fields.
- Create dashboards and alerts.
- Strengths:
- Centralized forensic view.
- Long-term retention.
- Limitations:
- High volume and cost.
- Latency for analysis.
Recommended dashboards & alerts for Complete Mediation
Executive dashboard:
- Panels: Authorization success rate, deny rate trend, unauthorized events, policy distribution lag.
- Why: High-level health and risk metrics for leadership.
On-call dashboard:
- Panels: Recent deny spike, PDP error rate, authz latency p95/p99, revocation propagation times, top denied users.
- Why: Rapid triage of enforcement incidents.
Debug dashboard:
- Panels: Per-service authz traces, last 100 decisions, cache hit/miss ratio, policy version per host, relevant logs stream.
- Why: Deep troubleshooting for engineers.
Alerting guidance:
- Page vs ticket:
- Page when unauthorized access events or PDP outage impacts production.
- Ticket for policy drift warnings or minor deny spikes.
- Burn-rate guidance:
- If SLO error budget consumption > 20% per hour, page and investigate.
- Noise reduction tactics:
- Deduplicate similar authz alerts by user/service.
- Group alerts by root cause (policy version, PDP endpoint).
- Use suppression windows during planned deployments.
Implementation Guide (Step-by-step)
1) Prerequisites – Inventory of sensitive resources and access paths. – Identity provider and token strategy defined. – Observability platform and logging pipeline available. – Policy language selected and governance process.
2) Instrumentation plan – Identify enforcement points: edge, sidecars, app code, DB proxies. – Standardize authz request and response schema. – Instrument decision latency and outcome metrics.
3) Data collection – Centralize audit logs with structured fields: timestamp, principal, resource, action, decision, policy_version. – Capture distributed traces including PDP calls.
4) SLO design – Define SLIs: decision success, latency, coverage. – Set SLOs based on risk appetite and performance needs.
5) Dashboards – Build executive, on-call, debug dashboards. – Add policy distribution and revocation panels.
6) Alerts & routing – Alerts for PDP errors, high deny rates, and failures to log. – Route to security or SRE on-call based on runbook.
7) Runbooks & automation – Runbook for PDP outage: verify redundancy, switch fail-mode. – Automation: policy rollout via CI, automated revocation push.
8) Validation (load/chaos/game days) – Load test PDP and enforcers. – Chaos test network partitions and validate fail-safe behavior. – Run game days simulating rapid deprovisioning.
9) Continuous improvement – Review denies weekly for false positives. – Audit policy complexity and remove stale rules.
Pre-production checklist:
- All enforcement points instrumented.
- Policy tests in CI passing.
- Audit logs forwarded and ingested.
- PDP redundancy and fail-mode tested.
Production readiness checklist:
- SLOs defined and dashboards operational.
- Alerts set and on-call trained.
- Revocation propagation validated in staging.
Incident checklist specific to Complete Mediation:
- Identify scope of affected requests.
- Check policy version history and distribution lag.
- Verify PDP health and error logs.
- Confirm audit logs for timeline.
- Apply mitigation: rollback policy, adjust TTLs, or switch fail-mode.
Use Cases of Complete Mediation
Provide 8–12 concise use cases.
1) Multi-tenant SaaS – Context: Many tenants share services. – Problem: Prevent cross-tenant data access. – Why helps: Enforces tenant-aware policies per request. – What to measure: Enforcement coverage, unauthorized events. – Typical tools: Service mesh, DB row-level enforcement.
2) Payroll processing – Context: Financial transactions with strict compliance. – Problem: Unauthorized adjustments cause legal issues. – Why helps: Ensures check per transaction and revocation. – What to measure: Revocation propagation time, decision latency. – Typical tools: PDP, audit logging, shadow mode.
3) Admin portals – Context: Elevated privileges for support staff. – Problem: Privilege misuse or overreach. – Why helps: Fine-grained checks on each admin action. – What to measure: Admin deny rate, last action audit trails. – Typical tools: Middleware enforcement, policy-as-code.
4) IoT fleets – Context: Devices with intermittent connectivity. – Problem: Device tokens repeatedly used after compromise. – Why helps: Short TTLs, nonce checks, revocation propagation reduce window. – What to measure: Cache stale window, revocation fail rate. – Typical tools: Edge enforcers with offline policies.
5) Platform engineering (internal APIs) – Context: Many internal services interacting. – Problem: Lateral movement risk during breach. – Why helps: Per-call enforcement limits blast radius. – What to measure: Enforcement coverage, sidecar deny counts. – Typical tools: Service mesh, mutual TLS.
6) Healthcare records – Context: PHI access controls required. – Problem: Ensuring patient consent and context at access time. – Why helps: Attribute-based checks per resource access. – What to measure: Unauthorized access events, audit completeness. – Typical tools: ABAC engines, audit pipelines.
7) CI/CD secret access – Context: Build jobs access secrets. – Problem: Stolen credentials enabling pipeline abuse. – Why helps: Short-lived credentials and per-access checks reduce risk. – What to measure: Secrets usage events, revocation time. – Typical tools: Short-lived token manager, PDP integration.
8) Serverless functions – Context: High concurrency ephemeral compute. – Problem: Avoid stale permissions in scaled functions. – Why helps: Enforce per-invocation checks and token refresh. – What to measure: Authz latency p95, invocation deny rate. – Typical tools: Serverless gateways, inline middleware.
9) Third-party integrations – Context: External apps call your APIs. – Problem: OAuth tokens retained after partnership ends. – Why helps: Immediate revocation and scope checks per call. – What to measure: Token exchange audit, revoke propagation. – Typical tools: OAuth token manager, gateway enforcement.
10) Data pipelines – Context: Streaming ETL with access to multiple datasets. – Problem: A compromised job exfiltrates data. – Why helps: Segment-level checks for each pipeline step. – What to measure: Access patterns, deny rates, data egress events. – Typical tools: Data proxy, RLS, policy engines.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes microservice authorization
Context: Multi-tenant microservices running in Kubernetes. Goal: Ensure per-tenant resource isolation with minimal latency. Why Complete Mediation matters here: Prevent cross-tenant reads and writes on service calls. Architecture / workflow: Ingress -> API gateway -> service mesh sidecars -> services -> DB with RLS. Step-by-step implementation:
- Deploy API gateway for edge checks.
- Inject sidecars with authz plugin.
- Use PDP with tenant attribute evaluation.
- Enable DB RLS for row enforcement.
- Instrument traces and authz metrics. What to measure: Enforcement coverage, authz p95 latency, unauthorized events. Tools to use and why: Service mesh for sidecar enforcement; PDP for policy centralization; DB RLS for data protection. Common pitfalls: Sidecar bypass during deployment, stale cache TTLs. Validation: Run simulated tenant deprovisioning and verify no further access. Outcome: Reduced cross-tenant incidents and audit trail.
Scenario #2 — Serverless payment validation (serverless/managed-PaaS)
Context: Payment processing using managed functions. Goal: Validate authorization per invocation without adding significant latency. Why Complete Mediation matters here: Payments are high-risk; each invocation must be authorized. Architecture / workflow: API gateway -> authz middleware -> serverless function -> payment gateway. Step-by-step implementation:
- Place authz at gateway with token verification.
- Use token exchange to scope tokens for function invocation.
- Employ short-lived tokens and immediate revocation push on compromise.
- Monitor function authz latency. What to measure: Authz decision latency, revoke propagation, deny rate. Tools to use and why: API gateway for first check; token manager for short TTLs; logging for audits. Common pitfalls: Long-running function state assuming old token permissions. Validation: Run load tests and revocation drills. Outcome: Secure, low-latency payments with auditable access decisions.
Scenario #3 — Incident response and postmortem (incident-response/postmortem)
Context: Breach where a deprovisioned account still made changes. Goal: Identify root cause and close the gap. Why Complete Mediation matters here: A missing enforcement check allowed the action. Architecture / workflow: Audit log ingestion -> trace correlation -> policy version history. Step-by-step implementation:
- Triage timeline from logs.
- Check policy distribution and cache TTLs.
- Reproduce the path with shadow mode to confirm fix.
- Apply remediation: Reduce TTL and push active revokes.
- Update runbooks and policy tests. What to measure: Time between deprovision and last access, audit completeness. Tools to use and why: SIEM for log correlation; tracing for request paths. Common pitfalls: Missing logs hinder root cause identification. Validation: Postmortem drills and automated alerts for revocation failures. Outcome: Root cause fixed and SLO adjusted.
Scenario #4 — Cost vs performance trade-off (cost/performance trade-off)
Context: High-throughput API where PDP calls add cost and latency. Goal: Maintain secure enforcement while controlling costs. Why Complete Mediation matters here: Need to balance per-request checks and system scaling. Architecture / workflow: Hybrid PDP with local policy cache and revocation push. Step-by-step implementation:
- Move static allow rules to local cache with very short TTLs for sensitive paths.
- Keep high-risk checks routed to PDP.
- Use sampled PDP verification for low-risk paths to validate cache accuracy.
- Monitor costs from PDP calls and authz latency. What to measure: PDP call rate, authz latency, unauthorized events chart. Tools to use and why: PDP for dynamic checks; cache with push invalidation for scale control. Common pitfalls: Excessive TTL causing stale access; under-sampling misses regressions. Validation: Load testing and chaos experiments. Outcome: Reduced PDP costs and acceptable latency while preserving security.
Common Mistakes, Anti-patterns, and Troubleshooting
List of mistakes with Symptom -> Root cause -> Fix. Include observability pitfalls.
1) Symptom: Unauthorized action observed -> Root cause: Stale cache -> Fix: Reduce TTL and implement push revocations. 2) Symptom: High authz latency -> Root cause: Sync calls to centralized PDP -> Fix: Add local cache and async evaluation. 3) Symptom: Policy drift between services -> Root cause: Manual policy updates -> Fix: Policy-as-code and CI validation. 4) Symptom: Missing audit entries -> Root cause: Logging pipeline misconfiguration -> Fix: Fix pipeline and backfill where possible. 5) Symptom: False positive denies -> Root cause: Overly strict ABAC rules -> Fix: Adjust policies and use shadow mode during rollout. 6) Symptom: PDP outage leads to permissive mode -> Root cause: Fail-open default -> Fix: Change to fail-closed for sensitive ops. 7) Symptom: Excessive cost from PDP calls -> Root cause: PDP per-request for all flows -> Fix: Cache and tiered decision strategy. 8) Symptom: Sidecar bypass during scaling -> Root cause: Deployment mis-injection -> Fix: Admission controller enforcement and CI checks. 9) Symptom: Latent revocations -> Root cause: Revocation queue backlog -> Fix: Prioritize revocations and monitor queue length. 10) Symptom: Unclear ownership -> Root cause: Security and platform teams misaligned -> Fix: Define clear ownership and runbooks. 11) Symptom: Incomplete telemetry -> Root cause: Instrumentation gaps -> Fix: Instrument at every enforcement point. 12) Symptom: Alert storm on deploy -> Root cause: policy version change causing denies -> Fix: Suppression window during rollout and preflight tests. 13) Symptom: Inconsistent decision outcomes -> Root cause: Multiple PDP versions -> Fix: Version gates and canary policies. 14) Symptom: Overcomplicated policies -> Root cause: Excessive condition branching -> Fix: Simplify and modularize policies. 15) Symptom: Developer friction -> Root cause: Poorly documented policy model -> Fix: Provide policy libraries and examples. 16) Symptom: Observability missing context -> Root cause: Logs lack request ids -> Fix: Add correlation ids to authz logs. 17) Symptom: High false negative unauthorized events -> Root cause: Sampling hides issues -> Fix: Reduce sampling for authz paths. 18) Symptom: Database-level bypass -> Root cause: Direct DB access ignored enforcement -> Fix: Enforce DB proxy or RLS. 19) Symptom: Token misuse -> Root cause: Long-lived JWTs -> Fix: Use short-lived tokens and refresh flows. 20) Symptom: Audit storage costs -> Root cause: Verbose logs with high retention -> Fix: Tiered retention and archive policies. 21) Symptom: Shadow mode ignored -> Root cause: No owner for analysis -> Fix: Assign owner to review shadow results. 22) Symptom: Inadequate testing -> Root cause: No policy tests in CI -> Fix: Add policy unit and integration tests. 23) Symptom: Revocation not immediate -> Root cause: No push mechanism -> Fix: Implement push invalidation or subscribe model. 24) Symptom: Lack of RBAC granularity -> Root cause: Flat role scopes -> Fix: Introduce scoped roles and ABAC for nuance. 25) Symptom: Missing incident playbook -> Root cause: No runbook for authz incidents -> Fix: Create targeted runbooks and drills.
Observability pitfalls included above: missing trace context, sampling hiding failures, logs lacking correlation IDs, incomplete coverage.
Best Practices & Operating Model
Ownership and on-call:
- Assign policy ownership to a platform or security team.
- Include authoring, testing, and rollout responsibilities.
- On-call rotations should include an authorization responder.
Runbooks vs playbooks:
- Runbooks: step-by-step for known errors (PDP outage, revocation failure).
- Playbooks: broader strategies for incidents requiring coordination.
Safe deployments:
- Canary policies: roll to small percentage first.
- Shadow deployments: evaluate denies without blocking.
- Automated rollback if deny spikes exceed threshold.
Toil reduction and automation:
- Policy-as-code with tests in CI/CD.
- Automated distribution and verification of policy versions.
- Auto-remediation for common misconfigurations.
Security basics:
- Fail-closed for sensitive operations.
- Short token TTLs and immediate revocations.
- Defense in depth: enforce at multiple layers.
Weekly/monthly routines:
- Weekly: Review deny spikes and false positives.
- Monthly: Audit policy drift and stale rules.
- Quarterly: Revocation drills and game days.
Postmortem review items:
- Was complete mediation enforced at every access point?
- Time to detect and remediate any bypass.
- Policy distribution lag during the incident.
- Audit log completeness for the incident window.
- Changes to SLOs or runbooks.
Tooling & Integration Map for Complete Mediation (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | PDP | Evaluates policy decisions | Enforcers and logging | Central decision engine |
| I2 | PAP | Manages policy lifecycle | CI/CD and PDP | Policy-as-code source |
| I3 | Service Mesh | Sidecar enforcement | Kubernetes and PDP | Fine-grained controls |
| I4 | API Gateway | Edge enforcement | IdP and PDP | First-line defense |
| I5 | IAM | Identity issuance and management | IdP, tokens, revocation | Source of truth for identities |
| I6 | DB Proxy | Enforces DB access rules | DB and PAP | Works with RLS |
| I7 | Logging/SIEM | Stores auditable logs | PDP and enforcers | Forensics and alerts |
| I8 | Observability | Traces and metrics | Tracing and metric backends | Perf and root cause |
| I9 | Policy Testing | Unit and integration tests for policies | CI/CD | Prevents regressions |
| I10 | Token Manager | Issues short-lived creds | IdP and gateways | Reduces token lifetime |
Row Details (only if needed)
None
Frequently Asked Questions (FAQs)
What is the difference between complete mediation and least privilege?
Complete mediation is about checking every access. Least privilege is about minimizing granted rights. Both are complementary.
Does complete mediation mean blocking all cached checks?
No. Caching is allowed but must honor short TTLs and revocation signals to preserve mediation guarantees.
How do we balance latency with per-request authorization?
Use local caches with bounded TTLs, tiered PDP checks, and async validation for low-risk flows.
Is service mesh required for complete mediation?
No. Service mesh is a convenient enforcement point but mediation can be implemented via gateways, sidecars, and app libraries.
How should revocations be propagated?
Push invalidation messages to enforcers, prioritize critical revocations, and monitor propagation times.
What logs are essential for mediation audits?
Structured decision logs with request ID, principal, resource, action, decision, and policy version.
How often should policies be tested?
Every change must pass unit tests in CI; integration tests in staging before rollout.
What failure mode is most dangerous?
Silent stale cache allowing revoked principals to act is among the most dangerous due to delayed detection.
Can shadow mode replace enforcement?
Shadow mode is for safe rollout and detection but must be followed by enforcement when validated.
How do we measure enforcement coverage?
Enumerate access paths and instrument each enforcement point to track whether decisions are logged.
Should PDP be centralized or distributed?
Hybrid: central PAP/PDP logic with distributed PDP instances or caches to balance consistency and latency.
What SLOs are typical for authorization latency?
Start with p95 < 50ms for APIs; adjust based on service requirements.
How do you handle third-party integrations?
Treat external callers as untrusted; enforce per-call authz and use short-lived scopes with revocation.
What policies should be in CI/CD?
Policy validation, syntax checks, unit tests, and integration tests in a staging environment.
How to reduce alert noise from policy rollouts?
Use suppression windows, group alerts by policy version, and adopt canary rollouts.
Who owns post-incident policy changes?
Policy authors with cross-functional review; tie to platform or security ownership.
Is complete mediation required for compliance?
Often required or strongly recommended for regulated systems; depends on the regulation and context.
Conclusion
Complete mediation is a foundational security and reliability practice for cloud-native systems. It requires disciplined policy management, instrumentation, fail-safe behavior, and continuous validation. Implemented correctly, it reduces incidents, limits blast radius, and supports compliance.
Next 7 days plan (practical):
- Day 1: Inventory all enforcement points and list sensitive resources.
- Day 2: Ensure structured audit logs are emitted from each enforcement point.
- Day 3: Define 3 SLIs (decision rate, deny rate, decision latency) and create dashboards.
- Day 4: Add policy checks to CI and run policy unit tests.
- Day 5: Implement short TTL cache strategy and revocation push in staging.
- Day 6: Run a shadow-mode rollout for a high-risk policy and analyze denies.
- Day 7: Run a mini game day: revoke a test principal and measure propagation.
Appendix — Complete Mediation Keyword Cluster (SEO)
- Primary keywords
- Complete mediation
- Authorization enforcement
- Per-request authorization
- Policy decision point
- Policy administration point
- Authorization SLO
- Authorization SLIs
- Revocation propagation
- Authorization audit logs
-
Policy-as-code
-
Secondary keywords
- Access control enforcement
- Token revocation
- Sidecar authorization
- Service mesh authorization
- API gateway authz
- ABAC for cloud
- RBAC and mediation
- Shadow mode rollout
- Fail-closed authorization
-
Authz latency metrics
-
Long-tail questions
- What does complete mediation mean in cloud-native systems
- How to implement complete mediation in Kubernetes
- How to measure authorization decision latency
- How to push policy revocations to caches
- Best practices for authorization SLIs and SLOs
- How to test policy-as-code in CI/CD
- When to use PDP vs local policy
- How to prevent stale token access after deprovisioning
- How to balance authz latency and throughput
-
How to debug authorization denials in microservices
-
Related terminology
- Policy cache invalidation
- Token exchange pattern
- Nonce and replay protection
- Row level security RLS
- Audit log completeness
- Authorization coverage
- Enforcement point telemetry
- PDP redundancy
- Shadow mode testing
- Revocation queue monitoring
- Authorization decision tracing
- Zero trust authorization
- Least privilege enforcement
- Fine-grained access control
- Authorization failover strategy
- Per-invocation authz
- CI/CD policy gates
- Authz decision sampling
- Authorization policy complexity
- Policy distribution lag
- Enforcement coverage metric
- Unauthorized access incident
- Data plane enforcement
- Service-to-service authz
- Authentication vs authorization
- Token refresh lifecycle
- Policy version drift
- Admission controller for sidecars
- Immutable audit storage
- Authorization deny spike
- Authorization error budget
- Tracing authz decision paths
- Observability for authz
- Authorization decision cache
- Policy test harness
- Revocation push notifications
- Authorization runbooks
- Authorization playbooks
- Authorization incident postmortem