Quick Definition (30–60 words)
Pod Security Policy (PSP) is a Kubernetes admission control mechanism that enforced pod-level security constraints. Analogy: PSP is like airport security rules for containers. Formal: PSP defines allowed pod spec features and validates pods at admission time against policy objects.
What is PSP?
What it is / what it is NOT
- PSP is a Kubernetes admission control resource model used to restrict pod capabilities, e.g., privileged mode, hostPath, running as root.
- PSP is NOT a runtime enforcement engine for already-running containers; it prevents creation rather than introspecting existing pods.
- PSP is NOT a replacement for broader cluster security like network policies, workload identity, or image scanning.
Key properties and constraints
- Admission-time enforcement: evaluates pod requests before creation.
- Policy granularity: operates on pod spec fields and security context attributes.
- RBAC binding: policies are applied via role or clusterrole bindings to service accounts and users.
- Deprecated upstream: the built-in PSP API was deprecated and removed in recent Kubernetes versions; many clusters use PodSecurity admission or third-party controllers.
- Compatibility constraints: behavior varies by Kubernetes version and vendor managed control planes.
Where it fits in modern cloud/SRE workflows
- Preventive security gate in CI/CD pipeline and admission control.
- Complement to runtime detection, image scanning, and network controls.
- Integrated into shift-left security: AC policies are tested in pre-prod to avoid CI failures.
- Used by platform teams to enforce organizational minimal privileges.
A text-only “diagram description” readers can visualize
- Developer -> CI builds image -> Developer submits deployment -> API server admission chain: first webhook checks -> PSP evaluates pod spec -> If allowed, write to etcd -> Scheduler places pod -> Kubelet runs pod -> Observability and runtime security tools monitor.
PSP in one sentence
PSP is an admission-time policy model for validating Kubernetes pod specs to enforce security constraints before pods are created.
PSP vs related terms (TABLE REQUIRED)
| ID | Term | How it differs from PSP | Common confusion |
|---|---|---|---|
| T1 | PodSecurity admission | New builtin policy enforcement model | Often assumed identical to PSP |
| T2 | Gatekeeper | Policy engine using OPA not PSP | People think Gatekeeper modifies PSP |
| T3 | PodSecurityPolicy API | The deprecated PSP API object | Confused with current admission models |
| T4 | NetworkPolicy | Controls networking not pod security | Some expect it blocks privileged containers |
| T5 | Runtime security | Detects behavior post-start | Assumed to prevent pod creation like PSP |
| T6 | Image scanning | Examines images not pod specs | Expected to block hostPath like PSP |
| T7 | RBAC | Authz for subjects not pod constraints | Mistaken for policy application method |
| T8 | Admission webhook | Mechanism not policy model | Believed to be a PSP replacement |
Row Details (only if any cell says “See details below”)
- None
Why does PSP matter?
Business impact (revenue, trust, risk)
- Prevents privilege escalation and data exfiltration risks that can lead to breaches and regulatory fines.
- Reduces blast radius from attacks, protecting customer trust and uptime.
- Enables consistent enforcement across teams, lowering compliance audit costs.
Engineering impact (incident reduction, velocity)
- Reduces production incidents due to insecure pod configurations.
- Improves developer velocity by preventing security rework earlier in the lifecycle.
- Lowers on-call load by removing a class of configuration-induced failures.
SRE framing (SLIs/SLOs/error budgets/toil/on-call) where applicable
- SLIs: percent of pods compliant with baseline security policy; time-to-detect policy violations in CI.
- SLOs: maintain compliance SLO versus audit requirements, e.g., 99.9% of production pods compliant.
- Error budget: violations consume policy compliance budget; repeated violations trigger remediation.
- Toil: manual review of pod specs becomes toil; automation via admission reduces it.
- On-call: alerts for policy admission failures should be routed to platform or CI owners, not app on-call.
3–5 realistic “what breaks in production” examples
- A deployment uses hostPath to mount host directories leading to data corruption across nodes.
- Containers run as root and write to node filesystems, enabling escape vectors.
- Privileged containers granted CAP_SYS_ADMIN break security assumptions in multi-tenant clusters.
- Use of hostNetwork unexpectedly exposes sensitive service endpoints to external traffic.
- Misconfigured seccomp/profile absent causes noisy kernel logs and performance degradation.
Where is PSP used? (TABLE REQUIRED)
| ID | Layer/Area | How PSP appears | Typical telemetry | Common tools |
|---|---|---|---|---|
| L1 | Edge / Ingress | Prevents hostNetwork hostPort usage | Admission deny logs | Admission webhooks |
| L2 | Node / Kubelet | Disallow privileged pods | Kube-apiserver audit | kube-apiserver audit |
| L3 | Service / App | Block hostPath and runAsRoot | Pod creation failures | PodSecurity admission |
| L4 | Data / Storage | Restrict volume types | PVC bind failures | StorageClass policies |
| L5 | Kubernetes control plane | Enforce RBAC-bound policies | Authz audit events | OPA Gatekeeper |
| L6 | Serverless / FaaS | Limit container capabilities | Platform invocation errors | Platform admission hooks |
| L7 | CI/CD pipeline | Pre-commit or admission testing | CI job pass/fail rates | Policy-as-code in CI |
| L8 | Observability / Security | Feed to SIEM for compliance | Alert counts and dashboards | Falco, Kyverno |
Row Details (only if needed)
- None
When should you use PSP?
When it’s necessary
- Multi-tenant clusters where isolation is required.
- Regulated environments with compliance requirements.
- Platform teams enforcing minimal privileges across teams.
When it’s optional
- Single-team clusters with trusted developers and tight review processes.
- Short-lived experimental clusters that are isolated and ephemeral.
When NOT to use / overuse it
- Avoid overly strict global policies that block legitimate Dev workflows.
- Don’t use PSP as the only security control; combine with runtime and network controls.
- Avoid per-pod micromanagement that creates constant friction for developers.
Decision checklist
- If multi-tenant AND compliance required -> enforce baseline policies at admission.
- If single-team AND rapid experimentation -> start with advisory policies in CI.
- If many legacy workloads break on first rollout -> use graduated enforcement (audit -> enforce).
Maturity ladder: Beginner -> Intermediate -> Advanced
- Beginner: Add an admission gate that denies privileged and hostPath.
- Intermediate: Implement policy-as-code in CI and enforce minimal runAsUser and seccomp.
- Advanced: Combine PodSecurity, OPA/Gatekeeper, runtime enforcement, and automated remediations.
How does PSP work?
Explain step-by-step
- Policy authoring: Define constraints (e.g., allowPrivilegeEscalation: false).
- Policy binding: Bind policy to service accounts, groups or namespaces via RBAC.
- Admission-time evaluation: API server or admission controller evaluates pod spec against policies.
- Decision: Admit, deny, or mutate (depending on controller capability).
- Audit and reporting: Log admission decisions to kube-apiserver audit and SIEM.
- Remediation: CI tests or automated tools fix violations or notify owners.
Components and workflow
- Policy storage: Policy objects stored in etcd or external Git (policy-as-code).
- Admission chain: kube-apiserver calls controllers/webhooks in order.
- Matchers: Rules match namespaces, service accounts, labels.
- Action: deny, audit, or mutate pod specs.
- Observability: Audit logs, metrics, and dashboards feed SRE workflows.
Data flow and lifecycle
- Developer pushes manifest -> CI runs policy checks -> Developer deploys -> API server admission checks -> Pod admitted/denied -> Runtime monitoring observes behavior.
Edge cases and failure modes
- Admission webhook outage can block all pod creations if webhook is synchronous and misconfigured.
- Version skew: older PSP objects may not be honored in newer clusters.
- RBAC misconfiguration leads to over- or under-enforcement.
- Exceptions: Some system pods require elevated privileges; misclassifying them breaks control plane.
Typical architecture patterns for PSP
- Baseline enforcement pattern – Use for: quick minimal security across all namespaces. – Implementation: deny privileged, enforce non-root.
- Namespace-tiered pattern – Use for: multi-tenant clusters with dev/prod tiers. – Implementation: different policies per namespace tier.
- GitOps policy-as-code pattern – Use for: teams using GitOps and automated reviews. – Implementation: policies stored in Git, validated by CI, applied via controllers.
- Advisory-to-enforce pattern – Use for: migrations from permissive to strict enforcement. – Implementation: audit first, then enforce after remediation windows.
- Mutating + validating pattern – Use for: automatic hardening (e.g., adding seccomp profiles). – Implementation: mutating webhook injects defaults, validating webhook enforces.
Failure modes & mitigation (TABLE REQUIRED)
| ID | Failure mode | Symptom | Likely cause | Mitigation | Observability signal |
|---|---|---|---|---|---|
| F1 | Webhook outage | Pod creation blocked | Synchronous webhook down | Use timeout and fail-open audit | Increased admission errors |
| F2 | Overly strict policy | Many deployment failures | Broad deny rules | Audit mode then incrementally enforce | Spike in deny audit logs |
| F3 | RBAC misbind | Policy not applied | Incorrect role binding | Correct bindings and test in staging | Discrepancy in expected vs actual denies |
| F4 | Version incompatibility | PSP ignored or errors | Kubernetes API removal | Migrate to PodSecurity or OPA | API errors in controller logs |
| F5 | Privileged system pods blocked | Control plane degraded | Policy applied to system ns | Exclude system namespaces | Control plane pod restarts |
| F6 | Silent drift | Policies diverge from Git | Manual edits in-cluster | Enforce GitOps reconciliation | Config drift alerts |
Row Details (only if needed)
- None
Key Concepts, Keywords & Terminology for PSP
Glossary entries (40+ terms). Each entry: Term — 1–2 line definition — why it matters — common pitfall
- PodSecurityPolicy — Deprecated Kubernetes API for pod admission controls — Central historic model — Pitfall: removed in newer K8s.
- PodSecurity admission — Replacement builtin admission controller enforcing pod security standards — Important as current recommended model — Pitfall: behavior differs from PSP.
- Admission controller — Component that intercepts API requests — Core enforcement point — Pitfall: misconfigured webhook can block cluster.
- Admission webhook — External service called during admission — Enables custom policies — Pitfall: availability impacts pod creation.
- OPA Gatekeeper — Policy engine using Open Policy Agent — Flexible policy-as-code — Pitfall: complexity and performance considerations.
- Kyverno — Kubernetes native policy engine — Simpler policy syntax for K8s — Pitfall: version compatibility.
- RBAC — Role-based access control for subjects — Defines who can create pods — Pitfall: over-permissive roles.
- Namespace — K8s logical partition — Allows per-namespace policies — Pitfall: forgetting system namespaces.
- ServiceAccount — Identity for workloads — Bind policies to SA for least privilege — Pitfall: default SA surprises.
- seccomp — Kernel syscall filtering for containers — Reduces attack surface — Pitfall: missing profile causes permissive syscalls.
- runAsUser — Security context setting to avoid root — Prevents privilege escalation — Pitfall: legacy images require root.
- runAsNonRoot — Enforce non-root container processes — Simple safety check — Pitfall: false positives in init containers.
- allowPrivilegeEscalation — Controls setuid usage — Prevents kernel privilege escalation — Pitfall: needed for some debuggers.
- hostPath — Mount host filesystem into pod — Dangerous for isolation — Pitfall: used for convenience in prod.
- hostNetwork — Shares node network namespace — Exposes node ports — Pitfall: unexpected external exposure.
- hostPID — Shares node process namespace — Security risk for node introspection — Pitfall: needed by some debugging tools.
- capabilities — Linux capabilities granting fine-grained privileges — Controls powerful ops like NET_ADMIN — Pitfall: granting CAP_SYS_ADMIN is near-root.
- privileged container — Full host access like root — Highest risk — Pitfall: used for convenience in init workloads.
- SELinux — Mandatory access control for processes — Adds defense layer — Pitfall: complex labels and policy tuning.
- AppArmor — Kernel security module for confinement — Reduces program actions — Pitfall: profile maintenance overhead.
- Mutating webhook — Alters requests, e.g., inject seccomp — Used for auto-hardening — Pitfall: unexpected changes to manifests.
- Validating webhook — Accept/deny admission requests — Enforces policies — Pitfall: blocks without clear remediation.
- GitOps — Policy-as-code workflows stored in Git — Enables reproducibility — Pitfall: delayed reconciliation can cause drift.
- Policy-as-code — Express policies in versioned code — Improves reviewability — Pitfall: overcomplex rules.
- Audit logs — Records of admission decisions — Required for compliance — Pitfall: noisy logs if policy too verbose.
- SIEM — Security information and event management — Centralizes alerts — Pitfall: high signal-to-noise if unfiltered.
- Least privilege — Principle to minimize permissions — Core security idea — Pitfall: too strict may break apps.
- Mutate-and-validate pattern — Inject defaults then enforce — Reduces friction — Pitfall: order of webhooks matters.
- Admission latency — Time added by webhooks — Affects deployment speed — Pitfall: slow webhooks slow CI.
- Fail-open vs fail-closed — Webhook failure behavior — Decides blocking behavior — Pitfall: fail-open may permit bad pods.
- PodSecurity standard levels — e.g., privileged, baseline, restricted — Defines graded constraints — Pitfall: mislabeling namespaces.
- Scanning vs enforcement — Image scanning looks at images, PSP checks pod specs — Complementary controls — Pitfall: relying on one alone.
- Runtime security (Falco) — Detects behavioral anomalies — Covers runtime gaps — Pitfall: alerts without context.
- Immutable infrastructure — Avoid manual in-cluster edits — Promotes reproducibility — Pitfall: manual fixes create drift.
- Canary policies — Gradual enforcement approach — Useful for migration — Pitfall: partial enforcement complexity.
- Policy templates — Reusable rule patterns — Aid consistency — Pitfall: hidden complexity in templates.
- Compliance baseline — Organization policy requirements — Guides PSP design — Pitfall: baselines too generic.
- Policy reconciliation — Ensure desired state applied — Keeps clusters consistent — Pitfall: reconciliation lag.
- Cluster-wide vs namespace policies — Different scope impacts — Pitfall: cluster policies can break system components.
- Emergency allowlist — Temporary exemptions for critical fixes — Operational necessity — Pitfall: abused and left in place.
- Capability bounding — Limit set of Linux capabilities — Prevent escalation — Pitfall: misidentifying required caps.
- Pod security context — Aggregated security settings per pod — Central to PSP checks — Pitfall: omissions cause denials.
How to Measure PSP (Metrics, SLIs, SLOs) (TABLE REQUIRED)
| ID | Metric/SLI | What it tells you | How to measure | Starting target | Gotchas |
|---|---|---|---|---|---|
| M1 | Pod compliance rate | Percent pods meeting policy | Count compliant pods / total pods | 99% for prod | Some system pods excluded |
| M2 | Admission deny rate | Fraction of admissions denied | Deny events / total admissions | <1% after rollout | Deny spikes indicate dev friction |
| M3 | Time to remediate violation | Time from deny to fix | Time tracked in ticketing | <48 hours for prod | Long lead due to cross-team handoffs |
| M4 | Audit denial alerts | Number of denied events alerting ops | Count denies from audit logs | Configurable threshold | High noise if policy verbose |
| M5 | Policy drift frequency | Number of in-cluster edits not in Git | Drift events per week | 0 for GitOps | Requires detection tooling |
| M6 | Admission latency | Extra ms added by policy checks | Median webhook latency | <200ms | Long latencies slow CI |
| M7 | Unauthorized privilege escalations | Runtime detections post-admit | Runtime alerts correlated to pod | 0 for prod | Runtime tools needed |
| M8 | Exceptions count | Number of emergency allowlist uses | Count per time window | Low and audited | Abuse of allowlist possible |
| M9 | CI policy failure rate | CI jobs failing policy checks | Failures / CI policy jobs | <2% post stabilization | Early migration may spike |
| M10 | Coverage of namespaces | Percent namespaces covered by PSP | CoveredNamespaces / totalNamespaces | 100% for regulated clusters | System namespaces may be exempt |
Row Details (only if needed)
- None
Best tools to measure PSP
Tool — Prometheus + kube-state-metrics
- What it measures for PSP: Pod counts, admission events, webhook metrics.
- Best-fit environment: Kubernetes clusters with metrics stack.
- Setup outline:
- Deploy kube-state-metrics.
- Instrument admission controllers to expose metrics.
- Create Prometheus rules to compute compliance rates.
- Configure Alertmanager with alarms for deny spikes.
- Strengths:
- Flexible queries and alerting.
- Widely used in cloud-native stacks.
- Limitations:
- Requires metric exposition from webhooks.
- Not a SIEM replacement.
Tool — Fluentd / Fluent Bit + ELK
- What it measures for PSP: Collects audit logs and denial events.
- Best-fit environment: Clusters with centralized logging.
- Setup outline:
- Enable kube-apiserver audit logs.
- Forward logs to Elasticsearch.
- Create dashboards for deny events.
- Strengths:
- Rich search across logs.
- Good for compliance audits.
- Limitations:
- Storage costs for large logs.
- Requires field normalization.
Tool — OPA Gatekeeper
- What it measures for PSP: Policy violations and audit reports.
- Best-fit environment: Policy-as-code users.
- Setup outline:
- Install Gatekeeper & ConstraintTemplates.
- Create Constraints for desired rules.
- Use audit mode and capture reports.
- Strengths:
- Expressive Rego policies.
- Audit capabilities.
- Limitations:
- Rego learning curve.
- Performance tuning may be needed.
Tool — Kyverno
- What it measures for PSP: Validation, mutation, and policy audit events.
- Best-fit environment: Kubernetes-native policy needs.
- Setup outline:
- Install Kyverno.
- Define policies in YAML.
- Use mutate to inject defaults and validate to enforce.
- Strengths:
- K8s-like policy syntax.
- Easier onboarding.
- Limitations:
- May lack some advanced Rego features.
Tool — Falco
- What it measures for PSP: Runtime violations that indicate admission gaps.
- Best-fit environment: Runtime security observability.
- Setup outline:
- Deploy Falco as DaemonSet.
- Configure rules for privilege escalation patterns.
- Forward alerts to SIEM/Alertmanager.
- Strengths:
- Detects behavioral anomalies.
- Complements admission controls.
- Limitations:
- False positives if rules not tuned.
Recommended dashboards & alerts for PSP
Executive dashboard
- Panels:
- Pod compliance rate (trend).
- Number of denied admissions by namespace.
- Time-to-remediate median.
- Policy drift count.
- Why:
- Shows compliance health to leadership.
On-call dashboard
- Panels:
- Recent admission denies with stacktrace.
- Admission webhook latency and error rate.
- Namespaces with repeated denies.
- Active exceptions/allowlist entries.
- Why:
- Rapid triage during incidents and deployment failures.
Debug dashboard
- Panels:
- Raw audit log stream filtered for policy events.
- Per-webhook latency and error logs.
- Pod spec differences between requested and mutated.
- Timeline of CI fail rate for policy checks.
- Why:
- Deep troubleshooting for policy failures.
Alerting guidance
- What should page vs ticket:
- Page: Admission webhook down or high error rate impacting pod creation.
- Ticket: Individual deployment denies for developers.
- Burn-rate guidance:
- If deny rate consumes more than 25% of weekly change-related tolerance, trigger review.
- Noise reduction tactics:
- Deduplicate identical denies.
- Group alerts by namespace or service account.
- Suppress during maintenance windows and known rollouts.
Implementation Guide (Step-by-step)
1) Prerequisites – Cluster admin privileges or platform team involvement. – CI and GitOps pipelines in place. – Observability stack capturing audit logs and metrics.
2) Instrumentation plan – Add metrics to admission controllers. – Ensure audit logging is enabled on kube-apiserver. – Plan policies in Git with review workflows.
3) Data collection – Forward audit logs to central logging. – Export metrics to Prometheus. – Store policy state in Git.
4) SLO design – Define SLOs for compliance rate and time-to-remediate. – Map SLOs to services and namespaces.
5) Dashboards – Build executive, on-call, and debug dashboards. – Provide drill-down links from exec panels to debug panels.
6) Alerts & routing – Create alerts for webhook health and deny spikes. – Route to platform on-call for blocking issues. – Route policy violations to application owners via ticketing.
7) Runbooks & automation – Create runbooks for webhook outages, policy denial investigations, and emergency allowlist processes. – Automate remediation for common fixes (e.g., add runAsUser where safe).
8) Validation (load/chaos/game days) – Run load tests for admission latency impacts. – Conduct chaos tests that simulate webhook failure. – Run game days to validate incident response.
9) Continuous improvement – Weekly policy reviews. – Quarterly audits and SLO reviews. – Postmortem-driven policy adjustments.
Pre-production checklist
- Policies authored and stored in Git.
- CI policy tests added to pipeline.
- Staging cluster mirrors production policy enforcement.
- Observability capturing admission and audit logs.
- Runbooks for expected failures.
Production readiness checklist
- Canary rollout with audit-mode first.
- Metrics and alerts configured and tested.
- Emergency allowlist process documented and limited.
- Training for app teams on common fixes.
Incident checklist specific to PSP
- Confirm scope: which namespaces and service accounts affected.
- Check webhook health and API server logs.
- Determine if deny is expected or due to policy drift.
- If webhook down, assess fail-open configuration and restore service.
- Apply temporary allowlist if safe and document.
Use Cases of PSP
Provide 8–12 use cases:
1) Multi-tenant SaaS cluster – Context: Multiple customers share cluster. – Problem: Isolation breaches risk data leaks. – Why PSP helps: Enforce least privilege for tenants. – What to measure: Pod compliance rate, unauthorized privileges. – Typical tools: PodSecurity, Gatekeeper, Prometheus.
2) Regulated environment (PCI/ISO) – Context: Compliance auditing required. – Problem: Inconsistent security posture across teams. – Why PSP helps: Standardize enforcement and produce audit logs. – What to measure: Policy drift, compliance SLOs. – Typical tools: PodSecurity, audit logging, SIEM.
3) Platform-as-a-Service team – Context: Platform team provides managed namespaces. – Problem: Developers bypassing guidelines. – Why PSP helps: Prevent risky pods before they run. – What to measure: CI policy failure rates. – Typical tools: Kyverno, GitOps.
4) CI/CD hardening – Context: Deployments automated via pipelines. – Problem: Broken deployments due to runtime privilege assumptions. – Why PSP helps: Fail early in CI to avoid prod incidents. – What to measure: CI policy failures, remediation time. – Typical tools: Policy-as-code, CI plugins.
5) Securing edge workloads – Context: Edge nodes run untrusted workloads. – Problem: Attack on edge node affects fleet. – Why PSP helps: Block hostNetwork and hostPath on edge pods. – What to measure: HostPath denies, hostNetwork usage. – Typical tools: PodSecurity admission, Falco.
6) Legacy migration – Context: Moving older workloads to K8s. – Problem: Many containers require root. – Why PSP helps: Gradual enforcement to modernize apps. – What to measure: Number of exemptions and trend. – Typical tools: Audit-mode policies, canary enforcement.
7) Serverless platform constraints – Context: Managed FaaS on K8s underneath. – Problem: Function runtimes gaining unintended capabilities. – Why PSP helps: Enforce minimal syscall surfaces. – What to measure: Runtime detections and denials. – Typical tools: Kyverno, seccomp profiles.
8) Incident containment automation – Context: Post-breach containment required. – Problem: Need to quickly limit new risky pods. – Why PSP helps: Quickly apply stricter policies cluster-wide. – What to measure: Time to apply emergency policy, deny rate. – Typical tools: GitOps for fast policy deployment.
9) Cost control (indirect) – Context: Privileged pods accessing node-level resources. – Problem: Unintended resource reserves and scheduling inefficiencies. – Why PSP helps: Prevent hostResource claims that remove capacity. – What to measure: Host-bound deployments and node utilization. – Typical tools: Admission policies and scheduler metrics.
10) Platform onboarding – Context: New team joining shared cluster. – Problem: Lack of standardized practices increases risk. – Why PSP helps: Provide baseline constraints and onboarding templates. – What to measure: First-week compliance rate and ROX. – Typical tools: Templates in Git, CI tests.
Scenario Examples (Realistic, End-to-End)
Scenario #1 — Kubernetes: Multi-tenant baseline enforcement
Context: A SaaS provider runs multiple customer namespaces in one cluster.
Goal: Enforce baseline security without breaking existing workloads.
Why PSP matters here: Prevents privilege escalation and protects shared nodes.
Architecture / workflow: GitOps for policy definitions, Kyverno for mutation/validation, Prometheus for metrics, Fluentd for audits.
Step-by-step implementation:
- Inventory current pod specs in prod.
- Create baseline policies denying privileged, hostPath, hostNetwork.
- Deploy policies in audit mode for 2 weeks.
- Fix violations and provide developer guidance.
- Switch to enforce mode for non-system namespaces.
What to measure: Pod compliance rate, deny events per namespace, time-to-remediate.
Tools to use and why: Kyverno for easy K8s-style policies; Prometheus for metrics; Fluentd for audit logs.
Common pitfalls: Not exempting kube-system causing control plane failures.
Validation: Run canary deployments with known-good manifests.
Outcome: Baseline enforced with minimal disruptions and measurable compliance.
Scenario #2 — Serverless / Managed-PaaS: Function sandboxing
Context: A company operates an internal FaaS platform on K8s.
Goal: Ensure functions cannot use host resources or escalate privileges.
Why PSP matters here: Functions are highly dynamic and riskier if permissive.
Architecture / workflow: Platform admission webhooks that mutate function pods to include seccomp and drop capabilities; Gatekeeper validates.
Step-by-step implementation:
- Define seccomp and capability baselines.
- Mutating webhook injects defaults into function pods.
- Gatekeeper validates no hostPath or privileged flags.
- CI validates functions against policies before deployment.
What to measure: Function pod compliance and runtime detections.
Tools to use and why: Mutating webhook for injection, Gatekeeper for validation, Falco for runtime.
Common pitfalls: Breakage of native libs requiring specific capabilities.
Validation: Canary with synthetic functions and runtime checks.
Outcome: Functions run in tighter sandboxes with reduced blast radius.
Scenario #3 — Incident response / Postmortem: Privilege exploit mitigation
Context: A container escape vulnerability exploited by an attacker to read node files.
Goal: Contain ongoing exploitation and prevent new risky pods.
Why PSP matters here: Quickly restrict new pods from using hostPath or privileged modes.
Architecture / workflow: Emergency policy pushed via GitOps; admission validates new pods; Falco watches for post-admit anomalies.
Step-by-step implementation:
- Declare incident and notify platform on-call.
- Apply emergency deny policy cluster-wide excluding kube-system.
- Monitor for new deny events and retroactive runtime alerts.
- Remediate running risky pods via orchestration.
- Postmortem to remove allowlists and refine policies.
What to measure: Time to apply emergency policy and reduction in risky pods.
Tools to use and why: GitOps for quick policy rollouts, Falco for runtime monitoring.
Common pitfalls: Overly broad emergency rules breaking legitimate jobs.
Validation: Confirm new pods are denied and runtime anomalies decline.
Outcome: Containment of the exploit vector while follow-up patches are deployed.
Scenario #4 — Cost / Performance trade-off: Limiting node-affinity privileged workloads
Context: Privileged workloads were allowed to reserve host devices causing scheduling hotspots.
Goal: Reduce node contention and improve cost efficiency.
Why PSP matters here: Prevents pods from requesting host resources unnecessarily.
Architecture / workflow: Enforcement via policy to forbid hostPath and hostNetwork for non-admin namespaces; scheduler metrics track node load.
Step-by-step implementation:
- Identify pods using hostPath/hostNetwork.
- Create policy disallowing host bindings in app namespaces.
- Educate teams on alternatives (CSI drivers, local PVs with eviction).
- Enforce policy and monitor node utilization and costs.
What to measure: Host-bound pod count, node utilization, cost per workload.
Tools to use and why: Prometheus for node metrics, Gatekeeper for enforcement.
Common pitfalls: Legacy storage needs needing migration effort.
Validation: Observe reduced node saturation and lower costs.
Outcome: Improved packing and cost reduction while maintaining app functionality.
Common Mistakes, Anti-patterns, and Troubleshooting
List of mistakes with Symptom -> Root cause -> Fix (15–25 items)
- Symptom: Many deployments suddenly fail. -> Root cause: Enforced policy applied cluster-wide without audit stage. -> Fix: Roll back to audit mode and stage enforcement.
- Symptom: Control plane pods restart. -> Root cause: Policy applied to kube-system namespaces. -> Fix: Exempt kube-system or bind policy selectively.
- Symptom: CI pipeline fails for legacy apps. -> Root cause: No migration path or advisory checks. -> Fix: Add migration tasks and config transforms in CI.
- Symptom: Webhook timeout blocking deploys. -> Root cause: Synchronous webhook slow response. -> Fix: Increase timeouts, optimize webhook, use caching, fail-open if acceptable.
- Symptom: High alert noise for denies. -> Root cause: Broad deny rules catching benign patterns. -> Fix: Tweak rules, add exceptions, group alerts.
- Symptom: Runtime escape detected after admission. -> Root cause: Admission misses runtime behavior. -> Fix: Add runtime security tools like Falco and correlate events.
- Symptom: Policy drift between Git and cluster. -> Root cause: Manual in-cluster edits. -> Fix: Enforce GitOps reconciliation and audit.
- Symptom: Developers request privileges frequently. -> Root cause: Missing capabilities/incompatible images. -> Fix: Provide developer guidance, alternative images, or safe allowlists.
- Symptom: Slow troubleshooting for denied pods. -> Root cause: Poor audit log indexing. -> Fix: Improve logging pipeline and searchable fields.
- Symptom: Too many exceptions. -> Root cause: Emergency allowlist overused. -> Fix: Time-bound allowlists and post-incident review.
- Symptom: False positives in runtime alerts. -> Root cause: Un-tuned rules. -> Fix: Tune Falco/IDS rules for environment.
- Symptom: Admission controller memory pressure. -> Root cause: Complex policy evaluation. -> Fix: Simplify policies or scale controller replicas.
- Symptom: Unauthorized privilege escapes not detected. -> Root cause: No runtime coverage. -> Fix: Deploy additional runtime sensors and process baselines.
- Symptom: Policy regressions after upgrade. -> Root cause: API behavior changes across K8s versions. -> Fix: Test policies during cluster upgrades in staging.
- Symptom: Too many one-off policies. -> Root cause: Lack of reuse and templates. -> Fix: Create reusable policy templates.
- Symptom: Missing seccomp profiles. -> Root cause: OS/container runtime mismatch. -> Fix: Standardize runtimes and maintain profiles.
- Symptom: App failures masked by admission denies. -> Root cause: Poor error messaging in deny responses. -> Fix: Provide detailed deny messages and remediation steps.
- Symptom: Observability blind spots. -> Root cause: Not collecting admission metrics. -> Fix: Instrument webhooks and export metrics.
- Symptom: Performance regressions with mutation. -> Root cause: Mutating webhook injects heavy sidecars. -> Fix: Re-evaluate injected artifacts and tune.
- Symptom: Security policy conflicts. -> Root cause: Multiple policy engines with overlapping rules. -> Fix: Consolidate or document precedence.
- Symptom: Unmonitored allowlist usage. -> Root cause: Lack of audit for exemptions. -> Fix: Log and review all allowlist entries periodically.
- Symptom: Poor developer adoption. -> Root cause: No training and unclear guidance. -> Fix: Provide examples, templates, and office hours.
- Symptom: Excessive manual remediation work. -> Root cause: No automation for common fixes. -> Fix: Create automation playbooks and PR bots.
Observability pitfalls (at least 5 included above): noisy logs, missing metrics, indexing gaps, false positives, lack of correlation between admission and runtime.
Best Practices & Operating Model
Ownership and on-call
- Platform team owns policy lifecycle and on-call for blocking webhook failures.
- Application teams own remediation for their violations.
- Create a clear escalation path when enforcement blocks critical business workflows.
Runbooks vs playbooks
- Runbooks: Step-by-step operational procedures for common issues (webhook down, emergency allowlist).
- Playbooks: Higher-level decision guides for incidents requiring judgment (policy rollback vs enforce).
Safe deployments (canary/rollback)
- Start in audit mode, then small-namespace canary, then full enforce.
- Keep quick rollback paths and automated tests in CI to detect breakages.
Toil reduction and automation
- Automate remediation PRs for simple fixes (e.g., inject runAsUser).
- Use policy templates and GitOps to avoid manual edits.
Security basics
- Default deny for capabilities and privileged flags.
- Enforce non-root where possible.
- Apply seccomp and AppArmor profiles.
- Monitor runtime for deviations.
Weekly/monthly routines
- Weekly: Review deny spikes and open remediation tickets.
- Monthly: Audit allowlist entries and drift reports.
- Quarterly: SLO review and policy effectiveness report.
What to review in postmortems related to PSP
- Whether policy prevented or caused the incident.
- Time to detect and remediate policy violations.
- Any changes to allowlists and their justification.
- Lessons to tighten or relax policies.
Tooling & Integration Map for PSP (TABLE REQUIRED)
| ID | Category | What it does | Key integrations | Notes |
|---|---|---|---|---|
| I1 | Policy engine | Validates and mutates pod specs | Kubernetes admission, GitOps | Gatekeeper and Kyverno common choices |
| I2 | Audit logging | Collects admission decisions | SIEM, ELK, cloud logging | kube-apiserver audit must be enabled |
| I3 | Metrics store | Stores compliance and latency metrics | Prometheus, Alertmanager | Needs metrics from controllers |
| I4 | Runtime security | Detects runtime anomalies | Falco, runtime scanners | Complements admission controls |
| I5 | GitOps | Manages policy-as-code | ArgoCD, Flux | Ensures reconciliation |
| I6 | CI integration | Runs policy checks pre-deploy | Jenkins, GitHub Actions | Prevents violations before admission |
| I7 | Dashboarding | Visualizes compliance | Grafana, Kibana | Executive and debug views |
| I8 | Identity / AuthN | Maps service accounts and users | OIDC, IAM | Critical for correct policy binding |
| I9 | Secrets & config | Securely store seccomp/AppArmor files | Vault, K8s Secrets | Sensitive artifacts storage |
| I10 | Incident mgmt | Routes alerts and tickets | PagerDuty, Opsgenie | On-call routing for platform team |
Row Details (only if needed)
- None
Frequently Asked Questions (FAQs)
H3: What does PSP stand for in Kubernetes?
Pod Security Policy in historical Kubernetes context; replaced by PodSecurity and third-party engines in newer K8s.
H3: Is PSP still supported in Kubernetes 1.27+?
No, the built-in PSP API was deprecated earlier and removed in later releases. Use PodSecurity, OPA Gatekeeper, or Kyverno.
H3: What’s the difference between PodSecurity and PSP?
PodSecurity is the newer builtin admission mode with standard levels; PSP was a more flexible but deprecated API.
H3: Can PSP mutate pod specs?
The original PSP was validating-only; mutation requires mutating webhooks like Kyverno or custom controllers.
H3: Should I use Gatekeeper or Kyverno?
Depends on team skills: Gatekeeper is powerful with Rego, Kyverno is Kubernetes-native and simpler for many use cases.
H3: How do I migrate from PSP to PodSecurity?
Inventory PSP usage, map rules to PodSecurity levels or Gatekeeper constraints, test in staging, and roll out audit-first.
H3: What are common PSP-equivalent policy rules?
Disallow privileged, disallow hostPath, enforce runAsNonRoot, require seccomp/AppArmor.
H3: How do I test policies without breaking prod?
Use audit mode and CI policy checks, and deploy to staging mirroring prod.
H3: Who should own PSP policies?
Platform or security team owns policies; application teams own remediation and exceptions.
H3: How do I measure the impact of PSP?
Track compliance rate, deny rate, remediation time, and runtime anomalies.
H3: What happens if admission webhook fails?
If configured fail-closed it will block creations; fail-open allows through. Choose based on risk.
H3: Are PSPs enough for security?
No, they are preventive controls; combine with image scanning, network policy, and runtime security.
H3: How to handle legacy workloads requiring root?
Provide an exemption path with strict auditing and time-bound allowlists while modernizing.
H3: Can I auto-remediate policy violations?
Yes, for safe deterministic changes like injecting non-root users, but vet consequences.
H3: How to avoid noisy deny alerts?
Tune policies, aggregate alerts, use grouping and suppression windows.
H3: What telemetry is most valuable for PSP?
Admission deny events, webhook latencies, policy drift, and runtime detections.
H3: How to handle cross-cluster policies?
Use GitOps and central policy templates with per-cluster overrides.
H3: Do managed Kubernetes providers enforce PSP?
Varies / depends.
Conclusion
PSP historically provided pod-level admission controls in Kubernetes and remains a core concept for enforcing pod security. By 2026, upstream PSP is deprecated, but the principles persist via PodSecurity, OPA Gatekeeper, Kyverno, mutating/admission webhooks, and runtime tools. A robust approach combines preventive admission checks, policy-as-code in GitOps, runtime detection, and clear SRE ownership and observability.
Next 7 days plan (5 bullets)
- Day 1: Inventory current pod specs and identify risky pod attributes.
- Day 2: Enable kube-apiserver audit logging and forward to central logs.
- Day 3: Create baseline policies in Git in audit mode.
- Day 4: Add CI checks to run policy validations for merge requests.
- Day 5: Build Prometheus/Grafana panels for basic compliance metrics.
- Day 6: Run a small canary enforcement in a non-critical namespace.
- Day 7: Review results, open remediation tickets, and plan next-week enforcement.
Appendix — PSP Keyword Cluster (SEO)
- Primary keywords
- Pod Security Policy
- PSP Kubernetes
- PodSecurity admission
- Kubernetes pod security
-
Pod security best practices
-
Secondary keywords
- Kubernetes admission controllers
- PodSecurityPolicy deprecation
- OPA Gatekeeper policies
- Kyverno pod policies
- seccomp profiles Kubernetes
- AppArmor Kubernetes
- runAsNonRoot enforcement
-
hostPath policy Kubernetes
-
Long-tail questions
- How to migrate from PSP to PodSecurity
- What replaces PodSecurityPolicy in Kubernetes
- How to enforce non-root containers in Kubernetes
- How to audit PSP in Kubernetes clusters
- How to prevent privileged containers in K8s
- How to measure pod security compliance
- How to design pod admission policies
- How to use Gatekeeper for pod validation
- How to use Kyverno to mutate pod specs
- How to integrate pod security with CI/CD
- What is the impact of admission webhook latency
- How to handle legacy apps with PSP rules
- How to author seccomp profiles for pods
- How to monitor admission deny rates
-
How to create policy-as-code for Kubernetes
-
Related terminology
- Admission webhook
- Mutating webhook
- Validating webhook
- Audit logs
- Policy-as-code
- GitOps policy management
- Runtime security
- Falco rules
- kube-apiserver audit
- Prometheus metrics for admission
- Alertmanager deny alerts
- Emergency allowlist
- Policy drift
- Compliance baseline
- Cluster role binding
- ServiceAccount policies
- Canary policy rollout
- Fail-open webhook
- Fail-closed webhook
- Policy reconciliation
- Seccomp profile injection
- AppArmor profile injection
- Capability bounding
- Least privilege enforcement
- Pod security context
- Node hostPath restrictions
- HostNetwork prevention
- Privileged container prevention
- Mutate-and-validate pattern
- Admission latency monitoring
- Policy audit reports
- SIEM integration for denies
- Kube-state-metrics compliance
- Kubernetes policy templates
- Policy testing in CI
- Postmortem policy review
- Emergency policy rollout
- Policy exclusion lists
- Policy coverage by namespace
- Policy SLOs and SLIs