Home of runbooks, playbooks, standards, and day‑to‑day operations for
planetcloud.cloud.Keep this page crisp. Link out to details; don’t duplicate.
Status Dashboard: https://status.planetcloud.cloud
Prod Console: https://system.planetcloud.cloud
Monitoring (Grafana/Uptime Kuma): https://grafana.planetcloud.cloud · https://kuma.system.planetcloud.cloud
Ticketing (Zammad): https://zammad.planetcloud.cloud
Logs: https://graylog.planetcloud.cloud
Docs Search: /search?q=
Change Calendar: /changes
Inventory / CMDB: /cmdb
🔧 Update the links above to match your actual hostnames.
Primary: @oncall-primary
Secondary: @oncall-secondary
Duty Hours (ICT/Asia/Bangkok): 24×7
Escalation: Pager / Hotline +66-XXX-XXX-XXX → SRE Lead → Head of Operations
Rotate: see /oncall/rotation
Declare severity (S0–S3) → open ticket MI-YYYYMMDD-##.
Create bridge: meet/bridge link and Slack channel #mi-YYYYMMDD-##.
Assign roles: Incident Commander (IC), Comms, Scribe, Tech Leads.
Stabilize service → capture timeline.
Comms: internal every 30 min; external via status page.
Close with RCA (≤ 72h), action items, and owner per item.
Runbook: /runbooks/incident/declare
Severity Matrix
| Sev | Impact | Example | Target Response | Target Restore |
|---|---|---|---|---|
| S0 | Full outage, safety/compliance risk | All customers down | 5 min | 60 min |
| S1 | Critical partial outage | Region down / Data loss risk | 15 min | 4 h |
| S2 | Degraded/non‑critical outage | Single service down | 1 h | 24 h |
| S3 | Minor issue | Cosmetic / Low risk | 1 biz day | As planned |
Runbooks → /runbooks (step‑by‑step, production‑safe)
Playbooks → /playbooks (diagnostics & options)
Standards → /standards (naming, tagging, security baselines)
Architecture → /architecture (diagrams, data flows)
Environments → /env (Prod / Staging / Lab)
Networks → /net (IPAM, VLANs, VPNs, BGP/OSPF)
Platforms → /platforms (OpenStack, VMware, Kubernetes, HyperFlex)
Observability → /obs (logging, metrics, tracing, alerts)
Security → /sec (IAM, keys, certs, SOC runbooks)
Backups/DR → /dr (RPO/RTO, restore drills)
Change Mgmt → /change (CAB, templates, risk matrix)
Service Catalog → /catalog (SLO/SLI, owners)
Open P1/P2 Incidents: X
Active Changes (24h): Y
Error Budget (Top 3 services): [svc-a 98.95% | svc-b 99.91% | svc-c 99.99%]
Last Deployment: svc-a @ 2025‑10‑06 10:12 ICT (commit abcd1234)
Tip: Drive this from an automated widget or a small cron that updates a JSON include.
OpenStack: Create/Access Instances · Neutron Troubleshooting
VMware/vCenter: Datastore Alerts · vMotion Failures
Network: WireGuard site‑to‑site · BGP session flaps
Security: Cert renewal (Traefik/LE) · Rotate secrets
Monitoring: Alert floods · Agent install
Keep the list short; rotate monthly.
Naming: pnc-<env>-<role>-<seq> (e.g., pnc-prod-api-03)
Tagging: owner, env, cost-center, data-class, backup-profile
Access: SSO + MFA, least privilege; break‑glass flow documented.
Backups: Daily (DB), hourly WAL, weekly full; 30‑day retention; test restores monthly.
Patching: Critical ≤ 7d; High ≤ 14d; kernel ≤ 14d.
Certificates: Let’s Encrypt via Traefik HTTP‑01; expire window alert at T‑30/T‑7.
→ Details: /standards
Core DCs: SPDC‑DC1, SPDC‑DC2
WAN: MPLS + Internet failover; IPSec/ZeroTier to remote sites.
IPAM: 10.40.0.0/16 (core), 172.18.0.0/16 (docker‑bridge), 172.19.0.0/16 (internal docker)
DNS: BIND9 internal; Cloudflare public.
Diagrams: /architecture/diagrams
Traefik reverse proxy
Zammad helpdesk
Grafana / Uptime Kuma monitoring
Graylog / Elasticsearch logs
MediaMTX streaming
Eocortex VMS
Keep owner + upgrade guide per tool under /tools.
Report an incident: open ticket INC-… and page @oncall
Request access: /requests/access
New service onboarding: /catalog/new-service
Security report: security@planetcloud.cloud
# <Runbook Title>
**Service:** <name>
**Owner:** <team/people>
**Last Reviewed:** <YYYY‑MM‑DD>
## When to use
<symptoms, scope>
## Preconditions
- [ ] Access to <systems>
- [ ] Change window required? <Yes/No>
## Steps (safe)
1. <step>
2. <step>
## Rollback
<steps>
## Validation
- [ ] Metric X recovered
- [ ] Error rate < threshold
## Links
- Dashboards: <links>
- Source: <repo/commit>
# Change: <short title>
**ID:** CHG-YYYYMMDD-##
**Risk:** Low/Med/High
**Window:** <start–end ICT>
**Services:** <list>
## Plan
<steps>
## Backout
<steps>
## Comms
- Pre‑notice link
- Post‑notice link
# <Service Name>
**Owner:** <team> · **Tier:** 1/2/3 · **SLO:** <99.9%>
## Overview
<what it does>
## Dependencies
<upstream/downstream>
## SLI/SLOs
- Availability: <target>
- Latency P95: <target>
## Runbooks
- <links>
## Dashboards
- <links>
MFA required for all admin access
Break‑glass creds vaulted and tested quarterly
SSH via bastion; no public SSH on prod
Secrets rotated on schedule; audit trails enabled
2025‑10‑06 – Initial homepage scaffold created.
MediaWiki (homepage skeleton)
= PlanetCloud Operations Wiki =
''Home of runbooks, playbooks, and standards for planetcloud.cloud.''
== Quick Links ==
* '''Status''': https://status.planetcloud.cloud
* '''Prod Console''': https://system.planetcloud.cloud
* '''Monitoring''': https://grafana.planetcloud.cloud
* '''Helpdesk''': https://zammad.planetcloud.cloud
== On-Call (Today) ==
; Primary : @oncall-primary
; Secondary : @oncall-secondary
; Escalation : +66-XXX-XXX-XXX
== Major Incident ==
# Declare severity (S0–S3); open MI ticket
# Create bridge and Slack channel
# Assign roles; stabilize; communications; RCA
== Navigation ==
* [[Runbooks]] · [[Playbooks]] · [[Standards]] · [[Architecture]] · [[Observability]] · [[Security]] · [[Change Management]] · [[Service Catalog]]
== Templates ==
* [[Template:Runbook]] · [[Template:Change]] · [[Template:Service]]
If you use Confluence, paste the Markdown above into a Markdown macro; for rich Confluence pages, mirror the sections and convert tables to native ones.