[Project] StatefulSet Backup Operator v0.0.2 - Lightweight snapshot-based backups for StatefulSets

Hi everyone! :waving_hand:

I’ve been working on a Kubernetes operator focused on a specific use case: automated backup and restore for StatefulSets using native VolumeSnapshot APIs.

GitHub: https://github.com/federicolepera/statefulset-backup-operator

What it does

The operator provides:

  • Scheduled snapshots via cron expressions

  • Coordinated backups across all StatefulSet replicas

  • Pre/post backup hooks for application consistency (e.g., database flush operations)

  • Per-replica retention policies (keeps N most recent snapshots per PVC)

  • Point-in-time recovery with a simple declarative CRD

  • 100% CRD-based - no CLI tools required, GitOps-friendly

Why not just use Velero?

Great question! Velero is excellent for full-cluster disaster recovery, but this operator targets a different use case:

StatefulSet Backup Operator is better if you need:

  • Fast snapshot-based backups (seconds vs minutes)

  • Minimal setup (2 minutes vs 15-30 minutes)

  • No external object storage dependency

  • Cost-effective incremental snapshots

  • Per-replica granular restore

Velero is better if you need:

  • Cross-cluster disaster recovery

  • Full cluster migrations

  • Backup of cluster-scoped resources

  • Multi-cloud portability

Think of it as “the right tool for the right job” - lightweight and focused vs comprehensive.

Example Usage

Scheduled PostgreSQL backup with consistency hooks:

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
  name: postgres-backup
spec:
  statefulSetRef:
    name: postgresql
    namespace: production
  schedule: "0 2 * * *"  # Daily at 2 AM
  retentionPolicy:
    keepLast: 7  # Keep 7 backups per replica
  preBackupHook:
    command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]
  volumeSnapshotClass: csi-hostpath-snapclass

Point-in-time restore:

yaml

apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
  name: restore-postgres
spec:
  statefulSetRef:
    name: postgresql
  useLatestBackup: true
  scaleDown: true

Current State (v0.0.2-alpha)

:white_check_mark: What works:

  • 26 unit tests with 42.5% coverage

  • GitHub Actions CI integration

  • Manual and scheduled backups

  • Pre/post hooks execution

  • Retention management (per-replica)

  • Restore workflow with automatic scale down/up

  • Tested on Minikube and Kind

:warning: Known limitations:

  • Pre/post hooks execute on first container only (working on containerName specification)

  • VolumeSnapshotClass currently hardcoded (making it configurable)

  • Cross-namespace operations not yet supported

  • No webhook validation yet

Questions for the Community

I’m trying to figure out what would make this genuinely useful beyond a learning project:

  1. Application-aware backups: Would integration with specific databases (MySQL, PostgreSQL, MongoDB) via custom hooks be valuable?

  2. Backup verification: Automated snapshot validation/integrity checks?

  3. Observability: Prometheus metrics for backup success/failure rates?

  4. Multi-cluster restore: Using snapshots for cross-cluster migrations?

  5. Or is Velero already solving these problems well enough that narrow-scope alternatives don’t make sense?

Looking for Feedback

  • Has anyone faced similar backup challenges with StatefulSets?

  • Are there specific features that would make this production-ready for your use cases?

  • Any storage providers you’d like to see tested? (currently tested: CSI hostpath, planning GKE/EKS/AKS)

  • Thoughts on the architectural approach?

Roadmap

  • Helm chart (next release)

  • Webhook validation

  • Configurable hook containers

  • Backup verification

  • Prometheus metrics

  • Multi-cluster restore support

Appreciate any feedback, criticism, or suggestions! Also happy to collaborate if anyone finds this interesting.

Thanks! :folded_hands: