Hi everyone! ![]()
I’ve been working on a Kubernetes operator focused on a specific use case: automated backup and restore for StatefulSets using native VolumeSnapshot APIs.
GitHub: https://github.com/federicolepera/statefulset-backup-operator
What it does
The operator provides:
-
Scheduled snapshots via cron expressions
-
Coordinated backups across all StatefulSet replicas
-
Pre/post backup hooks for application consistency (e.g., database flush operations)
-
Per-replica retention policies (keeps N most recent snapshots per PVC)
-
Point-in-time recovery with a simple declarative CRD
-
100% CRD-based - no CLI tools required, GitOps-friendly
Why not just use Velero?
Great question! Velero is excellent for full-cluster disaster recovery, but this operator targets a different use case:
StatefulSet Backup Operator is better if you need:
-
Fast snapshot-based backups (seconds vs minutes)
-
Minimal setup (2 minutes vs 15-30 minutes)
-
No external object storage dependency
-
Cost-effective incremental snapshots
-
Per-replica granular restore
Velero is better if you need:
-
Cross-cluster disaster recovery
-
Full cluster migrations
-
Backup of cluster-scoped resources
-
Multi-cloud portability
Think of it as “the right tool for the right job” - lightweight and focused vs comprehensive.
Example Usage
Scheduled PostgreSQL backup with consistency hooks:
yaml
apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetBackup
metadata:
name: postgres-backup
spec:
statefulSetRef:
name: postgresql
namespace: production
schedule: "0 2 * * *" # Daily at 2 AM
retentionPolicy:
keepLast: 7 # Keep 7 backups per replica
preBackupHook:
command: ["psql", "-U", "postgres", "-c", "CHECKPOINT"]
volumeSnapshotClass: csi-hostpath-snapclass
Point-in-time restore:
yaml
apiVersion: backup.sts-backup.io/v1alpha1
kind: StatefulSetRestore
metadata:
name: restore-postgres
spec:
statefulSetRef:
name: postgresql
useLatestBackup: true
scaleDown: true
Current State (v0.0.2-alpha)
What works:
-
26 unit tests with 42.5% coverage
-
GitHub Actions CI integration
-
Manual and scheduled backups
-
Pre/post hooks execution
-
Retention management (per-replica)
-
Restore workflow with automatic scale down/up
-
Tested on Minikube and Kind
Known limitations:
-
Pre/post hooks execute on first container only (working on
containerNamespecification) -
VolumeSnapshotClass currently hardcoded (making it configurable)
-
Cross-namespace operations not yet supported
-
No webhook validation yet
Questions for the Community
I’m trying to figure out what would make this genuinely useful beyond a learning project:
-
Application-aware backups: Would integration with specific databases (MySQL, PostgreSQL, MongoDB) via custom hooks be valuable?
-
Backup verification: Automated snapshot validation/integrity checks?
-
Observability: Prometheus metrics for backup success/failure rates?
-
Multi-cluster restore: Using snapshots for cross-cluster migrations?
-
Or is Velero already solving these problems well enough that narrow-scope alternatives don’t make sense?
Looking for Feedback
-
Has anyone faced similar backup challenges with StatefulSets?
-
Are there specific features that would make this production-ready for your use cases?
-
Any storage providers you’d like to see tested? (currently tested: CSI hostpath, planning GKE/EKS/AKS)
-
Thoughts on the architectural approach?
Roadmap
-
Helm chart (next release)
-
Webhook validation
-
Configurable hook containers
-
Backup verification
-
Prometheus metrics
-
Multi-cluster restore support
Appreciate any feedback, criticism, or suggestions! Also happy to collaborate if anyone finds this interesting.
Thanks! ![]()