Comparing Kubernetes Job-Based Architecture vs. Celery Worker Pods-Based Workflow for High-Traffic Resilience and Performance

I am configuring a Kubernetes cluster to support a microservice with the following workflow:

Current Setup:

  • Ingress: Manages external traffic.
  • FastAPI Service and Pod: Connected to a FastAPI deployment pod.
  • Redis pod: Handles task queuing.
  • Celery Workers Pods: Process tasks from the Redis queue and scale based on queue length.

To manage scaling, I am using KEDA to autoscale based on the length of the Redis queue.

This architecture works, but I am exploring whether a Kubernetes Job-based workflow might provide better resilience during high-traffic and for batch processing. For example, replacing Celery workers with Kubernetes Jobs for task processing.

Specific Questions:

  1. Performance: How does the performance of Celery workers compare to Kubernetes Jobs in handling high-throughput scenarios?
  2. Scalability: Are Kubernetes Jobs more effective at scaling during traffic spikes compared to a KEDA+Celery worker setup?
  3. Failure Recovery and Observability: Which approach provides better fault tolerance and task retry mechanisms in the event of pod or node failures? And does it allow better monitoring in case of failure ?
  4. Resource Efficiency: Does using Kubernetes Jobs result in better resource utilization, considering overhead and runtime behavior?

I want to understand the trade-offs between my current workflow and a Job-based architecture in terms of scalability, resilience, and performance.

What are the key factors to consider when deciding between these two architectures? Are there benchmarks or real-world examples that highlight the differences?