Your idea is interesting, but I wonder if in the long run you’d end up reinventing something like celery anyway.
Celery describes itself as a distributed task queue. It provides multiple capabilities:
- distributing workloads to workers (via external broker like rabbitmq, etc)
- queuing up jobs that are not immediately assignable to workers (more jobs than workers)
- a mechanism to return results from workers to the requester
- job persistence
- job state tracking (and I think it can retry jobs where the worker failed)
Your proposed alternative could work for some types of workloads, but I think it will fall short on tasks that are long running or when tasks arrive faster than you can scale up the number of handlers. There’s also some limit to how many handlers you can run. What process will scale down handlers?
Where will jobs be queued while waiting for a handler? I suppose a service mesh can accept some backlogged connections, but how many? Are those waiting jobs durable? Will they survive a restart of the mesh proxy, or the node?
Celery tasks are async, but I think http/grpc service mesh is mostly for synchronous calls. How long will the service mesh wait for a response from the handler before timing out?
What process or controller is going to autoscale your handlers? One way would be to send all requests through a single process so that requests/second can be counted or time waiting in queue is tracked, etc. Is that process/controller going to retain state for jobs not yet handled by workers? Will it store state so it can survive a restart w/o losing jobs?
Don’t forget day 2 operations, upgrading handlers, new attributes on jobs, new types of jobs.
(oops, I forgot to submit this comment last week)