I'm developing a python project that will be hosted on kubernets, on the Google Cloud provider. The idea is to read a file of millions of rows, where each row is the query's input key in an API

Cloud being used: Google Cloud

I’m developing a python project that will be hosted on kubernets, on the Google Cloud provider. The idea is to read a file of millions of rows, where each row is the query’s input key in an API.

I want to run my application on several Kubernetes PODs, because I want to have scalability, that is, multiple queries running at the same time. However, in this code structure (read lines from txt file), each pod will end up iterating over the file from the beginning, reading lines already consulted. And it is not what I want. Then two ideas came up:

  1. Split the files, and each split would be distributed among the pods. (Example: for a 100 line file with 10 pods, each pod would read 10 lines)
  2. Before running the application, create a consumption queue with the lines of the file, so that all pods would read the queue and not the file directly.

Option 2 seems to me to be more scalable and faster. But I would like suggestions for the best way to make a query using a file as a reference. I may want to run, for example, 1 million queries in 24 hours.

Hi @Guilherme_Duarte

I prefer the second option: as you run on GCP, you can have one pod that reads the file and sends lines into a pub/sub queue, and other pods that are getting lines from this pub/sub queue.

You can also run Kubernetes jobs for these latter, for which it is a typical case: Jobs | Kubernetes

1 Like