I’m trying to find a solution so that instead of limiting my servers to a certain workcount, I can have the workcount unlimited and then measure the average amount of CPU in use and then scale a new machine to distribute the jobs between them.
The problem is that I can’t find a way to do this only with the resources available through hangfire, can someone help me with the questions below:
Considering that I set my server to 9999 workcount (a high number, really so that it can’t be reached), I already know that I will have a huge amount of jobs running beyond the capacity of my CPU. How then can I redistribute this, with jobs already queued or scheduled to start in a predetermined queue?
Considering that I limit my server to 5 workcounts, I know that my job queue will be limited to this amount, but then they will already have been queued on the same server based on a queue name. How then can I handle this limit amount without interrupting jobs already in progress or already queued?
The first option could have many jobs running at a time. If you had 1000 jobs running at the same job, they would all be processing with the system divvying up the CPU with them going idle again and again. This could take a job a lot longer to process causing a massive backlog. I would not recommend this option.
The question is how scaling works on your end. In AWS ECS you can specify CPU targets that, if hit, a new instance is spun up. If you have everything going to the same queue, Hangfire can already manage that. Hangfire will pull a job ensuring no other running instance will also get that job.
Using dynamic queues seems unnecessary. It’s just a matter of finding the number of workers to max out the CPU. This also assumes a job is CPU bound and jobs are similar in nature (not waiting on external resources, etc.).
If you can specify or handle custom rules for spinning up a machine you could also look at the number of jobs in the queue. If it is high, spin up another instance. When it gets low, power down the instance.
@aschenta, first of all thanks for the help!
I understand that creating multiple dynamic queues is unnecessary for what I’m looking for, as well as long-running processes as well.
My problem is that, before I start a new server (ec2, azure, etc), all jobs have already been allocated to the only current server, this is because, I purposely left my workcount without limits so that I could reach the maximum possible CPU, considering that each one of my jobs consume a different amount of CPU and it would be impossible to calculate a group of jobs to limit per server.
On the one hand, I have each server using the maximum of what it is capable of, then I can create a dynamic rule to bring up a new server when cpu >70%. But this is not working, because my 1000 jobs that have already been fired, are allocated to the unlimited queue (workcount) on the current server. In this case, when I bring up a new server, the jobs don’t readjust until they all fail due to the overload of the 1000 jobs running on the first server.
I can’t think of a solution for this;
Maybe I have to invest in some technique to relocate jobs already associated from one server to another?
Jobs aren’t assigned to a server. They are being actively processed by the server. If they fail, another server could pick them up on retry. You would be telling Hangfire to abort processing of jobs which doesn’t sound like a good idea.
I would consider implementing a messaging, unit-of-work system. Move all short lived jobs to handlers and leave all long running jobs in Hangfire. You could also have a fast queue and a slow queue in Hangfire. This may help you achieve a higher CPU usage without needing 9999 worker count.