We have a database of around 200,000 webpages that we’d like to crawl at repeating intervals. If each webpage is scheduled as a recurring job in Hangfire, and each job takes around 4 seconds to complete, then just around 400 jobs can be completed each minute on a single quad-core server (i.e. 20 workers).
Now, is Hangfire a suitable solution to do something like this (i.e. crawling webpages at set intervals)? If so, then how can be the throughput improved (say from 400 fetch jobs per minute to 5000 jobs per minute), besides adding more servers?
Any suggestions or advice will be much appreciated.
I wouldn’t recommend to create 200000 recurring jobs just because it is very hard to manage them. You can have one or several recurring jobs based on your check intervals that create a swarm of other background jobs aimed to crawl web sites (I hope you are not a spammer ;)).
Web site crawling is a two step process: fetch data + process data. First is I/O bound, second is CPU bound. For I/O bound operations you always can increase the number of workers as most of the time they will wait for an I/O result (async workers are not supported yet). However, this will increase the context switching that may result into a degraded performance of a second phase.
But there are a number of tweaks to improve this process, for example, have two servers that listen different queues – one fetches the data and put it, for example, into Redis; another processes the data stored in Redis (you can add a continuation for a job related to fetch stage). First server will have a lot of workers, second server will have low number of workers.
Not a spammer just trying to come up with a crawling solution for a client/retailer.
Do you mean run one-time background jobs for fetching webpages via the main recurring job(s)? Or do you mean just spawn background threads in C# (via Parallel.ForEach or async/await)?
I suppose concurrent processing will still be limited by the number of cores, right? If the server has 4 cores then only 4 workers will be able to process simultaneously.
On a past project we used Sidekiq, and were able to scale it to a few thousand background jobs (i.e. fetching url’s). Is Hangfire capable of similar volume? If not, do you think Quartz will be a more suitable alternative in this case?
public void CrawlWebsites()
// You can also use Parallel.ForEach, but this will take
// ThreadPool threads and increase connection usage (may
// need to increase the connection pool site).
foreach (var siteId in _siteIds)
BackgroundJob.Enqueue(() => CrawlSite(siteId));
Hangfire creates a dedicated thread for each worker. Some threads may sleep waiting for the completion of I/O operation (HTTP GET, for example). You may think of Hangfire as a Sidekiq for .NET world – they are both using message queues and thus have the same scaling techniques.
I’m not sure what you’ve suggested will work for me. I want to be able to fetch the webpages at separate intervals. Some every 10 minutes, some every 60 minutes and some every 24 hours etc. Also, looping through thousands of sites (as in your example) will be a synchronous/blocking operation. The whole point of fetching the webpages in the background is to make it asynchronous/non-blocking.
Ideally, I just want to be able to do this (not necessarily in a loop) from the client app:
Is it possible to ensure that recurring or background jobs are not processed simultaneously (using some sort of locking), such that if the example ‘CrawlWebSites’ method has been running for more than 10 minutes then the ‘crawl-every-10-mins’ job shouldn’t be triggered again, unless the previous run has completed?