I didn't expect the Advent calendar in our laboratory to end safely. This time I will write about Splash's memory problem.
** If you eat too much memory, it will fall without permission, so restart **
I wrote an article before Scrapy is good. But Scrapy has one problem. That is, of course, JavaScript doesn't work. A tool called Splash is often used when scraping pages that use JavaScript with Scrapy.
Splash is a server that renders JavaScript, and you can get the source after executing JavaScript of the specified site by accessing it using WebAPI. The contents are written in Python and seem to use Twisted and QT3.
Such a convenient Splash has one big problem. It consumes a lot of memory. The figure below is a graph of Splash's memory consumption. The more requests you make, the more memory you consume. In this condition, no matter how much memory is loaded, it will disappear in a blink of an eye. There is a similar issue on Github, but it seems that the Python specification does not allow you to free memory. https://github.com/scrapinghub/splash/issues/674
The problem with Python is untouched. If anyone knows, please let me know. As mentioned in the Issue above, the only way to free memory is to drop Splash once. However, Splash cannot be dropped manually, so it must be dropped automatically. It is ant to write a script like restarting every minute with cron, but it is troublesome. What should I do in such a case?
** Wait until it falls due to lack of memory. And let's restart automatically **
It's not a very smart solution, but it's the easiest.
Target the following situations.
First, set the upper limit of the memory that Splash's Docker container can use. For docker-compose, use 2 because mem_limit
has disappeared from version 3.
Also, add restart: always
so that it will restart automatically when it falls.
version: "2"
services:
splash:
image: "scrapinghub/splash:3.3"
ports:
- "8050:8050"
mem_limit: 2g
restart: always
command: --disable-browser-caches --maxrss 4000
Now you don't eat more memory than you need.
As I said in the middle, it's not a very smart solution. Please let me know if there is a smarter and easier way to do it.
Recommended Posts