Today I want to share a little but very helpful tool with you. If you ever have to download thousands of little files for later use in a parser/scrapper/whatever you will find these less than 60 lines of bash very useful.
Updated according to Bob
What is it good for?
Here at adeven we have some heavy data processing tasks. One of them is parsing more than 30K little (<900KB) files that we download twice a day.
So we needed something to download 30 thousand files with more than 20GB in the shortest time possible. Also in order to parse the files in one continuous stream (30K seperate file handlers are no fun in any language) the result should be one huge file with specified delimiters.
The “todo list” for the downloader should be one file with all the urls seperated by line breaks.
Big plus would be a program, that is easy to deploy with very little dependencies.
To cut to the chase, this is how I did it:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56
You can also find the code here: github.com/adeven/hydra-curl.
If you happy to just download a lot of stuff really fast.
For those of you who want to know what that thing acutally does, bare with me.
How does it work?
The basic idea is using the strengths of bash employing really fast forking of curl (a great linux tool for downloading).
So if we look at the code we see some parameters you can set. The
MAX_FORKS controls how many curls this script should try to fork. Due to the nature of bash (no comunication between forks) and curl it is not easy to maintain a steady number of processes. The
MIN_SIZE in line 13 sets a lower limit for your files to catch broken downloads.
So to understand what my strategy is we look at line 35 ff.
First of we create the temporary target folder and start reading in the todo file. Then we check the
CURRENT_TODO counter has arrived at 0 (or less). If yes (start condition) we count the number of current running
ps aux. If it’s to big we wait for 500ms and count again. When it’s lower than our
MAX_FORK target we add a new block of todos to the counter. We use blocks of todos and not single todos since the call of
ps aux takes to long if we recount after every fork.
So now that we have todos in our counter we skip the
ps aux part and for a function called
download. This is just a small wrapper to use
curl and ensure or downloads have certain size (some webservers deliver empty files from time to time). Also we attach our delimiter to every downloaded file for later concatenation.
If everything is downloaded we
cat it into one big target file and clean up after ourselfs.
Thanks to a comment from Bob i adopted the script a little. Although wget would work fine in many other examples, we need to check for a minimum file size, since our source sometimes delivers status 200 with almost empty answers/broken content.
So I adopted the idea of using xargs and mixed it with my download function. Now it’s much nicer to your CPU and keeps closer to the
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38
That makes it less than 40 lines now.
Btw. this downloads 21GB in less than 25 minutes ;)
That’s it, i hope you enjoyed the read.