A Massive Parallel Downloader in Bash

Today I want to share a little but very helpful tool with you. If you ever have to download thousands of little files for later use in a parser/scrapper/whatever you will find these less than 60 lines of bash very useful.

Updated according to Bob

What is it good for?

Here at adeven we have some heavy data processing tasks. One of them is parsing more than 30K little (<900KB) files that we download twice a day.

So we needed something to download 30 thousand files with more than 20GB in the shortest time possible. Also in order to parse the files in one continuous stream (30K seperate file handlers are no fun in any language) the result should be one huge file with specified delimiters.

The “todo list” for the downloader should be one file with all the urls seperated by line breaks.

Big plus would be a program, that is easy to deploy with very little dependencies.

Enter hydra-curl

To cut to the chase, this is how I did it:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
#!/bin/bash
MAX_FORKS=100
TODO_FILE=$1
TARGET=$2
DOWNLOAD_FOLDER="/tmp/hydra-curl"
TIME_STAMP=$(date +%s)
let "BLOCK_SIZE = MAX_FORKS/5"
CURRENT_TODO=0
COUNTER=0

function download {
  ANSWER_SIZE=0
  MIN_SIZE=10000
  MAX_RETRIES=5
  DELIMITER="###DELIMITER###"
  ANSWER=""
  TRIES=0
  while [ "$TRIES" -lt "$MAX_RETRIES" ] && [ "$ANSWER_SIZE" -lt "$MIN_SIZE" ]
    do
      ANSWER=$(curl --retry 3 -s "$10$TRIES" --max-time 60)
      if [ "$?" -ne "0" ]; then
        ANSWER=""
      else
        ANSWER_SIZE=${#ANSWER}
      fi
      let "TRIES += 1"
    done
  
  if [ "$ANSWER_SIZE" -ge "$MIN_SIZE" ]; then
    ANSWER=$ANSWER"$DELIMITER"
    echo $ANSWER > $2
  fi
}

mkdir $DOWNLOAD_FOLDER

filecontent=( `cat $TODO_FILE `)

for t in "${filecontent[@]}"
do
  if [ "$CURRENT_TODO" -le 0 ]; then
    while [ $(ps aux | grep curl |wc -l) -ge "$MAX_FORKS" ]
     do
       sleep 0.5
     done
    let "CURRENT_TODO += BLOCK_SIZE"
  fi
  download "$t?$TIME_STAMP" "$DOWNLOAD_FOLDER/$COUNTER" &
  let "COUNTER += 1"
  let "CURRENT_TODO -= 1"
done

wait

cat $DOWNLOAD_FOLDER/* > $TARGET
rm -rf $DOWNLOAD_FOLDER

You can also find the code here: github.com/adeven/hydra-curl.

If you happy to just download a lot of stuff really fast.

Have fun.

For those of you who want to know what that thing acutally does, bare with me.

How does it work?

The basic idea is using the strengths of bash employing really fast forking of curl (a great linux tool for downloading).

So if we look at the code we see some parameters you can set. The MAX_FORKS controls how many curls this script should try to fork. Due to the nature of bash (no comunication between forks) and curl it is not easy to maintain a steady number of processes. The MIN_SIZE in line 13 sets a lower limit for your files to catch broken downloads.

So to understand what my strategy is we look at line 35 ff.

First of we create the temporary target folder and start reading in the todo file. Then we check the CURRENT_TODO counter has arrived at 0 (or less). If yes (start condition) we count the number of current running curls using ps aux. If it’s to big we wait for 500ms and count again. When it’s lower than our MAX_FORK target we add a new block of todos to the counter. We use blocks of todos and not single todos since the call of ps aux takes to long if we recount after every fork.

So now that we have todos in our counter we skip the ps aux part and for a function called download. This is just a small wrapper to use curl and ensure or downloads have certain size (some webservers deliver empty files from time to time). Also we attach our delimiter to every downloaded file for later concatenation.

If everything is downloaded we cat it into one big target file and clean up after ourselfs.

Update

Thanks to a comment from Bob i adopted the script a little. Although wget would work fine in many other examples, we need to check for a minimum file size, since our source sometimes delivers status 200 with almost empty answers/broken content.

So I adopted the idea of using xargs and mixed it with my download function. Now it’s much nicer to your CPU and keeps closer to the MAX_FORK target.

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
#!/bin/bash
MAX_FORKS=100
TODO_FILE=$1
TARGET=$2
DOWNLOAD_FOLDER="/tmp/hydra-curl"

function curl_download {
  ANSWER_SIZE=0
  MIN_SIZE=10000
  MAX_RETRIES=5
  DELIMITER="###DELIMITER###"
  ANSWER=""
  TRIES=0
  while [ "$TRIES" -lt "$MAX_RETRIES" ] && [ "$ANSWER_SIZE" -lt "$MIN_SIZE" ]
    do
      ANSWER=$(curl --retry 3 -s "$2?$1$TRIES" --max-time 60)
      if [ "$?" -ne "0" ]; then
        ANSWER=""
      else
        ANSWER_SIZE=${#ANSWER}
      fi
      let "TRIES += 1"
    done
  
  if [ "$ANSWER_SIZE" -ge "$MIN_SIZE" ]; then
    ANSWER=$ANSWER"$DELIMITER"
    echo $ANSWER > $3/$1
  fi
}

export -f curl_download

mkdir $DOWNLOAD_FOLDER

cat -n $TODO_FILE | xargs -P $MAX_FORKS -n 3 -I{} bash -c curl_download\ \{\}\ $DOWNLOAD_FOLDER

cat $DOWNLOAD_FOLDER/* > $TARGET
rm -rf $DOWNLOAD_FOLDER

That makes it less than 40 lines now.

Btw. this downloads 21GB in less than 25 minutes ;)

That’s it, i hope you enjoyed the read.

Have fun.

Comments