Download List Of Files Wget Mac

VPS

Dedicated

Files

WP Professional

WP Professional Plus

Apr 17, 2020 The wget command can be used to download files using the Linux and Windows command lines. Wget can download entire websites and accompanying files. When downloading a single file, we can use wget's -O option to specify the file name. When I'm downloading Urls in a file using wget -i filelist.txt (filelist.txt contains list Of Urls I want to Download), how can I construct filelist.txt so that each file is renamed as it is downloaded? For Ex, if the filelist.txt contains the following content.

The wget command is an internet file downloader that can download anything from files and web pages all the way through to entire websites.

Basic Usage

The wget command is in the format of:

For example, in its most basic form, you would write a command something like this:

This will download the filename.zip file from www.domain.com and place it in your current directory.

Redirecting Output

The -O option sets the output file name. If the file was called filename-4.0.1.zip and you wanted to save it directly to filename.zip you would use a command like this:

The wget program can operate on many different protocols with the most common being ftp:// and http://.

Downloading in the background.

If you want to download a large file and close your connection to the server you can use the command:

Downloading Multiple Files

If you want to download multiple files you can create a text file with the list of target files. Each filename should be on its own line. You would then run the command:

You can also do this with an HTML file. If you have an HTML file on your server and you want to download all the links within that page you need add --force-html to your command.

To use this, all the links in the file must be full links, if they are relative links you will need to add <base href='/support/knowledge_base/'> following to the HTML file before running the command:

Limiting the download speed

Usually, you want your downloads to be as fast as possible. However, if you want to continue working while downloading, you want the speed to be throttled.

To do this use the --limit-rate option. You would use it like this:

Continuing a failed download

If you are downloading a large file and it fails part way through, you can continue the download in most cases by using the -c option.

For example:

Normally when you restart a download of the same filename, it will append a number starting with .1 to the downloaded file and start from the beginning again.

Downloading in the background

If you want to download in the background use the -b option. An example of this is:

Checking if remote files exist before a scheduled download

If you want to schedule a large download ahead of time, it is worth checking that the remote files exist. The option to run a check on files is --spider.

In circumstances such as this, you will usually have a file with the list of files to download inside. An example of how this command will look when checking for a list of files is:

However, if it is just a single file you want to check, then you can use this formula:

Copy an entire website

If you want to copy an entire website you will need to use the --mirror option. As this can be a complicated task there are other options you may need to use such as -p, -P, --convert-links, --reject and --user-agent.

-pThis option is necessary if you want all additional files necessary to view the page such as CSS files and images
-PThis option sets the download directory. Example: -P downloaded
--convert-linksThis option will fix any links in the downloaded files. For example, it will change any links that refer to other files that were downloaded to local ones.
--rejectThis option prevents certain file types from downloading. If for instance, you wanted all files except flash video files (flv) you would use --reject=flv
--user-agentThis option is for when a site has protection in place to prevent scraping. You would use this to set your user agent to make it look like you were a normal web browser and not wget.

Using all these options to download a website would look like this:

TIP: Being Nice

It is always best to ask permission before downloading a site belonging to someone else and even if you have permission it is always good to play nice with their server. These two additional options will ensure you don’t harm their server while downloading.

This will wait 15 seconds between each page and limit the download speed to 50K/sec.

Downloading using FTP

If you want to download a file via FTP and a username and password is required, then you will need to use the --ftp-user and --ftp-password options.

An example of this might look like:

Retry

If you are getting failures during a download, you can use the -t option to set the number of retries. Such a command may look like this:

You could also set it to infinite retries using -t inf.

Recursive down to level X

If you want to get only the first level of a website, then you would use the -r option combined with the -l option.

For example, if you wanted only the first level of website you would use:

Setting the username and password for authentication

If you need to authenticate an HTTP request you use the command:

wget is a very complicated and complete downloading utility. It has many more options and multiple combinations to achieve a specific task. For more details, you can use the man wget command in your terminal/command prompt to bring up the wget manual. You can also find the wget manual here in webpage format.

Was this article helpful?

Related Articles

If you’ve ever wanted to download files from many different archive.org items in an automated way, here is one method to do it.

____________________________________________________________

Here’s an overview of what we’ll do:

1. Confirm or install a terminal emulator and wget
2. Create a list of archive.org item identifiers
3. Craft a wget command to download files from those identifiers
4. Run the wget command.

____________________________________________________________

Requirements

Required: a terminal emulator and wget installed on your computer. Below are instructions to determine if you already have these.
Recommended but not required: understanding of basic unix commands and archive.org items structure and terminology.

____________________________________________________________

Section 1. Determine if you have a terminal emulator and wget.
If not, they need to be installed (they’re free)

Download

1. Check to see if you already have wget installed
If you already have a terminal emulator such as Terminal (Mac) or Cygwin (Windows) you can check if you have wget also installed. If you do not have them both installed go to Section 2. Here’s how to check to see if you have wget using your terminal emulator:

1. Open Terminal (Mac) or Cygwin (Windows)
2. Type “which wget” after the $ sign
3. If you have wget the result should show what directory it’s in such as /usr/bin/wget. If you don’t have it there will be no results.

2. To install a terminal emulator and/or wget:
Windows: To install a terminal emulator along with wget please read Installing Cygwin Tutorial. Be sure to choose the wget module option when prompted.

MacOSX: MacOSX comes with Terminal installed. You should find it in the Utilities folder (Applications > Utilities > Terminal). For wget, there are no official binaries of wget available for Mac OS X. Instead, you must either build wget from source code or download an unofficial binary created elsewhere. The following links may be helpful for getting a working copy of wget on Mac OSX.
Prebuilt binary for Mac OSX Lion and Snow Leopard
wget for Mac OSX leopard

Building from source for MacOSX: Skip this step if you are able to install from the above links.
To build from source, you must first Install Xcode. Once Xcode is installed there are many tutorials online to guide you through building wget from source. Such as, How to install wget on your Mac.

____________________________________________________________

Section 2. Now you can use wget to download lots of files

Wget To Download Multiple Files

The method for using wget to download files is:

  1. Generate a list of archive.org item identifiers (the tail end of the url for an archive.org item page) from which you wish to grab files.
  2. Create a folder (a directory) to hold the downloaded files
  3. Construct your wget command to retrieve the desired files
  4. Run the command and wait for it to finish

Step 1: Create a folder (directory) for your downloaded files
1. Create a folder named “Files” on your computer Desktop. This is where the downloaded where files will go. Create it the usual way by using either command-shift-n (Mac) or control-shift-n (Windows)

Step 2: Create a file with the list of identifiers
You’ll need a text file with the list of archive.org item identifiers from which you want to download files. This file will be used by the wget to download the files.

If you already have a list of identifiers you can paste or type the identifiers into a file. There should be one identifier per line. The other option is to use the archive.org search engine to create a list based on a query. To do this we will use advanced search to create the list and then download the list in a file.

First, determine your search query using the search engine. In this example, I am looking for items in the Prelinger collection with the subject “Health and Hygiene.” There are currently 41 items that match this query. Once you’ve figured out your query:

1. Go to the advanced search page on archive.org. Use the “Advanced Search returning JSON, XML, and more.” section to create a query. Once you have a query that delivers the results you want click the back button to go back to the advanced search page.
3. Select “identifier” from the “Fields to return” list.
4. Optionally sort the results (sorting by “identifier asc” is handy for arranging them in alphabetical order.)
5. Enter the number of results from step 1 into the “Number of results” box that matches (or is higher than) the number of results your query returns.
6. Choose the “CSV format” radio button.
This image shows what the advance query would look like for our example:

7. Click the search button (may take a while depending on how many results you have.) An alert box will ask if you want your results – click “OK” to proceed. You’ll then see a prompt to download the “search.csv” file to your computer. The downloaded file will be in your default download location (often your Desktop or your Downloads folder).
8. Rename the “search.csv” file “itemlist.txt” (no quotes.)
9. Drag or move the itemlist.txt file into your “Files” folder that you previously created
10. Open the file in a text program such as TextEdit (Mac) or Notepad (Windows). Delete the first line of copy which reads “identifier”. Be sure you deleted the entire line and that the first line is not a blank line. Now remove all the quotes by doing a search and replace replacing the ” with nothing.

The contents of the itemlist.txt file should now look like this:

Download List Of Files Wget Macos

Using wget to download files

…………………………………………………………………………………………………………………………
NOTE: You can use this advanced search method to create lists of thousands of identifiers, although we don’t recommend using it to retrieve more than 10,000 or so items at once (it will time out at a certain point).
………………………………………………………………………………………………………………………...

Step 3: Create a wget command
The wget command uses unix terminology. Each symbol, letter or word represents different options that the wget will execute.

Below are three typical wget commands for downloading from the identifiers listed in your itemlist.txt file.

To get all files from your identifier list:
wget -r -H -nc -np -nH --cut-dirs=1 -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

If you want to only download certain file formats (in this example pdf and epub) you should include the -A option which stands for “accept”. In this example we would download the pdf and jp2 files
wget -r -H -nc -np -nH --cut-dirs=1 -A .pdf,.epub -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

To only download all files except specific formats (in this example tar and zip) you should include the -R option which stands for “reject”. In this example we would download all files except tar and zip files:
wget -r -H -nc -np -nH --cut-dirs=1 -R .tar,.zip -e robots=off -l1 -i ./itemlist.txt -B 'http://archive.org/download/'

If you want to modify one of these or craft a new one you may find it easier to do it in a text editing program (TextEdit or NotePad) rather than doing it in the terminal emulator.

…………………………………………………………………………………………………………………………
NOTE: To craft a wget command for your specific needs you might need to understand the various options. It can get complicated so try to get a thorough understanding before experimenting.You can learn more about unix commands at Basic unix commands

An explanation of each options used in our example wget command are as follows:

-r recursive download; required in order to move from the item identifier down into its individual files

-H enable spanning across hosts when doing recursive retrieving (the initial URL for the directory will be on archive.org, and the individual file locations will be on a specific datanode)

-nc no clobber; if a local copy already exists of a file, don’t download it again (useful if you have to restart the wget at some point, as it avoids re-downloading all the files that were already done during the first pass)

-np no parent; ensures that the recursion doesn’t climb back up the directory tree to other items (by, for instance, following the “../” link in the directory listing)

-nH no host directories; when using -r, wget will create a directory tree to stick the local copies in, starting with the hostname ({datanode}.us.archive.org/), unless -nH is provided

--cut-dirs=1 completes what -nH started by skipping the hostname; when saving files on the local disk (from a URL likehttp://{datanode}.us.archive.org/{drive}/items/{identifier}/{identifier}.pdf), skip the /{drive}/items/ portion of the URL, too, so that all {identifier} directories appear together in the current directory, instead of being buried several levels down in multiple {drive}/items/ directories

-e robots=off archive.org datanodes contain robots.txt files telling robotic crawlers not to traverse the directory structure; in order to recurse from the directory to the individual files, we need to tell wget to ignore the robots.txt directive

-i ../itemlist.txt location of input file listing all the URLs to use; “../itemlist” means the list of items should appear one level up in the directory structure, in a file called “itemlist.txt” (you can call the file anything you want, so long as you specify its actual name after -i)

-B 'http://archive.org/download/' base URL; gets prepended to the text read from the -i file (this is what allows us to have just the identifiers in the itemlist file, rather than the full URL on each line)

Additional options that may be needed sometimes:

-l depth --level=depth Specify recursion maximum depth level depth. The default maximum depth is 5. This option is helpful when you are downloading items that contain external links or URL’s in either the items metadata or other text files within the item. Here’s an example command to avoid downloading external links contained in an items metadata:
wget -r -H -nc -np -nH --cut-dirs=1 -l 1 -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

-A -R accept-list and reject-list, either limiting the download to certain kinds of file, or excluding certain kinds of file; for instance, adding the following options to your wget command would download all files except those whose names end with _orig_jp2.tar or _jpg.pdf:
wget -r -H -nc -np -nH --cut-dirs=1 -R _orig_jp2.tar,_jpg.pdf -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

And adding the following options would download all files containing zelazny in their names, except those ending with .ps:
wget -r -H -nc -np -nH --cut-dirs=1 -A '*zelazny*' -R .ps -e robots=off -i ../itemlist.txt -B 'http://archive.org/download/'

See http://www.gnu.org/software/wget/manual/html_node/Types-of-Files.html for a fuller explanation.
…………………………………………………………………………………………………………………………

Step 4: Run the command
1. Open your terminal emulator (Terminal or Cygwin)
2. In your terminal emulator window, move into your folder/directory. To do this:
For Mac: type cd Desktop/Files
For Windows type in Cygwin after the $ cd /cygdrive/c/Users/archive/Desktop/Files
3. Hit return. You have now moved into th e”Files” folder.
4. In your terminal emulator enter or paste your wget command. If you are using on of the commands on this page be sure to copy the entire command which may be on two lines. You can just cut and paste in Mac. For Cygwin, copy the command, click the Cygwin logo in the upper left corner, select Edit then select Paste.
5. Hit return to run the command.

You will see your progress on the screen. If you have sorted your itemlist.txt alphabetically, you can estimate how far through the list you are based on the screen output. Depending on how many files you are downloading and their size, it may take quite some time for this command to finish running.

…………………………………………………………………………………………………………………………
NOTE: We strongly recommend trying this process with just ONE identifier first as a test to make sure you download the files you want before you try to download files from many items.
…………………………………………………………………………………………………………………………

Tips:

  • You can terminate the command by pressing “control” and “c” on your keyboard simultaneously while in the terminal window.
  • If your command will take a while to complete, make sure your computer is set to never sleep and turn off automatic updates.
  • If you think you missed some items (e.g. due to machines being down), you can simply rerun the command after it finishes. The “no clobber” option in the command will prevent already retrieved files from being overwritten, so only missed files will be retrieved.