Mark's Blog on Digital Media, Software Development and IT things - Sharing is Caring

I'm going to explain how you can use a command line tool called wget to download an entire copy of website to your local machine, so that you have a copy of the website. In this example I'm working on an Ubuntu 16.04 machine. But you should be able to install wget on Mac OS X and even on Windows (Cygwin). OK so legally I need to say that if you're not the owner of the website you are going to copy then make sure you're not breaking any copyright rules etc. Play nicely and respect other people's hard work please.

You want to make a copy of your website so you can view it offline.

OK but fair enough you want to view a website offline, or perhaps you just want to have a copy as a backup. So here's the interesting (if you're a geek) bit. The website is probably built using a scripting language like PHP, JSP or ASP and each page is probably made up of lots of fragments of pages and data in a database. So how do you get a working local copy with so many moving parts in the background? Easy! what your browser sees is the finished page with everything put together nicely. So wget basically pretends to be a browser and pulls down the website one page at a time saving it locally and making sure that all the bits that are needed are there.

Howto: wget to backup an entire site on one command line:

The following code explains exactly what to type to get your website copied onto your local machine. Like I said I'm working on an ubuntu machine but there's ways of getting wget installed on Mac OS X and even Windows. In this example I'm going to pull down a copy of my website 'www.pureenergymeditation.com' but obviously you want to change this in the example below. Note it's in two places!

nohup wget \
     --recursive \
     --adjust-extension \
     --no-clobber \
     --page-requisites \
     --convert-links \
     --restrict-file-names=windows \
     --domains www.pureenergymeditation.com \
     --no-parent \
     --wait 4 \
     --trust-server-names \
     www.pureenergymeditation.com &

OK so let's break down what we're looking at:

nohup = do not close the process if the terminal window get's closed or the remote ssh login session is dropped. This is needed because the process of pulling down a website could take quite a long time.

wget =  this is the main script. Most of the options are self explanatory. If you want to know more then type "man wget" in to the terminal and you'll see the wget manual. I'll explain some of the key options below:

--adjust-extension = force the file names to end in .html if they don't already.

--page-requisites = make sure you download all the images, style sheets and other required files so that you have everything you need to show the page locally.

--convert-links = save the local copies of the pages with .html on the end so they can be read by a browser.

--wait 4 = respect the website you're scraping. This makes wget wait up to 4 seconds between calls to the server for the next page or page component (image, style sheet etc.). If you don't have a wait then you'll fire off your requests and hammer the server. This could be seen as a denial of service attack and have your IP address blocked. Either way it's not cool so don't do it. Put a wait in (and hence why you need nohup).

\ = What's with the backslashes on each line? This tells unix that the command line carries onto the next line.

& = This is not a wget thing. This is a unix command line directive which means "run this process in the background". Since you're running with nohup then all output (stdout) is captured in the log file ./nohup.out. Running in the background means that your terminal is now free to be used. When the wget command is finished you'll get a message on stdout saying it's completed. You could use the -b flag for wget instead but this would create an additional log file.

How to follow the progress of the downloading website?

Here are several ways to see the progress:

# follow the nohup.out log file:
tail -f ./nohup.out

# see how much file space is being used. This number will grow as the site is downloaded
# note wget has automatically created a folder for the website it's downloading
cd www.pureenergymeditation.com
du -kd1 .
# note du -kd1 . means Disk Usage (du) in Kb (k) going one folder depth (d1) from the current folder (.)

 If you are following the nohup.out log file you'll see something like this for each file downloaded:

--2017-04-08 15:16:14--  http://www.pureenergymeditation.com/blog/page=2
Reusing existing connection to www.pureenergymeditation.com:80.
HTTP request sent, awaiting response... 200 OK
Length: unspecified [text/html]
Saving to: ‘www.pureenergymeditation.com/blog/page=2’

     0K .......... ..........                                   135K=0.2s

What that's showing you is the timestamp and the page that's being pulled. You then see that a 200 code was returned which basically means everything is OK. The page is downloaded and it says where it's saving the page to. Then it shows you how big the download is (135K) and how long it took to download (0.2s)

How do you view the downloaded website?

The website now exists on your local machine as a collection of static HTML files. You can simply open your favorite browser like Firefox and go to File -> Open and select index.html in the main folder of the downloaded website.

FIX: WGET <a href=""> Links Don't Work in Local Copy?

In some cases the original site is using a script language for example PHP. So the files have the extension ".php". We've asked wget to append the extension ".html" to the end of all content files so that they can be opened in the browser. If you try to open a php file your browser wont know what to do with it.

The problem is that within these files they may have links to other files in the site. A link would look like this:

<a href="/about.php">Click here</a>

When you click on the link in the browser the browser tries to open the file about.php but there's two problems: 1. about.php doesn't exist as it's now named about.php.html (remember the --adjust-extension flag) and 2. even if it did exist the browser can't process a .php file.

So the solution is to find any links in the files and to change references to "*.php" to "*.php.html". Well that could be thousands of links! But don't panic.

# Firstly move into the directory which contains the website. Change it to your domain
cd ./www.pureenergymeditation.com

# look in files ending .php.html and replace .php" with .php.html" making backup files ending in .bak
find . -type f -name "*.php.html" -exec sed -i '.bak' 's#\.php"#\.php.html"#g' {} \;

# Once you're satisfied that it's all worked as wanted remove the backup files
sudo find . -name "*.bak" -type f -exec rm -f {} \;

Tags: Website, wget

Posted by Mark Zaretti at 17:13

No comments have been posted yet.


Add a new comment

Existing user

New user sign up


Click for More