In my last article, I introduced you all to the internet downloading tool- Wget. We have already seen the installation procedure and few basic commands.
Today, we are going to explore more custom uses of this simple yet powerful tool.
The method we are going to follow is, we’ll discuss a use case and then the command to get it done and various tweaks that could be performed on it to suit your requirements.
So lets get started…
Say, you have a file.txt with any number of links each on newline and you want to download them all, use
wget –input file.txt
Here “–input” option tells wget to download all the links in file.txt.
If you are a web developer, you might want to have a look at a website’s html, css, or js to figure out some new cool UI implentations. To do that, our wget tool comes in handy and can download and store all of the webpage’s html, css, js and other assets like images etc… locally which are enough to make it run offline
To do this, use:
wget –page-requisites –span-hosts –convert-links –adjust-extension http://sitename.com/dir/file
|–page-requisites||This option causes Wget to download all the files that are necessary to properly display a given HTML page. This includes things such as inlined images, sounds, and referenced stylesheets.|
|–page-span||Enable spanning across hosts when doing recursive retrieving.
|–convert-links||Retrieve only one HTML page, but make sure that all the elements needed for the page to be displayed, such as inline images and external style sheets, are also downloaded. Also make sure the downloaded page references the downloaded links.|
|–adjust-extension||Takes care of file extension names. Checks and correct extensions of all the files.|
What we just saw above can be modified to mirror entire websites which is something every blogger should use regularly to take a backup of their wordpress files if in case the domain service messes up the online blog. To mirror a website, use
wget –execute robots-off –recursive –no-parent –continue –no-cobbler http://sitename.com
Lets download all the mp3 within a subdirectory of a website
wget –level=1 –recursive –no-parent –accept mp3,MP3 http://sitename.com/mp3/
This will download all the mp3 files contained in /mp3/ directory. We have set recusion depth as 1 using “–level” option
Exact above command can be modified for images as
wget –directory-prefix=file/images –no-directories –recursive –no-cobbler –accept jpg,gif,png,jpeg http://sitename/images/
Here “–accept” tells wget to only accept files with extensions matching jpg,gif,png and jpeg
Following is the command to downlaod files from a website which checks the User Agent and the HTTP Referer
wget –refer=http://google.com –user-agent=”Mozilla/5.0 Firefox/4.0.1” http://sitename.com
Download files from a password protected site
wget –http-user=labnol –http-password=hello123 http://sitename.com/secret/file.zip
Some other options that might come in handy:
|–wait=10||Use this option when you dont want to consume all of the bandwidth and download files after a gap of 10 sec in between. It is a good practice to not put a lot of load on some website.|
|–random-wait||Use this in combination with above one to not wait exact 10 sec but random amount of time.|
|–domain=xyz.com,docs.abc.com,files.abc.com||Use this to specify any number of domains name to fetch from separated by comma.|
|–limit-rate=200k||Limits the bandwidth use to 200Kbps.|
By now, you have mastered the use of [options] fields to suit as per your requirements. You can use various combinations of options and tweak above commands to get the desired result set. All the options listed here will by and large cover your needs but if you are stuck, you can always refer to the wget manual at https://www.gnu.org/software/wget/ or ask in the comments section below.
Image Source: lintut.com