Using wget in interesting ways

Make a web request for a web-page and all its resources in order to display correctly, but delete everything immediately after being downloaded:

wget -H -p -e robots=off --delete-after http://www.google.com 
  • -H [or] --span-hosts
  • -p [or] --page-requisites
  • -e robots=off [or] --execute
  • --delete-after

Other useful options

  • --no-dns-cache [...Turn off caching of DNS lookups]
  • --no-cache [...to disable server-side cache so as to always get the latest page]

If you want to store the pages then the -E and -K directives may be of use

wget -E -H -K -p -e robots=off http://www.google.com 
  • -E [or] --adjust-extension
  • -K [or] --backup-converted

If the web-server that you are fetching pages from blocks automated web-requests based on the user-agent, you can fool it with the following directive:

  • -U [or] --user-agent=""

If you don’t want to use the –user-agent option you can create a .wgetrc file in the home directory such that wget will always use the pre-configured user-agent

Example ./wgetrc

### Sample Wget initialization file .wgetrc by http://www.askapache.com
## Local settings (for a user to set in his $HOME/.wgetrc).  It is
## *highly* undesirable to put these settings in the global file, since
## they are potentially dangerous to "normal" users.
## Even when setting up your own ~/.wgetrc, you should know what you
## are doing before doing so.
header = Accept-Language: en-us,en;q=0.5
header = Accept: text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8
header = Connection: keep-alive
user_agent = Mozilla/5.0 (X11; Linux x86_64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/48.0.2564.109 Safari/537.36
referer = /
robots = off</pre>


