The Core

Why We Are Here => Hardware & Technology => Topic started by: Adam C on April 11, 2013, 03:20:58 PM

Title: Periodically save copy of webpage
Post by: Adam C on April 11, 2013, 03:20:58 PM
Wondering if there's a tool / service out there that will do this - I expect so - or some easy to configure process...

Basically I want something that will save a copy of certain pages of competitor websites on a periodic basis - say weekly.

Set it and forget - until such time that you have a need to look.

What do you / would you use for this?
Title: Re: Periodically save copy of webpage
Post by: BoL on April 11, 2013, 04:03:01 PM
I'd just use curl or wget, wget if you want to fetch CSS/JS/images on the page too.

This small hacky bash script should do the trick. I'd put it in a cron job to run once a week.

Quotei=0;
folder='/home/richard/Desktop/urlfolder';
while read url; do
curl -L -A "Mozilla 6.0" -o "$folder/weekly_"`echo $i`"_"`eval date +%Y%m%d`".txt" "$url"
i=`expr $i + 1`
done < $1

It'll take a command along the lines of
sh urllist.sh /home/richard/Desktop/urllist.txt

where urllist.txt is just a list of urls, one per line. script may need editing but worked ok testing it.
Title: Re: Periodically save copy of webpage
Post by: Rooftop on April 11, 2013, 05:26:11 PM
We've got a tool that does this and highlights when pages change in certain ways: title changes, number of links, words on page - that sort of thing.  The eventual plan is to show those changes against some sort of visibility index  and link data.  The hope is to me able to visualise some of what is helping pages rank in particular sectors.

Don't suppose that is close to what you are looking at?  (this is the most cack-handed market research ever - in case you wondered)
Title: Re: Periodically save copy of webpage
Post by: littleman on April 11, 2013, 05:36:00 PM
Something like BoL's script could be setup to run automatically via a cron script.
Title: Re: Periodically save copy of webpage
Post by: ergophobe on April 11, 2013, 10:06:35 PM
perhaps simpler

wget -p -k -E http://example.com/page.html

-p: grab page requisites (images, JS, CSS)
-k:convert links. So if you have src="/images/image.jpg" and you're downloading example.com/dir/page it will convert that link to scr="../images/image.jpg"
-E: add html extension. If you're  downloading example.com/dir/page it will save it as page.html so you can double click to open in your browser.

Pipe it to gzip and save it in an archive with a timestamp in the filename and you're done!
Title: Re: Periodically save copy of webpage
Post by: bill on April 14, 2013, 01:35:31 AM
Are there any recommended WGET clients for Windows?
Title: Re: Periodically save copy of webpage
Post by: ergophobe on April 15, 2013, 06:43:56 PM
>>wget clients

Hmm.... well since I have git on my windows machines and git for win comes with git bash and bash has grep, wget, less and a lot of other things you'd expect from bash, that's how I would use it on windows (though actually, I don't really use it on windows).

You can also get it as a standalone from gnu: http://gnuwin32.sourceforge.net/packages/wget.htm
Title: Re: Periodically save copy of webpage
Post by: bill on April 16, 2013, 02:13:23 AM
The GNU tools might be the way to go. I was holding off on Cygwin on my work machine, which would have been the other option I know of.
Title: Re: Periodically save copy of webpage
Post by: Rupert on April 16, 2013, 07:58:50 AM
If you are looking for something pc based, how about iopus? You can get that to run macros to automate anything like that.
Title: Re: Periodically save copy of webpage
Post by: Rumbas on April 16, 2013, 08:35:01 AM
http://www.changedetection.com/
Title: Re: Periodically save copy of webpage
Post by: bill on April 16, 2013, 09:00:33 AM
Quote from: Rumbas on April 16, 2013, 08:35:01 AM
http://www.changedetection.com/
I use that all the time. Forgot about its ability to cache pages.