How to create an html snapshot / archive of your site at a particular date and time

#1 / Apr 23, 2011 4:59pm

talbina
20 posts

Let me explain this using an example.

You call know Drudge Report, which is a new link aggregator.

You can also see the Drudge Report at a specific data and time. For example, here it is on April 1st: http://www.drudgereportarchives.com/data/2011/04/01/20110401_130016.htm

Can anyone please tell me how would I go about doing something like this for an EE site?

What languages do I need and how complicated is something like this to implement for a beginner?

[Mod Edit: Moved to the CodeShare Corner forum]
#2 / Apr 23, 2011 5:58pm

narration
773 posts

talbina, it’s an interesting proposition, even if I have to quit looking at the forums and go on about a Saturday 😉

- my first thought was to do this by constructing an extension, which would allow modifying the EE-internal idea of what date-time it was. This alone would probably be more than a beginner would want to become involved with, unless with good experience in PHP as well as programming beforehand.

- but then, you can see that this would not be enough. Working from the modified date-time, you’d have to alter every database query EE makes, to disallow any kind of content after the cutoff moment.

This would be a pretty daunting concept, and there aren’t ‘hooks’ to allow you to do it in the fashion that extensions are developed, so you’d have to make an altered copy of the EE code. The work would have to be re-done potentially each time EE released.

- That CodeIgniter exists would let you build a ‘view-only and time limited’ version of EE’s data, but you’d have to replicate things like the template parser (very complex indeed) accurately—and again change this each time EE were to change in the future.

So, unless I’m missing something fundamental, I think we’re seeing that the code behind the Drudge Report is a very custom and special-purpose piece of work. With their traffic, likely they can afford that. And equally, if EE were to add the ability, it would pretty much need to be taken on and agreed to as a Feature Request, so that EllisLab persons would create and maintain it, because it was valuable to enough customers.

talbina, your proposal is particularly interesting because looking at it shows that although EE is in fact very flexible as far as specializing and adding abilities, there are some ways in which it really can’t be so flexible—an aspect of the larger world.

A nice weekend, and regards,
Clive
#3 / Apr 23, 2011 6:18pm

narration
773 posts

Just a note or two more, as yes, the concept could be interesting for some special uses.

- however, another roadblock would be that to have a true ‘backwards window’, EE would have to somehow save not only every entry you ever made into its database, but all the content which doesn’t come from the database, such as images, uploaded files, etc.. This isn’t part of its design, and again would be very specialist.

- I thought after a moment of the Wayback Machine, which crawls the web and does take snapshots of websites, often daily. I’d always wanted to look at it, so just did. Clearly it takes immense resources to do this, and it can be very useful, perhaps for your purposes.

Issues noted:

1. It only publishes after 6 months, so that recent changes aren’t available.

2. There’s a limitation, probably also controlled by the individual sites, on how deep the site links are recorded. You can’t look at forums on EE, for example.

3. However, I was interested that for http://www.cnn.com, links into story areas did work, and that actually individual stories (pages) could be linked to beyond this. That’s because the links could still be reached live, as CNN has itself kept them. The look of the story and site might well be different, however, as you are now no longer strictly looking back in time.

Enough to learn on this today 😉

C.
#4 / Apr 24, 2011 6:58pm

handyman
509 posts

Let me explain this using an example.

You call know Drudge Report, which is a new link aggregator.

You can also see the Drudge Report at a specific data and time. For example, here it is on April 1st: http://www.drudgereportarchives.com/data/2011/04/01/20110401_130016.htm

Can anyone please tell me how would I go about doing something like this for an EE site?

What languages do I need and how complicated is something like this to implement for a beginner?

[Mod Edit: Moved to the CodeShare Corner forum]

This is fairly easy - assuming you don’t want to do it daily or anything!

Use a “sitesucker” piece of software - like sitesucker, web devil, httrack, etc.

These will pull down every linked page of a site and turn them into static html sites (with relative links) that work (mostly).
There are options you have to check off - for instance, you tell sitesucker to “add .html” extension to each filename - that way it will work without php and a database!

Does that sound like what you want to do?
#5 / Apr 25, 2011 10:36pm

talbina
20 posts

handyman,

Thank you very much. I used httract and i like it. Worked well with my site. However, it downloads so many files and directories.

I just tried something which will make me look really stupid for making this. All I do is “view source”, copy and paste it into notepad, and save as html. What am I missing? I can append this html onto my website, such as http://www.mywebsite.com/2011-04-01.html. I am missing something troubling?

Clive,

Thank you very much for those writings, I really appreciate them.

Another idea that can be implemented is to make a URL http://www.mywebsite.com/2011-04-1, which would show what the page looked like at the end of that day. Just to let you know, I only need this once a day, not every hour or anything.

Then I would use the filters in the channel to only have posts that were posted between 12:01 am and 11:59pm. The problem is
#6 / Apr 25, 2011 10:48pm

handyman
509 posts

Something like viewsource would work if only a few pages had to be copied….....and it does not change the internal links, so you have to do it all manually.

There are a number of easier ways to do this - even firefox has a free extension called “down them all” which downloads any page or pages. You can usually set these to download only as many levels “deep” as you want, so you will not download as many files.

A “view source” will NOT download images or other referenced objects in the page, just the text! If I were you, I would fiddle with the tools available.
http://www.downthemall.net/
#7 / Apr 25, 2011 10:51pm

talbina
20 posts

Something like viewsource would work if only a few pages had to be copied….....and it does not change the internal links, so you have to do it all manually.

There are a number of easier ways to do this - even firefox has a free extension called “down them all” which downloads any page or pages. You can usually set these to download only as many levels “deep” as you want, so you will not download as many files.

A “view source” will NOT download images or other referenced objects in the page, just the text! If I were you, I would fiddle with the tools available.
http://www.downthemall.net/

But the images are pointing to where they are located, the image location did not change.

If I have an image at http://www.mywebsite.com/images/lolcats.jpeg....when I do view source, it will still point to http://www.mywebsite.com/images/lolcats.jpeg. My point is to host this html file, not to download.

I guess I should have been more clear in my original post.
#8 / Apr 25, 2011 11:00pm

narration
773 posts

Hmm. Well, but the thinking on this html-only site look back means that you wouldn’t have to store the images, etc., so long as you were willing to keep them alive at the site.

I was just mulling this over, in fact, after the earlier look at how to ‘really’ make a site that could look back in all aspects, and what the cost and resources would be.

I think, talbina, that you’re narrowing down your requirements here in a way that may produce a workable design as you keep at it.

I believe you’re now saying:

- once a day snapshot
- one page (site home) only
- willing to keep the resources such as images, javascript, css etc. undisturbed at the site. However, complications may enter here, as libraries like jquery, and your css perhaps as well, don’t tend to remain stable, though on the side of it working adequately, they may remain compatible.
- unlike the Wayback Machine, doesn’t purport to be a solution dependable over the ages.
- may not need always or over time to deliver picture-perfect results.

You could make a first try of such an arrangement by putting the html into a textfield of a simple EE channel, and displaying it using a template that had nothing in it but the exp:entries finding the page, and naming that text field. I think 😉 And that’s what feasability tests are for.

If this works adequately, you could do a manual version by simply copy-pasting the site view source html once a day. Something more automatic could be pretty readily constructed as an extension or module, along with using something like ee-cron to run it on schedule.

Anyway, an interesting idea, talbina, and you’re most welcome for the pokes into it above.

Regards,
Clive
#9 / Apr 25, 2011 11:01pm

narration
773 posts

p.s. interesting other viewpoints, Craig, and am sure it all helps here.

Regards,
Clive
#10 / Apr 25, 2011 11:26pm

narration
773 posts

p.p.s. Here’s one more chicken for the pot.

Adobe Acrobat 9 or so, if you have that, seems to do a bang-up job of downloading a page (or could be a site depth of your choosing).

The pictures are captured (and available) at screen resolution, the links work, the overall site page presentation looks exactly right, and the package is nicely compressed.

Note that this is via capture from within the Acrobat application—printing to Acrobat from your web browser doesn’t do nicely, usually loses much layout and makes messes where menus are, etc.. Though I have been using that to capture pages for several years, have been thinking of fast ways to go to the full Acrobat route, now that an upgrade lets me see how it’s matured.

I think this and the details necessary to get the PDFs and thumbnails uploaded to a directory for EE may be workable, if you have something like an Adobe Creative Suite, though no research on this point. Once on the server, again an ee-cron’d add-on could put them into a channel for available view.

This may not be at all how you’d like to proceed, and I kind of like the html skeleton approach, as far as we may be correct about your needs, but it would provide a nice sharable archive which really does hold the page view for the day. Stranger things have been automated, I am pretty sure.

C.

Thread

talbina

narration

narration

handyman

talbina

handyman

talbina

narration

narration

narration

Username

Password

Thread

How to create an html snapshot / archive of your site at a particular date and time

ExpressionEngine News!

Username

Password

Email Address

Display Name

Password