Profile

unixronin: Galen the technomage, from Babylon 5: Crusade (Default)
Unixronin

December 2012

S M T W T F S
      1
2345678
9101112131415
16171819202122
23242526272829
3031     

Most Popular Tags

Expand Cut Tags

No cut tags
Sunday, October 10th, 2004 04:38 pm

[livejournal.com profile] rbos just mentioned the robots.txt file on http://www.whitehouse.gov, and I went and took a look at it myself.  (For those who don't know, this is a text file that lists locations on the site that search engines are forbidden to spider or index.)  It makes interesting reading, particularly when considered as evidence of clinical paranoia.

Search engines are forbidden to index, for example:

  • Huge numbers of directories mentioning Iraq; OK, reasonable enough I suppose, if there's classified material in there that you were stupid enough to put on a publicly-accessible website.
  • Baseball photoessays?
  • Climate change fact sheets?  Do you have something to hide there, Mr. Bush?
  • /easter/2004/eggsbystate/text ..... whatever that is ....
  • All the First Lady's news speeches.... er, hello???
  • Ditto, all the First Lady's photos and photoessays....
  • Art history links?  History of First ladies?  History of the grounds?
  • All the historical photoessays?
  • All the Independence Day 2004 photoessays?
  • Several HUNDRED directories of in-focus links on everything from education to tax relief to rural America to US veterans to small businesses?
  • All of the press releases for the past four years, including the State of the Union addresses?!?!?  Why on earth would you exclude your own press releases from being indexed by search engines?  (Unless, of course, you want to be able to do a little historical revisionism on them later without anyone noticing....)

Find the file here.  I don't know what there is in the 1931 lines of exclusions in this file that the White House thinks needs to be hidden from search engines, but as [livejournal.com profile] rbos pointed out, they'd look a whole lot less guilty (not to mention less paranoid) if they just blocked search engines from the entire site.  I mean, wouldn't the sensible thing be to put all the sensitive material on a different site that's not publicly accessible?  Or for heaven's sake, just move all those 1931 directories into a new /restricted directory, exclude just /restricted, and make it accessible to whitehouse.gov hosts only.

I still can't wrap my brain around the kind of mental convulsions it must take to come up with a reason to exclude all of your own public press releases and State of the Union addresses, publicly posted on your own site, from search engine indexing.

Sunday, October 10th, 2004 08:07 pm (UTC)
Project for you, my friend. Write a script to mirror the site and make it look like several casual browsers. Better, write something that works like seti@home so that we can mirror the entire site in a distributed fashion, and upload it to a central CVS repository via SSH.

We'll figure out what the fsck he's up to.
Sunday, October 10th, 2004 08:30 pm (UTC)
Hmm. Basically not difficult. Selection of a safe central repository, though ..... I wonder if havenco would donate a chunk of space?
Sunday, October 10th, 2004 08:33 pm (UTC)
Define safe?

As for havenco... ask?
Sunday, October 10th, 2004 08:47 pm (UTC)
"Safe," in this context, probably means (a) offshore, and (b) not likely to roll over at the first suggestion from US authorities that they'd be pleased if such-and-such a site went away and all names connected with it mysteriously appeared in their mailbox one day.
Wednesday, October 13th, 2004 11:49 pm (UTC)
I have done something better ..... I have unleashed the Archive on the problem.