The idea is interesting and even kind of tech-funny.
But man, you really have to explain how it works a bit better. At first I thought that we should redirect 404s to your website and I was: "??".
What I understood:
Which each iteration of the website, you archive the old one on a specific subdomain. Then, you redirect all 404s of the new website to the old one. Like that: no link is broken.
Yeah, I didn't understand and jumped straight to the comments to see if there was an easy explanation here. Guess I should have clicked "How?" at the top.
Even after reading that page, I didn't understand that this was a suggestion to start regularly archiving old versions of the site and only sending visitors there if their link didn't point to anything on the current site. Instead, I thought the idea was that, for software reasons I don't understand, web developers commonly changed the subdomain name for the main site and this was just a method for reducing the number of broken links when such a change was made.
FWIW Brave does that. If you reach a page with a 404 a banner will appear on top with a button to try and navigate to the latest version of the page you're looking for at the archive.
There's a fantastic extension for Firefox, Chrome, etc. called Web Archives which will attempt to find a copy of the missing page from the Wayback Machine, Google Cache, Archive.is and many more.
Ignoring database defined URLS, which make this harder or at least different, you could automate this to some degree with snapshots (for filesystems that support them), using the snapshot time as the subdomain or path prefix that references an older version, and the 404 page noting the current domain, finding the next oldest snapshot, and redirecting to that next subdomain or path prefix.
404s are useful though.. how annoying would it be if you wanted to get to a specific part of the site, and you kept getting redirected to somewhere else without being informed that the part you are trying to access doesn't actually exist. 4 errors exist for a reason.
This is right - 404 pages are the right UX when you have a 'stateful' resource that can be deleted, and you need to show that the URL (or ID param within) is correct, and once pointed to a resource, but now that resource has been permanently deleted and can't be shown any more.
In a sense this information conveyed by the 404 page is now the immutable 'resource' that will stay permanently at that URL. Doing a redirect breaks this, it's lossy and usually a bad UX.
It's used in APIs, just not user-facing documents (or rather would-be documents). Too late for that though, while laymen pretty much get '404', introducing a subtly different numeric code to co-exist would be a bit much, I think. And also pretty worthless, anecdotally I think actually displaying big '404' text is on a downward trend, probably because of the prevalence of SPAs/webapps in general.
Yeah, I mean, for the lay public to have to know what "404" means feels like a side effect of the early web being a frontier. What's important to tell the user is "there's nothing here", and that can and should be done without invoking any magic numbers at all.
A super useful distinction for APIs for similar reasons. Interesting these statuses are often remapped for the web, where a 410 is often implemented as a 404 like in the scenario I describe and the 'true' 404 is often implemented as a 302.
Quite the contrary: if the _resource_ still exists but in the older site, the redirection will point the browser to the other location. OP's stratagem is doing exactly what it's supposed to do (if the identifier of a page is its path and not the full URL of course)
The subpage called "How"[1] covers this though. The idea is to redirect you through all the historical versions and then to a 404 if no match is found on any of them.
doesn't this mean when they end up on your 1997 site (without realizing it, because all they did was click a link), and then try to navigate around, they're stuck in the old version of your site?
edit: maybe not, because the old site was written to assume it's running at the current site's subdomain? i guess it depends on much you've changed your URL structure since then. that thought makes me a little squeemish.
it seems like a nice approach would be to return the 404, and make your 404 page render a link that says "try an archived version?". you gotta let the user know that what they're about to see might be stale.
Yeah. I really like this idea of warning the user. Keep the usability without losing the functionality.
Also, not sure if anyone is thinking of this, but there are security concerns if you were to serve up old pages on a new domain. If that old page has a vulnerability, it now has access to data in the new domain. This is more for archiving, so the old pages wont get patched.
These ideas don't scale though - a 404 handles everything, not just textual content pages (meaning like images, css, js, etc. - the underlying plumbing for most sites). Nobody is going to sit there and write 404/410 content for moved assets one by one when they rebrand their site, that's madness.
In my travels, most folks who actually care about this (SEO links) build an extensive alias/redirect map for the old/previous core URLs they want to remain functional as part of their rebrand, when the new site goes live the redirects are dropped in as well. This is especially true if they've ever published the URLs on physical media (mailed postcards, e.g.).
The home page is slightly misleading, the nginx config is of course the easy part, the hard part is to correctly archive all the previous versions of your website every time you make a change. I thought this website provided an archival service but it does not.
People who care enough about preserving history and current links probably already do that. People who don't care aren't going to start now because of this page. Especially those who have dynamic content and probably don't want to keep running a million different versions of their backend forever.
If you like something on the web then make a copy.
Doing this as a consumer and doing this as a webmaster are different processes.
As a webmaster, if it's at all possible to go static (whatever your flavor of that is), then do that. A static website is easy to host and keep forever, and it's usually easy for consumers to archive too.
Not to hate on PHP, but keeping older PHP sites around securely has become a major undertaking. You can't safely run a wordpress site that hasn't been updated in 5 years because your security vulnerabilities are exposed to the wide web. If your static site generator has security flaws.. well that doesn't affect your current build artifacts and you can still run the thing in secure ways.
A better solution might be to start the archiving process from the beginning. Having a main page that links to content stored on 2020.website.com. The following year, publish new content on 2021.website.com, etc.
Remember "Cool URIs Don't Change"? I think about it a lot.
It was written toward the end of the BOHF's reign, when a technical specialist of the web had quite a lot of sway, when their decisions about a site's information architecture and how it was run was, if not the law, at least a very heavy hand on the till.
Those days are long past. Now Ted in Marketing wants a URL and who are you not to give it to him? I remember the pain of creating vanity top-level URLs in SharePoint 2003 because some functionary wanted them, and then they would promptly forget what they demanded. Yes, I used to use 410 Gones where appropriate.
That sort of thing has not been in our hands for quite a white, even if it is probably the best thing to do. After all, has the product URL changed? Or will it be back? Or has it been discontinued? The correct HTTP response, properly and widely used, would be very helpful in moving so much of the web forward but that is not under our control. Hasn't been for a while.
First, it unduly burdens the server, in sending multiple redirects to cover the entire search space of possible versions of a URL - for a mature site, this could be a lot of redirects. It also unduly burdens the client, in following them, and the network between the two.
Second, 302 is the wrong type of redirect to use here, because it is temporary; a well-tempered user agent will treat it as such, necessitating the same cascade of redirects on followup visits. The right way to do this is with a 301, which has a semantic of permanency, and is treated as such by user agents. But it's still the wrong thing to do.
Maintaining access to older versions of websites is, again, an entirely desirable thing to do. But if you're going to do it in a way that requires work on the server (as this design also does), you're better off just having the server maintain version information and serve the latest available page at a given URL, in a 200 response, when the URL is accessed.
I don't have any opinion on whether 302 or 301 is the better choice, I'd just like to point out that using 302 seems to have been a deliberate decision:
Google treats 302s and 301s differently when it crawls. A 302 is effectively ignored and the search index isn't updated - the original URL that threw the 302, is left in. Whereas 301s result in the new redirect target URL replacing the original in the search index.
By filling up your sites with nested 302s (following this to it's conclusion, in ~10 years time), is not only a management headache, but may fall foul of Google (I'm not sure nested 302s, will send positive signals to Google) and result in your whole site being de-indexed.
So I'm aware of at least BBC using this approach, where opening really old articles reveals they have their 90s site still up. I'm also aware of at my last employer, an unmaintained wiki with open security issues that nonetheless had vital information for still in use legacy internal software was replaced by saved static HTML grabs so the information wasn't lost.
But for a lot of medium sized companies with dynamic websites, this isn't always practical. They may not have the know how to dump their 2000s drupal install to static HTML files, and don't have the IT staff to upgrade and secure it.
I wish people would would change their language regarding tech debt. The companies of your second paragraph choose not to upgrade and secure their websites. It's not something unavoidable.
They point your parent poster is making is that these companies instead choose to shut the website down, which is a legitimate alternative to updating and securing it -- unlike just leaving it up without maintenance, which is not.
That's why I specifically indicated the companies referred to in the second paragraph of the parent post. The BBC and the unnamed company where Macha was working did it correctly.
Have to agree with all of the disagreements to this. The 404 is a 404 for a reason, just like 301 and 302 are different for a reason. It's not uncommon though for WordPress to do things like this, or blogs for that matter. If an author changes the title or date of their post, and the URL structure is reliant on those two pieces of data, then the URL will change. The old URL is preserved in a DB and if accessed again, 301s to the newly named resource. Others, will throw a 404 and give a cutesey Levenshtein message, "did you mean x?" at which the user can decide to go to the new resource. It's all circumstantial... It shouldn't be enforced.
Re: Google and PageRank, pretty certain they've addressed this and recognize 302 and 301s and treat them the same. Previously, this was an issue.
1. Create a 404.php (or whatever your preferred back-end is)
2. mod_rewrite real 404s to serve that script.
3. In that script have a lookup table/db/file that lists all the redirects you need.
4. Extract the requested URL from the server variables.
5. Use the lookup table to find the correct URL, and issue a 302 for it to the users browser.
It's kinda seamless, and I've been doing it for years.
There was a post here that was deleted by the author before I could write my reply. But he ended his comment with this:
> You know why the web is broken? Because nobody cares.
I agree with this.
404s aren't a technical problem, they are a maintenance problem. If there was time, budget or interest to fix it, the 404 wouldn't have existed in the first place.
I appreciate the author’s work on so many projects, but the quote doesn’t resonate with me. There is a new post almost every day on HN about “saving the Web” that gets voted to the top - and it has been this way for several years now.
But I don’t think fixing broken links is at the top of our priority. Many links need to be broken. And I’m glad my old stuff isn’t around. Hopefully much of my new stuff will go away too. I think this is just humanity’s process - to sift through it all and hang on to the pieces people want to save.
Not for enterprise SEO or inbound marketing it isn't - I worked for part of Reed Elsevier we found 12-15% of link to the home page went to the "wrong" page.
You can recover a lot of value from recovering old links that would take a lot of effort to replace.
404s aren't a technical problem, they are a maintenance problem. If there was time, budget or interest to fix it, the 404 wouldn't have existed in the first place.
Doesn't that assume that all things on the internet are permanent? Why should that be the case? If I have a page on my website and I decide to delete it then I should be able to do that. Having links that pointed to it return a 404 is correct. 404s are useful. They convey real information.
Sites that do things like 302 redirection to the home page when the link is apparently incorrect are annoying - you can never tell if the page is really gone or if the website has incorrectly bounced you to the home page.
>Indicates that the resource requested is no longer available and will not be available again. This should be used when a resource has been intentionally removed and the resource should be purged.
Ok, that would be a better response. The point still holds true for things that a link is pointing at but aren't found for other reasons though - for example, if a storage server has failed.
It looks like they want websites to do the following: whenever your website goes under a drastic change, serve the previous version of the website under a subdomain, and redirect all 404s on the newer version to the subdomain. More drastic changes in the future may potentially redirect 404s to previous subdomains/versions like a chain. At least that's my understanding of it.
It is poorly explained. I had to read the about page.
Instead of 404s, you redirect to a previous version of your webserver (that is still running), which then instead of 404s redirects to a previous version of your site (that is still running), and eventually it tries the wayback machine.
So never pull down a previous version of your website, and issue 302s to that?
I’d like to be able to show any eventual 404 from the current version of the site though, which means there may need to be a wrapper around the terminal site (or more reasonably, code that runs locally on the server to find the right page URL and 302 directly to that rather than a client round trip for every version searched).
You are not alone there. I tried multiple times to understand what the author tries to convey, without being really sure to grasp it.
At first I thought that it was simply some advocacy for taking the time to route legacy content using 302, but it also seems to be some sub-domain trick with years…?
The idea is to keep all old versions of your website, and when somebody requests an URL, try to find that URL on all versions of your websites (starting with the newest), and only return 404 not found if the URL does not exist on any of the archived versions.
In a nutshell it is simply allowing you the ability to present fallback content with 302s but with a more gentle redirect than a 301, which tells you it is permanently gone.
I've maintained link integrity for a long time on my personal site using a similar tactic, however for me anyhow the new Wordpress version of the site had very little URL overlap with the old version so I just converted the old PHP version to static HTML and and set up nginx to serve both. When I migrate off WordPress, I expect I'll maintain a similar approach. No reason to maintain an older version of the site which could have security issues.
If I'm reading this correctly, this concept work only if you never take down old sites, and you have a full archive.org copy out there.
This is not realistic for large SaaS apps. I run a SaaS app that has been online for over a decade, with millions of public documents. While we do strive to minimize URLs changes, and we do have robust redirects in place for old URL formats... it is simply not feasible to keep old versions online.
My apologies if I am repeating what others have said, but just yesterday I chose to give a 404 for an operation that could not be concluded because a resource is not available. A 404 is not about pages but about resources. It is perfectly legitimate that a resource is not available anymore or is temporarily down in a non internal error kind of way.
A REST request on a IoT device can return 404 if the device is not available at the moment. Any redirect is meaningless and actually breaks the semantics of 404. My understanding is that 404 has been so connected with the file like persistent resources, that people forgot the elegance of the wording in the standard.
I can understand the popular usage change, but I think then the standard actually becomes incomplete because it only fits web pages or document type resources when an HTTP status code is about Hyper text Transfer Protocol.
Alternate view disagreeing with what postulated above is that, actually HTTP has been abused and is being used in scenarios that have nothing to do with Hyper Text, looking at you JSON REST :D
I don't think 404 is the right code for "resource not available". If you want to send a request to a device, and it's not online, that is "503 Service Unavailable". The device exists, after all, it just isn't available to handle the request.
Ultimately, HTTP has a very weak set of status codes for application layer concerns. The vast majority of the ones that sound good mean something about the HTTP layer rather than the application layer -- for example, it makes sense to return "precondition failed" if you request deletion of a directory that isn't empty (the precondition for deleting directories is that they are empty), but "412 Precondition Failed" means that the client supplied an HTTP precondition like "If-Match: abc123" and it failed. Trap for the unwary.
For this reason, I think REST largely fails to provide what API authors and users desire. If your API server can successfully convey an error to the client, you might as well say '200 OK {"error": "The Raspberry Pi you were looking for exists, but it's turned off or someone used its Ethernet cable to test their scissors and the test went very well."}' Now the client actually knows what's going on.
The Ethernet cable scissor test success made me laugh :D.
We agree on the general weak suitability of http codes to convey application layer status, but I kind of disagree on the 503 because it implies an overload and an internal error, which is not the case. Also I really believe the resource not a available is as accurate as it gets.
In the end I agree with you we may be splitting hairs with a very unsuitable status report and should just send 200 with the error. I have done this before actually, but it tickles my neck a lot to see an OK, then error.
I recently dropped my last shared hosting in favor of Vercel/Netlify and some content was lost. This solution wouldn’t work for me because the very reason why the content was lost is that I don’t want to pay for hosting I barely use.
A better solution would be an intermediary part that never changes — say, CloudFlare — that caches HTML pages forever, automatically adds a “Archived content” header to the page, and warns the author so that they can either allow the archived version or make it a 404/410 instead.
Nobody wants to maintain servers forever, but serving static/frozen pages is much easier and cheaper.
On a similar note, my blog is hosted on github pages.
Due to laziness and inertia I've kinda just left it there as it just works, but the lack of control makes me uneasy.
So at some point I want to move it, hosted by myself (with CDN) and under my own domain. Except all the links that people have used to link to my GH pages blog are still out there and I'm not sure how to cleanly redirect visitors to my own hosted version of the blog.
window.location hack? Has anyone else done this (I appreciate my lack of foresight on my behalf is a problem of my own making...)
Well, if it stays as a static site, you could always set your server to push any changes to GitHub as well, so you effectively keep your GitHub pages site up as a mirror of your personal domain. You could also make all of your navigation links absolute links that redirect to your new domain.
Or, like you suggested, make a skeleton site out of the current version of the site where each page just has a bit of JS to redirect a visitor.
This is a good idea, but I'm afraid it's rather unworkable in practice. If you're going to remove a page, you either didn't want it up, or you moved it somewhere else, or your entire site is unavailable. If you moved it somewhere else, you should add a 301, but for all other cases it's not for lack of will that the link died.
The nice thing about IPFS is that it has this out of the box. Pages can never die as long as at least one node has them in their cache, even if the original owner went away.
At least you can look up what 404 means and reason about what's going on. What does this change solve? I mean if you moved your site then you would issue a 301 from the old host showing people it had moved. I don't know when exactly people make lots of different versions of the same website where it makes sense to fail over to the old one if the new one doesn't work. Baffling
It have a very specific use case, and will keep giving a strong visibility to the old, unmaintained site, for people that had old links to it.
A more comprehensive 404 page (that says that what what you were looking for is not there, and links to the current and the old site), or redirects for the most accessed URLs of the old site in the new one are better approachs in my opinion.
I remember learning one of the golden rules of the Web is you don't break links. It's a shame it's regressed this far. Now it's normal to get many 404s every day.
I thought this was about a service that tries to find the right url and redirect back to your site at first. That would be a nice idea, like a spell correcting url finder for GET requests.
If you crawl the old version of the site and save off static html, then the hosting costs would be minimal.
If you added a banner at top saying it's "archived content" then that would also solve the issue with people being confused by the redirect that other comments have had.
But man, you really have to explain how it works a bit better. At first I thought that we should redirect 404s to your website and I was: "??".
What I understood: Which each iteration of the website, you archive the old one on a specific subdomain. Then, you redirect all 404s of the new website to the old one. Like that: no link is broken.