The good old days
Every once in a while, we long for times that were, but not so in TYPO3 development. I assume here that we use RealURL for generating those nice URLs. RealURL is one of those improvements we cannot live without as it shows us where we are in a website in a logical fashion (if configured correctly). Lets hit a non existing page in TYPO3 3.8. We get presented with the following error message in the good old fashioned style.
[image 1]
OK, so now what? It's nice to know the page does not exist. If correctly configured in the install tool, the server throws a correct header (HTTP/1.x 404 Not Found) letting the client (your browser or Google) know that this page is not here. I presume Google is intelligent enough to use this info and not list this URL in their search results.
A step into the future
TYPO3 4.0 provided a 'hidden/undocumented' feature that lets you define your own 404 page as a file or URL using configuration settings ([FE][pageNotFound_handling] = /404/). One option is to create a ?static? HTML file, set the path to it in the configuration, prepending it by ?READFILE:?, and TYPO3 will show it as 404 page. However static files are not good. You cannot display, for example, an automatically updated site map with it. Here URLs come to play. If you set a URL (relative or absolute) for pageNotFound_handling, TYPO3 will fetch it and display it as the 404 page. The problem with this solution is that TYPO3 (3.x and 4.0) redirected you to this 404 page. Technically this means that a search engine like Google was not able to recognize this as a 404. So it would never remove this page from it's database. Also the HTTP error codes section from your statistics program like AWStats would never indicate that you have missing pages on your website.
[image 2]
Finally not found
Since TYPO3 4.0.1 the same as before is possible, but a real 404 is thrown at the client. NOW Google knows what is going on and will remove the page from it's database. Nice to know is maybe that during development a bug caused the error handling to start a loop in trying to find a 404 page that was not there and brought the test server (actually production) down to it's knees several times. This is all solved now and this error is shown.
[image 3]
Unfortunately, the error message is not complete yet, but it gives a good indication of what is going on. Of course you should provide your 404 page with useful information and I will show you later how to do this.
My statistics program also gives me some correct feedback now.
[image 4]
In this case you can also click this 404 and see what pages were attempted to visit.
How can you make this work for your site
To set up 404 page for your site, go to the install tool and find a setting named [pageNotFound_handling]. Type the URL where your 404 page is located. If the page is on the same domain, it should be the URL without the domain name (i.e. /404/). If it is on another server, it should be absolute.
Create a hidden page with the title 404 in the root of your website. Multiple domains in one install are supported. You can create a 404 page for every (sub)domain you have. Now provide your visitor with meaningful information. Maybe you have recently moved whole parts of your website. Maybe you have thrown away some pages by accident. Tell your visitor, let your visitor know what is going on. Provide a link to the search engine on your website and perhaps provide a sitemap.
Under the hood
When TYPO3 detects that a page does not exist, it executes special functions that determine how this situation should be handled. The most common situation is to fetch a dynamic page. Since the error page can be in another domain, TYPO3 puts a proper <base> tag into the page ensuring that the page will display images and links properly.
One problem that arises from fetching error page is logging. If you use Webalizer, you can tell it to ignore this page in reports. However if your log analysis software does not allow excluding certain page you want want to exclude this page using conditional logging as described on the Apache web site (http://httpd.apache.org/docs/2.0/logs.html#conditional)
For AWStats you can use this next line to exclude the 404 pages:
SkipFiles="/404/ REGEX[^\/typo3]"