wiki:web_validation

Version 5 (modified by Daniel Kahn Gillmor, 9 years ago) (diff)

--

Validating your web site

If you post content on the web, and you want it to be readable by everyone, it makes sense to try to post that content in a universally-readable format. This page is about helping web site authors and administrators ensure that the data they offer is presented in a universally-readable format.

Declaring a Standard

The first step i take when trying to debug cross-browser compatibility issues for web sites is to make sure that the pages i'm generating explicitly specify the standard they're targeting (e.g. it's better to say "please use xhtml 1.0 strict", instead of asking the browser to guess, in which case it might choose something else, like "HTML 4.01 transitional"). This is done with a "DOCTYPE" header at the start of each page -- if you're using a dynamic page generation tool, there should be a way to ask it to emit a DOCTYPE header anyway.

Meeting the Standard

After i've specified a doctype, i make sure that the pages i'm emitting are syntactically valid according to the standard i've chosen as a target. One easy way to do this is to ask the W3C's validator to check the syntax of the pages.

The goal is to get the site to a state where the validator reports no errors. Note that this should theoretically be done on all pages of a site, since each page might be different. However, just starting with a few example pages is a good start.

While this process can be frustrating initially, it really does help to ensure cross-browser compatibility. If your page doesn't syntactically match the standard you specify, (or if you don't specify one at all), you're asking each browser to take its best guess at what you meant. Since browsers are written by different people, when you diverge from the defined common language, you'll get into different assumptions and have to deal with different quirks. Sticking to a rigorous syntactic validity will help you minimize these kind of surprises.

Character Sets

In addition to making sure your site has valid syntax, you might also want to consider making sure that it uses unicode (the most popular form of unicode is UTF-8). This choice of character set encoding (or "charset") is crucial for sites that contain (or may one day contain) characters from outside the standard alphabet used in your native language. If you're not sure which character set your site uses, the W3c's validator will also report the charset your site defaults to. You can read up on declaring a new charset if you're interested in making this change.

Offline Validators

If you run a debian-derived operating system (including ubuntu), you might be interested in w3c-markup-validator and wdg-html-validator, both of which are tools designed to let you run similar validation tests on your own.

Running a local validator gives you more control over what you do with the output, and makes it easier to incorporate the tools into scripted or automated tests.

It also lets you run the validator on sites that you might not want to grant access to over the internet (for example, if you're working on a local staging or development copy of a site that is only available over a loopback device).

Validation and Content Management Systems

Many Content Management Systems (or CMSes) already declare a DOCTYPE and charset for you, and use it in their templates and static pages. This usually means that any code you write that is published through such a CMS needs to be carefully checked to ensure that it meets the standard (and charset!) selected by the CMS. Some CMSes will help you meet that standard by providing filters to run user-submitted content through. While there is still guesswork about how to translate non-standard markup into standards-compliant markup, using the CMS' provided filters moves the guesswork into the server side, where it can be made once in a canonical fashion, instead of asking each viewer's web browser to do its own guesswork.

An example of this is Drupal's Input Formats provided by the "filter" module.

Also note that sometimes you might paste data from one system (e.g. Microsoft Word) into another system (e.g. an input box on a Drupal-based web site), and the two systems might be using different character sets. Since common Latin characters (a, b, c, etc) are handled identically in most character sets, you'll usually notice the problem first in things like unusual punctuation (e.g. smart quotes, em dashes, the interrobang (‽), etc), unusual characters, new symbols, or characters with unusual diacritics (e.g. ß, €, ẍ, , etc). If characters like these are showing up in some garbled state, you should consider the choice of character set as a possible culprit.

Conclusion

Once you've got a syntactically valid page that does what you want in one browser, the difference between that and other modern browsers should be relatively small, and you can focus your time on resolving those differences. Regular checks of page validity when you make changes are also a good idea, just to make sure the page keeps working.