PDA

View Full Version : Anatomy of an HTML Document


xhtml
Sun., Apr. 9, 2006, 2:03 am
This is a rather lengthy posting, but it is contains some very important information that everyone creating web pages should know, and it's nothing complicated. I've frequently touched on some of these issues in other postings, but it all bears repeating (in one place?). Hang in there - here we go!

Here is a minimal/basic document (HTML 4.01 Strict) that includes some sample content. Consider it a starting point for any HTML document. Although in some places, I mention XHTML, this tutorial applies to an HTML document. Every HTML document must contain at least the following:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<title>Document Title</title>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
</head>
<body>
<h1>Heading Level 1</h1>
<!-- Some example content -->
<p><cite>Homer Simpson</cite> said; <strong>Operator!</strong> <em>Quick!</em> Get me the number for 9-1-1!</p>
<p><img src="someimage.gif" width="100" height="100" alt="Alternate Text"></p>
</body>
</html>

And now the (often lengthy) explanations for the declarations/elements used:

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">

1) The DOCTYPE (acronym for Document Type Declaration or DTD) is required at the start of every page for two reasons: A) In order to validate your document to ensure that it contains no errors and B) to make sure the web browser renders the page correctly. The DOCTYPE is a declaration, not an element.

Without a valid DOCTYPE - modern browsers don't know how to render the page and will go into what is known as Quirks Mode. In Quirks Mode, the browser must use its built in error handling to try and render the page as best it can. The results may not be what you want and can cause you to have a "hissy-fit" trying to figure out why your CSS (Cascading Style Sheets) won't work correctly. Among other things - Internet Explorer in Quirks Mode will not render the CSS box model correctly - the cause of many "hissy-fits" for people trying to learn CSS.

Those of you using Firefox (http://www.mozilla.com/) can right-click inside an (X)HTML page - select View Page Info - and instantly see if the page you're viewing is in Standards compliance mode or Quirks mode. (Handy huh?).

The DOCTYPE is case sensitive and must be quoted and written as shown above. Also note that this DOCTYPE is fully qualified as it should be, although HTML 4 will permit a shorter version like:
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
If the shortened version is used the document is assumed to be HTML 4.01 Strict, so keep that in mind. An XHTML document should NEVER use a shortened version of a DOCTYPE declaration.

Nothing may appear in the document before the DOCTYPE declaration (wouldn't be inside the document would it?). When I say nothing may appear before the DOCTYPE, I'm not talking about PHP/ASP server scripts which don't appear in the document when served. However, <script> or <style> elements must never appear before the DOCTYPE! Your page may render correctly because of the browsers built-in error handling, but don't rely on that! A standards compliant browser or user agent may actually ignore such improper markup.

Make sure your <script> and <style> elements are inside the <head> element where they belong. Remember too, that although <script> elements may be embedded within the document <body> content, <style> elements may only appear in the <head>. I'm talking about the <style></style> element and not inline styles such as <p style="text-align:justify;">, etc.

<html>

2) Opening tag for the root element html

<head>

3) Opening tag for the head element. The <head> element may contain meta data, scripts and/or style declarations, but must always contain:

<title>Document Title</title>

4) The title element is required for every page and should contain a meaningful title for each document. Some folks simply put the site title here on every page, but the actual page title is better - especially for search engines. One school of thought is to show the page title followed by a dash or vertical bar and the site title. For example: <title>Some Page | My Web Site</title>.

Some folks seem to forget the <title> element and leave it in its default - usually <title>Untitled Document</title>. If you're really bored one day, do a Google search on "Untitled Document"; take the family out to dinner and when you return, see how many hits you got!

<meta http-equiv="Content-Type" content="text/html; charset=utf-8">

5) Every page should declare the charset (character set or encoding) although this is really the responsiblity of the server. UTF-8 is the recommended charset encoding. The content value text/html is the default MIME type for an HTML document. MIME stands for Multipurpose Internet Mail Extension.

The correct MIME type for XHTML is application/xhtml+xml. I won't go into the details of what happens when your XHTML page is rendered with the correct MIME type (I'll save that for another tutorial), but you should know that Internet Explorer (all versions including the forthcoming IE 7) can not deal with a document served as application/xhtml+xml.

Ouch! I hear some of you screaming at me right now that you have been marking up all your pages in XHTML and they work just fine in IE - but remember - you aren't serving your XHTML documents with the correct MIME type. You can test this for yourself by trying to open this link in IE, then open it in Firefox, Opera or another standards compliant browser. What happened? Can I come out from under my desk now?

</head>

6) The closing tag for the <head> element.

<body>

7) The opening tag for the body element. All subsequent elements result in the content displayed within the viewport.

<h1>Heading Level 1</h1>

8) The h1 element. Every page should have an H1 element (but only one). It is the most important heading on any page and is often used by search engines. There are two schools of thought about its usage. A) to contain the site title on every page or B) to contain the page title, and with that usage it would be the same as your <title> element.

<!-- Some example content -->
<p><cite>Homer Simpson</cite> said; <strong>Operator!</strong> <em>Quick!</em> Get me the number for 9-1-1!</p>
<p><img src="someimage.gif" width="100" height="100" alt="Alternate Text"></p>

9) Two examples of paragraph elements. The first one includes some elements that imply meaning to parts of the paragraph. The second one contains an image element. Every element should be properly terminated, but note that the <img> element has no closing tag. Other things to note about this <img> element example are all attributes are properly quoted - the width and height of the image are specified and it contains the alternate attribute which is displayed if the browser can not load the image. If the image is presentational only - then you should use the null alt attribute - alt="". The alt attribute is required for all images.

</body>

10) The closing tag for the <body> element.

</html>

11) The closing tag for the <html> root element.

Notes/Remarks:

The <body> element should contain proper semantic markup elements - for example if it's a list of items then use the <ul> unordered list or <ol> ordered list element. If you need to display tabular data then use the <table> element - for a form use the <form> element, etc. Never use or select an HTML element based on what it looks like just for the sake of design/layout. All elements can be styled in all sorts of ways using CSS. The elements used should always convey meaning.

Note that I use the proper term element and not tag. An (X)HTML element is composed of an opening tag with any needed or required attributes - the content of the element - and a closing tag (if required). In HTML element tags and attributes are not case sensitive so <BODY> <Body> or <body> are all the same. Not so with XHTML where all element names, tags, and attributes are case sensitive.

If you aren't aware of it - in HTML 4 (and previous versions) the <html> <head> and <body> elements are optional and may be omitted. In addition, HTML permits omitting quotes from many attributes and that some end tags are also optional and maybe omitted. Examples: </p> </li> </dt> </dd>, etc. See the Index of HTML 4 Elements (http://www.w3.org/TR/html4/index/elements.html) for which elements are optional or have optional end tags. But remember, this only applies to HTML - you may never omit any of these elements in XHTML!

Here is how the our minimal/basic example could be marked up in HTML 4 - including uppercase element tags (and yes this example will validate as HTML 4.01 Strict):

<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN">
<TITLE>Document Title</TITLE>
<META HTTP-EQUIV="Content-Type" CONTENT="text/html; charset=utf-8">
<H1>Heading Level 1</H1>
<!-- Some example content -->
<P><CITE>Homer Simpson</CITE> said; <STRONG>Operator!</STRONG> <EM>Quick!</EM> Get me the number for 9-1-1!
<P><IMG SRC=someimage.gif WIDTH=100 HEIGHT=100 ALT="Alternate Text">

If you were to chose this type of minimal markup - you'd be in for some serious editing if you ever decided to convert your document to XHTML! So don't do it! Close all elements that have closing tags - and quote all attributes. If you use XHTML style markup as in our minimal/basic example at the top of the page - you can easily convert a valid HTML 4.01 document to XHTML 1.0, with few changes required.

Always use the Strict DOCTYPE whether HTML or XHTML. Using a Transitional DTD simply allows you to include invalid or deprecated markup on your pages and that's not what you want. Elements not permitted in the Strict DOCTYPE include <font> <embed> (which is not in any HTML spec) the <link target="..."> attribute, to name a few.

Be sure your elements are properly nested (well-formed). Most browsers, because of their error handling would render this example correctly as Some text:
<p><b><i>Some text</b></i></p>
However, you'll note that the <b><i> elements are improperly nested (not well-formed). The example should look like:
<p><b><i>Some text</i></b></p>.
You can't get away with that in XHTML. Besides, your pages will render faster and display correctly if everything is well-formed.

David Gillaspey
Sun., Apr. 9, 2006, 5:10 pm
Hi Ed,

Thanks for taking time to compose and post your guidelines about HTML 4.01. I'm already using the information you provided this weekend as I progress toward making my own website standards compliant.

Sincerely,

David Gillaspey
President
Great Church Websites

GuruGreg
Wed., Aug. 9, 2006, 9:38 am
Here's one I've seen: You need to check the appearance of your site in multiple browsers.

Those who only test in IE often can have nice sites in IE, but in FireFox or other browsers, they look shoddy.

Conversely, those who test only in other browsers may see some discrepancies when their site is viewed in IE. One example I can think of immediately is using percentages for the width of nested page elements. Starting with the following code,


<div style="width: 300px">
<table style="width: 100%">
...
</table>
</div>


in FireFox or other standards compliant browsers, then the table width should be 100% of the width of the element in which it is contained. Therefore, the table will take up the full width of the DIV element. However, in IE, all percentage widths are calculated based on the size of the window. Therefore, in this example, the TABLE element will then be the width of the browser window. If you are using a multi-column layout, this will cause problems (which I've experienced in the past).

This is just one of many differences in browsers that we need to account for as designers/developers.

generalhavok
Wed., Aug. 9, 2006, 6:17 pm
One minor item: all elements must be "closed" in an XHTML document. That's easy for things like <p>, which have </p>. There are, however, a few elements that have no close:

<br> becomes <br />
<option> becomes <option />
<img> becomes <img />

So, the example above...
<p><img src="someimage.gif" width="100" height="100" alt="Alternate Text"></p>

...should be...
<p><img src="someimage.gif" width="100" height="100" alt="Alternate Text" /></p>

...but I'd go even further:
<p><img src="someimage.gif" alt="Alternate Text" /></p>

...I'd remove the height and width, because I'd create the image with the dimensions I want. I wouldn't have the browser size them.

David Gillaspey
Fri., Aug. 11, 2006, 5:17 pm
Note: I moved a number of posts in this thread having actually to do with images in HTML to here:

http://www.greatchurchwebsites.org/forums/showthread.php?t=455

Members are welcome to continue to post in this thread if discussing HTML rights and wrongs.

Sincerely,

David Gillaspey
Forum administrator