Performance, Implementation, and Design Notes 

Contents

  1. Notes on helping search engines index your Web site
    1. Search robots
  2. Notes on tables
    1. Design rationale
    2. Recommended Layout Algorithms
  3. Notes on styles
    1. New media types
  4. Notes on forms
    1. Incremental display
    2. Future projects
  5. Notes on scripting
    1. Reserved syntax for future script macros

The following notes are informative, not normative.

Notes on helping search engines index your Web site 

This section provides some simple suggestions that will make your documents more accessible to search engines.

Define the document language
In the global context of the Web it is important to know which human language a page was written in. This is discussed in the section on language information.
Specify language variants of this document If you have prepared translations of this document into other languages, you should use the LINK element to reference these. This allows an indexing engine to offer users search results in the user's preferred language, regardless of how the query was written. For instance, the following links offer French and German alternatives to a search engine:
<LINK rel="alternate" href="mydoc-fr.html"
         lang="fr" title="La vie souterrainne">
<LINK rel="alternate" href="mydoc-de.html"
         lang="de" title="Das Leben im Untergrund">
Provide keywords and descriptions
Some indexing engines look for META elements that define a comma-separated list of keywords/phrases, or that give a short description. Search engines may present these keywords as the result of a search. The value of the name attribute sought by a search attribute is not defined by this specification. Consider these examples,
 <META name="keywords" content="vacation,Greece,sunshine">
 <META name="description" content="Idylic European vacations">
Indicate the beginning of a collection
Collections of word processing documents or presentations are frequently translated into collections of HTML documents. It is helpful for search results to reference the beginning of the collection in addition to the page hit by the search. You may help search engines by using the LINK element with rel="begin" along with a TITLE, as in:
 
<LINK rel="begin" 
         href="page1.html" 
         title="General Theory of Relativity">
Provide robots with indexing instructions
People may be surprised to find that their site has been indexed by an indexing robot and that the robot should not have been permitted to visit a sensitive part of the site. Many Web robots offer facilities for Web site administrators and content providers to limit what the robot does. This is achieved through two mechanisms: a "robots.txt" file and the META element in HTML documents, described below.

Search robots 

The robots.txt file 

When a Robot vists a Web site, say http://www.foobar.com/, it firsts checks for http://www.foobar.com/robots.txt. If it can find this document, it will analyze its contents to see if it is allowed to retrieve the document. You can customize the robots.txt file to apply only to specific robots, and to disallow access to specific directories or files.

Here is a sample robots.txt file that prevents all robots from visiting the entire site:

        User-agent: *    # applies to all robots
        Disallow: /      # disallow indexing of all pages

The Robot will simply look for a "/robots.txt" URL on your site, where a site is defined as a HTTP server running on a particular host and port number. Here are some sample locations for robots.txt:

Site URLURL for robots.txt
http://www.w3.org/ http://www.w3.org/robots.txt
http://www.w3.org:80/ http://www.w3.org:80/robots.txt
http://www.w3.org:1234/ http://www.w3.org:1234/robots.txt
http://w3.org/ http://w3.org/robots.txt

There can only be a single "/robots.txt" on a site. Specifically, you should not put "robots.txt" files in user directories, because a robot will never look at them. If you want your users to be able to create their own "robots.txt", you will need to merge them all into a single "/robots.txt". If you don't want to do this your users might want to use the Robots META Tag instead.

Some tips: URL's are case-sensitive, and "/robots.txt" string must be all lower-case. Blank lines are not permitted.

There must be exactly one "User-agent" field. The robot should be liberal in interpreting this field. A case insensitive substring match of the name without version information is recommended.

If the value is "*", the record describes the default access policy for any robot that has not matched any of the other records. It is not allowed to have multiple such records in the "/robots.txt" file.

The "Disallow" field specifies a partial URL that is not to be visited. This can be a full path, or a partial path; any URL that starts with this value will not be retrieved. For example,

    Disallow: /help disallows both /help.html and /help/htmlweb.html, whereas
    Disallow: /help/ would disallow /help/htmlweb.html but allow /help.html. 

An empty value for "Disallow", indicates that all URLs can be retrieved. At least one "Disallow" field must be present in the robots.txt file.

Robots and the META element 

The META element allows HTML authors to tell visiting robots whether a document may be indexed, or used to harvest more links. No server administrator action is required.

In the following example a robot should neither index this document, nor analyze it for links.

<META name="ROBOTS" content="NOINDEX, NOFOLLOW">

The list of terms in the content is ALL, INDEX, NOFOLLOW, NOINDEX. The name and the content attribute values are case-insensitive.

Note: In early 1998 only a few robots implement this, but this is expected to change as more public attention is given to controlling indexing robots.

Notes on tables 

Design rationale 

The HTML table model has evolved from studies of existing SGML tables models, the treatment of tables in common word processing packages, and a wide range of tabular layout techniques in magazines, books and other paper-based documents. The model was chosen to allow simple tables to be expressed simply with extra complexity available when needed. This makes it practical to create the markup for HTML tables with everyday text editors and reduces the learning curve for getting started. This feature has been very important to the success of HTML to date.

Increasingly, people are creating tables by converting from other document formats or by creating them directly with WYSIWYG editors. It is important that the HTML table model fit well with these authoring tools. This affects how the cells that span multiple rows or columns are represented, and how alignment and other presentation properties are associated with groups of cells.

Dynamic reformatting 

A major consideration for the HTML table model is that the author does not control how a user will size a table, what fonts he or she will use, etc. This makes it risky to rely on column widths specified in terms of absolute pixel units. Instead, tables must be able to change sizes dynamically to match the current window size and fonts. Authors can provide guidance as to the relative widths of columns, but user agents should ensure that columns are wide enough to render the width of the largest element of the cell's content. If the author's specification must be overridden, relative widths of individual columns should not be changed drastically.

Incremental display 

For large tables or slow network connections, incremental table display is important to user satisfaction. User agents should be able to begin displaying a table before all of the data has been received. The default window width for most user agents shows about 80 characters, and the graphics for many HTML pages are designed with these defaults in mind. By specifying the number of columns, and including provision for control of table width and the widths of different columns, authors can give hints to user agents that allow the incremental display of table contents.

For incremental display, the browser needs the number of columns and their widths. The default width of the table is the current window size (width="100%"). This can be altered by setting the width-TABLE attribute of the TABLE element. By default, all columns have the same width, but you can specify column widths with one or more COL elements before the table data starts.

The remaining issue is the number of columns. Some people have suggested waiting until the first row of the table has been received, but this could take a long time if the cells have a lot of content. On the whole it makes more sense, when incremental display is desired, to get authors to explicitly specify the number of columns in the TABLE element.

Authors still need a way of telling user agents whether to use incremental display or to size the table automatically to fit the cell contents. In the two pass auto-sizing mode, the number of columns is determined by the first pass. In the incremental mode, the number of columns must be stated up front. It makes more sense to set the cols attribute to the number of columns rather than using some "layout" attribute (e.g., layout="fixed" or layout="auto").

Structure and presentation 

HTML distinguishes structural markup such as paragraphs and quotations from rendering idioms such as margins, fonts, colors, etc. How does this distinction affect tables? From the purist's point of view, the alignment of text within table cells and the borders between cells is a rendering issue, not one of structure. In practice, though, it is useful to group these with the structural information, as these features are highly portable from one application to the next. The HTML table model leaves most rendering information to associated style sheets. The model presented in this specification is designed to take advantage of such style sheets but not to require them.

Current desktop publishing packages provide very rich control over the rendering of tables, and it would be impractical to reproduce this in HTML, without making HTML into a bulky rich text format like RTF or MIF. This specification does, however, offer authors the ability to choose from a set of commonly used classes of border styles. The frame attribute controls the appearence of the border frame around the table while the rules attribute determines the choice of rulings within the table. A finer level of control will be supported via rendering annotations. The style attribute can be used for specifying rendering information for individual elements. Further rendering information can be given with the STYLE element in the document head or via linked style sheets.

During the development of this specification, a number of avenues were investigated for specifying the ruling patterns for tables. One issue concerns the kinds of statements that can be made. Including support for edge subtraction as well as edge addition leads to relatively complex algorithms. For instance, work on allowing the full set of table elements to include the frame and rules attributes led to an algorithm involving some 24 steps to determine whether a particular edge of a cell should be ruled or not. Even this additional complexity doesn't provide enough rendering control to meet the full range of needs for tables. The current specification deliberately sticks to a simple intuitive model, sufficient for most purposes. Further experimental work is needed before a more complex approach is standardized.

Row and column groups 

This specification provides a superset of the simpler model presented in earlier work on HTML+. Tables are considered as being formed from an optional caption together with a sequence of rows, which in turn consist of a sequence of table cells. The model further differentiates header and data cells, and allows cells to span multiple rows and columns.

Following the CALS table model (see [CALS]), this specification allows table rows to be grouped into head and body and foot sections. This simplifies the representation of rendering information and can be used to repeat table head and foot rows when breaking tables across page boundaries, or to provide fixed headers above a scrollable body panel. In the markup, the foot section is placed before the body sections. This is an optimization shared with CALS for dealing with very long tables. It allows the foot to be rendered without having to wait for the entire table to be processed.

Accessibility 

For the visually impaired, HTML offers the hope of setting to rights the damage caused by the adoption of windows based graphical user interfaces. The HTML table model includes attributes for labeling each cell, to support high quality text to speech conversion. The same attributes can also be used to support automated import and export of table data to databases or spreadsheets.

Recommended Layout Algorithms 

If the cols attribute on the TABLE element specifies the number of columns, then the table may be rendered using a fixed layout, otherwise the autolayout algorithm described below should be used.

If the width attribute is not specified, visual user agents should assume a default value of 100% for formatting.

We recommended that user agents increase table widths beyond the value specified by width in cases when cell contents would otherwise overflow. User agents that override the specified width should do so within reason. User agents may elect to split words across lines to avoid the need for excessive horizontal scrolling or when such scrolling is impractical or undesired.

Fixed Layout Algorithm 

For this algorithm, it is assumed that the number of columns is known. The column widths by default should be set to the same size. Authors may override this by specifying relative or absolute column widths, using the COLGROUP or COL elements. The default table width is the space between the current left and right margins, but may be overridden by the width attribute on the TABLE element, or determined from absolute column widths. To deal with mixtures of absolute and relative column widths, the first step is to allocate space from the table width to columns with absolute widths. After this, the space remaining is divided up between the columns with relative widths.

The table syntax alone is insufficient to guarantee the consistency of attribute values. For instance, the number of columns specified by the cols attribute may be inconsistent with the number of columns implied by the COL elements. This in turn, may be inconsistent with the number of columns implied by the table cells. A further problem occurs when the columns are too narrow to avoid overflow of cell contents. The width of the table as specified by the TABLE element or COL elements may result in overflow of cell contents. It is recommended that user agents attempt to recover gracefully from these situations, e.g., by hyphenating words and resorting to splitting words if hyphenation points are unknown.

In the event that an indivisible element causes cell overflow, the user agent may consider adjusting column widths and re-rendering the table. In the worst case, clipping may be considered if column width adjustments and/or scrollable cell content are not feasible. In any case, if cell content is split or clipped this should be indicated to the user in an appropriate manner.

Autolayout Algorithm 

If the COLS attribute is missing from the table start tag, then the user agent should use the following autolayout algorithm. It uses two passes through the table data and scales linearly with the size of the table.

In the first pass, line wrapping is disabled, and the user agent keeps track of the minimum and maximum width of each cell. The maximum width is given by the widest line. Since line wrap has been disabled, paragraphs are treated as long lines unless broken by BR elements. The minimum width is given by the widest text element (word, image, etc.) taking into account leading indents and list bullets, etc. In other words, it is necessary to determine the minimum width a cell would require in a window of its own before the cell begins to overflow. Allowing user agents to split words will minimize the need for horizontal scrolling or in the worst case, clipping the cell contents.

This process also applies to any nested tables occuring in cell content. The minimum and maximum widths for cells in nested tables are used to determine the minimum and maximum widths for these tables and hence for the parent table cell itself. The algorithm is linear with aggregate cell content, and broadly speaking, independent of the depth of nesting.

To cope with character alignment of cell contents, the algorithm keeps three running min/max totals for each column: Left of align char, right of align char and un-aligned. The minimum width for a column is then: max(min_left + min_right, min_non-aligned).

The minimum and maximum cell widths are then used to determine the corresponding minimum and maximum widths for the columns. These in turn, are used to find the minimum and maximum width for the table. Note that cells can contain nested tables, but this doesn't complicate the code significantly. The next step is to assign column widths according to the available space (i.e., the space between the current left and right margins).

For cells that span multiple columns, a simple approach (as used by Arena) consists of apportioning the min/max widths evenly to each of the constituent columns. A slightly more complex approach is to use the min/max widths of unspanned cells to weight how spanned widths are apportioned. Experiments suggest that a blend of the two approaches gives good results for a wide range of tables.

The table borders and intercell margins need to be included in assigning column widths. There are three cases:

  1. The minimum table width is equal to or wider than the available space. In this case, assign the minimum widths and allow the user to scroll horizontally. For conversion to braille, it will be necessary to replace the cells by references to notes containing their full content. By convention these appear before the table.
  2. The maximum table width fits within the available space. In this case, set the columns to their maximum widths.
  3. The maximum width of the table is greater than the available space, but the minimum table width is smaller. In this case, find the difference between the available space and the minimum table width, lets call it W. Lets also call D the difference between maximum and minimum width of the table.

    For each column, let d be the difference between maximum and minimum width of that column. Now set the column's width to the minimum width plus d times W over D. This makes columns with large differences between minimum and maximum widths wider than columns with smaller differences.

This assignment step is then repeated for nested tables using the minimum and maximum widths derived for all such tables in the first pass. In this case, the width of the parent (i.e., englobing) table cell plays the role of the current window size in the above description. This process is repeated recursively for all nested tables. The topmost table is then rendered using the assigned widths. Nested tables are subsequently rendered as part of the parent table's cell contents.

If the table width is specified with the width attribute, the user agent attempts to set column widths to match. The width attribute is not binding if this results in columns having less than their minimum (i.e., indivisible) widths.

If relative widths are specified with the COL element, the algorithm is modified to increase column widths over the minimum width to meet the relative width constraints. The COL elements should be taken as hints only, so columns shouldn't be set to less than their minimum width. Similarly, columns shouldn't be made so wide that the table stretches well beyond the extent of the window. If a COL element specifies a relative width of zero, the column should always be set to its minimum width.

When using the two pass layout algorithm, the default alignment position in the absence of an explicit or inherited charoff attribute can be determined by choosing the position that would center lines for which the widths before and after the alignment character are at the maximum values for any of the lines in the column for which align="char". For incremental table layout the suggested default is charoff="50%". If several cells in different rows for the same column use character alignment, then by default, all such cells should line up, regardless of which character is used for alignment. Rules for handling objects too large for column apply when the explicit or implied alignment results in a situation where the data exceeds the assigned width of the column.

Choice of attribute names. It would have been preferable to choose values for the frame attribute consistent with the rules attribute and the values used for alignment. For instance: none, top, bottom, topbot, left, right, leftright, all. Unfortunately, SGML requires enumerated attribute values to be unique for each element, independent of the attribute name. This causes immediate problems for "none", "left", "right" and "all". The values for theframe attribute have been chosen to avoid clashes with the rules, align and valign-COLGROUP attributes. This provides a measure of future proofing, as it is anticipated that that the frame and rules attributes will be added to other table elements in future revisions to this specification. An alternative would be to make frame a CDATA attribute. The consensus of the W3C HTML Working Group was that the benefits of being able to use SGML validation tools to check attributes based on enumerated values outweighs the need for consistent names.

Notes on styles 

Some people have voiced concerns over performance issues for style sheets. For instance, retrieving an external style sheet may delay the full presentation for the user. A similar situation arises if the document head includes a lengthy set of style rules.

The current proposal addresses these issues by allowing authors to include rendering instructions within each HTML element. The rendering information is then always available by the time the user agent wants to render each element.

In many cases, authors will take advantage of a common style sheet for a group of documents. In this case, distributing style rules throughout the document will actually lead to worse performance than using a linked style sheet, since for most documents, the style sheet will already be present in the local cache. The public availability of good style sheets will encourage this effect.

New media types 

It is likely that the list of media types will grow in the future. To enable such extensions to be introduced smoothly, user agents conforming to this specification must be able to parse the media type attribute value as follows:

  1. Comma characters (Unicode decimal 44) are used to break the media attribute value into a list of entries. For example,
        media="screen, 3d-glasses, print and resolution > 90dpi"
    

    is mapped to:

        "screen"
        "3d-glasses"
        "print and resolution > 90dpi"
    
  2. Each entry is truncated just before the first character that isn't a US ASCII letter [a-zA-Z] (Unicode decimal 65-90, 97-122), or hyphen (Unicode decimal 45). In our example, this gives:
        "screen"
        "3d-glasses"
        "print"
    
  3. A case-insensitive match is then made with the set of media types defined above. Entries that don't match should be ignored. In the example we are left with screen and print.

Note: Style sheets may include media dependent variations within them. For instance the proposed CSS @media construct. In such cases it may be appropriate to use the default value "all".

Notes on forms 

Incremental display 

The incremental display of documents being received from the network gives rise to certain problems with respect to forms. User agents should prevent forms from being submitted until all of the form's elements have been received.

The incremental display of documents raises some issues with respect to tabbing navigation. The heuristic of giving focus to the lowest valued tabindex in the document seems reasonable enough at first glance. However this implies having to wait until all of the document's text is received, since until then, the lowest valued tabindex may stil change. If the user hits the tab key before then, it is reasonable for user agents to move the focus to the lowest currently available tabindex.

If forms are associated with client-side scripts, there is further potential for problems. For instance, a script handler for a given field may refer to a field that doesn't yet exist.

Future projects 

This specification defines a set of elements and attributes powerful enough to fulfill the general need for producing forms. However there is still room for many possible improvements. For instance the following problems could be addressed in the future:

Notes on scripting 

Reserved syntax for future script macros 

This specification reserves syntax for the future support of script macros in HTML CDATA attributes. The intention is to allow attributes to be set depending on the properties of objects that appear earlier on the page. The syntax is:

   attribute = "... &{ macro body }; ... "

Current Practice for Script Macros 

The macro body is made up of one or more statements in the default scripting language (as per instrinsic event attributes). The semicolon following the right brace is always needed, as otherwise the right brace character "}" is treated as being part of the macro body. Its also worth noting that quote marks are always needed for attributes containing script macros.

The processing of CDATA attributes proceeds as follows:

  1. The SGML parser evaluates any SGML entities (e.g., "&gt;").
  2. Next the script macros are evaluated by the script engine.
  3. Finally the resultant character string is passed to the application for subsequent processing.

Macro processing takes place when the document is loaded (or reloaded) but does not reoccur when the document is resized, repainted, etc.

Here are some examples using JavaScript. The first one randomizes the document background color:

    
<BODY bgcolor='&{randomrbg()};'>

Perhaps you want to dim the background for evening viewing:

    
<BACKGROUND src='&{if(Date.getHours > 18)...};'>

The next example uses JavaScript to set the coordinates for a client-side image map:

    
<MAP NAME=foo>
   <AREA shape="rect" coords="&{myrect(imageurl)};" href="&{myurl};">
 </MAP>

This example sets the size of an image based upon document properties:

    
<IMG src="bar.gif" width='&{document.banner.width/2};' height='50%'>

You can set the URL for a link or image by script:

 <SCRIPT>
   function manufacturer(widget) {
       ...
   }
   function location(manufacturer) {
       ...
   }
   function logo(manufacturer) {
       ...
   }
 </SCRIPT>
  <A href='&{location(manufacturer("widget"))};'>widget</A>
  <IMG src='&{logo(manufacturer("widget"))};'>

This last example shows how SGML CDATA attributes can be quoted using single or double quote marks. If you use single quotes around the attribute string then you can include double quote marks as part of the attribute string. Another approach is use &quot; for double quote marks:

   
<IMG src="&{logo(manufacturer(&quot;widget&quot;))};">