CSE134A LECTURE NOTES

November 18, 2002
 
 

ANNOUNCEMENTS

For good short XML tutorials see here.  Parts of the lectures last week and this week are based on this site.
 

XML AND XHTML

XML is not a presentation language (HTML), not a programming language (PHP), not a database language (SQL), not a communication protocol (HTTP).

One simple application of XML is to redefine HTML.  This is called XHTML.  All the changes can be automated, but they are extensive.  Many HTML ambiguities and incorrectnesses must be removed.  In particular:

Fortunately, software will translate to XHTML automatically: HTML Tidy.

Recent browsers (IE 5.5 and Mozilla, i.e. Netscape 6.0) handle XHTML, though not all do so perfectly.  However XHTML is still not very useful in the real world, because developers must still cater to older browsers.
 
 

WELL-FORMED VERSUS VALID

An XML document is well-formed if it satisfies the XML syntax rules.  Some special cases: If a document satisfies a document type definition (DTD) also, then it is valid.

A DTD specifies application-specific syntax.  It cannot specify constraints like "this piece of data is a year after 2000" or even "this piece of data is a number."  XML schemas can specify data types, but they are more complex and less widely used.

In an XML document, the DTD to use is given by something like a special tag, for example

<!DOCTYPE person SYSTEM "http://www.ucsd.edu/person.dtd">
A DTD can be external, internal, or both.  The syntax for an external DTD is     <!DOCTYPE NAME SYSTEM "file">

To have an external part and an internal part, write   <!DOCTYPE NAME SYSTEM "file" [ ... ]>
where [ ... ] indicates the internal DTD.

One of the limitations of DTDs is there is no way of making them modular by combining several DTDs into one.  The DTD for addresses above includes a DTD for names.  This has to be repeated explicitly.  It should be included by referring to it in some way.

XML schemas are a newer alternative to DTDs.  For an explanation of XML schemas see here.
 
 

A DTD FOR POSTAL ADDRESSES

Here is the DTD for postal addresses developed by the HR-XML consortium, from page 15 of this PDF document:
<!-- Copyright 2000 The HR-XML Consortium (TM) -->
<!-- version 1.0  October 17 2000 -->
<!-- 11/05/2000

<!ELEMENT PostalAddress  (CountryCode , PostalCode? , Region* , Municipality? , DeliveryAddress? , Recipient* )>
<!ATTLIST PostalAddress  type  (postOfficeBoxAddress | streetAddress | undefined )  'undefined' >
<!ELEMENT PostalCode  (#PCDATA )>
<!ELEMENT CountryCode  (#PCDATA )>
<!ELEMENT Region  (#PCDATA )>
<!ELEMENT Municipality  (#PCDATA )>
<!ELEMENT DeliveryAddress  (AddressLine* )>
<!ELEMENT AddressLine  (#PCDATA )>
 <!ELEMENT PersonName  (FormattedName* , GivenName* , PreferredGivenName? , MiddleName? , FamilyName* , Affix* )>
<!ELEMENT FormattedName  (#PCDATA )>
<!ATTLIST FormattedName  type  (presentation | legal | sortOrder )  'presentation' >
<!ELEMENT GivenName  (#PCDATA )>
<!ELEMENT PreferredGivenName  (#PCDATA )>
<!ELEMENT MiddleName  (#PCDATA )>
<!ELEMENT FamilyName  (#PCDATA )>
<!ATTLIST FamilyName  primary  (true | false | undefined )  'undefined' >
<!ELEMENT Affix  (#PCDATA )>
<!ATTLIST Affix  type  (academicGrade |
                        aristocraticPrefix |
                        aristocraticTitle|
                        familyNamePrefix |
                        familyNameSuffix |
                        formOfAddress |
                        generation )  #REQUIRED >
<!ELEMENT AdditionalText  (#PCDATA )>
<!ELEMENT Organization  (#PCDATA )>


 

ATTRIBUTE DECLARATIONS

Each element named in a DTD can have one or more ATTLIST declarations.  An empty element can still have attributes.  For example
<!ATTLIST image source     CDATA       #REQUIRED
                width      NMTOKEN     #IMPLIED
                height     NMTOKEN     #IMPLIED
                format     CDATA       #FIXED "jpeg"
                alt        CDATA       "No caption provided."
                catalogno  ID          #REQUIRED
                owner      IDREF       "Unknownn_owner"
>
The meaning of the modifier #REQUIRED is obvious.  #IMPLIED means the attribute is optional, and no default value is provided.  A literal value is a default for when the attribute is not given a value.

CDATA means that the content of an attribute value can be aribtrary text inside quotation marks, while NMTOKEN means the content must be a legal XML name.

There is no XML-defined syntax inside attribute values, so nested elements are preferable.  Also, attributes must be unique for each tag instance.
 
 

ENTITIES IN A DTD

An internal entity is simply an abbreviation that can be used in an XML document.  For example in the DTD you can write:
<!ENTITY notice "Copyright Regents of the University of California, 2001.  All rights reserved.">
Then in every document using this DTD you can just write
<header>&notice;<header>
External entities are useful for including non-XML data in an XML document.  You do this indirectly, by declaring the external data to be an "entity" in the DTD for the document, for example
<!ENTITY pic  SYSTEM "http://www.w3schools.com/entities/photo.gif">
Then in the document you can write
<author>&pic;</author>
It is the job of the software that parses the XML document to refer back to the DTD and to do something with the URL it finds there.
 
 

NAME SPACES

Conflicts between the names of XML elements are resolved using a prefix.  A namespace attribute declares a prefix for an element and all elements nested inside it, for example:
<f:table xmlns:f="http://www.w3schools.com/furniture">
The URL that identifies the namespace is just a placeholder.  No corresponding file has to exist, and no information is looked up at this URL.  (Technically it is a URI, not a URL.)

In an XSL document, the root element is

<xsl:stylesheet xmlns:xsl="http://www.w3.org/TR/xsl">
All tags that are XSL commands then start with the prefix xsl.

Unfortunately DTDs know nothing about namespaces.  If you have a DTD for a document that uses namespaces, the DTD has to use the same prefixes.  Each namespace should be associated with its own DTD, which should be included automatically when the namespace is used.  But this is not the case.
 
 



Copyright (c) by Charles Elkan, 2002.