XML Bible, Gold Edition 0764548190, 9780764548192

This fast-paced and thorough tutorial/reference contains everything an experienced web developer needs to put XML to wor

307 2 6MB

English Pages 856 Year 2001

Report DMCA / Copyright

DOWNLOAD PDF FILE

Recommend Papers

XML Bible, Gold Edition
 0764548190, 9780764548192

  • 0 0 0
  • Like this paper and download? You can publish your own PDF file online for free in a few minutes! Sign Up
File loading please wait...
Citation preview

1

C H A P T E R

An Eagle’s Eye View of XM L









In This Cha pter

T

his first c hapter intro duc es yo u to XML. It explains in general what XML is and ho w it is used. It sho ws yo u ho w the different piec es o f the XML equatio n fit to gether, and ho w an XML do c ument is c reated and delivered to readers.

W hat is XML? W hy are develo pers excited abo ut XML? The life o f an XML do cument

What Is XM L? XML stands fo r Extensible Markup Language (o ften written as eXtensibleMarkup Language to justify the ac ro nym). XML is a set o f rules fo r defining semantic tags that break a do c ument into parts and identify the different parts o f the do c ument. It is a meta-markup language that defines a syntax used to define o ther do main-spec ific , semantic , struc tured markup languages.

XM L Is a M eta-M arkup Language The first thing yo u need to understand abo ut XML is that it isn’t just ano ther markup language like the Hypertext Markup Language (HTML) o r tro ff. These languages define a fixed set o f tags that desc ribe a fixed number o f elements. If the markup language yo u use do esn’t c o ntain the tag yo u need — yo u’re o ut o f luc k. Yo u c an wait fo r the next versio n o f the markup language ho ping that it inc ludes the tag yo u need; but then yo u’re really at the merc y o f what the vendo r c ho o ses to inc lude. XML, ho wever, is a meta-markup language. It’s a language in whic h yo u make up the tags yo u need as yo u go alo ng. These tags must be o rganized ac c o rding to c ertain general princ iples, but they’re quite flexible in their meaning. Fo r instanc e, if yo u’re wo rking o n genealo gy and need to desc ribe peo ple, births, deaths, burial sites, families, marriages, divo rc es, and so o n, yo u c an c reate tags fo r eac h o f these. Yo u do n’t have to fo rc e yo ur data to fit into paragraphs, list items, stro ng emphasis, o r o ther very general c atego ries.

Related techno lo g ies









4

Part I ✦ Introducing XM L

The tags yo u c reate c an be do c umented in a Do c ument Type Definitio n (DTD). Yo u’ll learn mo re abo ut DTDs in Part II o f this bo o k. Fo r no w, think o f a DTD as a vo c abulary and a syntax fo r c ertain kinds o f do c uments. Fo r example, the MOL.DTD in Peter Murray-Rust’s Chemic al Markup Language (CML) desc ribes a vo c abulary and a syntax fo r the mo lec ular sc ienc es: c hemistry, c rystallo graphy, so lid state physic s, and the like. It inc ludes tags fo r ato ms, mo lec ules, bo nds, spec tra, and so o n. This DTD c an be shared by many different peo ple in the mo lec ular sc ienc es field. Other DTDs are available fo r o ther fields, and yo u c an also c reate yo ur o wn. XML defines a meta syntax that do main-spec ific markup languages like Music ML, MathML, and CML must fo llo w. If an applic atio n understands this meta syntax, it auto matic ally understands all the languages built fro m this meta language. A bro wser do es no t need to kno w in advanc e eac h and every tag that might be used by tho usands o f different markup languages. Instead it disc o vers the tags used by any given do c ument as it reads the do c ument o r its DTD. The detailed instruc tio ns abo ut ho w to display the c o ntent o f these tags are pro vided in a separate style sheet that is attac hed to the do c ument. Fo r example, c o nsider Sc hro dinger’s equatio n:

2 2 ∂ ψ r, t ∂ψ r, t h ih = – + V(r) ψ r, t ∂t 2m ∂x2

Sc ientific papers are full o f equatio ns like this, but sc ientists have been waiting eight years fo r the bro wser vendo rs to suppo rt the tags needed to write even the mo st basic math. Music ians are in a similar bind, sinc e Netsc ape Navigato r and Internet Explo rer do n’t suppo rt sheet music . XML means yo u do n’t have to wait fo r bro wser vendo rs to c atc h up with what yo u want to do . Yo u c an invent the tags yo u need, when yo u need them, and tell the bro wsers ho w to display these tags.

XM L Describes Structure and Semantics, Not Formatting The sec o nd thing to understand abo ut XML is that XML markup desc ribes a do c ument’s struc ture and meaning. It do es no t desc ribe the fo rmatting o f the elements o n the page. Fo rmatting c an be added to a do c ument with a style sheet. The do c ument itself o nly c o ntains tags that say what is in the do c ument, no t what the do c ument lo o ks like.

Chapter 1 ✦ An Eagle’s Eye View of XM L

By c o ntrast, HTML enc o mpasses fo rmatting, struc tural, and semantic markup. is a fo rmatting tag that makes its c o ntent bo ld. is a semantic tag that means its c o ntents are espec ially impo rtant. is a struc tural tag that indic ates that the c o ntents are a c ell in a table. In fac t, so me tags c an have all three kinds o f meaning. An tag c an simultaneo usly mean 20 po int Helvetic a bo ld, a level-1 heading, and the title o f the page. Fo r example, in HTML a so ng might be desc ribed using a definitio n title, definitio n data, an uno rdered list, and list items. But no ne o f these elements ac tually have anything to do with music . The HTML might lo o k so mething like this:

Hot Cop by Jacques Morali, Henri Belolo, and Victor Willis

  • Producer: Jacques Morali
  • Publisher: PolyGram Records
  • Length: 6:20
  • Written: 1978
  • Artist: Village People
In XML the same data might be marked up like this:

Hot Cop Jacques Morali Henri Belolo Victor Willis Jacques Morali PolyGram Records 6:20 1978 Village People

Instead o f generic tags like and
  • , this listing uses meaningful tags like , , , and . This has a number o f advantages, inc luding that it’s easier fo r a human to read the so urc e c o de to determine what the autho r intended. XML markup also makes it easier fo r no n-human auto mated ro bo ts to lo c ate all o f the so ngs in the do c ument. In HTML ro bo ts c an’t tell mo re than that an element is a dt. They c anno t determine whether that dt represents a so ng title, a definitio n, o r just so me designer’s favo rite means o f indenting text. In fac t, a single do c ument may well c o ntain dt elements with all three meanings. XML element names c an be c ho sen suc h that they have extra meaning in additio nal c o ntexts. Fo r instanc e, they might be the field names o f a database. XML is far mo re flexible and amenable to varied uses than HTML bec ause a limited number o f tags do n’t have to serve many different purpo ses.

    5

    6

    Part I ✦ Introducing XM L

    Why Are Developers Excited about XM L? XML makes easy many Web -d evelo p ment tasks that are extremely p ainful using o nly HTML, and it makes tasks that are imp o ssib le with HTML, p o ssib le. Bec ause XML is eXtensib le, d evelo p ers like it fo r many reaso ns. Whic h o nes mo st interest yo u d ep end o n yo ur ind ivid ual need s. But o nc e yo u learn XML, yo u’ re likely to d isc o ver that it’s the so lutio n to mo re than o ne p ro b lem yo u’ re alread y struggling with. This sec tio n investigates so me o f the generic uses o f XML that exc ite d evelo p ers. In Chap ter 2, yo u’ ll see so me o f the sp ec ific ap p lic atio ns that have alread y b een d evelo p ed with XML.

    Design of Domain-Specific M arkup Languages XML allo ws vario us pro fessio ns (e.g., music , c hemistry, math) to develo p their o wn do main-spec ific markup languages. This allo ws individuals in the field to trade no tes, data, and info rmatio n witho ut wo rrying abo ut whether o r no t the perso n o n the rec eiving end has the partic ular pro prietary payware that was used to c reate the data. They c an even send do c uments to peo ple o utside the pro fessio n with a reaso nable c o nfidenc e that the peo ple who rec eive them will at least be able to view the do c uments. Furthermo re, the c reatio n o f markup languages fo r individual do mains do es no t lead to blo atware o r unnec essary c o mplexity fo r tho se o utside the pro fessio n. Yo u may no t be interested in elec tric al engineering diagrams, but elec tric al engineers are. Yo u may no t need to inc lude sheet music in yo ur Web pages, but c o mpo sers do . XML lets the elec tric al engineers desc ribe their c irc uits and the c o mpo sers no tate their sc o res, mo stly witho ut stepping o n eac h o ther’s to es. Neither field will need spec ial suppo rt fro m the bro wser manufac turers o r c o mplic ated plug-ins, as is true to day.

    Self-Describing Data Muc h c o mputer data fro m the last 40 years is lo st, no t bec ause o f natural disaster o r dec aying bac kup media (tho ugh tho se are pro blems to o , o nes XML do esn’t so lve), but simply bec ause no o ne bo thered to do c ument ho w o ne ac tually reads the data media and fo rmats. A Lo tus 1-2-3 file o n a 10-year o ld 5.25-inc h flo ppy disk may be irretrievable in mo st c o rpo ratio ns to day witho ut a huge investment o f time and reso urc es. Data in a less-kno wn binary fo rmat like Lo tus Jazz may be go ne fo rever. XML is, at a basic level, an inc redibly simple data fo rmat. It c an be written in 100 perc ent pure ASCII text as well as in a few o ther well-defined fo rmats. ASCII text is reaso nably resistant to c o rruptio n. The remo val o f bytes o r even large sequenc es o f bytes do es no t no tic eably c o rrupt the remaining text. This starkly c o ntrasts with many o ther fo rmats, suc h as c o mpressed data o r serialized Java o bjec ts where the c o rruptio n o r lo ss o f even a single byte c an render the entire remainder o f the file unreadable.

    Chapter 1 ✦ An Eagle’s Eye View of XM L

    At a higher level, XML is self-desc ribing. Suppo se yo u’re an info rmatio n arc haeo lo gist in the 23rd c entury and yo u enc o unter this c hunk o f XML c o de o n an o ld flo ppy disk that has survived the ravages o f time:

    Judson McDaniel

    21 Feb 1834

    9 Dec 1905

    Even if yo u’re no t familiar with XML, assuming yo u speak a reaso nable fac simile o f 20th c entury English, yo u’ve go t a pretty go o d idea that this fragment desc ribes a man named Judso n Mc Daniel, who was bo rn o n February 21, 1834 and died o n Dec ember 9, 1905. In fac t, even with gaps in, o r c o rruptio n o f the data, yo u c o uld pro bably still extrac t mo st o f this info rmatio n. The same c o uld no t be said fo r so me pro prietary spreadsheet o r wo rd-pro c esso r fo rmat. Furthermo re, XML is very well do c umented. The W3C’s XML 1.0 spec ific atio n and numero us paper bo o ks like this o ne tell yo u exac tly ho w to read XML data. There are no sec rets waiting to trip up the unwary.

    Interchange of Data Among Applications Sinc e XML is no n-pro prietary and easy to read and write, it’s an exc ellent fo rmat fo r the interc hange o f data amo ng different applic atio ns. One suc h fo rmat under c urrent develo pment is the Open Financ ial Exc hange Fo rmat (OFX). OFX is designed to let perso nal financ e pro grams like Mic ro so ft Mo ney and Quic ken trade data. The data c an be sent bac k and fo rth between pro grams and exc hanged with banks, bro kerage ho uses, and the like. CrossReference

    OFX is discussed in Chapter 2.

    As no ted abo ve, XML is a no n-pro prietary fo rmat, no t enc umbered by c o pyright, patent, trade sec ret, o r any o ther so rt o f intellec tual pro perty restric tio n. It has been designed to be extremely po werful, while at the same time being easy fo r bo th human beings and c o mputer pro grams to read and write. Thus it’s an o bvio us c ho ic e fo r exc hange languages. By using XML instead o f a pro prietary data fo rmat, yo u c an use any to o l that understands XML to wo rk with yo ur data. Yo u c an even use different to o ls fo r different purpo ses, o ne pro gram to view and ano ther to edit fo r instanc e. XML keeps yo u fro m getting lo c ked into a partic ular pro gram simply bec ause that’s what

    7

    8

    Part I ✦ Introducing XM L

    yo ur data is already written in, o r bec ause that pro gram’s pro prietary fo rmat is all yo ur c o rrespo ndent c an ac c ept. Fo r example, many publishers require submissio ns in Mic ro so ft Wo rd. This means that mo st autho rs have to use Wo rd, even if they wo uld rather use Wo rdPerfec t o r Nisus Writer. So it’s extremely diffic ult fo r any o ther c o mpany to publish a c o mpeting wo rd pro c esso r unless they c an read and write Wo rd files. Sinc e do ing so requires a develo per to reverse-engineer the undo c umented Wo rd file fo rmat, it’s a signific ant investment o f limited time and reso urc es. Mo st o ther wo rd pro c esso rs have a limited ability to read and write Wo rd files, but they generally lo se trac k o f graphic s, mac ro s, styles, revisio n marks, and o ther impo rtant features. The pro blem is that Wo rd’s do c ument fo rmat is undo c umented, pro prietary, and c o nstantly c hanging. Wo rd tends to end up winning by default, even when writers wo uld prefer to use o ther, simpler pro grams. If a c o mmo n wo rd-pro c essing fo rmat were develo ped in XML, writers c o uld use the pro gram o f their c ho ic e.

    Structured and Integrated Data XML is ideal fo r large and c o mplex do c uments bec ause the data is struc tured. It no t o nly lets yo u spec ify a vo c abulary that defines the elements in the do c ument; it also lets yo u spec ify the relatio ns between elements. Fo r example, if yo u’re putting to gether a Web page o f sales c o ntac ts, yo u c an require that every c o ntac t have a pho ne number and an email address. If yo u’re inputting data fo r a database, yo u c an make sure that no fields are missing. Yo u c an require that every bo o k have an autho r. Yo u c an even pro vide default values to be used when no data is entered. XML also pro vides a c lient-side inc lude mec hanism that integrates data fro m multiple so urc es and displays it as a single do c ument. The data c an even be rearranged o n the fly. Parts o f it c an be sho wn o r hidden depending o n user ac tio ns. This is extremely useful when yo u’re wo rking with large info rmatio n repo sito ries like relatio nal databases.

    The Life of an XM L Document XML is, at the ro o t, a do c ument fo rmat. It is a series o f rules abo ut what XML do c uments lo o k like. There are two levels o f c o nfo rmity to the XML standard. The first is we ll-fo rme dne ss and the sec o nd is validity. Part I o f this bo o k sho ws yo u ho w to write well-fo rmed do c uments. Part II sho ws yo u ho w to write valid do c uments. HTML is a do c ument fo rmat designed fo r use o n the Internet and inside Web bro wsers. XML c an c ertainly be used fo r that, as this bo o k demo nstrates. Ho wever, XML is far mo re bro adly applic able. As previo usly disc ussed, it c an be used as a sto rage fo rmat fo r wo rd pro c esso rs, as a data interc hange fo rmat fo r different pro grams, as a means o f enfo rc ing c o nfo rmity with Intranet templates, and as a way to preserve data in a human-readable fashio n.

    Chapter 1 ✦ An Eagle’s Eye View of XM L

    However, like all data formats, XML needs programs and content before it’s useful. So it isn’t enough to only understand XML itself which is little more than a specification for what data should look like. You also need to know how XML documents are edited, how processors read XML documents and pass the information they read on to applications, and what these applications do with that data.

    Editors XML do c uments are mo st c o mmo nly c reated with an edito r. This may be a basic text edito r like No tepad o r vi that do esn’t really understand XML at all. On the o ther hand, it may be a c o mpletely WYSIWYG edito r like Ado be FrameMaker that insulates yo u almo st c o mpletely fro m the details o f the underlying XML fo rmat. Or it may be a struc tured edito r like JUMBO that displays XML do c uments as trees. Fo r the mo st part, the fanc y edito rs aren’t very useful yet, so this bo o k c o nc entrates o n writing raw XML by hand in a text edito r. Other pro grams c an also c reate XML do c uments. Fo r example, later in this bo o k, in the c hapter o n designing a new DTD, yo u’ll see so me XML data that c ame straight o ut o f a FileMaker database. In this c ase, the data was first entered into the FileMaker database. Then a FileMaker c alc ulatio n field c o nverted that data to XML. In general, XML wo rks extremely well with databases. CrossReference

    Specifically, you’ll see this in Chapter 23, Designing a New XML Application .

    In any c ase, the edito r o r o ther pro gram c reates an XML do c ument. Mo re o ften than no t this do c ument is an ac tual file o n so me c o mputer’s hard disk, but it do esn’t abso lutely have to be. Fo r example, the do c ument may be a rec o rd o r a field in a database, o r it may be a stream o f bytes rec eived fro m a netwo rk.

    Parsers and Processors An XML parser (also kno wn as an XML pro c esso r) reads the do c ument and verifies that the XML it c o ntains is well fo rmed. It may also c hec k that the do c ument is valid, tho ugh this test is no t required. The exac t details o f these tests will be c o vered in Part II. But assuming the do c ument passes the tests, the pro c esso r c o nverts the do c ument into a tree o f elements.

    Browsers and Other Tools Finally the parser passes the tree o r individual no des o f the tree to the end applic atio n. This applic atio n may be a bro wser like Mo zilla o r so me o ther pro gram that understands what to do with the data. If it’s a bro wser, the data will be displayed to the user. But o ther pro grams may also rec eive the data. Fo r instanc e, the data might be interpreted as input to a database, a series o f music al no tes to play, o r a Java pro gram that sho uld be launc hed. XML is extremely flex-ible and c an be used fo r many different purpo ses.

    9

    10

    Part I ✦ Introducing XM L

    The Process Summarized To summarize, an XML do c ument is c reated in an edito r. The XML parser reads the do c ument and c o nverts it into a tree o f elements. The parser passes the tree to the bro wser that displays it. Figure 1-1 sho ws this pro c ess.

    Figure 1-1: XML Docum ent Life Cycle

    It’s impo rtant to no te that all o f these piec es are independent and dec o upled fro m eac h o ther. The o nly thing that c o nnec ts them all is the XML do c ument. Yo u c an c hange the edito r pro gram independently o f the end applic atio n. In fac t yo u may no t always kno w what the end applic atio n is. It may be an end user reading yo ur wo rk, o r it may be a database suc king in data, o r it may even be so mething that hasn’t been invented yet. It may even be all o f these. The do c ument is independent o f the pro grams that read it. Note

    HTML is also som ew hat independent of the program s that read and w rite it, but it’s really only suitable for brow sing. Other uses, like database input, are outside its scope. For exam ple, HTML does not provide a w ay to force an author to include certain required content, like requiring that every book have an ISBN num ber. In XML you can require this. You can even enforce the order in w hich particular elem ents appear (for exam ple, that level-2 headers m ust alw ays follow level-1 headers).

    Related Technologies XML do esn’t o perate in a vac uum. Using XML as mo re than a data fo rmat requires interac tio n with a number o f related tec hno lo gies. These tec hno lo gies inc lude HTML fo r bac kward c o mpatibility with legac y bro wsers, the CSS and XSL stylesheet languages, URLs and URIs, the XLL linking language, and the Unic o de c harac ter set.

    Hypertext M arkup Language Mo zilla 5.0 and Internet Explo rer 5.0 are the first Web bro wsers to pro vide so me (albeit inc o mplete) suppo rt fo r XML, but it takes abo ut two years befo re mo st users have upgraded to a partic ular release o f the so ftware. (In 1999, my wife Beth is still

    Chapter 1 ✦ An Eagle’s Eye View of XM L

    using Netsc ape 1.1.) So yo u’re go ing to need to c o nvert yo ur XML c o ntent into c lassic HTML fo r so me time to c o me. Therefo re, befo re yo u jump into XML, yo u sho uld be c o mpletely c o mfo rtable with HTML. Yo u do n’t need to be an abso lutely snazzy graphic al designer, but yo u sho uld kno w ho w to link fro m o ne page to the next, ho w to inc lude an image in a do c ument, ho w to make text bo ld, and so fo rth. Sinc e HTML is the mo st c o mmo n o utput fo rmat o f XML, the mo re familiar yo u are with HTML, the easier it will be to c reate the effec ts yo u want. On the o ther hand, if yo u’re ac c usto med to using tables o r single-pixel GIFs to arrange o bjec ts o n a page, o r if yo u start to make a Web site by sketc hing o ut its appearanc e rather than its c o ntent, then yo u’re go ing to have to unlearn so me bad habits. As previo usly disc ussed, XML separates the c o ntent o f a do c ument fro m the appearanc e o f the do c ument. The c o ntent is develo ped first; then a fo rmat is attac hed to that c o ntent with a style sheet. Separating c o ntent fro m style is an extremely effec tive tec hnique that impro ves bo th the c o ntent and the appearanc e o f the do c ument. Amo ng o ther things, it allo ws autho rs and designers to wo rk mo re independently o f eac h o ther. Ho wever, it do es require a different way o f thinking abo ut the design o f a Web site, and perhaps even the use o f different pro jec tmanagement tec hniques when multiple peo ple are invo lved.

    Cascading Style Sheets Sinc e XML allo ws arbitrary tags to be inc luded in a do c ument, there isn’t any way fo r the bro wser to kno w in advanc e ho w eac h element sho uld be displayed. When yo u send a do c ument to a user yo u also need to send alo ng a style sheet that tells the bro wser ho w to fo rmat individual elements. One kind o f style sheet yo u c an use is a Casc ading Style Sheet (CSS). CSS, initially designed fo r HTML, defines fo rmatting pro perties like fo nt size, fo nt family, fo nt weight, paragraph indentatio n, paragraph alignment, and o ther styles that c an be applied to partic ular elements. Fo r example, CSS allo ws HTML do c uments to spec ify that all H1 elements sho uld be fo rmatted in 32 po int c entered Helvetic a bo ld. Individual styles c an be applied to mo st HTML tags that o verride the bro wser’s defaults. Multiple style sheets c an be applied to a single do c ument, and multiple styles c an be applied to a single element. The styles then c asc ade ac c o rding to a partic ular set o f rules. CrossReference

    CSS rules and properties are explored in m ore detail in Chapter 12, Cascading Style Sheets Level 1 , and Chapter 13, Cascading Style Sheets Level 2 .

    It’s easy to apply CSS rules to XML do c uments. Yo u simply c hange the names o f the tags yo u’re applying the rules to . Mo zilla 5.0 direc tly suppo rts CSS style sheets c o mbined with XML do c uments, tho ugh at present, it c rashes rather to o frequently.

    11

    12

    Part I ✦ Introducing XM L

    Extensible Style Language The Extensible Style Language (XSL) is a mo re advanc ed style-sheet language spec ific ally designed fo r use with XML do c uments. XSL do c uments are themselves well-fo rmed XML do c uments. XSL do c uments c o ntain a series o f rules that apply to partic ular patterns o f XML elements. An XSL pro c esso r reads an XML do c ument and c o mpares what it sees to the patterns in a style sheet. When a pattern fro m the XSL style sheet is rec o gnized in the XML do c ument, the rule o utputs so me c o mbinatio n o f text. Unlike c asc ading style sheets, this o utput text is so mewhat arbitrary and is no t limited to the input text plus fo rmatting info rmatio n. CSS c an o nly c hange the fo rmat o f a partic ular element, and it c an o nly do so o n an element-wide basis. XSL style sheets, o n the o ther hand, c an rearrange and reo rder elements. They c an hide so me elements and display o thers. Furthermo re, they c an c ho o se the style to use no t just based o n the tag, but also o n the c o ntents and attributes o f the tag, o n the po sitio n o f the tag in the do c ument relative to o ther elements, and o n a variety o f o ther c riteria. CSS has the advantage o f bro ader bro wser suppo rt. Ho wever, XSL is far mo re flexible and po werful, and better suited to XML do c uments. Furthermo re, XML do c uments with XSL style sheets c an be easily c o nverted to HTML do c uments with CSS style sheets. CrossReference

    XSL style sheets w ill be explored in great detail in Chapter 14, XSL Transformations, and Chapter 15, XSL Formatting Objects.

    URLs and URIs XML do c uments c an live o n the Web, just like HTML and o ther do c uments. When they do , they are referred to by Unifo rm Reso urc e Lo c ato rs (URLs), just like HTML files. Fo r example, at the URL http://www.hypermedic.com/style/xml/tempest.xml yo u’ll find the c o mplete text o f Shakespeare’s Te mpe st marked up in XML. Altho ugh URLs are well understo o d and well suppo rted, the XML spec ific atio n uses the mo re general Unifo rm Reso urc e Identifier (URI). URIs are a mo re general arc hitec ture fo r lo c ating reso urc es o n the Internet, that fo c us a little mo re o n the reso urc e and a little less o n the lo c atio n. In theo ry, a URI c an find the c lo sest c o py o f a mirro red do c ument o r lo c ate a do c ument that has been mo ved fro m o ne site to ano ther. In prac tic e, URIs are still an area o f ac tive researc h, and the o nly kinds o f URIs that are ac tually suppo rted by c urrent so ftware are URLs.

    Chapter 1 ✦ An Eagle’s Eye View of XM L

    XLinks and XPointers As lo ng as XML do c uments are po sted o n the Internet, yo u’re go ing to want to be able to address them and ho t link between them. Standard HTML link tags c an be used in XML do c uments, and HTML do c uments c an link to XML do c uments. Fo r example, this HTML link po ints to the afo rementio ned c o py o f the Te mpe st rendered in XML:

    The Tempest by Shakespeare

    Whether the brow ser can display this docum ent if you follow the link, depends on just how w ell the brow ser handles XML files. Most current brow sers don’t handle them very w ell.

    Note

    Ho wever, XML lets yo u go further with XLinks fo r linking to do c uments and XPo inters fo r addressing individual parts o f a do c ument. XLinks enable any element to bec o me a link, no t just an A element. Furthermo re, links c an be bi-direc tio nal, multidirec tio nal, o r even po int to multiple mirro r sites fro m whic h the nearest is selec ted. XLinks use no rmal URLs to identify the site they’re linking to . CrossReference

    XLinks are discussed in Chapter 16, XLinks.

    XPo inters enable links to po int no t just to a partic ular do c ument at a partic ular lo c atio n, but to a partic ular part o f a partic ular do c ument. An XPo inter c an refer to a partic ular element o f a do c ument, to the first, the sec o nd, o r the 17th suc h element, to the first element that’s a c hild o f a given element, and so o n. XPo inters pro vide extremely po werful c o nnec tio ns between do c uments that do no t require the targeted do c ument to c o ntain additio nal markup just so its individual piec es c an be linked to it. Furthermo re, unlike HTML anc ho rs, XPo inters do n’t just refer to a po int in a do c ument. They c an po int to ranges o r spans. Thus an XPo inter might be used to selec t a partic ular part o f a do c ument, perhaps so that it c an be c o pied o r lo aded into a pro gram. CrossReference

    XPointers are discussed in Chapter 17, XPointers.

    13

    14

    Part I ✦ Introducing XM L

    The Unicode Character Set The Web is internatio nal, yet mo st o f the text yo u’ll find o n it is in English. XML is starting to c hange that. XML pro vides full suppo rt fo r the two -byte Unic o de c harac ter set, as well as its mo re c o mpac t representatio ns. This c harac ter set suppo rts almo st every c harac ter c o mmo nly used in every mo dern sc ript o n Earth. Unfo rtunately, XML alo ne is no t eno ugh. To read a sc ript yo u need three things:

    1. A c harac ter set fo r the sc ript 2. A fo nt fo r the c harac ter set 3. An o perating system and applic atio n so ftware that understands the c harac ter set If yo u want to write in the sc ript as well as read it, yo u’ll also need an input metho d fo r the sc ript. Ho wever, XML defines c harac ter referenc es that allo w yo u to use pure ASCII to enc o de c harac ters no t available in yo ur native c harac ter set. This is suffic ient fo r an o c c asio nal quo te in Greek o r Chinese, tho ugh yo u wo uldn’t want to rely o n it to write a no vel in ano ther language. CrossReference

    In Chapter 7, Foreign Languages and non-Roman Text , you’ll explore how international text is represented in com puters, how XML understands text, and how you can use the softw are you have to read and w rite in languages other than English.

    How the Technologies Fit Together XML defines a grammar fo r tags yo u c an use to mark up a do c ument. An XML do c ument is marked up with XML tags. The default enc o ding fo r XML do c uments is Unic o de. Amo ng o ther things, an XML do c ument may c o ntain hypertext links to o ther do c uments and reso urc es. These links are c reated ac c o rding to the XLink spec ific atio n. XLinks identify the do c uments they’re linking to with URIs (in theo ry) o r URLs (in prac tic e). An XLink may further spec ify the individual part o f a do c ument it’s linking to . These parts are addressed via XPo inters. If an XML do c ument is intended to be read by human beings — and no t all XML do c uments are — then a style sheet pro vides instruc tio ns abo ut ho w individual elements are fo rmatted. The style sheet may be written in any o f several style-sheet languages. CSS and XSL are the two mo st po pular style-sheet languages, tho ugh there are o thers inc luding DSSSL — the Do c ument Style Semantic s and Spec ific atio n Language — o n whic h XSL is based.

    Chapter 1 ✦ An Eagle’s Eye View of XM L

    Caution

    I’ve outlined a lot of exciting stuff in this chapter. How ever, honesty com pels m e to tell you that I haven’t discussed all of it yet. In fact, m uch of w hat I’ve described is the prom ise of XML rather than the current reality. XML has a lot of people in the softw are industry very excited, and a lot of program m ers are w orking very hard to turn these dream s into reality. New softw are is released every day that brings us closer to XML nirvana, but this is all very new, and som e of the softw are isn’t fully cooked yet. Throughout the rest of this book, I’ll be careful to point out not only w hat is supposed to happen, but w hat actually does happen. Depressingly these are all too often not the sam e thing. Nonetheless w ith a little caution you can do real w ork right now w ith XML.

    Summary In this c hapter, yo u have learned so me o f the things that XML c an do fo r yo u. In partic ular, yo u have learned:

    ✦ XML is a meta-markup language that enables the c reatio n o f markup languages fo r partic ular do c uments and do mains.

    ✦ XML tags desc ribe the struc ture and semantic s o f a do c ument’s c o ntent, no t the fo rmat o f the c o ntent. The fo rmat is desc ribed in a separate style sheet.

    ✦ XML grew o ut o f many users’ frustratio n with the c o mplexity o f SGML and the inadequac ies o f HTML.

    ✦ XML do c uments are c reated in an edito r, read by a parser, and displayed by a bro wser.

    ✦ XML o n the Web rests o n the fo undatio ns pro vided by HTML, Casc ading Style Sheets, and URLs.

    ✦ Numero us suppo rting tec hno lo gies layer o n to p o f XML, inc luding XSL style sheets, XLinks, and XPo inters. These let yo u do mo re than yo u c an ac c o mplish with just CSS and URLs.

    ✦ Be c areful. XML isn’t c o mpletely finished. It will c hange and expand, and yo u will enc o unter bugs in c urrent XML so ftware. In the next c hapter, yo u’ll see a number o f XML applic atio ns, and learn abo ut so me ways XML is being used in the real wo rld to day. Examples inc lude vec to r graphic s, music no tatio n, mathematic s, c hemistry, human reso urc es, Webc asting, and mo re.







    15

    2

    C H A P T E R

    An Introduction to XM L Applications









    In This Cha pter W hat is an XML applicatio n? XML fo r XML

    I

    n this c hapter we’ll be lo o king at so me examples o f XML applic atio ns, markup languages used to further refine XML, and behind-the-sc ene uses o f XML. It is inspiring to lo o k at so me o f the uses to whic h XML has already been put, even in this early stage o f its develo pment. This c hapter will give yo u so me idea o f the wide applic ability o f XML. Many mo re XML applic atio ns are being c reated o r po rted fro m o ther fo rmats as I write this.

    CrossReference

    Part V covers som e of the XML applications discussed in this chapter in m ore detail.

    What Is an XM L Application? XML is a meta-markup language fo r designing do main-spec ific markup languages. Eac h XML-based markup language is c alled an XML applic atio n. This is no t an applic atio n that uses XML like the Mo zilla Web bro wser, the Gnumeric spreadsheet, o r the XML Pro edito r, but rather an applic atio n o f XML to a spec ific do main suc h as Chemic al Markup Language (CML) fo r c hemistry o r GedML fo r genealo gy. Eac h XML applic atio n has its o wn syntax and vo c abulary. This syntax and vo c abulary adheres to the fundamental rules o f XML. This is muc h like human languages, whic h eac h have their o wn vo c abulary and grammar, while at the same time adhering to c ertain fundamental rules impo sed by human anato my and the struc ture o f the brain.

    Behind-the-scene uses o f XML









    18

    Part I ✦ Introducing XM L

    XML is an extremely flexible fo rmat fo r text-based data. The reaso n XML was c ho sen as the fo undatio n fo r the wildly different applic atio ns disc ussed in this c hapter (aside fro m the hype fac to r) is that XML pro vides a sensible, welldo c umented fo rmat that’s easy to read and write. By using this fo rmat fo r its data, a pro gram c an o fflo ad a great quantity o f detailed pro c essing to a few standard free to o ls and libraries. Furthermo re, it’s easy fo r suc h a pro gram to layer additio nal levels o f syntax and semantic s o n to p o f the basic struc ture XML pro vides.

    Chemical M arkup Language Peter Murray-Rust’s Chemic al Markup Language (CML) may have been the first XML applic atio n. CML was o riginally develo ped as an SGML applic atio n, and gradually transitio ned to XML as the XML standard develo ped. In its mo st simplistic fo rm, CML is “HTML plus mo lec ules”, but it has applic atio ns far beyo nd the limited c o nfines o f the Web. Mo lec ular do c uments o ften c o ntain tho usands o f different, very detailed o bjec ts. Fo r example, a single medium-sized o rganic mo lec ule may c o ntain hundreds o f ato ms, eac h with several bo nds. CML seeks to o rganize these c o mplex c hemic al o bjec ts in a straightfo rward manner that c an be understo o d, displayed, and searc hed by a c o mputer. CML c an be used fo r mo lec ular struc tures and sequenc es, spec tro graphic analysis, c rystallo graphy, publishing, c hemic al databases, and mo re. Its vo c abulary inc ludes mo lec ules, ato ms, bo nds, c rystals, fo rmulas, sequenc es, symmetries, reac tio ns, and o ther c hemistry terms. Fo r instanc e Listing 2-1 is a basic CML do c ument fo r water (H2O):

    Listing 2-1: The water molecule H 2 O



    H

    1 2 1



    O H 2 3 1

    The biggest impro vement CML o ffers o ver traditio nal appro ac hes to managing c hemic al data is ease o f searc hing. CML also enables c o mplex mo lec ular data to be sent o ver the Web. Bec ause the underlying XML is platfo rm-independent, the pro blem o f platfo rm-dependenc y that plagues the binary fo rmats used by

    Chapter 2 ✦ An Introduction to XM L Applications

    traditio nal c hemic al so ftware and do c uments like the Pro tein Data Bank (PDB) fo rmat o r MDL Mo lfiles, is avo ided. Murray-Rust also c reated JUMBO, the first general-purpo se XML bro wser. Figure 2-1 sho ws JUMBO displaying a CML file. Jumbo wo rks by assigning eac h XML element to a Java c lass that kno ws ho w to render that element. To allo w Jumbo to suppo rt new elements, yo u simply write Java c lasses fo r tho se elements. Jumbo is distributed with c lasses fo r displaying the basic set o f CML elements inc luding mo lec ules, ato ms, and bo nds, and is available at http://www.xml-cml.org/.

    Figure 2-1: The JUMBO brow ser displaying a CML file

    M athematical M arkup Language Legend c laims that Tim Berners-Lee invented the Wo rld Wide Web and HTML at CERN so that high-energy physic ists c o uld exc hange papers and preprints. Perso nally I’ve never believed that. I grew up in physic s; and while I’ve wandered bac k and fo rth between physic s, applied math, astro no my, and c o mputer sc ienc e o ver the years, o ne thing the papers in all o f these disc iplines had in c o mmo n was lo ts and lo ts o f equatio ns. Until no w, nine years after the Web was invented, there hasn’t been any go o d way to inc lude equatio ns in Web pages. There have been a few hac ks — Java applets that parse a c usto m syntax, c o nverters that turn LaTeX equatio ns into GIF images, c usto m bro wsers that read TeX files — but no ne o f these have pro duc ed high quality results, and no ne o f them have c aught o n with Web autho rs, even in sc ientific fields. Finally, XML is starting to c hange this.

    19

    20

    Part I ✦ Introducing XM L

    The Mathematic al Markup Language ( MathML) is an XML applic atio n fo r mathematic al eq uatio ns. MathML is suffic iently expressive to handle pretty muc h all fo rms o f math — fro m grammar-sc ho o l arithmetic thro ugh c alc ulus and differential eq uatio ns. It c an handle many c o nsiderab ly mo re advanc ed to pic s as well, tho ugh there are definite gaps in so me o f the mo re advanc ed and o b sc ure no tatio ns used b y c ertain sub -fields o f mathematic s. While there are limits to MathML o n the high end o f pure mathematic s and theo retic al physic s, it is elo q uent eno ugh to handle almo st all educ atio nal, sc ientific , engineering, b usiness, ec o no mic s, and statistic s needs. And MathML is likely to b e expanded in the future, so even the purest o f the pure mathematic ians and the mo st theo retic al o f the theo retic al physic ists will b e ab le to pub lish and do researc h o n the Web . MathML c o mpletes the develo pment o f the Web into a serio us to o l fo r sc ientific researc h and c o mmunic atio n ( despite its lo ng digressio n to make it suitab le as a new medium fo r advertising b ro c hures) . Netsc ape Navigato r and Internet Explo rer do no t yet suppo rt MathML. No netheless, it is the fervent ho pe o f many mathematic ians that they so o n will. The W3C has integrated so me MathML suppo rt into their test-bed bro wser, Amaya. Figure 2-2 sho ws Amaya displaying the c o variant fo rm o f Maxwell’s equatio ns written in MathML. On the CD-ROM

    Am aya is on the CD-ROM in the brow sers/ am aya directory.

    Figure 2-2: The Am aya brow ser displaying the covariant form of Maxw ell’s equations w ritten in MathML

    The XML file the Amaya bro wser is displaying is given in Listing 2-2:

    Listing 2-2: M axwell’s Equations in M athM L

    Chapter 2 ✦ An Introduction to XM L Applications

    Fiat Lux

    And God said,



    δ α

    F αβ

    =

    4 π

    c

    J

    β



    and there was light



    Listing 2-2 is an example o f a mixed HTML/ XML page. The headers and paragraphs o f text (“Fiat Lux”, “Maxwell’s Equatio ns”, “And Go d said”, “and there was light”) is given in c lassic HTML. The ac tual equatio ns are written in MathML, an applic atio n o f XML.

    21

    22

    Part I ✦ Introducing XM L

    In general, suc h mixed pages require spec ial suppo rt fro m the bro wser, as is the c ase here, o r perhaps plug-ins, Ac tiveX c o ntro ls, o r JavaSc ript pro grams that parse and display the embedded XML data. Ultimately, o f c o urse, yo u want a bro wser like Mo zilla 5.0 o r Internet Explo rer 5.0 that c an parse and display pure XML files witho ut an HTML intermediary.

    Channel Definition Format Mic ro so ft’s Channel Definitio n Fo rmat (CDF) is an XML applic atio n fo r defining c hannels. Web sites use c hannels to uplo ad info rmatio n to readers who subsc ribe to the site rather than waiting fo r them to c o me and get it. This is alternately c alled We bcasting o r push. CDF was first intro duc ed in Internet Explo rer 4.0. A CDF do c ument is an XML file, separate fro m, but linked to an HTML do c ument o n the site being pushed. The c hannel defined in the CDF do c ument determines whic h pages are sent to the readers, ho w the pages are transpo rted, and ho w o ften the pages are sent. Pages c an either be pushed by sending no tific atio ns, o r even who le Web sites, to subsc ribers; o r pulled do wn by the readers at their c o nvenienc e. Yo u c an add CDF to yo ur site witho ut c hanging any o f the existing c o ntent. Yo u simply add an invisible link to a CDF file o n yo ur ho me page. Then when a reader visits the page, the bro wser displays a dialo g bo x asking them if they want to subsc ribe to the c hannel. If the reader c ho o ses to subsc ribe, the bro wser do wnlo ads a c o py o f the CDF do c ument desc ribing the c hannel. The bro wser then c o mbines the parameters spec ified in the CDF do c ument with the user’s o wn preferenc es to determine when to c hec k bac k with the server fo r new c o ntent. This isn’t true push, bec ause the c lient has to initiate the c o nnec tio n, but it still happens witho ut an explic it request by the reader. Figure 2-3 sho ws the IDG Ac tive Channel in Internet Explo rer 4.0. CrossReference On the CD-ROM

    CDF is covered in m ore detail in Chapter 21, Pushing Web Sites with CDF.

    Internet Explorer 4.0 is on the CD-ROM in the brow sers/ ie4 directory.

    Classic Literature Jo n Bo sak has translated the c o mplete plays o f Shakespeare into XML. The c o mplete text o f the plays is inc luded, and XML markup is used to distinguish between titles, subtitles, stage direc tio ns, speec hes, lines, speakers, and mo re. On the CD-ROM

    The com plete set of plays is on the CD-ROM in the exam ples/ shakespeare directory.

    Chapter 2 ✦ An Introduction to XM L Applications

    Figure 2-3: The IDG Active Channel in Internet Explorer 4.0

    Yo u may ask yo urself what this o ffers o ver a bo o k, o r even a plain text file. To a human reader, the answer is no t muc h. But to a c o mputer do ing textual analysis, it o ffers the o ppo rtunity to easily distinguish between the different elements into whic h the plays have been divided. Fo r instanc e, it makes it quite simple fo r the c o mputer to go thro ugh the text and extrac t all o f Ro meo ’s lines. Furthermo re, by altering the style sheet with whic h the do c ument is fo rmatted, an ac to r c o uld easily print a versio n o f the do c ument in whic h all their lines were fo rmatted in bo ld fac e, and the lines immediately befo re and after theirs were italic ized. Anything else yo u might imagine that requires separating a play into the lines uttered by different speakers is muc h mo re easily ac c o mplished with the XMLfo rmatted versio ns than with the raw text. Bo sak has also marked up English translatio ns o f the o ld and new testaments, the Ko ran, and the Bo o k o f Mo rmo n in XML. The markup in these is a little different. Fo r instanc e, it do esn’t distinguish between speakers. Thus yo u c o uldn’t use these partic ular XML do c uments to c reate a red-letter Bible, fo r example, altho ugh a different set o f tags might allo w yo u to do that. (A red-letter Bible prints wo rds spo ken by Jesus in red.) And bec ause these files are in English rather than the o riginal languages, they are no t as useful fo r sc ho larly textual analysis. Still, time and reso urc es permitting, tho se are exac tly the so rts o f things XML wo uld allo w yo u to do if yo u wanted to . Yo u’d simply need to invent a different vo c abulary and syntax than the o ne Bo sak used that wo uld still desc ribe the same data.

    23

    24

    Part I ✦ Introducing XM L

    On the CD-ROM

    The XML-ized Bible, Koran, and Book of Morm on are all on the CD-ROM in the exam ples/ religion directory.

    Synchronized M ultimedia Integration Language The Sync hro nized Multimedia Integratio n Language (SMIL, pro no unc ed “smile”) is a W3C rec o mmended XML applic atio n fo r writing “TV-like” multimedia presentatio ns fo r the Web. SMIL do c uments do n’t desc ribe the ac tual multimedia c o ntent (that is the video and so und that are played) but rather when and where they are played. Fo r instanc e, a typic al SMIL do c ument fo r a film festival might say that the bro wser sho uld simultaneo usly play the so und file beetho ven9.mid, sho w the video file c o range.mo v, and display the HTML file c lo c kwo rk.htm. Then, when it’s do ne, it sho uld play the video file 2001.mo v, the audio file zarathustra.mid, and display the HTML file ac larke.htm. This eliminates the need to embed lo w bandwidth data like text in high bandwidth data like video just to c o mbine them. Listing 2-3 is a simple SMIL file that do es exac tly this.

    Listing 2-3: A SM IL film festival









    Furthermo re, as well as spec ifying the time sequenc ing o f data, a SMIL do c ument c an po sitio n individual graphic s elements o n the display and attac h links to media o bjec ts. Fo r instanc e, at the same time the mo vie and so und are playing, the text o f the respec tive no vels c o uld be subtitling the presentatio n.

    Chapter 2 ✦ An Introduction to XM L Applications

    HTM L+ TIM E SMIL o perates independently o f the Web page. The streaming media pushed thro ugh SMIL has its o wn pane in the bro wser frame, but it do esn’t really have any interac tio n with the c o ntent in the HTML o n the rest o f the page. Fo r instanc e, SMIL o nly lets yo u time SMIL elements like audio , video , and text. It do esn’t let yo u add timing info rmatio n to basic HTML tags like

    ,

  • , o r . And SMIL duplic ates so me aspec ts o f HTML, suc h as ho w elements are po sitio ned o n the page. Mic ro so ft, alo ng with Mac ro media and Co mpaq, has pro po sed a semi-c o mpeting XML applic atio n c alled Timed Interac tive Multimedia Extensio ns fo r HTML (o r HTML+TIME fo r sho rt). HTML+TIME builds o n SMIL to suppo rt timing fo r traditio nal HTML elements and features muc h c lo ser integratio n with the HTML o n the Web page. Fo r example, HTML+TIME lets yo u write a c o untdo wn Web page like Listing 2-4 that adds to the page as time pro gresses.

    Listing 2-4: A countdown Web page using HTM L+ TIM E

    Countdown

    10

    9

    8

    7

    6

    5

    4

    3

    2

    1

    Blast Off!



    This is useful fo r slide sho ws, timed quizzes, and the like. In HTML+TIME, the film festival example o f Listing 2-3 lo o ks like the fo llo wing:







    25

    26

    Part I ✦ Introducing XM L

    It’s c lo se to , tho ugh no t quite exac tly the same as, the SMIL versio n. The majo r differenc e is that the SMIL versio n is intended to be sto red in separate files and rendered by spec ial players like RealPlayer, whereas the HTML+TIME versio n is suppo sed to be inc luded in the Web page and rendered by the bro wser. Ano ther key differenc e is that there are several pro duc ts that c an play SMIL files no w, inc luding RealPlayer G2, whereas HTML+TIME-enabled Web bro wsers do no t exist at the mo ment. Ho wever, it’s likely that future versio ns o f Internet Explo rer will inc lude HTML+TIME suppo rt. There are so me nic e features and so me go o d ideas in HTML+TIME. Ho wever, the W3C had already given its blessing to SMIL several mo nths befo re Mic ro so ft pro po sed HTML+TIME, and SMIL has a lo t mo re mo mentum and suppo rt in the third-party, c o ntent c reato r c o mmunity. Thus it seems we’re in fo r yet ano ther kno c kdo wn, drag-o ut, Mic ro so ft-vs.-everybo dy-else-in-the-kno wn-universe battle whic h will o nly leave third party develo pers bruised and c o nfused. One c an o nly ho pe that the W3C has the will and energy to referee this fight fairly. Web develo pment really wo uld be a lo t simpler if Mic ro so ft didn’t pic k up its to ys and go ho me every time they do n’t get their way.

    Open Software Description The Open So ftware Desc riptio n (OSD) fo rmat is an XML applic atio n c o -develo ped by Marimba and Mic ro so ft fo r updating so ftware auto matic ally. OSD defines XML tags that desc ribe so ftware c o mpo nents. The desc riptio n o f a c o mpo nent inc ludes the versio n o f the c o mpo nent, its underlying struc ture, and its relatio nships to and dependenc ies o n o ther c o mpo nents. This pro vides eno ugh info rmatio n fo r OSD to dec ide whether a user needs a partic ular update o r no t. If they do need the update, it c an be auto matic ally pushed to users, rather than requiring them to manually do wnlo ad and install it. Listing 2-5 is an example o f an OSD file fo r an update to WhizzyWriter 1000:

    Listing 2-5: An OSD file for an update to WhizzyWriter 1000

    WhizzyWriter 1000 Update Channel

    WhizzyWriter 1000

    Abstract: WhizzyWriter 1000: now with tint control!

    Chapter 2 ✦ An Introduction to XM L Applications





    Only info rmatio n abo ut the update is kept in the OSD file. The ac tual update files are sto red in a separate CAB arc hive o r exec utable and do wnlo aded when needed. There is c o nsiderable c o ntro versy abo ut whether o r no t this is ac tually a go o d thing. Many so ftware c o mpanies, Mic ro so ft no t least amo ng them, have a lo ng histo ry o f releasing updates that c ause mo re pro blems than they fix. Many users prefer to stay away fro m new so ftware fo r a while until o ther, mo re adventuro us so uls have given it a shakedo wn.

    Scalable Vector Graphics Vec to r graphic s are superio r to the bitmap GIF and JPEG images c urrently used o n the Web fo r many pic tures inc luding flo w c harts, c arto o ns, advertisements, and similar images. Ho wever, many traditio nal vec to r graphic s fo rmats like PDF, Po stSc ript, and EPS were designed with ink o n paper in mind rather than elec tro ns o n a sc reen. (This is o ne reaso n PDF o n the Web is suc h an inferio r replac ement fo r HTML, despite PDF’s muc h larger c o llec tio n o f graphic s primitives.) A vec to r graphic s fo rmat fo r the Web sho uld suppo rt a lo t o f features that do n’t make sense o n paper like transparenc y, anti-aliasing, additive c o lo r, hypertext, animatio n, and ho o ks to enable searc h engines and audio renderers to extrac t text fro m graphic s. No ne o f these features are needed fo r the ink-o n-paper wo rld o f Po stSc ript and PDF. Several vendo rs have made a variety o f pro po sals to the W3C fo r XML applic atio ns fo r vec to r graphic s. These inc lude:

    ✦ The Prec isio n Graphic s Markup Language (PGML) fro m IBM, Ado be, Netsc ape, and Sun.

    ✦ The Vec to r Markup Language (VML) fro m Mic ro so ft, Mac ro media, Auto desk, Hewlett-Pac kard, and Visio

    ✦ Sc hematic Graphic s o n the Wo rld Wide Web fro m the Central Labo rato ry o f the Researc h Co unc ils

    ✦ DrawML fro m Exc o so ft AB ✦ Hyper Graphic s Markup Language (HGML) fro m PRP and Orange PCSL Eac h o f these reflec ts the interests and experienc e o f its autho rs. Fo r example, no t surprisingly given Ado be’s partic ipatio n, PGML has the flavo r o f Po stSc ript but with XML element-attribute syntax rather than Po stSc ript’s reverse Po lish no tatio n. Listing 2-6 demo nstrates the embedding o f a pink triangle in PGML.

    27

    28

    Part I ✦ Introducing XM L

    Listing 2-6: A pink triangle in PGM L









    The W3C has fo rmed a wo rking gro up with representatives fro m the abo ve vendo rs to dec ide o n a single, unified, sc alable vec to r graphic s spec ific atio n c alled SVG. SVG is an XML applic atio n fo r desc ribing two -dimensio nal graphic s. It defines three basic types o f graphic s: shapes, images, and text. A shape is defined by its o utline, also kno wn as its path, and may have vario us stro kes o r fills. An image is a bitmapped file like a GIF o r a JPEG. Text is defined as a string o f text in a partic ular fo nt, and may be attac hed to a path, so it’s no t restric ted to ho rizo ntal lines o f text like the o nes that appear o n this page. All three kinds o f graphic s c an be po sitio ned o n the page at a partic ular lo c atio n, ro tated, sc aled, skewed, and o therwise manipulated. Sinc e SVG is a text fo rmat, it’s easy fo r pro grams to generate auto matic ally; and it’s easy fo r pro grams to manipulate. In partic ular yo u c an c o mbine it with DHTML and ECMASc ript to make the pic tures o n a Web page animated and respo nsive to user ac tio n. Sinc e SVG desc ribes graphic s rather than text — unlike mo st o f the o ther XML applic atio ns disc ussed in this c hapter — it will pro bably need spec ial display so ftware. All o f the pro po sed style-sheet languages assume they’re displaying fundamentally text-based data, and no ne o f them c an suppo rt the heavy graphic s requirements o f an applic atio n like SVG. It’s po ssible SVG suppo rt may be added to future bro wsers, espec ially sinc e Mo zilla is o pen so urc e c o de; and it wo uld be even easier fo r a plug-in to be written. Ho wever, fo r the time being, the prime benefit o f SVG is that it is likely to be used as an exc hange fo rmat between different pro grams like Ado be Illustrato r and Co relDraw, whic h use different native binary fo rmats. SVG is no t fully fleshed o ut at the time o f this writing, and there are exac tly zero implementatio ns o f it. The first wo rking draft o f SVG was released by the Wo rld Wide Web Co nso rtium in February o f 1999. Co mpared to o ther wo rking drafts, ho wever, it is wo efully inc o mplete. It’s really no t muc h mo re than an o utline o f graphic s elements that need to be inc luded, witho ut any details abo ut ho w exac tly tho se elements will be enc o ded in XML. I wo uldn’t be surprised if this draft go t pushed o ut the do o r a little early to head o ff the ado ptio n o f c o mpeting effo rts like VML.

    Chapter 2 ✦ An Introduction to XM L Applications

    Vector M arkup Language Mic ro so ft has develo ped their o wn XML applic atio n fo r vec to r graphic s c alled the Vec to r Markup Language (VML). VML is mo re finished than SVG, and is already suppo rted by Internet Explo rer 5.0 and Mic ro so ft Offic e 2000. Listing 2-7 is an HTML file with embedded VML that draws the pink triangle. Figure 2-4 sho ws this file displayed in Internet Explo rer 5.0. Ho wever, VML is no t nearly as ambitio us a fo rmat as SVG, and leaves o ut a lo t o f advanc ed features SVG inc ludes suc h as c lipping, masking, and c o mpo siting.

    Listing 2-7: The pink triangle in VM L

    A Pink Triangle, Listing 2-7 from the XML Bible









    There’s really no reaso n fo r there to be two separate, mutually inc o mpatible vec to r graphic s standards fo r the Web, and Mic ro so ft will pro bably grudgingly suppo rt SVG in the end. Ho wever, VML is available to day, even if its use is limited to Mic ro so ft pro duc ts, whereas SVG is o nly an inc o mplete draft spec ific atio n. Web artists wo uld prefer to have a single standard, but having two is no t unheard o f (think Gif and JPEG). As lo ng as the fo rmats are do c umented and no n-pro prietary,

    29

    30

    Part I ✦ Introducing XM L

    it’s no t o ut o f the questio n fo r Web bro wsers to suppo rt bo th. At the least, the underlying XML makes it a lo t easier fo r pro grammers to write c o nverters that translate files fro m o ne fo rmat to the o ther.

    Figure 2-4: The pink triangle created w ith VML

    CrossReference

    VML is discussed in m ore detail in Chapter 22, The Vector Markup Language.

    M usicM L The Co nnec tio n Fac to ry has c reated an XML applic atio n fo r sheet music c alled Music ML. Music ML inc ludes no tes, beats, c lefs, staffs, ro ws, rhythms, rests, beams, ro ws, c ho rds and mo re. Listing 2-8 sho ws the first bar fro m Beth Anderso n’s Flute Swale in Music ML.

    Listing 2-8: The first bar of Beth Anderson’s Flute Swale



    Chapter 2 ✦ An Introduction to XM L Applications

















    The Co nnec tio n Fac to ry has also written a Java applet that c an parse and display Music ML. Figure 2-5 sho ws the abo ve example rendered by this applet. The applet has a few bugs (fo r instanc e the last no te is missing) but o verall it’s a surprisingly go o d renditio n.

    Figure 2-5: The first bar of Beth Anderson’s Flute Swale in MusicML

    Music ML isn’t go ing to replac e Finale o r Nightingale anytime so o n. And it really seems like mo re o f a pro o f o f c o nc ept than a po lished pro duc t. Music ML has a lo t o f disc repanc ies that will drive music ians nuts ( e.g., rhythm is misspelled, treble and bass c lefs are reversed, segments sho uld really be measures, and so fo rth) .

    31

    32

    Part I ✦ Introducing XM L

    No netheless so mething like this is a reaso nable o utput fo rmat fo r music no tatio n pro grams that enables sheet music to be displayed o n the Web. Furthermo re, if the vario us no tatio n pro grams all suppo rt Music ML o r so mething like it, then it c an be used as an interc hange fo rmat to mo ve data fro m o ne pro gram to the o ther, so mething c o mpo sers desperately need to be able to do no w.

    VoxM L Mo to ro la’s Vo xML ( http://www.voxml.com/) is an XML applic atio n fo r the spo ken wo rd. In partic ular, it’s intended fo r tho se anno ying vo ic e mail and auto mated pho ne respo nse systems (“If yo ur hair turned green after using o ur pro duc t, please press o ne. If yo ur hair turned purple after using o ur pro duc t, please press two . If yo u fo und an unidentifiable insec t in the pro duc t, please press 3. Otherwise, please stay o n the line until yo ur hair gro ws bac k to its natural c o lo r.”). Vo xML enables the same data that’s used o n a Web site to be served up via telepho ne. It’s partic ularly useful fo r info rmatio n that’s c reated by c o mbining small nuggets o f data, suc h as sto c k pric es, spo rts sc o res, weather repo rts, and test results. The Weather Channel and CBS MarketWatc h.c o m are c o nsidering using Vo xML to pro vide mo re info rmatio n o ver regular vo ic e pho nes. A small Vo xML file fo r a shampo o c o mpany’s auto mated pho ne respo nse system might lo o k so mething like the c o de in Listing 2-9.

    Listing 2-9: A VoxM L file

    Welcome to TIC consumer products division. For shampoo information, say shampoo now.

    Welcome to Wonder Shampoo

    Which color did Wonder Shampoo turn your hair?

    green purple bald exit

    Chapter 2 ✦ An Introduction to XM L Applications

    If Wonder Shampoo turned your hair green and you wish to return it to its natural color, simply shampoo seven times with three parts soap, seven parts water, four parts kerosene, and two parts iguana bile.



    If Wonder Shampoo turned your hair purple and you wish to return it to its natural color, please walk widdershins around your local cemetery three times while chanting “Surrender Dorothy”.



    If you went bald as a result of using Wonder Shampoo, please purchase and apply a three months supply of our Magic Hair Growth Formula(TM). Please do not consult an attorney as doing so would violate the license agreement printed on inside fold of the Wonder Shampoo box in 3 point type which you agreed to by opening the package.



    Thank you for visiting TIC Corp. Goodbye.



    I c an’t sho w yo u a sc reen sho t o f this example, bec ause it’s no t intended to be sho wn in a Web bro wser. Instead, yo u wo uld listen to it o n a telepho ne.

    33

    34

    Part I ✦ Introducing XM L

    Open Financial Exchange So ftware c anno t be c hanged willy-nilly. The data that so ftware kno ws ho w to read has inertia. The mo re data yo u have in a given pro gram’s pro prietary, undo c umented fo rmat, the harder it is to c hange pro grams. Fo r example, my perso nal financ es fo r the last five years are sto red in Quic ken. Ho w likely is it that I will c hange to Mic ro so ft Mo ney even if Mo ney has features I need that Quic ken do esn’t have? Unless Mo ney c an read and c o nvert Quic ken files with zero lo ss o f data, the answer is “NOT BLOODY LIKELY!” The pro blem c an even o c c ur within a single c o mpany o r a single c o mpany’s pro duc ts. Mic ro so ft Wo rd 97 fo r Windo ws c an’t read do c uments c reated by so me earlier versio ns o f Wo rd. And earlier versio ns o f Wo rd c an’t read Wo rd 97 files at all. And Mic ro so ft Wo rd 98 fo r the Mac c an’t quite read everything that’s in a Wo rd 97 fo r Windo ws file, even tho ugh Wo rd 98 fo r the Mac c ame o ut a year later! As no ted in Chapter 1, the Open Financ ial Exc hange Fo rmat (OFX) is an XML applic atio n used to desc ribe financ ial data o f the type yo u’re likely to sto re in a perso nal financ e pro duc t like Mo ney o r Quic ken. Any pro gram that understands OFX c an read OFX data. And sinc e OFX is fully do c umented and no n-pro prietary (unlike the binary fo rmats o f Mo ney, Quic ken, and o ther pro grams) it’s easy fo r pro grammers to write the c o de to understand OFX. OFX no t o nly allo ws Mo ney and Quic ken to exc hange data with eac h o ther. It allo ws o ther pro grams that use the same fo rmat to exc hange the data as well. Fo r instanc e, if a bank wants to deliver statements to c usto mers elec tro nic ally, it o nly has to write o ne pro gram to enc o de the statements in the OFX fo rmat rather than several pro grams to enc o de the statement in Quic ken’s fo rmat, Mo ney’s fo rmat, Managing Yo ur Mo ney’s fo rmat, and so fo rth. The mo re pro grams that use a given fo rmat, the greater the savings in develo pment c o st and effo rt. Fo r example, six pro grams reading and writing their o wn and eac h o ther’s pro prietary fo rmat require 36 different c o nverters. Six pro grams reading and writing the same OFX fo rmat require o nly six c o nverters. Effo rt is reduc ed to O(n) rather than O(n2). Figure 2-6 depic ts six pro grams reading and writing their o wn and eac h o ther’s pro prietary fo rmat. Figure 2-7 depic ts six pro grams reading and writing the same OFX fo rmat. Every arro w represents a c o nverter that has to trade files and data between pro grams. In Figure 2-6, yo u c an see the c o nnec tio ns fo r six different pro grams reading and writing eac h o ther’s pro prietary binary fo rmat. In Figure 2-7, yo u c an see the same six different pro grams reading and writing o ne o pen XML fo rmat. The XML-based exc hange is muc h simpler and c leaner than the binary-fo rmat exc hange.

    Chapter 2 ✦ An Introduction to XM L Applications

    Quicken

    M oney

    CheckFree

    M utual Fund Program

    M anaging Your M oney

    Proprietary Bank System

    Figure 2-6: Six different program s reading and w riting their ow n and each other’s form ats

    35

    36

    Part I ✦ Introducing XM L

    Quicken

    M oney

    CheckFree

    OFX Format M utual Fund Program

    M anaging Your M oney

    Proprietary Bank System

    Figure 2-7: Six program s reading and w riting the sam e OFX form at

    Extensible Forms Description Language I went do wn to my lo c al bo o ksto re to day and bo ught a c o py o f Armistead Maupin’s no vel Sure o f Yo u. I paid fo r that purc hase with a c redit c ard, and when I did so I signed a piec e o f paper agreeing to pay the c redit c ard c o mpany $14.07 when billed. Eventually they will send me a bill fo r that purc hase, and I’ll pay it. If I refuse to pay it, then the c redit c ard c o mpany c an take me to c o urt to c o llec t, and they c an use my signature o n that piec e o f paper to pro ve to the c o urt that o n Oc to ber 15, 1998 I really did agree to pay them $14.07. The same day I also o rdered Anne Ric e’s The Vampire Armand fro m the o nline bo o ksto re amazo n.c o m. Amazo n c harged me $16.17 plus $3.95 shipping and handling and again I paid fo r that purc hase with a c redit c ard. But the differenc e is

    Chapter 2 ✦ An Introduction to XM L Applications

    that Amazo n never go t a signature o n a piec e o f paper fro m me. Eventually the c redit c ard c o mpany will send me a bill fo r that purc hase, and I’ll pay it. But if I did refuse to pay the bill, they do n’t have a piec e o f paper with my signature o n it sho wing that I agreed to pay $20.12 o n Oc to ber 15, 1998. If I c laim that I never made the purc hase, the c redit c ard c o mpany will bill the c harges bac k to Amazo n. Befo re Amazo n o r any o ther o nline o r pho ne-o rder merc hant is allo wed to ac c ept c redit c ard purc hases witho ut a signature in ink o n paper, they have to agree that they will take respo nsibility fo r all disputed transac tio ns. Exac t numbers are hard to c o me by, and o f c o urse vary fro m merc hant to merc hant, but pro bably a little under 10% o f Internet transac tio ns get billed bac k to the o riginating merc hant bec ause o f c redit c ard fraud o r disputes. This is a huge amo unt! Co nsumer businesses like Amazo n simply ac c ept this as a c o st o f do ing business o n the Net and wo rk it into their pric e struc ture, but o bvio usly this isn’t go ing to wo rk fo r six figure business-to -business transac tio ns. No bo dy wants to send o ut $200,000 o f maso nry supplies o nly to have the purc haser c laim they never made o r rec eived the o rder. Befo re business-to -business transac tio ns c an mo ve o nto the Internet, a metho d needs to be develo ped that c an verify that an o rder was in fac t made by a partic ular perso n and that this perso n is who he o r she c laims to be. Furthermo re, this has to be enfo rc eable in c o urt. (It’s a sad fac t o f Americ an business that many c o mpanies wo n’t do business with anyo ne they c an’t sue.) Part o f the so lutio n to the pro blem is digital signatures — the elec tro nic equivalent o f ink o n paper. To digitally sign a do c ument, yo u c alc ulate a hash c o de fo r the do c ument using a kno wn algo rithm, enc rypt the hash c o de with yo ur private key, and attac h the enc rypted hash c o de to the do c ument. Co rrespo ndents c an dec rypt the hash c o de using yo ur public key and verify that it matc hes the do c ument. Ho wever, they c an’t sign do c uments o n yo ur behalf bec ause they do n’t have yo ur private key. The exac t pro to c o l fo llo wed is a little mo re c o mplex in prac tic e, but the bo tto m line is that yo ur private key is merged with the data yo u’re signing in a verifiable fashio n. No o ne who do esn’t kno w yo ur private key c an sign the do c ument. The sc heme isn’t fo o lpro o f — it’s vulnerable to yo ur private key being sto len, fo r example-but it’s pro bably as hard to fo rge a digital signature as it is to fo rge a real ink-o n-paper signature. Ho wever, there are also a number o f less o bvio us attac ks o n digital signature pro to c o ls. One o f the mo st impo rtant is c hanging the data that’s signed. Changing the data that’s signed sho uld invalidate the signature, but it do esn’t if the c hanged data wasn’t inc luded in the first plac e. Fo r example, when yo u submit an HTML fo rm, the o nly things sent are the values that yo u fill into the fo rm’s fields and the names o f the fields. The rest o f the HTML markup is no t inc luded. Yo u may agree to pay $1500 fo r a new 450 MHz Pentium II PC running Windo ws NT, but the o nly thing sent o n the fo rm is the $1500. Signing this number signifies what yo u’re paying, but no t what yo u’re paying fo r. The merc hant c an then send yo u two gro ss o f flusho meters and c laim that’s what yo u bo ught fo r yo ur $1500. Obvio usly, if digital signatures are to be useful, all details o f the transac tio n must be inc luded. No thing c an be o mitted.

    37

    38

    Part I ✦ Introducing XM L

    The pro blem gets wo rse if yo u have to deal with the U.S. federal go vernment. Go vernment regulatio ns fo r purc hase o rders and requisitio ns o ften spell o ut the c o ntents o f fo rms in minute detail, right do wn to the fo nt fac e and type size. Failure to adhere to the exac t spec ific atio ns c an lead to yo ur invo ic e fo r $20,000,000 wo rth o f depleted uranium artillery shells being rejec ted. Therefo re, yo u no t o nly need to establish exac tly what was agreed to ; yo u also need to establish that yo u met all legal requirements fo r the fo rm. HTML’s fo rms just aren’t so phistic ated eno ugh to handle these needs. XML, ho wever, c an. It is almo st always po ssible to use XML to develo p a markup language with the right c o mbinatio n o f po wer and rigo r to meet yo ur needs, and this example is no exc eptio n. In partic ular UWI.COM has pro po sed an XML applic atio n c alled the Extensible Fo rms Desc riptio n Language (XFDL) fo r fo rms with extremely tight legal requirements that are to be signed with digital signatures. XFDL further o ffers the o ptio n to do simple mathematic s in the fo rm, fo r instanc e to auto matic ally fill in the sales tax and shipping and handling c harges and to tal up the pric e. UWI.COM has submitted XFDL to the W3C, but it’s really o verkill fo r Web bro wsers, and thus pro bably wo n’t be ado pted there. The real benefit o f XFDL, if it bec o mes widely ado pted, is in business-to -business and business-to -go vernment transac tio ns. XFDL c an bec o me a key part o f elec tro nic c o mmerc e, whic h is no t to say it will bec o me a key part o f elec tro nic c o mmerc e. It’s still early, and there are o ther players in this spac e.

    Human Resources M arkup Language HireSc ape’s Human Reso urc es Markup Language (HRML) is an XML applic atio n that pro vides a simple vo c abulary fo r desc ribing jo b o penings. It defines elements matc hing the parts o f a typic al c lassified want ad suc h as c o mpanies, divisio ns, rec ruiters, c o ntac t info rmatio n, terms, experienc e, and mo re. A jo b listing in HRML might lo o k so mething like the c o de in Listing 2-10.

    Listing 2-10: A Job Listing in HRM L

    IDG Books

    http://www.idgbooks.com/

    http://www.idgbooks.com/cgibin/gatekeeper.pl?uidg4841:%2Fcompany%2Fjobs%2Findex.html

    Chapter 2 ✦ An Introduction to XM L Applications



    09/10/1998

    http://www.idgbooks.com/cgibin/gatekeeper.pl?uidg4841:%2Fcompany%2Fjobs%2Findex.html

    Web Development Manager 1 3

    This position is responsible for the technical and production functions of the Online group as well as strategizing and implementing technology to improve the IDG Books web sites. Skills must include Perl, C/C++, HTML, SQL, JavaScript, Windows NT 4, mod-perl, CGI, TCP/IP, Netscape servers and Apache server. You must also have excellent communication skills, project management, the ability to communicate technical solutions to non-technical people and management experience.

    Perl, C/C++, HTML, SQL, JavaScript, Windows NT 4, mod-perl, CGI, TCP/IP, Netscape server, Apache server

    $60,000



    [email protected]

    Continued

    39

    40

    Part I ✦ Introducing XM L

    Listing 2-10 (continued) Dee Harris, HR Manager 919 E. Hillsdale Blvd. Suite 400 Foster City CA 94404



    Altho ugh yo u c o uld c ertainly define a style sheet fo r HRML, and use it to plac e jo b listings o n Web pages, that’s no t its main purpo se. Instead HRML is designed to auto mate the exc hange o f jo b info rmatio n between c o mpanies, applic ants, rec ruiters, jo b bo ards, and o ther interested parties. There are hundreds o f jo b bo ards o n the Internet to day as well as numero us Usenet newsgro ups and mailing lists. It’s impo ssible fo r o ne individual to searc h them all, and it’s hard fo r a c o mputer to searc h them all bec ause they all use different fo rmats fo r salaries, lo c atio ns, benefits, and the like. But if many sites ado pt HRML, then it bec o mes relatively easy fo r a jo b seeker to searc h with c riteria like “all the jo bs fo r Java pro grammers in New Yo rk City paying mo re than $100,000 a year with full health benefits.” The IRS c o uld enter a searc h fo r all full-time, o n-site, freelanc e o penings so they’d kno w whic h c o mpanies to go after fo r failure to withho ld tax and pay unemplo yment insuranc e. In prac tic e, these searc hes wo uld likely be mediated thro ugh an HTML fo rm just like c urrent Web searc hes. The main differenc e is that suc h a searc h wo uld return far mo re useful results bec ause it c an use the struc ture in the data and semantic s o f the markup rather than relying o n imprec ise English text.

    Resource Description Framework XML adds struc ture to do c uments. The Reso urc e Desc riptio n Framewo rk (RDF) is an XML applic atio n that adds semantic s. RDF c an be used to spec ify anything fro m the autho r and abstrac t o f a Web page to the versio n and dependenc ies o f a so ftware pac kage to the direc to r, sc reenwriter, and ac to rs in a mo vie. What links all o f these uses is that what’s being enc o ded in RDF is no t the data itself (the Web page, the so ftware, the mo vie) but info rmatio n abo ut the data. This data abo ut data is c alled me ta-data , and is RDF’s raiso n d’ê tre .

    Chapter 2 ✦ An Introduction to XM L Applications

    An RDF vo c abulary defines a set o f elements and their permitted c o ntent that’s appro priate fo r meta-data in a given do main. RDF enables c o mmunities o f interest to standardize their vo c abularies and share tho se vo c abularies with o thers who may extend them. Fo r example, the Dublin Co re is an RDF vo c abulary spec ific ally designed fo r meta-data abo ut Web pages. Educ o m’s Instruc tio nal Metadata System (IMS) builds o n the Dublin Co re by adding additio nal elements that are useful when desc ribing sc ho o l-related c o ntent like learning level, educ atio nal o bjec tives, and pric e. Of c o urse, altho ugh RDF c an be used fo r print-publishing systems, video -sto re c atalo gs, auto mated so ftware updates, and muc h mo re, it’s likely to be ado pted first fo r embedding meta-data in Web pages. RDF has the po tential to sync hro nize the c urrent ho dge-po dge o f tags used fo r site maps, c o ntent rating, auto mated indexing, and digital libraries into a unified c o llec tio n that all o f these to o ls understand. Onc e RDF meta-data bec o mes a standard part o f Web pages, searc h engines will be able to return mo re fo c used, useful results. Intelligent agents c an mo re easily traverse the Web to find info rmatio n yo u want o r c o nduc t business fo r yo u. The Web c an go fro m its c urrent state as an uno rdered sea o f info rmatio n to a struc tured, searc hable, understandable sto re o f data. As the name implies, RDF desc ribes re so urce s. A reso urc e is anything that c an be addressed with a URI. The desc riptio n o f a reso urc e is c o mpo sed o f a number o f pro perties. Eac h pro perty has a type and a value. Fo r example, HTML has the type “DC:Fo rmat” and the value “HTML”. Values may be text strings, numbers, dates, and so fo rth, o r they may be o ther reso urc es. These o ther reso urc es c an have their o wn desc riptio ns in RDF. Fo r example, the c o de in Listing 2-11 uses the Dublin Co re vo c abulary to desc ribe the Cafe c o n Lec he Web site.

    Listing 2-11: An RDF description of the Cafe con Leche home page using the Dublin Core vocabulary

    Elliotte Rusty Harold en HTML 1999-08-19 home page Cafe con Leche

    41

    42

    Part I ✦ Introducing XM L

    RDF will be used fo r versio n 2.0 o f the Platfo rm fo r Internet Co ntent Selec tio n (PICS) and the Platfo rm fo r Privac y Preferenc es (P3P) as well as fo r many o ther areas where meta-data is needed to desc ribe Web pages and o ther kinds o f c o ntent.

    XM L for XM L XML is an extremely general-purpo se fo rmat fo r text data. So me o f the things it is used fo r are further refinements o f XML itself. These inc lude the XSL style-sheet language, the XLL-linking language, and the Do c ument Co ntent Desc riptio n fo r XML.

    XSL XSL, the Extensible Style Language, is itself an XML applic atio n. XSL has two majo r parts. The first part defines a vo c abulary fo r transfo rming XML do c uments. This part o f XSL inc ludes XML tags fo r trees, no des, patterns, templates, and o ther elements needed fo r matc hing and transfo rming XML do c uments fro m o ne markup vo c abulary to ano ther (o r even to the same o ne in a different o rder). The sec o nd part o f XSL defines an XML vo c abulary fo r fo rmatting the transfo rmed XML do c ument pro duc ed by the first part. This inc ludes XML tags fo r fo rmatting o bjec ts inc luding paginatio n, blo c ks, c harac ters, lists, graphic s, bo xes, fo nts, and mo re. A typic al XSL style sheet is sho wn in Listing 2-12:

    Listing 2-12: An XSL style sheet









    Chapter 2 ✦ An Introduction to XM L Applications

    We’ll explo re XSL in great detail in Chapters 14 and 15.

    XLL The Extensible Linking Language, XLL, defines a new, mo re general kind o f link c alled an XLink. XLinks ac c o mplish everything po ssible with HTML’s URL-based hyperlinks and anc ho rs. Ho wever, any element c an bec o me a link, no t just A elements. Fo r instanc e a footnote element c an link direc tly to the text o f the no te like this:

    7 Furthermo re, XLinks c an do a lo t o f things HTML links c an’t. XLinks c an be bidirec tio nal so readers c an return to the page they c ame fro m. XLinks c an link to arbitrary po sitio ns in a do c ument. XLinks c an embed text o r graphic data inside a do c ument rather than requiring the user to ac tivate the link (muc h like HTML’s tag but mo re flexible). In sho rt, XLinks make hypertext even mo re po werful. CrossReference

    XLinks are discussed in m ore detail in Chapter 16, XLinks.

    DCD XML’s fac ilities fo r dec laring ho w the c o ntents o f an XML element sho uld be fo rmatted are weak to no nexistent. Fo r example, suppo se as part o f a date, yo u set up MONTH elements like this:

    9 All yo u c an say is that the c o ntents o f the MONTH element sho uld be c harac ter data. Yo u c anno t say that the mo nth sho uld be given as an integer between 1 and 12. A number o f sc hemes have been pro po sed to use XML itself to mo re tightly restric t what c an appear in the c o ntents o f any given element. One suc h pro po sal is the Do c ument Co ntent Desc riptio n, (DCD). Fo r example, here’s a DCD that dec lares that MONTH elements may o nly c o ntain an integer between 1 and 12:



    There are mo re examples I c o uld sho w yo u o f XML used fo r XML, but the o nes I’ve already disc ussed demo nstrate the basic po int: XML is po werful eno ugh to desc ribe and extend itself. Amo ng o ther things, this means that the XML spec ific atio n c an remain small and simple. There may well never be an XML 2.0 bec ause any majo r additio ns that are needed c an be built o ut o f raw XML rather

    43

    44

    Part I ✦ Introducing XM L

    than bec o ming new features o f the XML. Peo ple and pro grams that need these enhanc ed features c an use them. Others who do n’t need them c an igno re them. Yo u do n’t need to kno w abo ut what yo u do n’t use. XML pro vides the bric ks and mo rtar fro m whic h yo u c an build simple huts o r to wering c astles.

    Behind-the-Scene Uses of XM L No t all XML applic atio ns are public , o pen standards. A lo t o f so ftware vendo rs are mo ving to XML fo r their o wn data simply bec ause it’s a well-understo o d, generalpurpo se fo rmat fo r struc tured data that c an be manipulated with easily available c heap and free to o ls. Mic ro so ft Offic e 2000 pro mo tes HTML to a c o equal file fo rmat with its native binary fo rmats. Ho wever, HTML 4.0 do esn’t pro vide suppo rt fo r all o f the features Offic e requires, suc h as revisio n trac king, fo o tno tes, c o mments, index and glo ssary entries, and mo re. Additio nal data that c an’t be written as HTML is embedded in the file in small c hunks o f XML. Wo rd’s vec to r graphic s will be sto red in VML. In this c ase, embedded XML’s invisibility in standard bro wsers is the c ruc ial fac to r. Federal Express uses detailed trac king info rmatio n as a c o mpetitive advantage o ver o ther shippers like UPS and the Po st Offic e. First that info rmatio n was available thro ugh c usto m so ftware, then thro ugh the Web. Mo re rec ently FedEx has begun beta testing an API/ library that third-party and in-ho use develo pers c an use to integrate their so ftware and systems with FedEx’s. The data fo rmat used fo r this servic e is XML. Netsc ape Navigato r 5.0 suppo rts direc t display o f XML in the Web bro wser, but Netsc ape ac tually started using XML internally as early as versio n 4.5. When yo u ask Netsc ape to sho w yo u a list o f sites related to the c urrent o ne yo u’re lo o king it, yo ur bro wser c o nnec ts to a CGI pro gram running o n a Netsc ape server. The data that server sends bac k is XML. Listing 2-13 sho ws the XML data fo r sites related to http://metalab.unc.edu/.

    Listing 2-13: XM L data for sites related to http:/ / metalab.unc.edu/



    Chapter 2 ✦ An Introduction to XM L Applications



















    45

    46

    Part I ✦ Introducing XM L

    This all happens c o mpletely behind the sc enes. The users never kno w that the data is being transferred in XML. The ac tual display is a menu in Netsc ape Navigato r, no t an XML o r HTML page. This really just sc ratc hes the surfac e o f the use o f XML fo r internal data. Many o ther pro jec ts that use XML are just getting started, and many mo re will be started o ver the next year. Mo st o f these wo n’t rec eive any public ity o r write-ups in the trade press, but they no netheless have the po tential to save their c o mpanies tho usands o f do llars in develo pment c o sts o ver the life o f the pro jec t. The selfdo c umenting nature o f XML c an be as useful fo r a c o mpany’s internal data as fo r its external data. Fo r instanc e, many c o mpanies right no w are sc rambling to try and figure o ut whether pro grammers who retired 20 years ago used two -digit dates. If that were yo ur jo b, wo uld yo u rather be po uring o ver data that lo o ked like this:

    3c 79 65 61 72 3e 39 39 3c 2f 79 65 61 72 3e o r like this:

    99 Unfo rtunately many pro grammers are no w stuc k trying to c lean up data in the first fo rmat. XML even makes the mistakes easier to find and fix.

    Summary This c hapter has just begun to to uc h the many and varied applic atio ns to whic h XML has been and will be put. So me o f these applic atio ns like CML, MathML, and Music ML are c lear extensio ns to HTML fo r Web bro wsers. But many o thers, like OFX, XFDL, and HRML, go into c o mpletely new direc tio ns. And all o f these applic atio ns have their o wn semantic s and syntax that sits o n to p o f the underlying XML. In so me c ases, the XML ro o ts are o bvio us. In o ther c ases, yo u c o uld easily spend mo nths wo rking with it and o nly hear o f XML tangentially. In this c hapter, yo u explo red the fo llo wing applic atio ns to whic h XML has been put to use:

    ✦ Mo lec ular sc ienc es with CML ✦ Sc ienc e and math with MathML ✦ Webc asting with CDF ✦ Classic literature ✦ Multimedia with SMIL and HTML+TIME ✦ So ftware updates thro ugh OSD ✦ Vec to r graphic s with bo th PGML and VML

    Chapter 2 ✦ An Introduction to XM L Applications

    ✦ Music no tatio n in Music ML ✦ Auto mated vo ic e respo nses with Vo xML ✦ Financ ial data with OFX ✦ Legally binding fo rms with XFDL ✦ Human reso urc es jo b info rmatio n with HRML ✦ Meta-data thro ugh RDF ✦ XML itself, inc luding XSL, XLL, and DCD, to refine XML ✦ Internal use o f XML by vario us c o mpanies, inc luding Mic ro so ft, Federal Express, and Netsc ape In the next c hapter, yo u will begin writing yo ur o wn XML do c uments and displaying them in Web bro wsers.







    47

    3

    C H A P T E R

    Your First XM L Document









    In This Cha pter

    T

    his c hapter teac hes yo u ho w to c reate simple XML do c uments with tags yo u define that make sense fo r yo ur do c ument. Yo u’ll learn ho w to write a style sheet fo r the do c ument that desc rib es ho w the c o ntent o f tho se tags sho uld b e displayed. Finally, yo u’ll learn ho w to lo ad the do c uments into a Web b ro wser so that they c an b e viewed. Sinc e this c hapter will teac h yo u b y example, and no t fro m first princ ipals, it will no t c ro ss all the t’s and do t all the i’s. Experienc ed readers may no tic e a few exc eptio ns and spec ial c ases that aren’t disc ussed here. Do n’t wo rry ab o ut these; yo u’ll get to them o ver the c o urse o f the next several c hapters. Fo r the mo st part, yo u do n’t need to wo rry ab o ut the tec hnic al rules right up fro nt. As with HTML, yo u c an learn and do a lo t b y c o pying simple examples that o thers have prepared and mo difying them to fit yo ur needs. To ward that end I enc o urage yo u to fo llo w alo ng by typing in the examples I give in this c hapter and lo ading them into the different pro grams disc ussed. This will give yo u a basic feel fo r XML that will make the tec hnic al details in future c hapters easier to grasp in the c o ntext o f these spec ific examples.

    Hello XM L This sec tio n fo llo ws an o ld pro grammer’s traditio n o f intro duc ing a new language with a pro gram that prints “Hello Wo rld” o n the c o nso le. XML is a markup language, no t a pro gramming language; but the basic princ iple still applies. It’s easiest to get started if yo u begin with a c o mplete, wo rking example yo u c an expand o n rather than trying to start with mo re fundamental piec es that by themselves do n’t do anything. And if yo u do enc o unter pro blems with the basic to o ls, tho se pro blems are

    Creating a simple XML do cument Explo ring the Simple XML Do cument Assig ning meaning to XML tag s W riting style sheets fo r XML do cuments Attaching style sheets to XML do cuments









    50

    Part I ✦ Introducing XM L

    a lo t easier to debug and fix in the c o ntext o f the sho rt, simple do c uments used here rather than in the c o ntext o f the mo re c o mplex do c uments develo ped in the rest o f the bo o k. In this sec tio n, yo u’ll learn ho w to c reate a simple XML do c ument and save it in a file. We’ll then take a c lo ser lo o k at the c o de and what it means.

    Creating a Simple XM L Document In this sec tio n, yo u will learn ho w to type an ac tual XML do c ument. Let’s start with abo ut the simplest XML do c ument I c an imagine. Here it is in Listing 3-1:

    Listing 3-1: Hello XM L

    Hello XML!

    That’s no t very c o mplic ated, but it is a go o d XML do c ument. To be mo re prec ise, it’s a we ll-fo rme d XML do c ument. (XML has spec ial terms fo r do c uments that it c o nsiders “go o d” depending o n exac tly whic h set o f rules they satisfy. “Well-fo rmed” is o ne o f tho se terms, but we’ll get to that later in the bo o k.) This do c ument c an be typed in any c o nvenient text edito r like No tepad, BBEdit, o r emac s. CrossReference

    Well-form edness is covered in Chapter 6, Well-Formed XML Documents.

    Saving the XM L File Onc e yo u’ve typed the prec eding c o de, save the do c ument in a file c alled hello .xml, Hello Wo rld.xml, MyFirstDo c ument.xml, o r so me o ther name. The three-letter extensio n .xml is fairly standard. Ho wever, do make sure that yo u save it in plain text fo rmat, and no t in the native fo rmat o f so me wo rd pro c esso r like Wo rdPerfec t o r Mic ro so ft Wo rd. Note

    If you’re using Notepad on Window s 95/ 98 to edit your files, w hen saving the docum ent be sure to enclose the file nam e in double quotes, e.g. “Hello.xm l”, not m erely Hello.xm l, as show n in Figure 3-1. Without the quotes, Notepad w ill append the .txt extension to your file nam e, nam ing it Hello.xm l.txt, w hich is not w hat you w ant at all.

    Chapter 3 ✦ Your First XM L Document

    Figure 3-1: A saved XML docum ent in Notepad w ith the file nam e in quotes

    The Windo ws NT versio n o f No tepad gives yo u the o ptio n to save the file in Unic o de. Surprisingly this will wo rk to o , tho ugh fo r no w yo u sho uld stic k to basic ASCII. XML files may be either Unic o de o r a c o mpressed versio n o f Unic o de c alled UTF-8, whic h is a stric t superset o f ASCII, so pure ASCII files are also valid XML files. CrossReference

    UTF-8 and ASCII are discussed in m ore detail in Chapter 7, Foreign Languages and non-Roman Text .

    Loading the XM L File into a Web Browser No w that yo u’ve c reated yo ur first XML do c ument, yo u’re go ing to want to lo o k at it. The file c an be o pened direc tly in a bro wser that suppo rts XML suc h as Internet Explo rer 5.0. Figure 3-2 sho ws the result. What yo u see will vary fro m bro wser to bro wser. In this c ase it’s a nic ely fo rmatted and syntax c o lo red view o f the do c ument’s so urc e c o de. Ho wever, whatever it is, it’s likely no t to be partic ularly attrac tive. The pro blem is that the bro wser do esn’t really kno w what to do with the FOO element. Yo u have to tell the bro wser what it’s expec ted to do with eac h element by using a style sheet. We’ll c o ver that sho rtly, but first let’s lo o k a little mo re c lo sely at yo ur first XML do c ument.

    51

    52

    Part I ✦ Introducing XM L

    Figure 3-2: hello.xm l in Internet Explorer 5.0

    Exploring the Simple XM L Document Let’s examine the simple XML do c ument in Listing 3-1 to better understand what eac h line o f c o de means. The first line is the XML de claratio n:

    This is an example o f an XML pro ce ssing instructio n. Pro c essing instruc tio ns begin with . The first wo rd after the

    Hello XML!

    Listing 3-3: paragraph.xml

    Hello XML!



    Listing 3-4: document.xml

    Hello XML!

    The fo ur XML do c uments in Listings 3-1 thro ugh 3-4 have tags with different names. Ho wever, they are all equivalent, sinc e they have the same struc ture and c o ntent.

    53

    54

    Part I ✦ Introducing XM L

    Assigning M eaning to XM L Tags Markup tags c an have three kinds o f meaning: struc ture, semantic s, and style. Struc ture divides do c uments into a tree o f elements. Semantic s relates the individual elements to the real wo rld o utside o f the do c ument itself. Style spec ifies ho w an element is displayed. Struc ture merely expresses the fo rm o f the do c ument, witho ut regard fo r differenc es between individual tags and elements. Fo r instanc e, the fo ur XML do c uments sho wn in Listings 3-1 thro ugh 3-4 are struc turally the same. They all spec ify do c uments with a single no n-empty, ro o t element. The different names o f the tags have no struc tural signific anc e. Semantic meaning exists o utside the do c ument, in the mind o f the autho r o r reader o r in so me c o mputer pro gram that generates o r reads these files. Fo r instanc e, a Web b ro wser that understands HTML, b ut no t XML, wo uld assign the meaning “paragraph” to the tags

    and

    b ut no t to the tags and , and , o r and . An English-speaking human wo uld b e mo re likely to understand and o r and than and o r

    and

    . Meaning, like b eauty, is in the mind o f the b eho lder. Co mputers, being relatively dumb mac hines, c an’t really be said to understand the meaning o f anything. They simply pro c ess bits and bytes ac c o rding to predetermined fo rmula (albeit very quic kly). A c o mputer is just as happy to use o r

    as it is to use the mo re meaningful o r tags. Even a Web bro wser c an’t be said to really understand that what a paragraph is. All the bro wser kno ws is that when a paragraph is enc o untered a blank line sho uld be plac ed befo re the next element. Naturally, it’s b etter to pic k tags that mo re c lo sely reflec t the meaning o f the info rmatio n they c o ntain. Many disc iplines like math and c hemistry are wo rking o n c reating industry standard tag sets. These sho uld b e used when appro priate. Ho wever, mo st tags are made up as yo u need them. Here are so me o ther po ssible tags:

















    Chapter 3 ✦ Your First XM L Document









    The third kind o f meaning that c an be asso c iated with a tag is style meaning. Style meaning spec ifies ho w the c o ntent o f a tag is to be presented o n a c o mputer sc reen o r o ther o utput devic e. Style meaning says whether a partic ular element is bo ld, italic , green, 24 po ints, o r what have yo u. Co mputers are better at understanding style than semantic meaning. In XML, style meaning is applied thro ugh style sheets.

    Writing a Style Sheet for an XM L Document XML allo ws yo u to c reate any tags yo u need. Of c o urse, sinc e yo u have almo st c o mplete freedo m in c reating tags, there’s no way fo r a generic bro wser to antic ipate yo ur tags and pro vide rules fo r displaying them. Therefo re, yo u also need to write a style sheet fo r yo ur XML do c ument that tells bro wsers ho w to display partic ular tags. Like tag sets, style sheets c an be shared between different do c uments and different peo ple, and the style sheets yo u c reate c an be integrated with style sheets o thers have written. As disc ussed in Chapter 1, there is mo re than o ne style-sheet language available. The o ne used here is c alled Casc ading Style Sheets (CSS). CSS has the advantage o f being an established W3C standard, being familiar to many peo ple fro m HTML, and being suppo rted in the first wave o f XML-enabled Web bro wsers. Note

    As noted in Chapter 1, another possibility is the Extensible Style Language. XSL is currently the m ost pow erful and flexible style-sheet language, and the only one designed specifically for use w ith XML. How ever, XSL is m ore com plicated than CSS, not yet as w ell supported, and not finished either.

    CrossReference

    XSL w ill be discussed in Chapters 5, 14, and 15.

    The greeting.xml example sho wn in Listing 3-2 o nly c o ntains o ne tag, , so all yo u need to do is define the style fo r the GREETING element. Listing 3-5 is a very simple style sheet that spec ifies that the c o ntents o f the GREETING element sho uld be rendered as a blo c k-level element in 24-po int bo ld type.

    55

    56

    Part I ✦ Introducing XM L

    Listing 3-5: greeting.xsl GREETING {display: block; font-size: 24pt; font-weight: bold;}

    Listing 3-5 sho uld be typed in a text edito r and saved in a new file c alled greeting.c ss in the same direc to ry as Listing 3-2. The .c ss extensio n stands fo r Casc ading Style Sheet. Onc e again the extensio n, .c ss, is impo rtant, altho ugh the exac t file name is no t. Ho wever if a style sheet is to be applied o nly to a single XML do c ument it’s o ften c o nvenient to give it the same name as that do c ument with the extensio n .c ss instead o f .xml.

    Attaching a Style Sheet to an XM L Document After yo u’ve written an XML do c ument and a CSS style sheet fo r that do c ument, yo u need to tell the bro wser to apply the style sheet to the do c ument. In the lo ng term there are likely to be a number o f different ways to do this, inc luding bro wser-server nego tiatio n via HTTP headers, naming c o nventio ns, and bro wser-side defaults. Ho wever, right no w the o nly way that wo rks is to inc lude ano ther pro c essing instruc tio n in the XML do c ument to spec ify the style sheet to be used. The pro c essing instruc tio n is and it has two attributes, type and href. The type attribute spec ifies the style-sheet language used, and the href attribute spec ifies a URL, po ssibly relative, where the style sheet c an be fo und. In Listing 3-6, the xml-stylesheet pro c essing instruc tio n spec ifies that the style sheet named greeting.css written in the CSS style-sheet language is to be applied to this do c ument.

    Listing 3-6: styledgreeting.xml with an xml-stylesheet processing instruction

    Hello XML!

    Chapter 3 ✦ Your First XM L Document

    No w that yo u’ve c reated yo ur first XML do c ument and style sheet, yo u’re go ing to want to lo o k at it. All yo u have to do is lo ad Listing 3–6 into Mo zilla o r Internet Explo rer 5.0. Figure 3–3 sho ws styledgreeting in Internet Explo rer 5.0. Figure 3–4 sho ws styledgreeting.xml in an early develo per build o f Mo zilla.

    Figure 3-3: styledgreeting.xm l in Internet Explorer 5.0

    Figure 3-4: styledgreeting.xm l in an early developer build of Mozilla

    57

    58

    Part I ✦ Introducing XM L

    Summary In this c hapter yo u learned ho w to c reate a simple XML do c ument. In partic ular yo u learned:

    ✦ Ho w to write and save simple XML do c uments. ✦ Ho w to assign to XML tags the three kinds o f meaning: struc ture, semantic s, and style.

    ✦ Ho w to write a CSS style sheet fo r an XML do c ument that tells bro wsers ho w to display partic ular tags.

    ✦ Ho w to attac h a CSS style sheet to an XML do c ument with an xmlstylesheet pro c essing instruc tio n.

    ✦ Ho w to lo ad XML do c uments into a Web bro wser. In the next c hapter, we’ll develo p a muc h larger example o f an XML do c ument that demo nstrates mo re o f the prac tic al c o nsideratio ns invo lved in c ho o sing XML tags.







    4

    C H A P T E R

    Structuring Data

    I

    n this c hapter, we will develo p a lo nger example that sho ws ho w a large list o f baseball statistic s and o ther similar data might be sto red in XML. A do c ument like this has several po tential uses. Mo st o bvio usly it c an be displayed o n a Web page. It c an also be used as input to o ther pro grams that want to analyze partic ular seaso ns o r lineup. Alo ng the way, yo u’ll learn, amo ng o ther things, ho w to mark up the data in XML, why XML tags are c ho sen, and ho w to prepare a CSS style sheet fo r a do c ument.

    Examining the Data As I write this (Oc to ber, 1998), the New Yo rk Yankees have just wo n their 24th Wo rld Series by sweeping the San Diego Padres in fo ur games. The Yankees finished the regular seaso n with an Americ an League rec o rd 114 wins. Overall, 1998 was an asto nishing seaso n. The St. Lo uis Cardinals’ Mark Mc Gwire and the Chic ago Cubs’ Sammy So sa dueled thro ugh September fo r the rec o rd, previo usly held by Ro ger Maris, fo r mo st ho me runs hit in a single seaso n sinc e baseball was integrated. (The all-time majo r league rec o rd fo r ho me runs in a single seaso n is still held by c atc her Jo sh Gibso n who hit 75 ho me runs in the Negro league in 1931. Admittedly, Gibso n didn’t have to fac e the so rt o f pitc hing So sa and Mc Gwire fac ed in to day’s integrated league. Then again neither did Babe Ruth who was widely (and inc o rrec tly) believed to have held the rec o rd until Ro ger Maris hit 61 in 1961.) What exac tly made 1998 suc h an exc iting seaso n? A c ynic wo uld tell yo u that 1998 was an expansio n year with three new teams, and c o nsequently muc h weaker pitc hing o verall. This gave o utstanding batters like So sa and Mc Gwire and o utstanding teams like the Yankees a c hanc e to really shine bec ause, altho ugh they were as stro ng as they’d been in 1997, the average o ppo nent they fac ed was a lo t weaker. Of c o urse true baseball fanatic s kno w the real reaso n, statistic s.









    In This Cha pter Examining the data XMLizing the data The advantag es o f the XML fo rmat Preparing a style sheet fo r do cument display









    60

    Part I ✦ Introducing XM L

    That’s a funny thing to say. In mo st spo rts yo u hear abo ut heart, guts, ability, skill, determinatio n, and mo re. But o nly in baseball do the fans get so wo rked up abo ut raw numbers. Batting average, earned run average, slugging average, o n base average, fielding perc entage, batting average against right handed pitc hers, batting average against left handed pitc hers, batting average against right handed pitc hers when batting left-handed, batting average against right handed pitc hers in Cleveland under a full mo o n, and so o n. Baseball fans are o bsessed with numbers; the mo re numbers the better. Every seaso n the Internet is ho st to tho usands o f ro tisserie leagues in whic h avid netizens manage teams and trade players with eac h o ther and c alc ulate ho w their fantasy teams are do ing based o n the real-wo rld perfo rmanc e o f the players o n their fantasy ro sters. STATS, Inc . trac ks the results o f eac h and every pitc h made in a majo r league game, so it’s po ssible to figure o ut that o ne batter do es better than his average with men in sc o ring po sitio n while ano ther do es wo rse. In the next two sec tio ns, fo r the benefit o f the less baseball-o bsessed reader, we will examine the c o mmo nly available statistic s that desc ribe an individual player’s batting and pitc hing. Fielding statistic s are also available, but I’ll o mit them to restric t the examples to a mo re manageable size. The spec ific example I’m using is the New Yo rk Yankees, but the same statistic s are available fo r any team.

    Batters A few years ago , Bruc e Bukiet, Jo se Palac io s, and myself, wro te a paper c alled “A Marko v Chain Appro ac h to Baseball” (Operatio ns Researc h, Vo lume 45, Number 1, January-February, 1997, pp. 14-23, http://www.math.njit.edu/~bukiet/ Papers/ball.pdf). In this paper we analyzed all po ssible batting o rders fo r all teams in the 1989 Natio nal League. The results o f that paper were mildly interesting. The wo rst batter o n the team, generally the pitc her, sho uld bat eighth rather than the c usto mary ninth po sitio n, at least in the Natio nal League, but what c o nc erns me here is the wo rk that went into pro duc ing this paper. As lo w grad student o n the to tem po le, it was my jo b to manually re-key the c o mplete batting histo ry o f eac h and every player in the Natio nal League. That summer wo uld have been a lo t mo re pleasant if I had had the data available in a c o nvenient fo rmat like XML. Right no w, I’m go ing to c o nc entrate o n data fo r individual players. Typic ally this data is presented in ro ws o f numbers as sho wn in Table 4-1 fo r the 1998 Yankees o ffense (batters). Sinc e pitc hers rarely bat in the Americ an League, o nly players who ac tually batted are listed. Eac h c o lumn effec tively defines an element. Thus there need to be elements fo r player, po sitio n, games played, at bats, runs, hits, do ubles, triples, ho me runs, runs batted in, and walks. Singles are generally no t repo rted separately. Rather they’re c alc ulated by subtrac ting the to tal number o f do ubles, triples, and ho me runs fro m the number o f hits.

    Catcher Outfield Shortstop Outfield Designated Hitter First base Outfield

    Jorge Posada

    Tim Raines

    Luis Sojo

    Shane Spencer

    Darryl Straw berry

    Dale Sveum

    Bernie William s

    Second Base

    Chuck Knoblauch

    Outfield

    Shortstop

    Derek Jeter

    Paul O’Neill

    Catcher

    Joe Girardi

    First Base

    Catcher

    Mike Figga

    Tino Martinez

    Designated Hitter

    Chili Davis

    Outfield

    Outfield

    Chad Curtis

    Third Base

    Second Base

    Hom er Bush

    Mike Low ell

    Third Base

    Scott Brosius

    Ricky Ledee

    Position

    Name

    128

    30

    101

    27

    54

    109

    111

    152

    142

    8

    42

    150 18

    149

    78

    1

    35

    151

    45

    152

    Games Played

    499

    58

    295

    67

    147

    321

    358

    602

    531

    15

    79

    603

    626

    254

    4

    103

    456

    71

    530

    At Bats

    101

    6

    44

    18

    16

    53

    56

    95

    92

    1

    13

    117

    127

    31

    1

    11

    79

    17

    86

    Runs

    169

    9

    73

    25

    34

    93

    96

    191

    149

    4

    19

    160

    203

    70

    1

    30

    111

    27

    159

    Hits

    30

    0

    11

    6

    3

    13

    23

    40

    33

    0

    5

    25

    25

    11

    0

    7

    21

    3

    34

    Doubles

    5

    0

    2

    0

    1

    1

    0

    2

    1

    0

    2

    4

    8

    4

    0

    0

    1

    0

    0

    Triples

    Table 4-1 The 1998 Yankees Offense

    26

    0

    24

    10

    0

    5

    17

    24

    28

    0

    1

    17

    19

    3

    0

    3

    10

    1

    19

    Home Runs

    97

    3

    57

    27

    14

    47

    63

    116

    123

    0

    12

    64

    84

    31

    0

    9

    56

    5

    98

    Runs Batted In

    74

    4

    46

    5

    4

    55

    47

    57

    61

    0

    7

    76

    57

    14

    0

    14

    75

    5

    52

    Strike Walks

    81

    16

    90

    12

    15

    49

    92

    103

    83

    1

    29

    70

    119

    38

    1

    18

    80

    19

    97

    Outs

    1

    0

    3

    0

    0

    3

    0

    2

    6

    0

    0

    5

    2

    0

    0

    7

    0

    10

    Hit by Pitch

    Chapter 4 ✦ Structuring Data

    61

    62

    Part I ✦ Introducing XM L

    Note

    The data in the previous table and the pitcher data in the next section is actually a som ew hat lim ited list that only begins to specify the data collected on a typical baseball gam e. There are a lot m ore elem ents including throw ing arm , batting arm , num ber of tim es the pitcher balked (rare), fielding percentage, college attended, and m ore. How ever, I’ll stick to this basic inform ation to keep the exam ples m anageable.

    Pitchers Pitc hers are no t expec ted to be ho me-run hitters o r base stealers. Indeed a pitc her who c an reac h first o n o c c asio n is a surprise bo nus fo r a team. Instead pitc hers are judged o n a who le different set o f numbers, sho wn in Table 4-2. Eac h c o lumn o f this table also defines an element. So me o f these elements, suc h as name and po sitio n, are the same fo r batters and pitc hers. Others like saves and shuto uts o nly apply to pitc hers. And a few — like runs and ho me runs — have the same name as a batter statistic , but have different meanings. Fo r instanc e, the number o f runs fo r a batter is the number o f runs the batter sc o red. The number o f runs fo r a pitc her is the number o f runs sc o red by the o ppo sing teams against this pitc her.

    Organization of the XM L Data XML is based o n a c o ntainment mo del. Eac h XML element c an c o ntain text o r o ther XML elements c alled its c hildren. A few XML elements may c o ntain bo th text and c hild elements, tho ugh in general this is bad fo rm and sho uld be avo ided wherever po ssible. Ho wever, there’s o ften mo re than o ne way to o rganize the data, depending o n yo ur needs. One o f the advantages o f XML is that it makes it fairly straightfo rward to write a pro gram that reo rganizes the data in a different fo rm. We’ll disc uss this when we talk abo ut XSL transfo rmatio ns in Chapter 14. To get started, the first questio n yo u’ll have to address is what c o ntains what? Fo r instanc e, it is fairly o bvio us that a league c o ntains divisio ns that c o ntain teams that c o ntain players. Altho ugh teams c an c hange divisio ns when mo ving fro m o ne c ity to ano ther, and players are ro utinely traded at any given mo ment in time, eac h player belo ngs to exac tly o ne team and eac h team belo ngs to exac tly o ne divisio n. Similarly, a seaso n c o ntains games, whic h c o ntain innings, whic h c o ntain at bats, whic h c o ntain pitc hes o r plays. Ho wever, do es a seaso n c o ntain leagues o r do es a league c o ntain a seaso n? The answer isn’t so o bvio us, and indeed there isn’t o ne unique answer. Whether it makes mo re sense to make seaso n elements c hildren o f league elements o r league elements c hildren o f seaso n elements depends o n the use to whic h the data will be put. Yo u c an even c reate a new ro o t element that c o ntains bo th seaso ns and leagues, neither o f whic h is a c hild o f the o ther (tho ugh do ing so effec tively wo uld require so me advanc ed tec hniques that wo n’t be disc ussed fo r several c hapters yet). Yo u c an o rganize the data as yo u like.

    Starting Pitcher

    Hideki Irabu

    13

    0

    5

    10

    Relief Pitcher

    Relief Pitcher

    Darren Holm es

    12

    Jeff Nelson

    Starting Pitcher

    Orlando Hernandez

    0

    Relief Pitcher

    Relief Pitcher

    Todd Erdos

    20

    Ram iro Mendoza

    Starting Pitcher

    David Cone

    4

    3

    Relief Pitcher

    Mike Buddie

    1 3

    Relief Pitcher

    Relief Pitcher

    Jim Bruske

    2

    Graem e Lloyd

    Relief Pitcher

    Ryan Bradley

    1

    0

    Relief Pitcher

    Joe Borow ski

    W

    Mike Starting Jerzem beck Pitcher

    P

    Name

    3

    2

    0

    1

    9

    3

    4

    0

    7

    1

    0

    1

    0

    L

    3

    1

    0

    0

    0

    2

    0

    0

    0

    0

    0

    0

    0

    S

    45

    41

    50

    3

    29

    34

    21

    2

    31

    24

    3

    5

    8

    G

    0

    14

    0

    2

    28

    0

    21

    0

    31

    2

    1

    1

    0

    GS

    0

    1

    0

    0

    2

    0

    3

    0

    3

    0

    0

    0

    0

    CG

    0

    1

    0

    0

    1

    0

    1

    0

    0

    0

    0

    0

    0

    SHO

    3.79

    3.25

    1.67

    12.79

    4.06

    3.33

    3.13

    9

    3.55

    5.62

    3

    5.68

    6.52

    ERA

    26

    9

    148

    53

    113

    5

    186

    46

    9

    12

    11

    H

    40.1

    44

    130.1 131

    37.2

    6.1

    173

    51.1

    141

    2

    207.2

    41.2

    9

    12.2

    9.2

    IP

    1

    9

    3

    2

    27

    4

    11

    0

    20

    5

    2

    2

    0

    HR

    Table 4-2 The 1998 Yankees Pitchers

    18

    50

    10

    9

    79

    19

    53

    2

    89

    29

    3

    9

    7

    R

    17

    47

    7

    9

    78

    19

    49

    2

    82

    26

    3

    8

    7

    ER

    8

    9

    2

    0

    9

    2

    6

    0

    15

    3

    0

    1

    0

    HB

    2

    3

    2

    1

    6

    1

    5

    0

    6

    2

    0

    0

    0

    WP

    0

    0

    0

    1

    1

    0

    2

    0

    0

    1

    0

    0

    0

    BK

    22

    30

    6

    4

    76

    14

    52

    1

    59

    13

    1

    9

    4

    35

    56

    20

    1

    126

    31

    131

    0

    209

    20

    13

    7

    SO

    Co ntinue d

    WB

    Chapter 4 ✦ Structuring Data

    63

    P

    Starting Pitcher

    Relief Pitcher

    Relief Pitcher

    Relief Pitcher

    Starting Pitcher

    Name

    Andy Pettitte

    Mariano Rivera

    Mike Stanton

    Jay Tessm er

    David Wells

    18

    1

    4

    3

    16

    W

    4

    0

    1

    0

    11

    L

    0

    0

    6

    36

    0

    S

    30

    7

    67

    54

    33

    G

    30

    0

    0

    0

    32

    GS

    8

    0

    0

    0

    5

    CG

    5

    0

    0

    0

    0

    SHO

    3.49

    3.12

    5.47

    1.91

    4.24

    ERA

    H

    4

    71

    48

    214.1 195

    8.2

    79

    61.1

    216.1 226

    IP

    Table 4-2 ( continued )

    29

    1

    13

    3

    20 1

    HR

    ER

    86

    3

    51

    13

    83

    3

    48

    13

    10 1 2

    R

    1

    0

    4

    1

    6

    HB

    2

    1

    0

    0

    5

    WP

    0

    0

    0

    0

    0

    BK

    29

    4

    26

    17

    87

    WB

    163

    6

    69

    36

    146

    SO

    64 Part I ✦ Introducing XM L

    Chapter 4 ✦ Structuring Data

    Note

    Readers fam iliar w ith database theory m ay recognize XML’s m odel as essentially a hierarchical database, and consequently recognize that it shares all the disadvantages (and a few advantages) of that data m odel. There are certainly tim es w hen a table-based relational approach m akes m ore sense. This exam ple certainly looks like one of those tim es. How ever, XML doesn’t follow a relational m odel. On the other hand, it is com pletely possible to store the actual data in m ultiple tables in a relational database, then generate the XML on the fly. Indeed, the larger exam ples on the CD-ROM w ere created in that fashion. This enables one set of data to be presented in m ultiple form ats. Transform ing the data w ith style sheets provides still m ore possible view s of the data.

    Sinc e my perso nal interests lie in analyzing player perfo rmanc e within a single seaso n, I’m go ing to make seaso n the ro o t o f my do c uments. Eac h seaso n will c o ntain leagues, whic h will c o ntain divisio ns, whic h will c o ntain players. I’m no t go ing to granularize my data all the way do wn to the level o f individual games, innings, o r plays — bec ause while useful — suc h examples wo uld be exc essively lo ng. Yo u, ho wever, may have o ther interests. If yo u c ho o se to divide the data in so me o ther fashio n, that wo rks to o . There’s almo st always mo re than o ne way to o rganize data in XML. In fac t, we’ll return to this example in several upc o ming c hapters where we’ll explo re alternative markup vo c abularies.

    XM Lizing the Data Let’s begin the pro c ess o f marking up the data fo r the 1998 Majo r League seaso n in XML with tags that yo u define. Remember that in XML we’re allo wed to make up the tags as we go alo ng. We’ve already dec ided that the fundamental element o f o ur do c ument will be a seaso n. Seaso ns will c o ntain leagues. Leagues will c o ntain divisio ns. Divisio ns will c o ntain teams. Teams c o ntain players. Players will have statistic s inc luding games played, at bats, runs, hits, do ubles, triples, ho me runs, runs batted in, walks, and hits by pitc h.

    Starting the Document: XM L Declaration and Root Element XML do c uments may be rec o gnized by the XML dec laratio n. This is a pro c essing instruc tio n plac ed at the start o f all XML files that identifies the versio n in use. The o nly versio n c urrently understo o d is 1.0.

    Every go o d XML do c ument (where the wo rd go o d has a very spec ific meaning to be disc ussed in the next c hapter) must have a ro o t element. This is an element that c o mpletely c o ntains all o ther elements o f the do c ument. The ro o t element’s start

    65

    66

    Part I ✦ Introducing XM L

    tag c o mes befo re all o ther elements’ start tags, and the ro o t element’s end tag c o mes after all o ther element’s end tags. Fo r o ur ro o t element, we will use SEASON with a start tag o f and an end tag o f . The do c ument no w lo o ks like this:



    The XML dec laratio n is no t an element o r a tag. It is a pro c essing instruc tio n. Therefo re, it do es no t need to be c o ntained inside the ro o t element, SEASON. But every element we put in this do c ument will go in between the start tag and the end tag. This c ho ic e o f ro o t element means that we will no t be able to sto re multiple seaso ns in a single file. If yo u want to do that, ho wever, yo u c an define a new ro o t element that c o ntains seaso ns. Fo r example,





    Naming Conventions Before w e begin, I’d like to say a few w ords about nam ing conventions. As you’ll see in the next chapter, XML elem ent nam es are quite flexible and can contain any num ber of letters and digits in either upper- or low ercase. You have the option of w riting XML tags that look like any of the follow ing:





    There are several thousand m ore variations. I don’t really care (nor does XML) w hether you use all uppercase, all low ercase, m ixed-case w ith internal capitalization, or som e other convention. How ever, I do recom m end that you choose one convention and stick to it.

    Chapter 4 ✦ Structuring Data

    Of c o urse we will want to identify whic h seaso n we’re talking abo ut. To do that, we sho uld give the SEASON element a YEAR c hild. Fo r example:



    1998

    I’ve used indentatio n here and in o ther examples to indic ate that the YEAR element is a c hild o f the SEASON element and that the text 1998 is the c o ntents o f the YEAR element. This is go o d c o ding style, but it is no t required. White spac e in XML is no t espec ially signific ant. The same example c o uld have been written like this:

    1998

    Indeed, I’ll o ften c o mpress elements to a single line when they’ll fit and spac e is at a premium. Yo u c an c o mpress the do c ument still further, even do wn to a single line, but with a c o rrespo nding lo ss o f c larity. Fo r example:

    1998 Of c o urse this versio n is muc h harder to read and understand whic h is why I didn’t write it that way. The tenth go al listed in the XML 1.0 spec ific atio n is “Terseness in XML markup is o f minimal impo rtanc e.” The baseball example reflec ts this go al thro ugho ut.

    XM Lizing League, Division, and Team Data Majo r league baseball is divided into two leagues, the Americ an League and the Natio nal League. Eac h league has a name. The two names c o uld be enc o ded like this:

    1998

    National League

    American League

    67

    68

    Part I ✦ Introducing XM L

    I’ve c ho sen to define the name o f a league with a LEAGUE_NAME element, rather than simply a NAME element bec ause NAME is to o generic and it’s likely to be used in o ther c o ntexts. Fo r instanc e, divisio ns, teams, and players also have names. CrossReference

    Elem ents from different dom ains w ith the sam e nam e can be com bined using nam espaces. Nam espaces w ill be discussed in Chapter 18. How ever, even w ith nam espaces, you w ouldn’t w ant to give m ultiple item s in the sam e dom ain (for exam ple, TEAM and LEAGUE in this exam ple) the sam e nam e.

    Eac h league c an be divided into east, west, and c entral divisio ns, whic h c an be enc o ded as fo llo ws:

    National League

    East

    Central

    West

    American League

    East

    Central

    West

    The true value o f an element depends o n its parent, that is the elements that c o ntain it as well as itself. Bo th the Americ an and Natio nal Leagues have an East divisio n but these are no t the same thing. Eac h divisio n is divided into teams. Eac h team has a name and a c ity. Fo r example, data that pertains to the Americ an League East c an be enc o ded as fo llo ws:

    East

    Baltimore Orioles

    Boston

    Chapter 4 ✦ Structuring Data

    Red Sox

    New York Yankees

    Tampa Bay Devil Rays

    Toronto Blue Jays

    XM Lizing Player Data Eac h team is c o mpo sed o f players. Eac h player has a first name and a last name. It’s impo rtant to separate the first and last names so that yo u c an so rt by either o ne. The data fo r the starting pitc hers in the 1998 Yankees lineup c an be enc o ded as fo llo ws:

    New York Yankees

    Orlando Hernandez

    David Cone

    David Wells

    Andy Pettitte

    Hideki Irabu

    Note

    The tags and are preferable to the m ore obvious and or and . Whether the fam ily nam e or the given nam e com es first or last varies from culture to culture. Furtherm ore, surnam es aren’t necessarily fam ily nam es in all cultures.

    69

    70

    Part I ✦ Introducing XM L

    XM Lizing Player Statistics The next step is to pro vide statistic s fo r eac h player. Statistic s lo o k a little different fo r pitc hers and batters, espec ially in the Americ an League in whic h few pitc hers bat. Belo w are Jo e Girardi’s 1998 statistic s. He’s a c atc her so we use batting statistic s:

    Joe Girardi Catcher 78 76 254 31 70 11 4 3 31 2 4 8 1 3 14 38 2

    No w let’s lo o k at the statistic s fo r a pitc her. Altho ugh pitc hers o c c asio nally bat in the Americ an League, and frequently bat in the Natio nal League, they do so far less o ften than all o ther players do . Pitc hers are hired and fired, c heered and bo o ed, based o n their pitc hing perfo rmanc e. If they c an ac tually hit the ball o n o c c asio n to o , that’s pure gravy. Pitc hing statistic s inc lude games played, wins, lo sses, innings pitc hed, earned runs, shuto uts, hits against, walks given up, and mo re. Here are Hideki Irabu’s 1998 statistic s enc o ded in XML:

    Hideki Irabu Starting Pitcher 13 9 0 29 28 2 1

    Chapter 4 ✦ Structuring Data

    4.06 173 148 27 79 78 9 6 1 76

    Terseness in XM L M arkup is of M inimal Importance Throughout this exam ple, I’ve been follow ing the explicit XML principal that “Terseness in XML m arkup is of m inim al im portance.” This certainly assists non-baseball literate readers w ho m ay not recognize baseball arcana such as the standard abbreviation for a w alk BB (base on balls), not W as you m ight expect. If docum ent size is truly an issue, it’s easy to com press the files w ith zip or som e other standard tool. How ever, this does m ean XML docum ents tend to be quite long, and relatively tedious to type by hand. I confess that this exam ple sorely tem pts m e to use abbreviations, clarity be dam ned. If I w ere to do so, a typical PLAYER elem ent m ight look like this:

    Joe Girardi

    C

    78 254 31 70 11 4 3 31 14 38 2 4 2

    71

    72

    Part I ✦ Introducing XM L

    Putting the XM L Document Back Together Again Until no w, I’ve been sho wing the XML do c ument in piec es, element by element. Ho wever, it’s no w time to put all the piec es to gether and lo o k at the c o mplete do c ument c o ntaining the statistic s fo r the 1998 Majo r League seaso n. Listing 4-1 demo nstrates the c o mplete XML do c ument with two leagues, six divisio ns, thirty teams, and nine players.

    Listing 4-1: A complete XM L document

    1998

    National League

    East

    Atlanta Braves

    Malloy Marty Second Base 11 8 28 3 5 1 0 1 1 0 0 0 0 0 2 2 0

    Guillen Ozzie Shortstop 83 59 264 35 73

    Chapter 4 ✦ Structuring Data

    15 1 1 22 1 4 4 2 6 24 25 1

    Bautista Danny Outfield 82 27 144 17 36 11 0 3 17 1 0 3 2 2 7 21 0

    Williams Gerald Outfield 129 51 266 46 81 18 3 10 44 11 5 2 1 Continued

    73

    74

    Part I ✦ Introducing XM L

    Listing 4-1 (continued) 5 17 48 3

    Glavine Tom Starting Pitcher 20 6 0 33 33 4 3 2.47 229.1 202 13 67 63 2 3 0 74

    Lopez Javier Catcher 133 124 489 73 139 21 1 34 106 5 3 1 8 5 30 85 6

    Klesko Ryan

    Chapter 4 ✦ Structuring Data

    Outfield 129 124 427 69 117 29 1 18 70 5 3 0 4 2 56 66 3

    Galarraga Andres First Base 153 151 555 103 169 27 1 44 121 7 6 0 5 11 63 146 25

    Helms Wes Third Base 7 2 13 2 4 1 0 1 2 Continued

    75

    76

    Part I ✦ Introducing XM L

    Listing 4-1 (continued) 0 0 0 0 1 0 4 0

    Florida Marlins

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Central

    Chicago Cubs

    Cincinatti Reds

    Houston Astros

    Milwaukee Brewers

    Pittsburgh Pirates

    St. Louis Cardinals

    Chapter 4 ✦ Structuring Data



    West

    Arizona Diamondbacks

    Colorado Rockies

    Los Angeles Dodgers

    San Diego Padres

    San Francisco Giants



    American League

    East

    Baltimore Orioles

    Boston Red Sox

    New York Yankees

    Tampa Bay Devil Rays

    Toronto Blue Jays

    Continued

    77

    78

    Part I ✦ Introducing XM L

    Listing 4-1 (continued) Central

    Chicago White Sox

    Kansas City Royals

    Detroit Tigers

    Cleveland Indians

    Minnesota Twins

    West

    Anaheim Angels

    Oakland Athletics

    Seattle Mariners

    Texas Rangers



    Figure 4-1 sho ws this do c ument lo aded into Internet Explo rer 5.0.

    Chapter 4 ✦ Structuring Data

    Figure 4-1: The 1998 m ajor league statistics displayed in Internet Explorer 5.0

    Even no w this do c ument is inc o mplete. It o nly c o ntains players fro m o ne team (the Atlanta Braves) and o nly nine players fro m that team. Sho wing mo re than that wo uld make the example to o lo ng to inc lude in this bo o k. On the CD-ROM

    A mo re c o mplete XML do c ument c alled 1998statistic s.xml with statistic s fo r all players in the 1998 majo r league is o n the CD-ROM in the examples/ baseball direc to ry.Furthermo re, I’ve deliberately limited the data inc luded to make this a manageable example within the c o nfines o f this bo o k. In reality there are far mo re details yo u c o uld inc lude. I’ve already alluded to the po ssibility o f arranging the data game by game, pitc h by pitc h. Even witho ut go ing to that extreme, there are a lo t o f details that c o uld be added to individual elements. Teams also have c o ac hes, managers, o wners (Ho w c an yo u think o f the Yankees witho ut thinking o f Geo rge Steinbrenner?), ho me stadiums, and mo re.

    I’ve also deliberately o mitted numbers that c an be c alc ulated fro m o ther numbers given here, suc h as batting average (number o f hits divided by number o f at bats). No netheless, players have batting arms, thro wing arms, heights, weights, birth dates, po sitio ns, numbers, nic knames, c o lleges attended, and muc h mo re. And o f c o urse there are many mo re players than I’ve sho wn here. All o f this is equally easy to inc lude in XML. But we will sto p the XMLific atio n o f the data here so we c an mo ve o n; first to a brief disc ussio n o f why this data fo rmat is useful, then to the tec hniques that c an be used fo r ac tually displaying it in a Web bro wser.

    79

    80

    Part I ✦ Introducing XM L

    The Advantages of the XM L Format Table 4-1 do es a pretty go o d jo b o f displaying the batting data fo r a team in a c o mprehensible and c o mpac t fashio n. What exac tly have we gained by rewriting that table as the muc h lo nger XML do c ument o f Example 4-1? There are several benefits. Amo ng them:

    ✦ The data is self-desc ribing ✦ The data c an be manipulated with standard to o ls ✦ The data c an be viewed with standard to o ls ✦ Different views o f the same data are easy to c reate with style sheets The first majo r benefit o f the XML fo rmat is that the data is self-desc ribing. The meaning o f eac h number is c learly and unmistakably asso c iated with the number itself. When reading the do c ument, yo u kno w that the 121 in 121 refers to hits and no t runs batted in o r strikeo uts. If the perso n typing in the do c ument skips a number, that do esn’t mean that every number after it is misinterpreted. HITS is still HITS even if the prec eding RUNS element is missing. CrossReference

    In Part II you’ll see that XML can even use DTDs to enforce constraints that certain elem ents like HITS or RUNS m ust be present.

    The sec o nd benefit to pro viding the data in XML is that it enables the data to be manipulated in a wide range o f XML-enabled to o ls, fro m expensive payware like Ado be FrameMaker to free o pen-so urc e so ftware like Pytho n and Perl. The data may be bigger, but the extra redundanc y allo ws mo re to o ls to pro c ess it. The same is true when the time c o mes to view the data. The XML do c ument c an be lo aded into Internet Explo rer 5.0, Mo zilla, FrameMaker 5.5.6, and many o ther to o ls, all o f whic h pro vide unique, useful views o f the data. The do c ument c an even be lo aded into simple, bare-bo nes text edito rs like vi, BBEdit, and TextPad. So it’s at least marginally viewable o n mo st platfo rms. Using new so ftware isn’t the o nly way to get a different view o f the data either. In the next sec tio n, we’ll build a style sheet fo r baseball statistic s that pro vides a c o mpletely different way o f lo o king at the data than what yo u see in Figure 4-1. Every time yo u apply a different style sheet to the same do c ument yo u see a different pic ture. Lastly, yo u sho uld ask yo urself if the size is really that impo rtant. Mo dern hard drives are quite big, and c an a ho ld a lo t o f data, even if it’s no t sto red very effic iently. Furthermo re, XML files c o mpress very well. The c o mplete majo r league 1998 statistic s do c ument is 653K. Ho wever, c o mpressing the file with gzip gets that all the way do wn to 66K, almo st 90 perc ent less. Advanc ed HTTP servers like Jigsaw

    Chapter 4 ✦ Structuring Data

    c an ac tually send c o mpressed files rather than the unc o mpressed files so that netwo rk bandwidth used by a do c ument like this is fairly c lo se to its ac tual info rmatio n c o ntent. Finally, yo u sho uld no t assume that binary file fo rmats, espec ially general-purpo se o nes, are nec essarily mo re effic ient. A Mic ro so ft Exc el file that c o ntains the same data as the 1998statistic s.xml ac tually takes up 2.37 MB, mo re than three times as muc h spac e. Altho ugh yo u c an c ertainly c reate mo re effic ient file fo rmats and enc o ding o f this data, in prac tic e that simply isn’t o ften nec essary.

    Preparing a Style Sheet for Document Display The view o f the raw XML do c ument sho wn in Figure 4-1 is no t bad fo r so me uses. Fo r instanc e, it allo ws yo u to c o llapse and expand individual elements so yo u see o nly tho se parts o f the do c ument yo u want to see. Ho wever, mo st o f the time yo u’d pro bably like a mo re finished lo o k, espec ially if yo u’re go ing to display it o n the Web. To pro vide a mo re po lished lo o k, yo u must write a style sheet fo r the do c ument. In this c hapter, we’ll use CSS style sheets. A CSS style sheet asso c iates partic ular fo rmatting with eac h element o f the do c ument. The c o mplete list o f elements used in o ur XML do c ument is:

    SEASON YEAR LEAGUE LEAGUE_NAME DIVISION DIVISION_NAME TEAM TEAM_CITY TEAM_NAME PLAYER SURNAME GIVEN_NAME POSITION GAMES GAMES_STARTED AT_BATS RUNS

    81

    82

    Part I ✦ Introducing XM L

    HITS DOUBLES TRIPLES HOME_RUNS RBI STEALS CAUGHT_STEALING SACRIFICE_HITS SACRIFICE_FLIES ERRORS WALKS STRUCK_OUT HIT_BY_PITCH Generally, yo u’ll want to fo llo w an iterative pro c edure, adding style rules fo r eac h o f these elements o ne at a time, c hec king that they do what yo u expec t, then mo ving o n to the next element. In this example, suc h an appro ac h also has the advantage o f intro duc ing CSS pro perties o ne at a time fo r tho se who are no t familiar with them.

    Linking to a Style Sheet The style sheet c an be named anything yo u like. If it’s o nly go ing to apply to o ne do c ument, then it’s c usto mary to give it the same name as the do c ument but with the three-letter extensio n .c ss instead o f .xml. Fo r instanc e, the style sheet fo r the XML do c ument 1998sho rtstats.xml might be c alled 1998sho rtstats.c ss. On the o ther hand, if the same style sheet is go ing to be applied to many do c uments, then it sho uld pro bably have a mo re generic name like baseballstats.c ss. CrossReference

    Since CSS style sheets cascade, m ore than one can be applied to the sam e docum ent. Thus it’s possible that baseballstats.css w ould apply som e general form atting rules, w hile 1998shortstats.css w ould override a few to handle specific details in the one docum ent 1998shortstats.xm l. We’ll discuss this procedure in Chapter 12, Cascading Style Sheets Level 1 .

    To attac h a style sheet to the do c ument, yo u simply add an additio nal pro c essing instruc tio n between the XML dec laratio n and the ro o t element, like this:



    ...

    Chapter 4 ✦ Structuring Data

    This tells a bro wser reading the do c ument to apply the style sheet fo und in the file baseballstats.c ss to this do c ument. This file is assumed to reside in the same direc to ry and o n the same server as the XML do c ument itself. In o ther wo rds, baseballstats.c ss is a relative URL. Co mplete URLs may also be used. Fo r example:



    ... Yo u c an begin by simply plac ing an empty file named baseballstats.c ss in the same direc to ry as the XML do c ument. Onc e yo u’ve do ne this and added the nec essary pro c essing instruc tio n to 1998sho rtstats.xml (Listing 4-1), the do c ument no w appears as sho wn in Figure 4-2. Only the element c o ntent is sho wn. The c o llapsible o utline view o f Figure 4-1 is go ne. The fo rmatting o f the element c o ntent uses the bro wser’s defaults, blac k 12-po int Times Ro man o n a white bac kgro und in this c ase.

    Figure 4-2: The 1998 m ajor league statistics displayed after a blank style sheet is applied

    Note

    You’ll also see a view m uch like Figure 4-2 if the style sheet nam ed by the xmlstylesheet processing instruction can’t be found in the specified location.

    83

    84

    Part I ✦ Introducing XM L

    Assigning Style Rules to the Root Element Yo u do no t have to assign a style rule to eac h element in the list. Many elements c an simply allo w the styles o f their parents to c asc ade do wn. The mo st impo rtant style, therefo re, is the o ne fo r the ro o t element, whic h is SEASON in this example. This defines the default fo r all the o ther elements o n the page. Co mputer mo nito rs at ro ughly 72 dpi do n’t have as high a reso lutio n as paper at 300 o r mo re dpi. Therefo re, Web pages sho uld generally use a larger po int size than is c usto mary. Let’s make the default 14-po int type, blac k o n a white bac kgro und, as sho wn belo w:

    SEASON {font-size: 14pt; background-color: white; color: black; display: block} Plac e this statement in a text file, save the file with the name baseballstats.c ss in the same direc to ry as Listing 4-1, 1998sho rtstats.xml, and o pen 1998sho rtstats.xml in yo ur bro wser. Yo u sho uld see so mething like what is sho wn in Figure 4-3.

    Figure 4-3: Baseball statistics in 14-point type w ith a black-onw hite background

    The default fo nt size c hanged between Figure 4-2 and Figure 4-3. The text c o lo r and bac kgro und c o lo r did no t. Indeed, it was no t abso lutely required to set them, sinc e blac k fo regro und and white bac kgro und are the defaults. No netheless, no thing is lo st by being explic it regarding what yo u want.

    Chapter 4 ✦ Structuring Data

    Assigning Style Rules to Titles The YEAR element is mo re o r less the title o f the do c ument. Therefo re, let’s make it appro priately large and bo ld — 32 po ints sho uld be big eno ugh. Furthermo re, it sho uld stand o ut fro m the rest o f the do c ument rather than simply running to gether with the rest o f the c o ntent, so let’s make it a c entered blo c k element. All o f this c an be ac c o mplished by the fo llo wing style rule.

    YEAR {display: block; font-size: 32pt; font-weight: bold; text-align: center} Figure 4-4 sho ws the do c ument after this rule has been added to the style sheet. No tic e in partic ular the line break after “1998.” That’s there bec ause YEAR is no w a blo c k-level element. Everything else in the do c ument is an inline element. Yo u c an o nly c enter (o r left-align, right-align o r justify) blo c k-level elements.

    Figure 4-4: Stylizing the YEAR elem ent as a title

    In this do c ument with this style rule, YEAR duplic ates the func tio nality o f HTML’s H1 header element. Sinc e this do c ument is so neatly hierarc hic al, several o ther elements serve the ro le o f H2 headers, H3 headers, etc . These elements c an be fo rmatted by similar rules with o nly a slightly smaller fo nt size. Fo r instanc e, SEASON is divided into two LEAGUE elements. The name o f eac h LEAGUE, that is, the LEAGUE_NAME element — has the same ro le as an H2 element in HTML. Eac h LEAGUE element is divided into three DIVISION elements. The name o f

    85

    86

    Part I ✦ Introducing XM L

    eac h DIVISION — that is, the DIVISION_NAME element — has the same ro le as an H3 element in HTML. These two rules fo rmat them ac c o rdingly:

    LEAGUE_NAME {display: block; text-align: center; font-size: 28pt; font-weight: bold} DIVISION_NAME {display: block; text-align: center; font-size: 24pt; font-weight: bold} Figure 4-5 sho ws the resulting do c ument.

    Figure 4-5: Stylizing the LEAGUE_NAME and DIVISION_NAME elem ents as headings

    Note

    One crucial difference betw een HTML and XML is that in HTML there’s generally no one elem ent that contains both the title of a section (the H2, H3, H4, etc., header) and the com plete contents of the section. Instead the contents of a section have to be im plied as everything betw een the end of one level of header and the start of the next header at the sam e level. This is particularly im portant for softw are that has to parse HTML docum ents, for instance to generate a table of contents autom atically.

    Divisio ns are divided into TEAM elements. Fo rmatting these is a little tric kier bec ause the title o f a team is no t simply the TEAM_NAME element but rather the TEAM_CITY c o nc atenated with the TEAM_NAME. Therefo re these need to be inline elements rather than separate blo c k-level elements. Ho wever, they are still titles so we set them to bo ld, italic , 20-po int type. Figure 4-6 sho ws the results o f adding these two rules to the style sheet.

    Chapter 4 ✦ Structuring Data

    TEAM_CITY {font-size: 20pt; font-weight: bold; font-style: italic} TEAM_NAME {font-size: 20pt; font-weight: bold; font-style: italic}

    Figure 4-6: Stylizing Team Nam es

    At this po int it wo uld be nic e to arrange the team names and c ities as a c o mbined blo c k-level element. There are several ways to do this. Yo u c o uld, fo r instanc e, add an additio nal TEAM_TITLE element to the XML do c ument who se so le purpo se is merely to c o ntain the TEAM_NAME and TEAM_CITY. Fo r instanc e:

    Colorado Rockies

    Next, yo u wo uld add a style rule that applies blo c k-level fo rmatting to TEAM_TITLE:

    TEAM_TITLE {display: block; text-align: center} Ho wever, yo u really sho uld never reo rganize an XML do c ument just to make the style sheet wo rk easier. After all, the who le po int o f a style sheet is to keep fo rmatting info rmatio n o ut o f the do c ument itself. Ho wever, yo u c an ac hieve muc h the same effec t by making the immediately prec eding and fo llo wing elements blo c k-

    87

    88

    Part I ✦ Introducing XM L

    level elements; that is, TEAM and PLAYER respec tively. This plac es the TEAM_NAME and TEAM_CITY in an implic it blo c k-level element o f their o wn. Figure 4-7 sho ws the result.

    TEAM {display: block} PLAYER {display: block}

    Figure 4-7: Stylizing team nam es and cities as headers

    Assigning Style Rules to Player and Statistics Elements The tric kiest fo rmatting this do c ument requires is fo r the individual players and statistic s. Eac h team has a c o uple o f do zen players. Eac h player has statistic s. Yo u c o uld think o f a TEAM element as being divided into PLAYER elements, and plac e eac h player in his o wn blo c k-level sec tio n as yo u did fo r previo us elements. Ho wever, a mo re attrac tive and effic ient way to o rganize this is to use a table. The style rules that ac c o mplish this lo o k like this:

    TEAM {display: table} TEAM_CITY {display: table-caption} TEAM_NAME {display: table-caption} PLAYER {display: table-row} SURNAME {display: table-cell} GIVEN_NAME {display: table-cell} POSITION {display: table-cell}

    Chapter 4 ✦ Structuring Data

    GAMES {display: table-cell} GAMES_STARTED {display: table-cell} AT_BATS {display: table-cell} RUNS {display: table-cell} HITS {display: table-cell} DOUBLES {display: table-cell} TRIPLES {display: table-cell} HOME_RUNS {display: table-cell} RBI {display: table-cell} STEALS {display: table-cell} CAUGHT_STEALING {display: table-cell} SACRIFICE_HITS {display: table-cell} SACRIFICE_FLIES {display: table-cell} ERRORS {display: table-cell} WALKS {display: table-cell} STRUCK_OUT {display: table-cell} HIT_BY_PITCH {display: table-cell} Unfo rtunately, table pro perties are o nly suppo rted in CSS Level 2, and this is no t yet suppo rted by Internet Explo rer 5.0 o r any o ther bro wser available at the time o f this writing. Instead, sinc e table fo rmatting do esn’t yet wo rk, I’ll settle fo r just making TEAM and PLAYER blo c k-level elements, and leaving all the rest with the default fo rmatting.

    Summing Up Listing 4-2 sho ws the finished style sheet. CSS style sheets do n’t have a lo t o f struc ture beyo nd the individual rules. In essenc e, this is just a list o f all the rules I intro duc ed separately abo ve. Reo rdering them wo uldn’t make any differenc e as lo ng as they’re all present.

    Listing 4-2: baseballstats.css SEASON {font-size: 14pt; background-color: white; color: black; display: block} YEAR {display: block; font-size: 32pt; font-weight: bold; text-align: center} LEAGUE_NAME {display: block; text-align: center; font-size: 28pt; font-weight: bold} DIVISION_NAME {display: block; text-align: center; font-size: 24pt; font-weight: bold} TEAM_CITY {font-size: 20pt; font-weight: bold; font-style: italic} TEAM_NAME {font-size: 20pt; font-weight: bold; font-style: italic} TEAM {display: block} PLAYER {display: block}

    89

    90

    Part I ✦ Introducing XM L

    This c o mpletes the basic fo rmatting fo r baseball statistic s. Ho wever, wo rk c learly remains to be do ne. Bro wsers that suppo rt real table fo rmatting wo uld definitely help. Ho wever, there are so me o ther piec es as well. They are no ted belo w in no partic ular o rder:

    ✦ The numbers are presented raw with no indic atio n o f what they represent. Eac h number sho uld be identified by a c aptio n that names it, like “RBI” o r “At Bats.”

    ✦ Interesting data like batting average that c o uld be c alc ulated fro m the data presented here is no t inc luded.

    ✦ So me o f the titles are a little sho rt. Fo r instanc e, it wo uld be nic e if the title o f the do c ument were “1998 Majo r League Baseball” instead o f simply “1998”.

    ✦ If all players in the Majo r League were inc luded, this do c ument wo uld be so lo ng it wo uld be hard to read. So mething similar to Internet Explo rer’s c o llapsible o utline view fo r do c uments with no style sheet wo uld be useful in this situatio n.

    ✦ Bec ause pitc her statistic s are so different fro m batter statistic s, it wo uld be nic e to so rt them separately in the ro ster. Many o f these po ints c o uld be addressed by adding mo re c o ntent to the do c ument. Fo r instanc e, to c hange the title “1998” to “1998 Majo r League Baseball,” all yo u have to do is rewrite the YEAR element like this:

    1998 Major League Baseball Captio ns c an be added to the player stats with a phanto m player at the to p o f eac h ro ster, like this:

    Surname Given name Postion Games Games Started At Bats Runs Hits Doubles Triples Home Runs Runs Batted In Steals Caught Stealing Sacrifice Hits Sacrifice Flies Errors Walks Struck Out Hit By Pitch

    Chapter 4 ✦ Structuring Data

    Still, there’s so mething fundamentally tro ubleso me abo ut suc h tac tic s. The year is 1998, no t “1998 Majo r League Baseball.” The c aptio n “At Bats” is no t the same as a number o f at bats. (It’s the differenc e between the name o f a thing and the thing itself.) Yo u c an enc o de still mo re markup like this:

    Surname Given name Position Games Games Started At Bats Runs Hits Doubles Triples Home Runs Runs Batted In Steals Caught Stealing Sacrifice Hits Sacrifice Flies Errors Walks Struck Out Hit By Pitch

    Ho wever, this basic ally reinvents HTML, and returns us to the po int o f using markup fo r fo rmatting rather than meaning. Furthermo re, we’re still simply repeating the info rmatio n that’s already c o ntained in the names o f the elements. The full do c ument is large eno ugh as is. We’d prefer no t to make it larger. Adding batting and o ther averages is easy. Just inc lude the data as additio nal elements. Fo r example, here’s a player with batting, slugging, and o n-base averages:

    Malloy Marty Second Base 11 8 .233 .321 .179 28 3 5 1 0 1 1

    91

    92

    Part I ✦ Introducing XM L

    0 0 0 0 0 2 2 0

    Ho wever, this info rmatio n is redundant bec ause it c an be c alc ulated fro m the o ther info rmatio n already inc luded in a player’s listing. Batting average, fo r example, is simply the number o f base hits divided by the number o f at bats; that is, HITS/ AT_BATS. Redundant data makes maintaining and updating the do c ument expo nentially mo re diffic ult. A simple c hange o r additio n to a single element requires c hanges and rec alc ulatio ns in multiple lo c atio ns. What’s really needed is a different style-sheet language that enables yo u to add c ertain bo iler-plate c o ntent to elements and to perfo rm transfo rmatio ns o n the element c o ntent that is present. Suc h a language exists — the Extensible Style Language (XSL). CrossReference

    Extensible Style Language (XSL) is covered in Chapters 5, 14, and 15.

    CSS is simpler than XSL and wo rks well fo r basic Web pages and reaso nably straightfo rward do c uments. XSL is c o nsiderably mo re c o mplex, but also mo re po werful. XSL builds o n the simple CSS fo rmatting yo u’ve learned abo ut here, but also pro vides transfo rmatio ns o f the so urc e do c ument into vario us fo rms the reader c an view. It’s o ften a go o d idea to make a first pass at a pro blem using CSS while yo u’re still debugging yo ur XML, then mo ve to XSL to ac hieve greater flexibility.

    Summary In this c hapter, yo u saw examples demo nstrating the c reatio n o f an XML do c ument fro m sc ratc h. In partic ular yo u learned

    ✦ Ho w to examine the data yo u’ll inc lude in yo ur XML do c ument to identify the elements.

    ✦ Ho w to mark up the data with XML tags yo u define. ✦ The advantages XML fo rmats pro vide o ver traditio nal fo rmats. ✦ Ho w to write a style sheet that says ho w the do c ument sho uld be fo rmatted and displayed.

    Chapter 4 ✦ Structuring Data

    This c hapter was full o f seat-o f-the-pants/ bac k-o f-the-envelo pe c o ding. The do c ument was written witho ut mo re than minimal c o nc ern fo r details. In the next c hapter, we’ll explo re so me additio nal means o f embedding info rmatio n in XML do c uments inc luding attributes, c o mments, and pro c essing instruc tio ns, and lo o k at an alternative way o f enc o ding baseball statistic s in XML.







    93

    5

    C H A P T E R

    Attributes, Empty Tags, and XSL









    In This Cha pter Attributes

    Y

    o u c an enc o de a given set o f data in XML in nearly an infinite numb er o f ways. There’s no o ne right way to do it altho ugh so me ways are mo re right than o thers, and so me are mo re appro priate fo r partic ular uses. In this c hapter, we explo re a different so lutio n to the pro b lem o f marking up b aseb all statistic s in XML, c arrying o ver the b aseb all example fro m the previo us c hapter. Spec ific ally, we will address the use o f attrib utes to sto re info rmatio n and empty tags to define element po sitio ns. In additio n, sinc e CSS do esn’t wo rk well with c o ntent-less XML elements o f this fo rm, we’ll examine an alternative — and mo re po werful — style sheet language c alled XSL.

    Attributes In the last c hapter, all data was c atego rized into the name o f a tag o r the c o ntents o f an element. This is a straightfo rward and easy-to -understand appro ac h, but it’s no t the o nly o ne. As in HTML, XML elements may have attributes. An attribute is a name-value pair asso c iated with an element. The name and the value are eac h strings, and no element may c o ntain two attributes with the same name. Yo u’re already familiar with attribute syntax fro m HTML. Fo r example, c o nsider this tag:

    Attributes versus elements Empty tag s XSL









    96

    Part I ✦ Introducing XM L

    It has fo ur attributes, the SRC attribute who se value is cup.gif, the WIDTH attribute who se value is 89, the HEIGHT attribute who se value is 67, and the ALT attribute who se value is Cup of coffee. Ho wever, in XML-unlike HTML-attribute values must always be quo ted and start tags must have matc hing c lo se tags. Thus, the XML equivalent o f this tag is:

    Note

    Another difference betw een HTML and XML is that XML assigns no particular m eaning to the IMG tag and its attributes. In particular, there’s no guarantee that an XML brow ser w ill interpret this tag as an instruction to load and display the im age in the file cup.gif.

    Yo u c an apply attribute syntax to the baseball example quite easily. This has the advantage o f making the markup so mewhat mo re c o nc ise. Fo r example, instead o f c o ntaining a YEAR c hild element, the SEASON element o nly needs a YEAR attribute.

    On the o ther hand, LEAGUE sho uld be a c hild o f the SEASON element rather than an attribute. Fo r o ne thing, there are two leagues in a seaso n. Anytime there’s likely to be mo re than o ne o f so mething c hild elements are c alled fo r. Attribute names must be unique within an element. Thus yo u sho uld no t, fo r example, write a SEASON element like this:

    The sec o nd reaso n LEAGUE is naturally a c hild element rather than an attribute is that it has substruc ture; it is subdivided into DIVISION elements. Attribute values are flat text. XML elements c an c o nveniently enc o de struc ture-attribute values c anno t. Ho wever, the name o f a league is unstruc tured, flat text; and there’s o nly o ne name per league so LEAGUE elements c an easily have a NAME attribute instead o f a LEAGUE_NAME c hild element:

    Sinc e an attribute is mo re c lo sely tied to its element than a c hild element is, yo u do n’t run into pro blems by using NAME instead o f LEAGUE_NAME fo r the name o f the attribute. Divisio ns and teams c an also have NAME attributes witho ut any fear o f c o nfusio n with the name o f a league. Sinc e a tag c an have mo re than o ne attribute (as lo ng as the attributes have different names), yo u c an make a team’s c ity an attribute as well, as sho wn belo w:

    Chapter 5 ✦ Attributes, Empty Tags, and XSL







    Players will have a lo t o f attributes if yo u c ho o se to make eac h statistic an attribute. Fo r example, here are Jo e Girardi’s 1998 statistic s as attributes:

    Listing 5-1 uses this new attribute style fo r a c o mplete XML do c ument c o ntaining the baseball statistic s fo r the 1998 majo r league seaso n. It displays the same info rmatio n (i.e., two leagues, six divisio ns, 30 teams, and nine players) as do es Listing 4-1 in the last c hapter. It is merely marked up differently. Figure 5-1 sho ws this do c ument lo aded into Internet Explo rer 5.0 witho ut a style sheet.

    Figure 5-1: The 1998 m ajor league baseball statistics using attributes for m ost inform ation.

    97

    98

    Part I ✦ Introducing XM L

    Listing 5-1: A complete XM L document that uses attributes to store baseball statistics































    Continued

    99

    100

    Part I ✦ Introducing XM L

    Listing 5-1 (continued)

































    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Listing 5-1 uses o nly attributes fo r player info rmatio n. Listing 4-1 used o nly element c o ntent. There are intermediate appro ac hes as well. Fo r example, yo u c o uld make the player’s name part o f element c o ntent while leaving the rest o f the statistic s as attributes, like this:

    On Tuesday Joe Girardi struck out twice and...

    This wo uld inc lude Jo e Girardi’s name in the text o f a page while still making his statistic s available to readers who want to lo o k deeper, as a hypertext fo o tno te o r to o l tip. There’s always mo re than o ne way to enc o de the same data. Whic h way yo u pic k generally depends o n the needs o f yo ur spec ific applic atio n.

    Attributes versus Elements There are no hard and fast rules abo ut when to use c hild elements and when to use attributes. Generally, yo u’ll use whic hever suits yo ur applic atio n. With experienc e, yo u’ll gain a feel fo r when attributes are easier than c hild elements and vic e versa. Until then, o ne go o d rule o f thumb is that the data itself sho uld be sto red in elements. Info rmatio n abo ut the data (meta-data) sho uld be sto red in attributes. And when in do ubt, put the info rmatio n in the elements. To differentiate between data and meta-data, ask yo urself whether so meo ne reading the do c ument wo uld want to see a partic ular piec e o f info rmatio n. If the answer is yes, then the info rmatio n pro bably belo ngs in a c hild element. If the answer is no , then the info rmatio n pro bably belo ngs in an attribute. If all tags were stripped fro m the do c ument alo ng with all the attributes, the basic info rmatio n sho uld still be present. Attributes are go o d plac es to put ID numbers, URLs, referenc es, and o ther info rmatio n no t direc tly o r immediately relevant to the reader. Ho wever, there are many exc eptio ns to the basic princ ipal o f sto ring meta-data as attributes. These inc lude:

    ✦ Attributes c an’t ho ld struc ture well. ✦ Elements allo w yo u to inc lude meta-meta-data (info rmatio n abo ut the info rmatio n abo ut the info rmatio n).

    ✦ No t everyo ne always agrees o n what is and isn’t meta-data. ✦ Elements are mo re extensible in the fac e o f future c hanges.

    101

    102

    Part I ✦ Introducing XM L

    Structured M eta-data One impo rtant princ ipal to remember is that elements c an have substruc ture and attributes c an’t. This makes elements far mo re flexible, and may c o nvinc e yo u to enc o de meta-data as c hild elements. Fo r example, suppo se yo u’re writing a paper and yo u want to inc lude a so urc e fo r a fac t. It might lo o k so mething like this:

    Josh Gibson is the only person in the history of baseball to hit a pitch out of Yankee Stadium.

    Clearly the info rmatio n “The Bio graphic al Histo ry o f Baseball, Do nald Dewey and Nic ho las Ac o c ella (New Yo rk: Carro ll & Graf Publishers, Inc . 1995) p. 169” is meta-data. It is no t the fac t itself. Rather it is info rmatio n abo ut the fac t. Ho wever, the SOURCE attribute c o ntains a lo t o f implic it substruc ture. Yo u might find it mo re useful to o rganize the info rmatio n like this:

    Donald Dewey Nicholas Acocella

    The Biographical History of Baseball 169 1995

    Furthermo re, using elements instead o f attributes makes it straightfo rward to inc lude additio nal info rmatio n like the autho rs’ e-mail addresses, a URL where an elec tro nic c o py o f the do c ument c an be fo und, the title o r theme o f the partic ular issue o f the jo urnal, and anything else that seems impo rtant. Dates are ano ther c o mmo n example. One c o mmo n piec e o f meta-data abo ut sc ho larly artic les is the date the artic le was first rec eived. This is impo rtant fo r establishing prio rity o f disc o very and inventio n. It’s easy to inc lude a DATE attribute in an ARTICLE tag like this:

    Polymerase Reactions in Organic Compounds

    Ho wever, the DATE attribute has substruc ture signified by the /. Getting that struc ture o ut o f the attribute value, ho wever, is muc h mo re diffic ult than reading c hild elements o f a DATE element, as sho wn belo w:

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    1969 06 28

    Fo r instanc e, with CSS o r XSL, it’s easy to fo rmat the day and mo nth invisibly so that o nly the year appears. Fo r example, using CSS:

    YEAR {display: inline} MONTH {display: none} DAY {display: none} If the DATE is sto red as an attribute, ho wever, there’s no easy way to ac c ess o nly part o f it. Yo u must write a separate pro gram in a pro gramming language like ECMASc ript o r Java that c an parse yo ur date fo rmat. It’s easier to use the standard XML to o ls and c hild elements. Furthermo re, the attribute syntax is ambiguo us. What do es the date “10/ 11/ 1999” signify? In partic ular, is it Oc to ber 11th o r No vember 10th? Readers fro m different c o untries will interpret this data differently. Even if yo ur parser understands o ne fo rmat, there’s no guarantee the peo ple entering the data will enter it c o rrec tly. The XML, by c o ntrast, is unambiguo us. Finally, using DATE c hildren rather than attributes allo ws mo re than o ne date to be asso c iated with an element. Fo r instanc e, sc ho larly artic les are o ften returned to the autho r fo r revisio ns. In these c ases, it c an also be impo rtant to no te when the revised artic le was rec eived. Fo r example:

    Maximum Projectile Velocity in an Augmented Railgun

    Elliotte Harold Bruce Bukiet William Peter

    1992 10 29

    1993 10 26

    103

    104

    Part I ✦ Introducing XM L

    As ano ther example, c o nsider the ALT attribute o f an IMG tag in HTML. This is limited to a single string o f text. Ho wever, given that a pic ture is wo rth a tho usand wo rds, yo u might well want to replac e an IMG with marked up text. Fo r instanc e, c o nsider the pie c hart sho wn in Figure 5-2.

    M ajor League Baseball Positions

    7%

    6% 20%

    6% 6%

    19%

    27% 9%

    Starting Pitcher

    Relief Pitcher

    Catcher

    Outfield

    First Base

    Shortstop

    Second Base

    Third Base

    Figure 5-2: Distribution of positions in m ajor league baseball

    Using an ALT attribute, the best desc riptio n o f this pic ture yo u c an pro vide is:

    Ho wever, with an ALT c hild element, yo u have mo re flexibility bec ause yo u c an embed markup. Fo r example, yo u might pro vide a table o f the relevant numbers instead o f a pie c hart.

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Starting Pitcher 242 20%
    Relief Pitcher 336 27%
    Catcher 104 9%
    Outfield 235 19%
    First Base 67 6%
    Shortstop 67 6%
    Second Base 88 7%
    Third Base 67 6%


    Yo u might even pro vide the ac tual Po stsc ript, SVG, o r VML c o de to render the pic ture in the event that the bitmap image is no t available.

    M eta-M eta-Data Using elements fo r meta-data also easily allo ws fo r meta-meta-data, o r info rmatio n abo ut the info rmatio n abo ut the info rmatio n. Fo r example, the autho r o f a po em may be c o nsidered to be meta-data abo ut the po em. The language in whic h that autho r’s name is written is data abo ut the meta-data abo ut the po em. This isn’t a trivial c o nc ern, espec ially fo r distinc tly no n-Ro man languages. Fo r instanc e, is the autho r o f the Odyssey Ho mer o r ______? If yo u use elements, it’s easy to write:

    Homer ______

    105

    106

    Part I ✦ Introducing XM L

    Ho wever, if POET is an attribute rather than a c hild element, yo u’re stuc k with unwieldy c o nstruc ts like this:

    Homer Tell me, O Muse, of the cunning man...

    And it’s even mo re bulky if yo u want to pro vide bo th the po et’s English and Greek names.

    Homer Tell me, O Muse, of the cunning man...

    What’s Your M eta-data Is Someone Else’s Data “Metaness” is in the mind o f the beho lder. Who is reading yo ur do c ument and why they are reading it determines what they c o nsider to be data and what they c o nsider to be meta-data. Fo r example, if yo u’re simply reading an artic le in a sc ho larly jo urnal, then the autho r o f the artic le is tangential to the info rmatio n it c o ntains. Ho wever, if yo u’re sitting o n a tenure and pro mo tio ns c o mmittee sc anning a jo urnal to see who is publishing and who is no t, then the names o f the autho rs and the number o f artic les they’ve published may be mo re impo rtant to yo u than what they wro te (sad but true). In fac t, yo u may c hange yo ur mind abo ut what’s meta and what’s data. What’s o nly tangentially relevant to day, may bec o me c ruc ial to yo u next week. Yo u c an use style sheets to hide unimpo rtant elements to day, and c hange the style sheets to reveal them later. Ho wever, it’s mo re diffic ult to later reveal info rmatio n that was first sto red in an attribute. Usually, this requires rewriting the do c ument itself rather than simply c hanging the style sheet.

    Elements Are M ore Extensible Attributes are c ertainly c o nvenient when yo u o nly need to c o nvey o ne o r two wo rds o f unstruc tured info rmatio n. In these c ases, there may genuinely be no c urrent need fo r a c hild element. Ho wever, this do esn’t prec lude suc h a need in the future. Fo r instanc e, yo u may no w o nly need to sto re the name o f the autho r o f an artic le, and yo u may no t need to distinguish between the first and last names. Ho wever, in the future yo u may unc o ver a need to sto re first and last names, e-mail addresses, institutio n, snail mail address, URL, and mo re. If yo u’ve sto red the autho r o f the artic le as an element, then it’s easy to add c hild elements to inc lude this additio nal info rmatio n.

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Altho ugh any suc h c hange will pro bably require so me revisio n o f yo ur do c uments, style sheets, and asso c iated pro grams, it’s still muc h easier to c hange a simple element to a tree o f elements than it is to make an attribute a tree o f elements. Ho wever, if yo u used an attribute, then yo u’re stuc k. It’s quite diffic ult to extend yo ur attribute syntax beyo nd the regio n it was o riginally designed fo r.

    Good Times to Use Attributes Having exhausted all the reaso ns why yo u sho uld use elements instead o f attributes, I feel c o mpelled to po int o ut that there are no netheless so me times when attributes make sense. First o f all, as previo usly mentio ned, attributes are fully appro priate fo r very simple data witho ut substruc ture that the reader is unlikely to want to see. One example is the HEIGHT and WIDTH attributes o f an IMG. Altho ugh the values o f these attributes may c hange if the image c hanges, it’s hard to imagine ho w the data in the attribute c o uld be anything mo re than a very sho rt string o f text. HEIGHT and WIDTH are o ne-dimensio nal quantities (in mo re ways than o ne) so they wo rk well as attributes. Furthermo re, attributes are appro priate fo r simple info rmatio n abo ut the do c ument that has no thing to do with the c o ntent o f the do c ument. Fo r example, it is o ften useful to assign an ID attribute to eac h element. This is a unique string po ssessed o nly by o ne element in the do c ument. Yo u c an then use this string fo r a variety o f tasks inc luding linking to partic ular elements o f the do c ument, even if the elements mo ve aro und as the do c ument c hanges o ver time. Fo r example:

    Donald Dewey Nicholas Acocella

    The Biographical History of Baseball

    169 1995

    ID attributes make links to partic ular elements in the do c ument po ssible. In this way, they c an serve the same purpo se as the NAME attribute o f HTML’s A elements. Other data asso c iated with linking — HREFs to link to , SRCs to pull images and binary data fro m, and so fo rth — also wo rk well as attributes. CrossReference

    You’ll see m ore exam ples of this w hen XLL, the Extensible Linking Language, is discussed in Chapter 16, XLinks, and Chapter 17, XPointers.

    107

    108

    Part I ✦ Introducing XM L

    Attributes are also o ften used to sto re do c ument-spec ific style info rmatio n. Fo r example, if TITLE elements are generally rendered as bo ld text but if yo u want to make just o ne TITLE element bo th bo ld and italic , yo u might write so mething like this:

    Significant Others This enables the style info rmatio n to be embedded witho ut c hanging the tree struc ture o f the do c ument. While ideally yo u’d like to use a separate element, this sc heme gives do c ument autho rs so mewhat mo re c o ntro l when they c anno t add elements to the tag set they’re wo rking with. Fo r example, the Webmaster o f a site might require the use o f a partic ular DTD and no t want to allo w everyo ne to mo dify the DTD. No netheless, they want to allo w them to make mino r adjustments to individual pages. Use this sc heme with restraint, ho wever, o r yo u’ll so o n find yo urself bac k in the HTML hell XML was suppo sed to save us fro m, where fo rmatting is freely intermixed with meaning and do c uments are no lo nger maintainable. The final reaso n to use attributes is to maintain c o mpatibility with HTML. To the extent that yo u’re using tags that at least lo o k similar to HTML suc h as ,

    , and , yo u might as well emplo y the standard HTML attributes fo r these tags. This has the do uble advantage o f enabling legac y bro wsers to at least partially parse and display yo ur do c ument, and o f being mo re familiar to the peo ple writing the do c uments.

    Empty Tags Last c hapter’s no -attribute appro ac h was an extreme po sitio n. It’s also po ssible to swing to the o ther extreme — sto ring all the info rmatio n in the attributes and no ne in the c o ntent. In general, I do n’t rec o mmend this appro ac h. Sto ring all the info rmatio n in element c o ntent — while equally extreme — is muc h easier to wo rk with in prac tic e. Ho wever, this sec tio n entertains the po ssibility o f using o nly attributes fo r the sake o f eluc idatio n. As lo ng as yo u kno w the element will have no c o ntent, yo u c an use empty tags as a sho rt c ut. Rather than inc luding bo th a start and an end tag yo u c an inc lude o ne empty tag. Empty tags are distinguished fro m start tags by a c lo sing /> instead o f simply a c lo sing >. Fo r instanc e, instead o f yo u wo uld write . Empty tags may c o ntain attributes. Fo r example, here’s an empty tag fo r Jo e Girardi with several attributes:

    XML parsers treat this identic ally to the no n-empty equivalent. This PLAYER element is prec isely equal (tho ugh no t identic al) to the previo us PLAYER element fo rmed with an empty tag.

    The differenc e between and is syntac tic sugar, and no thing mo re. If yo u do n’t like the empty tag syntax, o r find it hard to read, yo u do n’t have to use it.

    XSL Attributes are visible in an XML so urc e view o f the do c ument as sho wn in Figure 5-1. Ho wever, o nc e a CSS style sheet is applied the attributes disappear. Figure 5-3 sho ws Listing 5-1 o nc e the baseball stats style sheet fro m the previo us c hapter is applied. It lo o ks like a blank do c ument bec ause CSS styles o nly apply to element c o ntent, no t to attributes. If yo u use CSS, any data yo u want to display to the reader sho uld be part o f an element’s c o ntent rather than o ne o f its attributes.

    Figure 5-3: A blank docum ent is displayed w hen CSS is applied to an XML docum ent w hose elem ents do not contain any character data.

    109

    110

    Part I ✦ Introducing XM L

    Ho wever, there is an alternative style sheet language that do es allo w yo u to ac c ess and display attribute data. This language is the Extensible Style Language (XSL); and it is also suppo rted by Internet Explo rer 5.0, at least in part. XSL is divided into two sec tio ns, transfo rmatio ns and fo rmatting. The transfo rmatio n part o f XSL enables yo u to replac e o ne tag with ano ther. Yo u c an define rules that replac e yo ur XML tags with standard HTML tags, o r with HTML tags plus CSS attributes. Yo u c an also do a lo t mo re inc luding reo rdering the elements in the do c ument and adding additio nal c o ntent that was never present in the XML do c ument. The fo rmatting part o f XSL defines an extremely po werful view o f do c uments as pages. XSL fo rmatting enables yo u to spec ify the appearanc e and layo ut o f a page inc luding multiple c o lumns, text flo w aro und o bjec ts, line spac ing, asso rted fo nt pro perties, and mo re. It’s designed to be po werful eno ugh to handle auto mated layo ut tasks fo r bo th the Web and print fro m the same so urc e do c ument. Fo r instanc e, XSL fo rmatting wo uld allo w o ne XML do c ument c o ntaining sho w times and advertisements to generate bo th the print and o nline editio ns o f a lo c al newspaper’s televisio n listings. Ho wever, IE 5.0 and mo st o ther to o ls do no t yet suppo rt XSL fo rmatting. Therefo re, in this sec tio n I’ll fo c us o n XSL transfo rmatio ns. CrossReference

    XSL form atting is discussed in Chapter 15, XSL Formatting Objects.

    XSL Style Sheet Templates An XSL style sheet c o ntains templates into whic h data fro m the XML do c ument is po ured. Fo r example, o ne template might lo o k so mething like this:



    XSL Instructions to get the title

    XSL Instructions to get the title

    XSL Instructions to get the statistics

    The italic ized sec tio ns will be replac ed by partic ular XSL elements that c o py data fro m the underlying XML do c ument into this template. Yo u c an apply this template to many different data sets. Fo r instanc e, if the template is designed to wo rk with the baseball example, then the same style sheet c an display statistic s fro m different seaso ns.

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    This may remind yo u o f so me server-side inc lude sc hemes fo r HTML. In fac t, this is very muc h like server-side inc ludes. Ho wever, the ac tual transfo rmatio n o f the so urc e XML do c ument and XSL style sheet takes plac e o n the c lient rather than o n the server. Furthermo re, the o utput do c ument do es no t have to be HTML. It c an be any well-fo rmed XML. XSL instruc tio ns c an retrieve any data sto red in the elements o f the XML do c ument. This inc ludes element c o ntent, element names, and, mo st impo rtantly fo r o ur example, element attributes. Partic ular elements are c ho sen by a pattern that c o nsiders the element’s name, its value, its attributes’ names and values, its abso lute and relative po sitio n in the tree struc ture o f the XML do c ument, and mo re. Onc e the data is extrac ted fro m an element, it c an be mo ved, c o pied, and manipulated in a variety o f ways. We wo n’t c o ver everything yo u c an do with XML transfo rmatio ns in this brief intro duc tio n. Ho wever, yo u will learn to use XSL to write so me pretty amazing do c uments that c an be viewed o n the Web right away. CrossReference

    Chapter 14, XSL Transformations, covers XSL transform ations in depth.

    The Body of the Document Let’s begin by lo o king at a simple example and applying it to the XML do c ument with baseball statistic s sho wn in Listing 5-1. Listing 5-2 is an XSL style sheet. This style sheet pro vides the HTML mo ld into whic h XML data will be po ured.

    Listing 5-2: An XSL style sheet



    Major League Baseball Statistics

    Major League Baseball Statistics

    Copyright 1999

    Elliotte Rusty Harold

    Continued

    111

    112

    Part I ✦ Introducing XM L

    Listing 5-2 (continued)


    [email protected]



    It resembles an HTML file inc luded inside an xsl:template element. In o ther wo rds its struc ture lo o ks like this:



    HTML file goes here

    Listing 5-2 is no t o nly an XSL style sheet; it’s also a well-fo rmed XML do c ument. It begins with an XML dec laratio n. The ro o t element o f this do c ument is xsl: stylesheet. This style sheet c o ntains a single template fo r the XML data enc o ded as an xsl:template element. The xsl:template element has a match attribute with the value / and its c o ntent is a well-fo rmed HTML do c ument. It’s no t a c o inc idenc e that the o utput HTML is well-fo rmed. Bec ause the HTML must first be part o f an XSL style sheet, and bec ause XSL style sheets are well-fo rmed XML do c uments, all the HTML in a XSL style sheet must be well-fo rmed. The Web bro wser tries to matc h parts o f the XML do c ument against eac h xsl:template element. The / template matc hes the ro o t o f the do c ument; that is the entire do c ument itself. The bro wser reads the template and inserts data fro m the XML do c ument where indic ated by XSL instruc tio ns. Ho wever, this partic ular template c o ntains no XSL instruc tio ns, so its c o ntents are merely c o pied verbatim into the Web bro wser, pro duc ing the o utput yo u see in Figure 5-4. No tic e that Figure 5-4 do es no t display any data fro m the XML do c ument, o nly fro m the XSL template.

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Attac hing the XSL style sheet o f Listing 5-2 to the XML do c ument in Listing 5-1 is straightfo rward. Simply add a pro c essing instruc tio n with a type attribute with value text/xsl and an href attribute that po ints to the style sheet between the XML dec laratio n and the ro o t element. Fo r example:



    ... This is the same way a CSS style sheet is attac hed to a do c ument. The o nly differenc e is that the type attribute is text/xsl instead o f text/css.

    Figure 5-4: The data from the XML docum ent, not the XSL tem plate, is m issing after application of the XSL style sheet in Listing 5-2.

    The Title Of c o urse there was so mething rather o bvio us missing fro m Figure 5-4 — the data! Altho ugh the style sheet in Listing 5-2 displays so mething (unlike the CSS style sheet o f Figure 5-3) it do esn’t sho w any data fro m the XML do c ument. To add this, yo u need to use XSL instruc tio n elements to c o py data fro m the so urc e XML do c ument into the XSL template. Listing 5-3 adds the nec essary XSL instruc tio ns to extrac t the YEAR attribute fro m the SEASON element and insert it in the TITLE and H1 header o f the resulting do c ument. Figure 5-5 sho ws the rendered do c ument.

    113

    114

    Part I ✦ Introducing XM L

    Listing 5-3: An XSL style sheet with instructions to extract the SEASON element and YEAR attribute





    Major League Baseball Statistics



    Major League Baseball Statistics

    Copyright 1999

    Elliotte Rusty Harold




    [email protected]



    The new XSL instruc tio ns that extrac t the YEAR attribute fro m the SEASON element are:



    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Figure 5-5: Listing 5-1 after application of the XSL style sheet in Listing 5-3

    These instruc tio ns appear twic e bec ause we want the year to appear twic e in the o utput do c ument-o nc e in the H1 header and o nc e in the TITLE. Eac h time they appear, these instruc tio ns do the same thing. finds all SEASON elements. inserts the value o f the YEAR attribute o f the SEASON element — that is, the string “1998” — fo und by . This is impo rtant, so let me say it again: xsl:for-each selec ts a partic ular XML element in the so urc e do c ument (Listing 5-1 in this c ase) fro m whic h data will be read. xsl:value-of c o pies a partic ular part o f the element into the o utput do c ument. Yo u need bo th XSL instruc tio ns. Neither alo ne is suffic ient. XSL instruc tio ns are distinguished fro m o utput elements like HTML and H1 bec ause the instruc tio ns are in the xsl namespac e. That is, the names o f all XSL elements begin with xsl:. The namespac e is identified by the xmlns:xsl attribute o f the ro o t element o f the style sheet. In Listings 5-2, 5-3, and all o ther examples in this bo o k, the value o f that attribute is http://www.w3.org/TR/WD-xsl. CrossReference

    Nam espaces are covered in depth in Chapter 18, Namespaces.

    Leagues, Divisions, and Teams Next, let’s add so me XSL instruc tio ns to pull o ut the two LEAGUE elements. We’ll map these to H2 headers. Listing 5-4 demo nstrates. Figure 5-6 sho ws the do c ument rendered with this style sheet.

    115

    116

    Part I ✦ Introducing XM L

    Listing 5-4: An XSL style sheet with instructions to extract LEAGUE elements





    Major League Baseball Statistics



    Major League Baseball Statistics





    Copyright 1999

    Elliotte Rusty Harold




    [email protected]



    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Figure 5-6: The league nam es are displayed as H2 headers w hen the XSL style sheet in Listing 5-4 is applied.

    The key new materials are the nested xsl:for-each instruc tio ns



    Major League Baseball Statistics





    The o utermo st instruc tio n says to selec t the SEASON element. With that element selec ted, we then find the YEAR attribute o f that element and plac e it between and alo ng with the extra text Major League Baseball Statistics. Next, the bro wser lo o ps thro ugh eac h LEAGUE c hild o f the selec ted SEASON and plac es the value o f its NAME attribute between and . Altho ugh there’s o nly o ne xsl:for-each matc hing a LEAGUE element, it lo o ps o ver all the LEAGUE elements that are immediate c hildren o f the SEASON element. Thus, this template wo rks fo r anywhere fro m zero to an indefinite number o f leagues. The same tec hnique c an be used to assign H3 headers to divisio ns and H4 headers to teams. Listing 5-5 demo nstrates the pro c edure and Figure 5-7 sho ws the do c ument rendered with this style sheet. The names o f the divisio ns and teams are read fro m the XML data.

    117

    118

    Part I ✦ Introducing XM L

    Listing 5-5: An XSL style sheet with instructions to extract DIVISION and TEAM elements





    Major League Baseball Statistics



    Major League Baseball Statistics













    Copyright 1999

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Elliotte Rusty Harold




    [email protected]



    Figure 5-7: Divisions and team nam es are displayed after application of the XSL style sheet in Listing 5-5.

    In the c ase o f the TEAM elements, the values o f bo th its CITY and NAME attributes are used as c o ntents fo r the H4 header. Also no tic e that the nesting o f the xsl:foreach elements that selec ts seaso ns, leagues, divisio ns, and teams mirro rs the hierarc hy o f the do c ument itself. That’s no t a c o inc idenc e. While o ther sc hemes are po ssible that do n’t require matc hing hierarc hies, this is the simplest, espec ially fo r highly struc tured data like the baseball statistic s o f Listing 5-1.

    119

    120

    Part I ✦ Introducing XM L

    Players The next step is to add statistic s fo r individual players o n a team. The mo st natural way to do this is in a table. Listing 5-6 sho ws an XSL style sheet that arranges the players and their stats in a table. No new XSL elements are intro duc ed. The same xsl:for-each and xsl:value-of elements are used o n the PLAYER element and its attributes. The o utput is standard HTML table tags. Figure 5-8 displays the results.

    Listing 5-6: An XSL style sheet that places players and their statistics in a table





    Major League Baseball Statistics



    Major League Baseball Statistics











    Chapter 5 ✦ Attributes, Empty Tags, and XSL





    PlayerPG GSABRH DTHRRBI SCSSHSF EBBSOHBP
















    Continued

    121

    122

    Part I ✦ Introducing XM L

    Listing 5-6 (continued)



    Copyright 1999

    Elliotte Rusty Harold




    [email protected]



    Separation of Pitchers and Batters One disc repanc y yo u may have no ted in Figure 5-8 is that the pitc hers aren’t handled pro perly. Thro ugho ut this c hapter and Chapter 4, we’ve always given the pitc hers a c o mpletely different set o f statistic s, whether tho se stats were sto red in element c o ntent o r attributes. Therefo re, the pitc hers really need a table that is separate fro m the o ther players. Befo re putting a player into the table, yo u must test whether he is o r is no t a pitc her. If his POSITION attribute c o ntains the string “pitc her” then o mit him. Then reverse the pro c edure in a sec o nd table that o nly inc ludes pitc hers-PLAYER elements who se POSITION attribute c o ntains the string “pitc her”. To do this, yo u have to add additio nal c o de to the xsl:for-each element that selec ts the players. Yo u do n’t selec t all players. Instead, yo u selec t tho se players who se POSITION attribute is no t pitc her. The syntax lo o ks like this:

    But bec ause the XML do c ument distinguishes between starting and relief pitc hers, the true answer must test bo th c ases:

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Figure 5-8: Player statistics are displayed after applying the XSL style sheet in Listing 5-6.

    Fo r the table o f pitc hers, yo u lo gic ally reverse this to the po sitio n being equal to either “Starting Pitc her” o r “Relief Pitc her”. (It is no t suffic ient to just c hange no t e qual to e qual. Yo u also have to c hange and to o r.) The syntax lo o ks like this:

    Note

    Only a single equals sign is used to test for equality rather than the double equals sign used in C and Java. That’s because there’s no equivalent of an assignm ent operator in XSL.

    Listing 5-7 sho ws an XSL style sheet separating the batters and pitc hers into two different tables. The pitc hers’ table adds c o lumns fo r all the usual pitc her statistic s. Listing 5-1 enc o des in attributes: wins, lo sses, saves, shuto uts, etc . Abbreviatio ns are used fo r the c o lumn labels to keep the table to a manageable width. Figure 5-9 sho ws the results.

    123

    124

    Part I ✦ Introducing XM L

    Listing 5-7: An XSL style sheet that separates batters and pitchers





    Major League Baseball Statistics



    Major League Baseball Statistics









    Batters



    Chapter 5 ✦ Attributes, Empty Tags, and XSL



    PlayerPG GSABRH DTHRRBI SCSSHSF EBBSO HBP
















    Pitchers

    Continued

    125

    126

    Part I ✦ Introducing XM L

    Listing 5-7 (continued)



    PlayerPG GSWLS CGSOERA IPHRRER HBWPBBB K






















    Chapter 5 ✦ Attributes, Empty Tags, and XSL



    Copyright 1999

    Elliotte Rusty Harold




    [email protected]



    Figure 5-9: Pitchers are distinguished from other players after applying the XSL style sheet in Listing 5-7.

    127

    128

    Part I ✦ Introducing XM L

    Element Co ntents and the selec t Attribute In this c hapter, I fo c used o n using XSL to fo rmat data sto red in the attributes o f an element bec ause it isn’t ac c essible when using CSS. Ho wever, XSL wo rks equally well when yo u want to inc lude an element’s c harac ter data rather than (o r in additio n to ) its attributes. To indic ate that an element’s text is to be c o pied into the o utput do c ument, simply use the element’s name as the value o f the select attribute o f the xsl:value-of element. Fo r example, c o nsider, o nc e again, Listing 5-8:

    Listing 5-8greeting.xml

    Hello XML!

    Let’s suppo se yo u want to c o py the greeting “Hello XML!” into an H1 header. First, yo u use xsl:for-each to selec t the GREETING element:



    This alo ne is eno ugh to c o py the two H1 tags into the o utput. To plac e the text o f the GREETING element between them, use xsl:value-of with no select attribute. Then, by default, the c o ntents o f the c urrent element ( GREETING) are selec ted. Listing 5-9 sho ws the c o mplete style sheet.

    Listing 5-9: greeting.xsl









    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    Yo u c an also use select to c ho o se the c o ntents o f a c hild element. Simply make the name o f the c hild element the value o f the select attribute o f xsl:value-of. Fo r instanc e, c o nsider the baseball example fro m the previo us c hapter in whic h eac h player’s statistic s were sto red in c hild elements rather than in attributes. Given this struc ture o f the do c ument (whic h is ac tually far mo re likely than the attribute-based struc ture o f this c hapter) the XSL fo r the batters’ table lo o ks like this:

    Batters





    129

    130

    Part I ✦ Introducing XM L



    PlayerPG GSABRH DTHRRBI SCSSHSF EBBSOHBP














    In this c ase, within eac h PLAYER element, the c o ntents o f that element’s GIVEN_NAME, SURNAME, POSITION, GAMES, GAMES_STARTED, AT_BATS, RUNS, HITS, DOUBLES, TRIPLES, HOME_RUNS, RBI, STEALS, CAUGHT_STEALING, SACRIFICE_HITS, SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT and HIT_BY_PITCH c hildren are extrac ted and c o pied to the o utput. Sinc e we used the same names fo r the attrib utes in this c hapter as we did fo r the PLAYER c hild elements in the last c hapter, this example is almo st identic al to the eq uivalent sec tio n o f Listing 5-7. The main differenc e is that the @ signs are missing. They indic ate an attrib ute rather than a c hild. Yo u c an do even mo re with the select attribute. Yo u c an selec t elements: by po sitio n (fo r example the first, sec o nd, last, seventeenth element, and so fo rth); with partic ular c o ntents; with spec ific attribute values; o r who se parents o r c hildren have c ertain c o ntents o r attribute values. Yo u c an even apply a c o mplete set o f Bo o lean lo gic al o perato rs to c o mbine different selec tio n c o nditio ns. We will explo re mo re o f these po ssibilities when we return to XSL in Chapters 14 and 15.

    CSS or XSL? CSS and XSL o verlap to so me extent. XSL is c ertainly mo re po werful than CSS. Ho wever XSL’s po wer is matc hed by its c o mplexity. This c hapter o nly to uc hed o n the basic s o f what yo u c an do with XSL. XSL is mo re c o mplic ated, and harder to learn and use than CSS, whic h raises the questio n, “When sho uld yo u use CSS and when sho uld yo u use XSL?” CSS is mo re bro adly suppo rted than XSL. Parts o f CSS Level 1 are suppo rted fo r HTML elements by Netsc ape 4 and Internet Explo rer 4 (altho ugh anno ying differenc es exist). Furthermo re, mo st o f CSS Level 1 and so me o f CSS Level 2 is likely to be well suppo rted by Internet Explo rer 5.0 and Mo zilla 5.0 fo r bo th XML and HTML. Thus, c ho o sing CSS gives yo u mo re c o mpatibility with a bro ader range o f bro wsers. Additio nally, CSS is mo re stable. CSS level 1 (whic h c o vers all the CSS yo u’ve seen so far) and CSS Level 2 are W3C rec o mmendatio ns. XSL is still a very early wo rking

    Chapter 5 ✦ Attributes, Empty Tags, and XSL

    draft, and pro bably wo n’t be finalized until after this bo o k is printed. Early ado pters o f XSL have already been burned o nc e, and will be burned again befo re standards gel. Cho o sing CSS means yo u’re less likely to have to rewrite yo ur style sheets fro m mo nth to mo nth just to trac k evo lving so ftware and standards. Eventually, ho wever, XSL will settle do wn to a usable standard. Furthermo re, sinc e XSL is so new, different so ftware implements different variatio ns and subsets o f the draft standard. At the time o f this writing (spring 1999) there are at least three majo r variants o f XSL in widespread use. Befo re this bo o k is published, there will be mo re. If the inc o mplete and buggy implementatio ns o f CSS in c urrent bro wsers bo ther yo u, the varieties o f XSL will drive yo u insane. Ho wever, XSLis definitely mo re po werful than CSS. CSS o nly allo ws yo u to apply fo rmatting to element c o ntents. It do es no t allo w yo u to c hange o r reo rder tho se c o ntents; c ho o se different fo rmatting fo r elements based o n their c o ntents o r attributes; o r add simple, extra text like a signature blo c k. XSL is far mo re appro priate when the XML do c uments c o ntain o nly the minimum o f data and no ne o f the HTML fro u-fro u that surro unds the data. With XSL, yo u c an separate the c ruc ial data fro m everything else o n the page, like mastheads, navigatio n bars, and signatures. With CSS, yo u have to inc lude all these piec es in yo ur data do c uments. XML+XSL allo ws the data do c uments to live separately fro m the Web page do c uments. This makes XML+XSL do c uments mo re maintainable and easier to wo rk with. In the lo ng run XSL sho uld bec o me the preferred c ho ic e fo r real-wo rld, data-intensive applic atio ns. CSS is mo re suitable fo r simple pages like grandparents use to po st pic tures o f their grandc hildren. But fo r these uses, HTML alo ne is suffic ient. If yo u’ve really hit the wall with HTML, XML+CSS do esn’t take yo u muc h further befo re yo u run into ano ther wall. XML+XSL, by c o ntrast, takes yo u far past the walls o f HTML. Yo u still need CSS to wo rk with legac y bro wsers, but lo ng-term XSL is the way to go .

    Summary In this c hapter, yo u saw examples o f c reating an XML do c ument fro m sc ratc h. Spec ific ally, yo u learned:

    ✦ Info rmatio n c an also be sto red in an attribute o f an element. ✦ An attribute is a name-value pair inc luded in an element’s start tag. ✦ Attributes typic ally ho ld meta-info rmatio n abo ut the element rather than the element’s data.

    ✦ Attributes are less c o nvenient to wo rk with than the c o ntents o f an element.

    131

    132

    Part I ✦ Introducing XM L

    ✦ Attributes wo rk well fo r very simple info rmatio n that’s unlikely to c hange its fo rm as the do c ument evo lves. In partic ular, style and linking info rmatio n wo rks well as an attribute.

    ✦ Empty tags o ffer syntac tic sugar fo r elements with no c o ntent. ✦ XSL is a po werful style language that enables yo u to ac c ess and display attribute data and transfo rm do c uments. In the next c hapter, we’ll spec ify the exac t rules that well-fo rmed XML do c uments must adhere to . We’ll also explo re so me additio nal means o f embedding info rmatio n in XML do c uments inc luding c o mments and pro c essing instruc tio ns.







    7

    C H A P T E R

    Foreign Languages and Non-Roman Text









    In This Cha pter Understanding the effects o f no n-Ro man scripts o n the W eb

    T

    he Web is internatio nal, yet mo st o f the text yo u’ll find o n it is English. XML is starting to c hange this. XML pro vides full suppo rt fo r the do uble-byte Unic o de c harac ter set, as well as its mo re c o mpac t representatio ns. This is go o d news fo r Web autho rs bec ause Unic o de suppo rts almo st every c harac ter c o mmo nly used in every mo dern sc ript o n Earth. In this c hapter, yo u’ll learn ho w internatio nal text is represented in c o mputer applic atio ns, ho w XML understands text, and ho w yo u c an take advantage o f the so ftware yo u have to read and write in languages o ther than English.

    Using scripts, character sets, fo nts, and g lyphs Leg acy character sets Using the Unico de Character Set W riting XML in Unico de



    Non-Roman Scripts on the Web Altho ugh the Web is internatio nal, muc h o f its text is in English. Bec ause o f the Web ’s expansiveness, ho wever, yo u c an still surf thro ugh Web pages in Frenc h, Spanish, Chinese, Arab ic , Heb rew, Russian, Hindi, and o ther languages. Mo st o f the time, tho ugh, these pages c o me o ut lo o king less than ideal. Figure 7-1 sho ws the Oc to b er 1998 c o ver page o f o ne o f the United States Info rmatio n Agenc y’s pro paganda jo urnals, Issue s in De mo cracy ( http://www.usia.gov/journals/ itdhr/1098/ijdr/ijdr1098.htm) , in Russian translatio n viewed in an English enc o ding. The red Cyrillic text in the upper left is a b itmapped image file so it’s legib le ( if yo u speak Russian) and so are a few wo rds in English suc h as “Ado b e Ac ro b at.” Ho wever, the rest o f the text is mo stly a b unc h o f ac c ented Ro man vo wels, no t the Cyrillic letters they are suppo sed to b e.







    162

    Part I ✦ Introducing XM L

    The quality o f Web pages deterio rates even further when c o mplic ated, no nWestern sc ripts like Chinese and Japanese are used. Figure 7-2 sho ws the ho me page fo r the Japanese translatio n o f my bo o k JavaBe ans ( IDG Bo o ks, 1997, http://www. ohmsha.co.jp/data/books/contents/4-274-06271-6.htm) viewed in an English bro wser. Onc e again the bitmapped image sho ws the pro per Japanese ( and English) text, but the rest o f the text o n the page lo o ks almo st like a rando m c o llec tio n o f c harac ters exc ept fo r a few rec o gnizable English wo rds like JavaBeans. The Kanji c harac ters yo u’re suppo sed to see are c o mpletely absent.

    Figure 7-1: The Russian translation of the October 1998 issue of Issues of Democracy viewed in a Roman script

    These pages lo o k as they’re intended to lo o k if viewed with the right enc o ding and applic atio n so ftware, and if the c o rrec t fo nt is installed. Figure 7-3 sho ws Issue s in De mo cracy viewed with the Windo ws 1251 enc o ding o f Cyrillic . As yo u c an see, the text belo w the pic ture is no w readable (if yo u c an read Russian). Yo u c an selec t the enc o ding fo r a Web page fro m the View/ Enc o ding menu in Netsc ape Navigato r o r Internet Explo rer. In an ideal wo rld, the Web server wo uld tell the Web bro wser what enc o ding to use, and the Web bro wser wo uld listen. It wo uld also be nic e if the Web server c o uld send the Web bro wser the fo nts it needed to display the page. In prac tic e, ho wever, yo u o ften need to selec t the enc o ding manually, even trying several to find the exac t right o ne when mo re than o ne enc o ding is available fo r a sc ript. Fo r instanc e, a Cyrillic page might be

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    enc o ded in Windo ws 1251, ISO 8859-5, o r KOI6-R. Pic king the wro ng enc o ding may make Cyrillic letters appear, but the wo rds will be gibberish. Figure 7-2: The

    Japanese translation of JavaBeans viewed in an English browser

    Figure 7-3: Issues of Democracy viewed in a Cyrillic script

    163

    164

    Part I ✦ Introducing XM L

    Even when yo u c an identify the enc o ding, there’s no guarantee yo u have fo nts available to display it. Figure 7-4 sho ws the Japanese ho me page fo r JavaBe ans with Japanese enc o ding, but witho ut a Japanese fo nt installed o n the c o mputer. Mo st o f the c harac ters in the text are sho wn as a bo x, whic h indic ates an unavailable c harac ter glyph. Fo rtunately, Netsc ape Navigato r c an rec o gnize that so me o f the bytes o n the page are do uble-byte Japanese c harac ters rather than two o ne-byte Western c harac ters.

    Figure 7-4: The Japanese translation of JavaBeans in Kanji without the necessary fonts installed

    If yo u do have a Japanese lo c alized editio n o f yo ur o perating system that inc ludes the nec essary fo nts, o r additio nal so ftware like Apple’s Japanese Language Kit o r NJStar’s NJWin ( http://www.njstar.com/) that adds Japanese-language suppo rt to yo ur existing system, yo u wo uld be able to see the text mo re o r less as it was meant to be seen as sho wn in Figure 7-5. Note

    Of course, the higher quality fonts you use, the better the text will look. Chinese and Japanese fonts tend to be quite large (there are over 80,000 characters in Chinese alone) and the distinctions between individual ideographs can be quite subtle. Japanese publishers generally require higher-quality paper and printing than Western publishers, so they can maintain the fine detail necessary to print Japanese letters. Regrettably a 72-dpi computer monitor can’t do justice to most Japanese and Chinese characters unless they’re displayed at almost obscenely large point sizes.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Figure 7-5: The Japanese translation of JavaBeans in Kanji

    with the necessary fonts installed Bec ause eac h page c an o nly have a single enc o ding, it is diffic ult to write a Web page that integrates multiple sc ripts, suc h as a Frenc h c o mmentary o n a Chinese text. Fo r a reaso ns suc h as this the Web c o mmunity needs a single, universal c harac ter set to display all c harac ters fo r all c o mputers and Web bro wsers. We do n’t have suc h a c harac ter set yet, but XML and Unic o de get as c lo se as is c urrently po ssible. XML files are written in Unic o de, a do ub le-b yte c harac ter set that c an represent mo st c harac ters in mo st o f the wo rld’s languages. If a Web page is written in Unic o de, as XML pages are, and if the b ro wser understands Unic o de, as XML b ro wsers sho uld, then it’s no t a pro b lem fo r c harac ters fro m different languages to b e inc luded o n the same page. Furthermo re, the bro wser do esn’t need to distinguish between different enc o dings like Windo ws 1251, ISO 8859-5, o r KOI8-R. It c an just assume everything’s written in Unic o de. As lo ng as the do uble-byte set has the spac e to ho ld all o f the different c harac ters, there’s no need to use mo re than o ne c harac ter set. Therefo re there’s no need fo r bro wsers to try to detec t whic h c harac ter set is in use.

    165

    166

    Part I ✦ Introducing XM L

    Scripts, Character Sets, Fonts, and Glyphs Mo st mo dern human languages have written fo rms. The set o f c harac ters used to write a language is c alled a script. A sc ript may be a pho netic alphabet, but it do esn’t have to be. Fo r instanc e, Chinese, Japanese, and Ko rean are written with ideo graphic c harac ters that represent who le wo rds. Different languages o ften share sc ripts, so metimes with slight variatio ns. Fo r instanc e, the mo dern Turkish alphabet is essentially the familiar Ro man alphabet with three extra letters — , , and ı. Chinese, Japanese, and Ko rean, o n the o ther hand, share essentially the same 80,000 Han ideo graphs, tho ugh many c harac ters have different meanings in the different languages. Note

    The word script is also often used to refer to programs written in weakly typed, interpreted languages like JavaScript, Perl, and TCL. In this chapter, the word script always refers to the characters used to write a language and not to any sort of program. So me languages c an even be written in different sc ripts. Serbian and Cro atian are virtually identic al and are generally referred to as Serbo -Cro atian. Ho wever, Serbian is written in a mo dified Cyrillic sc ript, and Cro atian is written in a mo dified Ro man sc ript. As lo ng as a c o mputer do esn’t attempt to grasp the meaning o f the wo rds it pro c esses, wo rking with a sc ript is equivalent to wo rking with any language that c an be written in that sc ript. Unfo rtunately, XML alo ne is no t eno ugh to read a sc ript. Fo r eac h sc ript a c o mputer pro c esses, fo ur things are required:

    1. A c harac ter set fo r the sc ript 2. A fo nt fo r the c harac ter set 3. An input metho d fo r the c harac ter set 4. An o perating system and applic atio n so ftware that understand the c harac ter set If any o f these fo ur elements are missing, yo u wo n’t be able to wo rk easily in the sc ript, tho ugh XML do es pro vide a wo rk-aro und that’s adequate fo r o c c asio nal use. If the o nly thing yo ur applic atio n is missing is an input metho d, yo u’ll be able to read text written in the sc ript. Yo u just wo n’t be able to write in it.

    A Character Set for the Script Co mputers o nly understand numbers. Befo re they c an wo rk with text, that text has to be enc o ded as numbers in a spec ified c harac ter set. Fo r example, the po pular ASCII c harac ter set enc o des the c apital letter ‘A’ as 65. The c apital letter ‘B’ is enc o ded as 66. ‘C’ is 67, and so o n.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    These are semantic enc o dings that pro vide no style o r fo nt info rmatio n. C, C, o r even C are all 67. Info rmatio n abo ut ho w the c harac ter is drawn is sto red elsewhere.

    A Font for the Character Set A fo nt is a c o llec tio n o f glyphs fo r a c harac ter set, generally in a spec ific size, fac e, and style. Fo r example, C, C, and C are all the same c harac ter, but they are drawn with different glyphs. No netheless their essential meaning is the same. Exac tly ho w the glyphs are sto red varies fro m system to system. They may be bitmaps o r vec to r drawings; they may even c o nsist o f ho t lead o n a printing press. The fo rm they take do esn’t c o nc ern us here. The key idea is that a fo nt tells the c o mputer ho w to draw eac h c harac ter in the c harac ter set.

    An Input M ethod for the Character Set An input metho d enab les yo u to enter text. English speakers do n’t think muc h ab o ut the need fo r an input metho d fo r a sc ript. We just type o n o ur keyb o ards and everything’s hunky-do ry. The same is true in mo st o f Euro pe, where all that’s needed is a slightly mo dified keyb o ard with a few extra umlauts, c edillas, o r tho rns ( depending o n the c o untry) . Radic ally different c harac ter sets like Cyrillic , Hebrew, Arabic , and Greek are mo re diffic ult to input. There’s a finite number o f keys o n the keybo ard, generally no t eno ugh fo r Arabic and Ro man letters, o r Ro man and Greek letters. Assuming bo th are needed tho ugh, a keybo ard c an have a Greek lo c k key that shifts the keybo ard fro m Ro man to Greek and bac k. Bo th Greek and Ro man letters c an be printed o n the keys in different c o lo rs. The same sc heme wo rks fo r Hebrew, Arabic , Cyrillic , and o ther no n-Ro man alphabetic c harac ter sets. Ho wever, this sc heme really b reaks do wn when fac ed with ideo graphic sc ripts like Chinese and Japanese. Japanese keyb o ards c an have in the b allpark o f 5,000 different keys; and that’s still less than 10% o f the language! Syllab ic , pho netic , and radic al representatio ns exist that c an reduc e the numb er o f keys; b ut it is q uestio nab le whether a keyb o ard is really an appro priate means o f entering text in these languages. Reliab le speec h and handwriting rec o gnitio n have even greater po tential in Asia than in the West. Sinc e speec h and handwriting rec o gnitio n still haven’t reac hed the reliability o f even a medio c re typist like myself, mo st input metho ds to day are map multiple sequenc es o f keys o n the keybo ard to a single c harac ter. Fo r example, to type the Chinese c harac ter fo r sheep, yo u might ho ld do wn the Alt key and type a tilde (~), then type yang, then hit the enter key. The input metho d wo uld then present yo u with a list o f wo rds that are pro no unc ed mo re o r less like yang. Fo r example:

    167

    168

    Part I ✦ Introducing XM L

    Yo u wo uld then c ho o se the c harac ter yo u wanted, _. The exac t details o f bo th the GUI and the transliteratio n system used to c o nvert typed keys like yang to the ideo graphic c harac ters like _ vary fro m pro gram to pro gram, o perating system to o perating system, and language to language.

    Operating System and Application Software As o f this writing, the majo r Web b ro wsers ( Netsc ape Navigato r and Internet Explo rer) do a surprisingly go o d jo b o f displaying no n-Ro man sc ripts. Pro vided the underlying o perating system suppo rts a given sc ript and has the right fo nts installed, a Web b ro wser c an pro b ab ly display it. Mac OS 7.1 and later c an handle mo st c o mmo n sc ripts in the wo rld to day. Ho wever, the base o perating system o nly suppo rts Western Euro pean languages. Chinese, Japanese, Ko rean, Arabic , Hebrew, and Cyrillic are available as language kits that c o st abo ut $100 a piec e. Eac h pro vides fo nts and input metho ds fo r languages written in tho se sc ripts. There’s also an Indian language kit, whic h handles the Devanagari, Gujarati, and Gurmukhu sc ripts c o mmo n o n the Indian subc o ntinent. Mac OS 8.5 adds o ptio nal, limited suppo rt fo r Unic o de (whic h mo st applic atio ns do n’t yet take advantage o f). Windo ws NT 4.0 uses Unic o de as its native c harac ter set. NT 4.0 do es a fairly go o d jo b with Ro man languages, Cyrillic , Greek, Hebrew, and a few o thers. The Luc ida Sans Unic o de fo nt c o vers abo ut 1300 o f the mo st c o mmo n o f Unic o de’s 40,000 o r so c harac ters. Mic ro so ft Offic e 97 inc ludes Chinese, Japanese, and Ko rean fo nts that yo u c an install to read text in these languages. (Lo o k in the Fareast fo lder in the Valupac k fo lder o n yo ur Offic e CD-ROM.) Mic ro so ft c laims Windo ws 2000 ( previo usly kno wn as NT 5.0) will also inc lude fo nts c o vering mo st o f the Chinese-Japanese-Ko rean ideo graphs, as well as input metho ds fo r these sc ripts. Ho wever they also pro mised that Windo ws 95 wo uld inc lude Unic o de suppo rt, and that go t dro pped b efo re shipment. Co nseq uently, I’m no t ho lding my b reath. Certainly, it wo uld b e nic e if they do pro vide full internatio nal suppo rt in all versio ns o f NT rather than relying o n lo c alized systems. Mic ro so ft’s c o nsumer o perating systems, Windo ws 3.1, 95, and 98, do no t fully suppo rt Unic o de. Instead they rely o n lo c alized systems that c an o nly handle basic English c harac ters plus the lo c alized sc ript. The majo r Unix variants have varying levels o f suppo rt fo r Unic o de. So laris 2.6 suppo rts Euro pean languages, Greek, and Cyrillic . Chinese, Japanese, and Ko rean are suppo rted by lo c alized versio ns using different enc o dings rather than Unic o de. Linux has embryo nic suppo rt fo r Unic o de, whic h may gro w to so mething useful in the near future.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Legacy Character Sets Different c o mputers in different lo c ales use different default c harac ter sets. Mo st mo dern c o mputers use a superset o f the ASCII c harac ter set. ASCII enc o des the English alphabet and the mo st c o mmo n punc tuatio n and whitespac e c harac ters. In the United States, Mac s use the Mac Ro man c harac ter set, Windo ws PCs use a c harac ter set c alled Windo ws ANSI, and mo st Unix wo rkstatio ns use ISO Latin-1. These are all extensio ns o f ASCII that suppo rt additio nal c harac ters like ç and ¿ that are needed fo r Western Euro pean languages like Frenc h and Spanish. In o ther lo c ales like Japan, Greec e, and Israel, c o mputers use a still mo re c o nfusing ho dgepo dge o f c harac ter sets that mo stly suppo rt ASCII plus the lo c al language. This do esn’t wo rk o n the Internet. It’s unlikely that while yo u’re reading the San Jo se Me rcury Ne ws yo u’ll turn the page and be c o nfro nted with several c o lumns written in German o r Chinese. Ho wever, o n the Web it’s entirely po ssible a user will fo llo w a link and end up staring at a page o f Japanese. Even if the surfer c an’t read Japanese it wo uld still be nic e if they saw a c o rrec t versio n o f the language, as seen in Figure 7-5, instead o f a rando m c o llec tio n o f c harac ters like tho se sho wn in Figure 7-2. XML addresses this pro blem by mo ving beyo nd small, lo c al c harac ter sets to o ne large set that’s suppo sed to enc o mpass all sc ripts used in all living languages (and a few dead o nes) o n planet Earth. This c harac ter set is c alled Unic o de. As previo usly no ted, Unic o de is a do uble-byte c harac ter set that pro vides representatio ns o f o ver 40,000 different c harac ters in do zens o f sc ripts and hundreds o f languages. All XML pro c esso rs are required to understand Unic o de, even if they c an’t fully display it. As yo u learned in Chapter 6, an XML do c ument is divided into text and binary entities. Eac h text entity has an enc o ding. If the enc o ding is no t explic itly spec ified in the entity’s definitio n, then the default is UTF-8 — a c o mpressed fo rm o f Unic o de whic h leaves pure ASCII text unc hanged. Thus XML files that c o ntain no thing but the c o mmo n ASCII c harac ters may be edited with to o ls that are unaware o f the c o mplic atio ns o f dealing with multi-byte c harac ter sets like Unic o de.

    The ASCII Character Set ASCII, the Americ an Standard Co de fo r Info rmatio n Interc hange, is o ne o f the o riginal c harac ter sets, and is by far the mo st c o mmo n. It fo rms a so rt o f lo west c o mmo n deno minato r fo r what a c harac ter set must suppo rt. It defines all the c harac ters needed to write U.S. English, and essentially no thing else. The c harac ters are enc o ded as the numbers 0-127. Table 7-1 presents the ASCII c harac ter set.

    169

    170

    Part I ✦ Introducing XM L

    Table 7-1 The ASCII Character Set Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    0

    null(Control-@)

    32

    Space

    64

    @

    96

    `

    1

    start of heading (Control-A)

    33

    !

    65

    A

    97

    a

    2

    start of text (Control-B)

    34



    66

    B

    98

    b

    3

    end of text (Control-C)

    35

    #

    67

    C

    99

    c

    4

    end of transmission (Control-D)

    36

    $

    68

    D

    100

    d

    5

    enquiry (Control-E)

    37

    %

    69

    E

    101

    e

    6

    acknowledge (Control-F)

    38

    &

    70

    F

    102

    f

    7

    bell (Control-G)

    39



    71

    G

    103

    g

    8

    backspace (Control-H)

    40

    (

    72

    H

    104

    h

    9

    tab(Control-I)

    41

    )

    73

    I

    105

    i

    10

    linefeed (Control-J)

    42

    *

    74

    J

    106

    j

    11

    vertical tab) (Control-K

    43

    +

    75

    K

    107

    k

    12

    formfeed (Control-L)

    44

    ,

    76

    L

    108

    l

    13

    carriage return (Control-M)

    45

    -

    77

    M

    109

    m

    14

    shift out (Control-N)

    46

    .

    78

    N

    110

    n

    15

    shift in (Control-O)

    47

    /

    79

    O

    111

    o

    16

    data link escape (Control-P)

    48

    0

    80

    P

    112

    p

    17

    device control 1 (Control-Q)

    49

    1

    81

    Q

    113

    q

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    18

    device control 2 (Control-R)

    50

    2

    82

    R

    114

    r

    19

    device control 3 (Control-S)

    51

    3

    83

    S

    115

    s

    20

    device control 4 (Control-T)

    52

    4

    84

    T

    116

    t

    21

    negative acknowledge (Control-U)

    53

    5

    85

    U

    117

    u

    22

    synchronous idle (Control-V)

    54

    6

    86

    V

    118

    v

    23

    end of transmission block (Control-W)

    55

    7

    87

    W

    119

    w

    24

    cancel (Control-X)

    56

    8

    88

    X

    120

    x

    25

    end of medium (Control-Y)

    57

    9

    89

    Y

    121

    y

    26

    substitute (Control-Z)

    58

    :

    90

    Z

    122

    z

    27

    escape (Control-[)

    59

    ;

    91

    [

    123

    {

    28

    file separator (Control-\)

    60




    94

    ^

    126

    ~

    31

    unit separator (Control-_)

    63

    ?

    95

    _

    127

    delete

    Charac ters 0 thro ugh 31 are no n-printing c o ntro l c harac ters. They inc lude the c arriage return, the linefeed, the tab, the bell, and similar c harac ters. Many o f these are lefto vers fro m the days o f paper-based teletype terminals. Fo r instanc e, c arriage return used to literally mean mo ve the c arriage bac k to the left margin, as yo u’d do o n a typewriter. Linefeed mo ved the platen up o ne line. Aside fro m the few c o ntro l c harac ters mentio ned, these aren’t used muc h anymo re. Mo st o ther c harac ter sets yo u’re likely to enc o unter are supersets o f ASCII. In o ther wo rds, they define 0 tho ugh 127 exac tly the same as ASCII, but add additio nal c harac ters fro m 128 o n up.

    171

    172

    Part I ✦ Introducing XM L

    The ISO Character Sets The A in ASCII stands fo r Americ an, so it sho uldn’t surprise yo u that ASCII is o nly adequate fo r writing English, and stric tly Americ an English at that. ASCII c o ntains no £, ü, ¿, o r many o ther c harac ters yo u might want fo r writing in o ther languages o r lo c ales. ASCII c an be extended by assigning additio nal c harac ters to numbers abo ve 128. The Internatio nal Standards Organizatio n (ISO) has defined a number o f different c harac ter sets based o n ASCII that add additio nal c harac ters needed fo r o ther languages and lo c ales. The mo st pro minent suc h c harac ter set is ISO 8859-1, c o mmo nly c alled Latin-1. Latin-1 inc ludes eno ugh additio nal c harac ters to write essentially all Western Euro pean languages. Charac ters 0 thro ugh 127 are the same as they are in ASCII. Charac ters 128 thro ugh 255 are given in Table 7-2. Again, the first 32 c harac ters are mo stly unused, no n-printing c o ntro l c harac ters.

    Table 7-2 The ISO 8859-1 Latin-1 Character Set Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    128

    Undefined

    160

    non-breaking space

    192

    À

    224

    À

    129

    Undefined

    161

    ¡

    193

    Á

    225

    Á

    130

    Bph

    162

    ¢

    194

    Â

    226

    Â

    131

    Nbh

    163

    £

    195

    Ã

    227

    Ã

    132

    Undefined

    164

    196

    Ä

    228

    Ä

    133

    Nel

    165

    ¥

    197

    Å

    229

    Å

    134

    Ssa

    166

    B

    198

    Æ

    230

    Æ

    135

    Esa

    167

    §

    199

    Ç

    231

    Ç

    136

    Hts

    168

    ¨

    200

    È

    232

    È

    137

    Htj

    169

    ©

    201

    É

    233

    É

    138

    Vts

    170

    ª

    202

    Ê

    234

    Ê

    139

    Pld

    171

    «

    203

    Ë

    235

    Ë

    140

    Plu

    172

    ¬

    204

    Ì

    236

    Ì

    141

    Ri

    173

    Discretionary hyphen

    205

    Í

    237

    Í

    142

    ss2

    174

    ®

    206

    Î

    238

    Î

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    143

    ss3

    175

    ¯

    207

    Ï

    239

    Ï

    144

    Dcs

    176

    °

    208

    W

    240

    e

    145

    pu1

    177

    ±

    209

    Ñ

    241

    Ñ

    146

    pu2

    178

    2

    210

    Ò

    242

    Ò

    147

    Sts

    179

    3

    211

    Ó

    243

    Ó

    148

    Cch

    180

    ´

    212

    Ô

    244

    Ô

    149

    Mw

    181

    µ

    213

    Õ

    245

    Õ

    150

    Spa

    182



    214

    Ö

    246

    Ö

    151

    Epa

    183

    ·

    215

    ×

    247

    ÷

    152

    Sos

    184

    ¸

    216

    Ø

    248

    Ø

    153

    Undefined

    185

    1

    217

    Ù

    249

    Ù

    154

    Sci

    186

    º

    218

    Ú

    250

    Ú

    155

    Csi

    187

    »

    219

    Û

    251

    Û

    156

    St

    188

    1/4

    220

    Ü

    252

    Ü

    157

    Osc

    189

    1/2

    221

    158

    Pm

    190

    3/4

    222

    T

    254

    T

    159

    Apc

    191

    ¿

    223

    ß

    255

    Ÿ

    253

    Latin-1 still lac ks many useful c harac ters inc luding tho se needed fo r Greek, Cyrillic , Chinese, and many o ther sc ripts and languages. Yo u might think these c o uld just be mo ved into the numbers fro m 256 up. Ho wever there’s a c atc h. A single byte c an o nly ho ld values fro m 0 to 255. To go beyo nd that, yo u need to use a multi-byte c harac ter set. Fo r histo ric al reaso ns mo st pro grams are written under the assumptio n that c harac ters and bytes are identic al, and they tend to break when fac ed with multi-byte c harac ter sets. Therefo re, mo st c urrent o perating systems (Windo ws NT being the no table exc eptio n) use different, single-byte c harac ter sets rather than o ne large multi-byte set. Latin-1 is the mo st c o mmo n suc h set, but o ther sets are needed to handle additio nal languages. ISO 8859 defines ten o ther c harac ter sets (8859-2 thro ugh 8859-10 and 8859-15) suitable fo r different sc ripts, with fo ur mo re (8859-11 thro ugh 8859-14) in ac tive develo pment. Table 7-3 lists the ISO c harac ter sets and the languages and sc ripts they c an be used fo r. All share the same ASCII c harac ters fro m 0 to 127, and then eac h inc ludes additio nal c harac ters fro m 128 to 255.

    173

    174

    Part I ✦ Introducing XM L

    Table 7-3 The ISO Character Sets

    Character Set

    Also Known As

    ISO 8859-1

    Latin-1

    ASCII plus the characters required for most Western European languages including Albanian, Afrikaans, Basque, Catalan, Danish, Dutch, English, Faroese, Finnish, Flemish, Galician, German, Icelandic, Irish, Italian, Norwegian, Portuguese, Scottish, Spanish, and Swedish. However it omits the ligatures ij (Dutch), Œ (French), and German quotation marks.

    ISO 8859-2

    Latin-2

    ASCII plus the characters required for most Central European languages including Czech, English, German, Hungarian, Polish, Romanian, Croatian, Slovak, Slovene, and Sorbian.

    ISO 8859-3

    Latin-3

    ASCII plus the characters required for English, Esperanto, German, Maltese, and Galician.

    ISO 8859-4

    Latin-4

    ASCII plus the characters required for the Baltic languages Latvian, Lithuanian, German, Greenlandic, and Lappish; superseded by ISO 8859-10, Latin-6

    Languages

    ISO 8859-5

    ASCII plus Cyrillic characters required for Byelorussian, Bulgarian, Macedonian, Russian, Serbian, and Ukrainian.

    ISO 8859-6

    ASCII plus Arabic.

    ISO 8859-7

    ASCII plus Greek.

    ISO 8859-8

    ASCII plus Hebrew.

    ISO 8859-9

    Latin-5

    Latin-1 except that the Turkish letters , ı, , , , and take the place of the less commonly used Icelandic letters , , T, y, W, and e.

    ISO 8859-10

    Latin-6

    ASCII plus characters for the Nordic languages Lithuanian, Inuit (Greenlandic Eskimo), non-Skolt Sami (Lappish), and Icelandic.

    ISO 8859-11

    ASCII plus Thai.

    ISO 8859-12

    This may eventually be used for ASCII plus Devanagari (Hindi, Sanskrit, etc.) but no proposal is yet available.

    ISO 8859-13

    Latin-7

    ASCII plus the Baltic Rim, particularly Latvian.

    ISO 8859-14

    Latin-8

    ASCII plus Gaelic and Welsh.

    ISO 8859-15

    Latin-9, Latin-0

    Essentially the same as Latin-1 but with a Euro sign instead of the international currency sign . Furthermore, the Finnish characters , , , replace the uncommon symbols B, ¨, ¸. And the French Œ, œ, and Ÿ characters replace the fractions 1/4, 1/2, 3/4.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    These sets o ften o verlap. Several languages, mo st no tably English and German, c an be written in mo re than o ne o f the c harac ter sets. To so me extent the different sets are designed to allo w different c o mbinatio ns o f languages. Fo r instanc e Latin-1 c an c o mbine mo st Western languages and Ic elandic whereas Latin-5 c o mbines mo st Western languages with Turkish instead o f Ic elandic . Thus if yo u needed a do c ument in English, Frenc h, and Ic elandic , yo u’d use Latin-1. Ho wever a do c ument c o ntaining English, Frenc h, and Turkish wo uld use Latin-5. Ho wever, a do c ument that required English, Hebrew, and Turkish, wo uld have to use Unic o de sinc e no single-byte c harac ter set handles all three languages and sc ripts. A single-byte set is insuffic ient fo r Chinese, Japanese, and Ko rean. These languages have mo re than 256 c harac ters apiec e, so they must use multi-byte c harac ter sets.

    The M acRoman Character Set The Mac OS predates Latin-1 b y several years. ( The ISO 8859-1 standard was first ado pted in 1987. The first Mac was released in 1984.) Unfo rtunately this means that Apple had to define its o wn extended c harac ter set c alled Mac Ro man. Mac Ro man has mo st o f the same extended c harac ters as Latin-1 ( exc ept fo r the Ic elandic letters T, y, and e) b ut the c harac ters are assigned to different numb ers. Mac Ro man is the same as ASCII and Latin-1 in the c o des tho ugh the first 127 c harac ters. This is o ne reaso n text files that use extended c harac ters o ften lo o k funny when mo ved fro m a PC to a Mac o r vic e versa. Tab le 7-4 lists the upper half o f the Mac Ro man c harac ter set.

    Table 7-4 The M acRoman Character Set Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    128

    Â

    160



    192

    ¿

    224



    129

    Å

    161

    °

    193

    ¡

    225

    ·

    130

    Ç

    162

    ¢

    194

    ¬

    226



    131

    É

    163

    £

    195



    227



    132

    Ñ

    164

    §

    196

    ƒ

    228



    133

    Ö

    165

    ·

    197

    ˜

    229

    Â

    134

    Û

    166



    198



    230

    Ê

    135

    Á

    167

    ß

    199

    «

    231

    Á

    136

    À

    168

    ®

    200

    »

    232 Co ntinue d

    175

    176

    Part I ✦ Introducing XM L

    Table 7-4 (continued) Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    137

    Â

    169

    ©

    201

    ...

    233

    È

    138

    Ä

    170



    202

    non-breaking space

    234

    Í

    139

    Ã

    171

    ´

    203

    À

    235

    Î

    140

    Å

    172

    ¨

    204

    Ã

    236

    Ï

    141

    Ç

    173



    205

    Õ

    237

    Ì

    142

    É

    174

    Æ

    206

    Œ

    238

    Î

    143

    È

    175

    Ø

    207

    Œ

    239

    Ó

    144

    Ê

    176



    208

    ¯

    240

    Ô

    145

    Ë

    177

    ±

    209

    _

    241

    Apple

    146

    Í

    178



    210



    242

    Ò

    147

    Ì

    179



    211



    243

    Ú

    148

    Ì

    180

    ¥

    212



    244

    Û

    149

    Ï

    181

    µ

    213



    245

    1

    150

    ñ

    182



    214

    ÷

    246

    ˆ

    151

    ó

    183



    215



    247

    ˜

    152

    ò

    184



    216

    Ÿ

    248

    ¯

    153

    ô

    185

    Π

    217

    Ÿ

    249

    ˘

    154

    ö

    186



    218

    /

    250

    ˙

    155

    õ

    187

    ª

    219

    251

    °

    156

    ú

    188

    °

    220



    252

    ¸

    157

    Ù

    189



    221



    253

    ˝

    158

    Û

    190

    Æ

    222

    fi

    254

    ˛

    159

    Ü

    191

    Ø

    223

    fl

    255

    ˇ

    The Windows ANSI Character Set The first versio n o f Windo ws to ac hieve widespread ado ptio n fo llo wed the Mac by a few years, so it was able to ado pt the Latin-1 c harac ter set. Ho wever, it replac ed the no n-printing c o ntro l c harac ters between 130 and 159 with mo re printing c harac ters to stretc h the available range a little further. This mo dified versio n o f Latin-1 is generally c alled “Windo ws ANSI.” Table 7-5 lists the Windo ws ANSI c harac ters.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Table 7-5 The Windows ANSI Character Set Code

    Character

    Code

    Character

    Code

    Character

    Code

    Character

    128

    Undefined

    136

    ˆ

    144

    Undefined

    152

    ~

    129

    Undefined

    137



    145



    153



    130

    ,

    138

    146



    154

    131



    139



    147



    155



    132



    140

    Œ

    148



    156

    Œ

    133

    ...

    141

    Undefined

    149



    157

    Undefined

    134



    142

    Undefined

    150



    158

    Undefined

    135



    143

    Undefined

    151



    159

    Ÿ

    The Unicode Character Set Using different c harac ter sets fo r different sc ripts and languages wo rks well eno ugh as lo ng as:

    1. Yo u do n’t need to wo rk in mo re than o ne sc ript at o nc e. 2. Yo u never trade files with anyo ne using a different c harac ter set. Sinc e Mac s and PCs use different c harac ter sets, mo re peo ple fail these c riteria than no t. Obvio usly what is needed is a single c harac ter set that everyo ne agrees o n and that enc o des all c harac ters in all the wo rld’s sc ripts. Creating suc h a set is diffic ult. It requires a detailed understanding o f hundreds o f languages and their sc ripts. Getting so ftware develo pers to agree to use that set o nc e it’s been c reated is even harder. No netheless wo rk is o ngo ing to c reate exac tly suc h a set c alled Unic o de, and the majo r vendo rs (Mic ro so ft, Apple, IBM, Sun, Be, and many o thers) are slo wly mo ving to ward c o mplying with it. XML spec ifies Unic o de as its default c harac ter set. Unic o de enc o des eac h c harac ter as a two -b yte unsigned numb er with a value b etween 0 and 65,535. Currently a few mo re than 40,000 different Unic o de c harac ters are defined. The remaining 25,000 spac es are reserved fo r future extensio ns. Ab o ut 20,000 o f the c harac ters are used fo r the Han ideo graphs and ano ther 11,000 o r so are used fo r the Ko rean Hangul syllab les. The remainder o f the c harac ters enc o des mo st o f the rest o f the wo rld’s languages. Unic o de c harac ters 0 thro ugh 255 are identic al to Latin-1 c harac ters 0 thro ugh 255. I’d lo ve to sho w yo u a tab le o f all the c harac ters in Unic o de, b ut if I did this b o o k wo uld c o nsist entirely o f this tab le and no t muc h else. If yo u need to kno w mo re ab o ut the spec ific enc o dings o f the different c harac ters in Unic o de, get a c o py o f

    177

    178

    Part I ✦ Introducing XM L

    The Unico de Standard ( sec o nd editio n, ISBN 0-201-48346-9, fro m Addiso n-Wesley) . This 950-page b o o k inc ludes the c o mplete Unic o de 2.0 spec ific atio n, inc luding c harac ter c harts fo r all the different c harac ters defined in Unic o de 2.0. Yo u c an also find info rmatio n o nline at the Unic o de Co nso rtium Web site at http://www. unicode.org/ and http://charts.unicode.org/. Tab le 7-6 lists the different sc ripts enc o ded b y Unic o de whic h sho uld give yo u so me idea o f Unic o de’s versatility. The c harac ters o f eac h sc ript are generally enc o ded in a c o nsec utive sub -range ( b lo c k) o f the 65,536 c o de po ints in Unic o de. Mo st languages c an b e written with the c harac ters in o ne o f these b lo c ks ( fo r example, Russian c an b e written with the Cyrillic b lo c k) tho ugh so me languages like Cro atian o r Turkish may need to mix and matc h c harac ters fro m the first fo ur Latin b lo c ks.

    Table 7-6 Unicode Script Blocks Script

    Range

    Purpose

    Basic Latin

    0-127

    ASCII, American English.

    Latin-1 Supplement

    126-255

    Upper half of ISO Latin-1, in conjunction with the Basic Latin block can handle Danish, Dutch, English, Faroese, Flemish, German, Hawaiian, Icelandic, Indonesian, Irish, Italian, Norwegian, Portuguese, Spanish, Swahili, and Swedish.

    Latin Extended-A

    256-383

    This block adds the characters from the ISO 8859 sets Latin-2, Latin-3, Latin-4, and Latin-5 not already found in the Basic Latin and Latin-1 blocks. In conjunction with those blocks, this block can encode Afrikaans, Breton, Basque, Catalan, Czech, Esperanto, Estonian, French, Frisian, Greenlandic, Hungarian, Latvian, Lithuanian, Maltese, Polish, Provençal, RhaetoRomanic, Romanian, Romany, Slovak, Slovenian, Sorbian, Turkish, and Welsh.

    Latin Extended-B

    383-591

    Mostly characters needed to extend the Latin script to handle languages not traditionally written in this script; includes many African languages, Croatian digraphs to match Serbian Cyrillic letters, the Pinyin transcription of Chinese, and the Sami characters from Latin-10.

    IPA Extensions

    592-687

    The International Phonetic Alphabet.

    Spacing Modifier Letters

    686-767

    Small symbols that somehow change (generally phonetically) the previous letter.

    Combining Diacritical Marks

    766-879

    Diacritical marks like ~, ‘, and _ that will somehow be combined with the previous character (most commonly, be placed on top of) rather than drawn as a separate character.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Script

    Range

    Purpose

    Greek

    880-1023

    Modern Greek, based on ISO 8859-7; also provides characters for Coptic.

    Cyrillic

    1024-1279

    Russian and most other Slavic languages (Ukrainian, Byelorussian, and so forth), and many non-Slavic languages of the former Soviet Union (Azerbaijani, Ossetian, Kabardian, Chechen, Tajik, and so forth); based on ISO 8859-5. A few languages (Kurdish, Abkhazian) require both Latin and Cyrillic characters

    Armenian

    1326-1423

    Armenian

    Hebrew

    1424-1535

    Hebrew (classical and modern), Yiddish, Judezmo, early Aramaic.

    Arabic

    1536-1791

    Arabic, Persian, Pashto, Sindhi, Kurdish, and classical Turkish.

    Devanagari

    2304-2431

    Sanskrit, Hindi, Nepali, and other languages of the Indian subcontinent including Awadhi, Bagheli, Bhatneri, Bhili, Bihari, Braj Bhasha, Chhattisgarhi, Garhwali, Gondi, Harauti, Ho, Jaipuri, Kachchhi, Kanauji, Konkani, Kului, Kumaoni, Kurku, Kurukh, Marwari, Mundari, Newari, Palpa, and Santali.

    Bengali

    2432-2559

    A North Indian script used in India’s West Bengal state and Bangladesh; used for Bengali, Assamese, Daphla, Garo, Hallam, Khasi, Manipuri, Mizo, Naga, Munda, Rian, Santali.

    Gurmukhi

    2560-2687

    Punjabi

    Gujarati

    2686-2815

    Gujarati

    Oriya

    2816-2943

    Oriya, Khondi, Santali.

    Tamil

    2944-3071

    Tamil and Badaga, used in south India, Sri Lanka, Singapore, and parts of Malaysia.

    Telugu

    3072-3199

    Telugu, Gondi, Lambadi.

    Kannada

    3200-3327

    Kannada, Tulu.

    Malalayam

    3326-3455

    Malalayam

    Thai

    3584-3711

    Thai, Kuy, Lavna, Pali.

    Lao

    3712-3839

    Lao

    Tibetan

    3840-4031

    Himalayan languages including Tibetan, Ladakhi, and Lahuli. Co ntinue d

    179

    180

    Part I ✦ Introducing XM L

    Table 7-6 (continued) Script

    Range

    Purpose

    Georgian

    4256-4351

    Georgian, the language of the former Soviet Republic of Georgian on the Black Sea.

    Hangul Jamo

    4352-4607

    The alphabetic components of the Korean Hangul syllabary.

    Latin Extended Additional

    7680-7935

    Normal Latin letters like E and Y combined with diacritical marks, rarely used except for Vietnamese vowels

    Greek Extended

    7936-8191

    Greek letters combined with diacritical marks; used in Polytonic and classical Greek.

    General Punctuation

    8192-8303

    Assorted punctuation marks.

    Superscripts and Subscripts

    8304-8351

    Common subscripts and superscripts.

    Currency Symbols

    8352-8399

    Currency symbols not already present in other blocks.

    Combining Marks for Symbols

    8400-8447

    Used to make a diacritical mark span two or more characters.

    Letter like Symbols

    8446-8527

    Symbols that look like letters such as ™ and _.

    Number Forms

    8526-8591

    Fractions and Roman numerals.

    Arrows

    8592-8703

    Arrows

    Mathematical Operators

    8704-8959

    Mathematical operators that don’t already appear in other blocks.

    Miscellaneous Technical

    8960-9039

    Cropping marks, braket notation from quantum mechanics, symbols needed for the APL programming language, and assorted other technical symbols.

    Control Pictures

    9216-9279

    Pictures of the ASCII control characters; generally used in debugging and network-packet sniffing.

    Optical Character Recognition

    9280-9311

    OCR-A and the MICR (magnetic ink character recognition) symbols on printed checks.

    Enclosed alphanumerics

    9312-9471

    Letters and numbers in circles and parentheses.

    Box Drawing

    9472-9599

    Characters for drawing boxes on monospaced terminals.

    Block Elements

    9600-9631

    Monospaced terminal graphics as used in DOS and elsewhere.

    Geometric Shapes

    9632-9727

    Squares, diamonds, triangles, and the like.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Script

    Range

    Purpose

    Miscellaneous Symbols

    9726-9983

    Cards, chess, astrology, and more.

    Dingbats

    9984-10175 The Zapf Dingbat characters.

    CJK Symbols and Punctuation

    1228612351

    Symbols and punctuation used in Chinese, Japanese, and Korean.

    Hiragana

    1235212447

    A cursive syllabary for Japanese

    Katakana

    1244612543

    A non-cursive syllabary used to write words imported from the West in Japanese, especially modern words like “keyboard”.

    Bopomofo

    1254412591

    A phonetic alphabet for Chinese used primarily for teaching.

    Hangul Compatibility Jamo

    1259212687

    Korean characters needed for compatibility with the KSC 5601 encoding.

    Kanbun

    1268612703

    Marks used in Japanese to indicate the reading order of classical Chinese.

    Enclosed CJK Letters and Months

    1280013055

    Hangul and Katakana characters enclosed in circles and parentheses.

    CJK Compatibility

    1305613311

    Characters needed only to encode KSC 5601 and CNS 11643.

    CJK Unified Ideographs

    1996640959

    The Han ideographs used for Chinese, Japanese, and Korean.

    Hangul Syllables

    4403255203

    A Korean syllabary.

    Surrogates

    5529657343

    Currently unused, but will eventually allow the extension of Unicode to over one million different characters.

    Private Use

    5734463743

    Software developers can include their custom characters here; not compatible across implementations.

    CJK Compatibility Ideographs

    6374464255

    A few extra Han ideographs needed only to maintain compatibility with existing standards like KSC 5601.

    Alphabetic Presentation Forms

    6425664335

    Ligatures and variants sometimes used in Latin, Armenian, and Hebrew.

    Arabic Presentation Forms

    6433665023

    Variants of assorted Arabic characters.

    Co ntinue d

    181

    182

    Part I ✦ Introducing XM L

    Table 7-6 (continued) Script

    Range

    Purpose

    Combining Half Marks

    6505665071

    Combining multiple diacritical marks into a single diacritical mark that spans multiple characters.

    CJK Compatibility Forms

    6507265103

    Mostly vertical variants of Han ideographs used in Taiwan.

    Small Form Variants

    6510465135

    Smaller version of ASCII punctuation mostly used in Taiwan.

    Additional Arabic Presentation Forms

    6513665279

    More variants of assorted Arabic characters.

    Half-width and Fullwidth Forms

    6528065519

    Characters that allow conversion between different Chinese and Japanese encodings of the same characters.

    Specials

    6552065535

    The byte order mark and the zero-width, no breaking space often used to start Unicode files.

    UTF 8 Sinc e Unic o de uses two bytes fo r eac h c harac ter, files o f English text are abo ut twic e as large in Unic o de as they wo uld be in ASCII o r Latin-1. UTF-8 is a c o mpressed versio n o f Unic o de that uses o nly a single byte fo r the mo st c o mmo n c harac ters, that is the ASCII c harac ters 0-127, at the expense o f having to use three bytes fo r the less c o mmo n c harac ters, partic ularly the Hangul syllables and Han ideo graphs. If yo u’re writing mo stly in English, UTF-8 c an reduc e yo ur file sizes by as muc h as 50 perc ent. On the o ther hand if yo u’re writing mo stly in Chinese, Ko rean, o r Japanese, UTF-8 c an incre ase yo ur file size by as muc h as 50 perc ent — so it sho uld be used with c autio n. UTF-8 has mo stly no effec t o n no n-Ro man, no n-CJK sc ripts like Greek, Arabic , Cyrillic , and Hebrew. XML pro c esso rs assume text data is in the UTF-8 fo rmat unless to ld o therwise. This means they c an read ASCII files, but o ther fo rmats like Mac Ro man o r Latin-1 c ause them tro uble. Yo u’ll learn ho w to fix this pro blem sho rtly.

    The Universal Character System Unic o de has been c ritic ized fo r no t enc o mpassing eno ugh, espec ially in regard to East Asian languages. It o nly defines abo ut 20,000 o f the 80,000 Han ideo graphs used amo ngst Chinese, Japanese, Ko rean, and histo ric al Vietnamese. (Mo dern Vietnamese uses a Ro man alphabet.) UCS (Universal Charac ter System), also kno wn as ISO 10646, uses fo ur bytes per c harac ter (mo re prec isely, 31 bits) to pro vide spac e fo r o ver two billio n different

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    c harac ters. This easily c o vers every c harac ter ever used in any language in any sc ript o n the planet Earth. Amo ng o ther things this enables a full set o f c harac ters to be assigned to eac h language so that the Frenc h “e” is no t the same as the English “e” is no t the same as the German “e,” and so o n. Like Unic o de, UCS defines a number o f different variants and c o mpressed fo rms. Pure Unic o de is so metimes referred to as UCS-2, whic h is two -byte UCS. UTF-16 is a spec ial enc o ding that maps so me o f the UCS c harac ters into byte strings o f varying length in suc h a fashio n that Unic o de (UCS-2) data is unc hanged. At this po int, the advantage o f UCS o ver Unic o de is mo stly theo retic al. The o nly c harac ters that have ac tually b een defined in UCS are prec isely tho se already in Unic o de. Ho wever, it do es pro vide mo re ro o m fo r future expansio n.

    How to Write XM L in Unicode Unic o de is the native c harac ter set o f XML, and XML b ro wsers will pro b ab ly do a pretty go o d jo b o f displaying it, at least to the limits o f the availab le fo nts. No netheless, there simply aren’t many if any text edito rs that suppo rt the full range o f Unic o de. Co nseq uently, yo u’ll pro b ab ly have to tac kle this pro b lem in o ne o f a c o uple o f ways:

    1. Write in a lo c alized c harac ter set like Latin-3; then c o nvert yo ur file to Unic o de. 2. Inc lude Unic o de c harac ter referenc es in the text that numeric ally identify partic ular c harac ters. The first o ptio n is preferab le when yo u’ve go t a large amo unt o f text to enter in essentially o ne sc ript, o r o ne sc ript plus ASCII. The sec o nd wo rks b est when yo u need to mix small po rtio ns o f multiple sc ripts into yo ur do c ument.

    Inserting Characters in XM L Files with Character References Every Unic o de c harac ter is a number between 0 and 65,535. If yo u do no t have a text edito r that c an write in Unic o de, yo u c an always use a c harac ter referenc e to insert the c harac ter in yo ur XML file instead. A Unic o de c harac ter referenc e c o nsists o f the two c harac ters &# fo llo wed b y the c harac ter c o de, fo llo wed b y a semic o lo n. Fo r instanc e, the Greek letter π has Unic o de value 960 so it may b e inserted in an XML file as π. The Cyrillic c harac ter has Unic o de value 1206 so it c an b e inc luded in an XML file with the c harac ter referenc e Ҷ Unic o de c harac ter referenc es may also b e spec ified in hexadec imal ( b ase 16) . Altho ugh mo st peo ple are mo re c o mfo rtab le with dec imal numb ers, the Unic o de

    183

    184

    Part I ✦ Introducing XM L

    Spec ific atio n gives c harac ter values as two -b yte hexadec imal numb ers. It’s o ften easier to use hex values direc tly rather than c o nverting them to dec imal. All yo u need to do is inc lude an x after the &# to signify that yo u’re using a hexadec imal value. Fo r example, π has hexadec imal value 3C0 so it may be inserted in an XML file as π. The Cyrillic c harac ter has hexadec imal value 4B6 so it c an be inc luded in an XML file with the esc ape sequenc e Ҷ. Bec ause two bytes always pro duc e exac tly fo ur hexadec imal digits, it’s c usto mary (tho ugh no t required) to inc lude leading zero s in hexadec imal c harac ter referenc es so they are ro unded o ut to fo ur digits. Unic o de c harac ter referenc es, b o th hexadec imal and dec imal, may b e used to emb ed c harac ters that wo uld o therwise b e interpreted as markup. Fo r instanc e, the ampersand ( &) is enc o ded as & o r &. The less-than sign ( native2ascii myfile.txt myfile.uni Yo u c an spec ify o ther enc o dings with the -encoding o ptio n:

    C:> native2ascii -encoding Big5 chinese.txt chinese.uni Yo u c an also reverse the pro c ess to go fro m Unic o de to a lo c al enc o ding with the reverse o ptio n:

    C:> native2ascii -encoding Big5 -reverse chinese.uni chinese.txt If the o utput file name is left o ff, the c o nverted file is printed o ut. The native2asc ii pro gram also pro c esses Java-style Unic o de esc apes, whic h are c harac ters emb edded as \u09E3. These are no t in the same fo rmat as XML numeric c harac ter referenc es, tho ugh they’re similar. If yo u c o nvert to Unic o de using native2asc ii, yo u c an still use XML c harac ter referenc es — the viewer will still rec o gnize them.

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    How to Write XM L in Other Character Sets Unless to ld o therwise, an XML pro c esso r assumes that text entity c harac ters are enc o ded in UTF-8. Sinc e UTF-8 inc ludes ASCII as a subset, ASCII text is easily parsed by XML pro c esso rs as well. The o nly c harac ter set o ther than UTF-8 that an XML pro c esso r is req uired to understand is raw Unic o de. If yo u c anno t c o nvert yo ur text into either UTF-8 o r raw Unic o de, yo u c an leave the text in its native c harac ter set and tell the XML pro c esso r whic h set that is. This sho uld b e a last reso rt, tho ugh, b ec ause there’s no guarantee an arb itrary XML pro c esso r c an pro c ess o ther enc o dings. No netheless Netsc ape Navigato r and Internet Explo rer b o th do a pretty go o d jo b o f interpreting the c o mmo n c harac ter sets. To warn the XML pro c esso r that yo u’re using a no n-Unic o de enc o ding, yo u inc lude an encoding attribute in the XML dec laratio n at the start o f the file. Fo r example, to spec ify that the entire do c ument uses Latin-1 by default ( unless o verridden by ano ther pro c essing instruc tio n in a nested entity) yo u wo uld use this XML dec laratio n:

    Yo u c an also inc lude the enc o ding dec laratio n as part o f a separate pro c essing instruc tio n after the XML dec laratio n b ut b efo re any c harac ter data appears.

    Table 7-7 lists the o ffic ial names o f the mo st c o mmo n c harac ter sets used to day, as they wo uld be given in XML enc o ding attributes. Fo r enc o dings no t fo und in this list, c o nsult the o ffic ial list maintained by the Internet Assigned Numbers Autho rity (IANA) at http://www.isi.edu/in-notes/iana/assignments/character-sets.

    Table 7-7 Names of Common Character Sets Character Set Name

    Languages/ Countries

    US-ASCII

    English

    UTF-8

    Compressed Unicode

    UTF-16

    Compressed UCS

    ISO-10646-UCS-2

    Raw Unicode

    ISO-10646-UCS-4

    Raw UCS Co ntinue d

    185

    186

    Part I ✦ Introducing XM L

    Table 7-7 (continued) Character Set Name

    Languages/ Countries

    ISO-8859-1

    Latin-1, Western Europe

    ISO-8859-2

    Latin-2, Eastern Europe

    ISO-8859-3

    Latin-3, Southern Europe

    ISO-8859-4

    Latin-4, Northern Europe

    ISO-8859-5

    ASCII plus Cyrillic

    ISO-8859-6

    ASCII plus Arabic

    ISO-8859-7

    ASCII plus Greek

    ISO-8859-8

    ASCII plus Hebrew

    ISO-8859-9

    Latin-5, Turkish

    ISO-8859-10

    Latin-6, ASCII plus the Nordic languages

    ISO-8859-11

    ASCII plus Thai

    ISO-8859-13

    Latin-7, ASCII plus the Baltic Rim languages, particularly Latvian

    ISO-8859-14

    Latin-8, ASCII plus Gaelic and Welsh

    ISO-8859-15

    Latin-9, Latin-0; Western Europe

    ISO-2022-JP

    Japanese

    Shift_JIS

    Japanese, Windows

    EUC-JP

    Japanese, Unix

    Big5

    Chinese, Taiwan

    GB2312

    Chinese, mainland China

    KOI6-R

    Russian

    ISO-2022-KR

    Korean

    EUC-KR

    Korean, Unix

    ISO-2022-CN

    Chinese

    Chapter 7 ✦ Foreign Languages and Non-Roman Text

    Summary In this c hapter yo u learned:

    ✦ Web pages sho uld identify the enc o ding they use. ✦ What a sc ript is, ho w it relates to languages, and the fo ur things a sc ript req uires.

    ✦ Ho w sc ripts are used in c o mputers with c harac ter sets, fo nts, glyphs, and input metho ds.

    ✦ What c harac ter sets are c o mmo nly used o n different platfo rms and that mo st are based o n ASCII.

    ✦ Ho w to write XML in Unic o de witho ut a Unic o de edito r ( write the do c ument in ASCII and inc lude Unic o de c harac ter referenc es) .

    ✦ When writing XML in o ther enc o dings, inc lude an encoding attribute in the XML dec laratio n. In the next c hapter, yo u’ll begin explo ring DTDs and ho w they enable yo u to define and enfo rc e a vo c abulary, syntax, and grammar fo r yo ur do c uments.







    187

    8

    C H A P T E R

    Document Type Definitions and Validity









    In This Cha pter Do cument Type Definitio ns (DTDs)

    X

    ML has been desc ribed as a meta-markup language, that is, a language fo r desc ribing markup languages. In this c hapter yo u begin to learn ho w to do c ument and desc ribe the new markup languages yo u c reate. Suc h markup languages (also kno wn as tag se ts) are defined via a do c ument type definitio n (DTD), whic h is what this c hapter is all abo ut. Individual do c uments c an be c o mpared against DTDs in a pro c ess kno wn as validatio n. If the do c ument matc hes the c o nstraints listed in the DTD, then the do c ument is said to be valid. If it do esn’t, the do c ument is said to be invalid.

    Document Type Definitions The ac ro nym DTD stands fo r do cume nt type de finitio n. A do c ument type definitio n pro vides a list o f the elements, attributes, no tatio ns, and entities c o ntained in a do c ument, as well as their relatio nships to o ne ano ther. DTDs spec ify a set o f rules fo r the struc ture o f a do c ument. Fo r example, a DTD may dic tate that a BOOK element have exac tly o ne ISBN c hild, exac tly o ne TITLE c hild, and o ne o r mo re AUTHOR c hildren, and it may o r may no t c o ntain a single SUBTITLE. The DTD ac c o mplishes this with a list o f markup dec laratio ns fo r partic ular elements, entities, attributes, and no tatio ns. CrossReference

    This chapter focuses on elem ent declarations. Chapters 9, 10, and 11 introduce entities, attributes, and notations, respectively.

    DTDs c an be inc luded in the file that c o ntains the do c ument they desc ribe, o r they c an be linked fro m an external URL.

    Do cument type declaratio ns Validatio n ag ainst a DTD The list o f elements Element declaratio ns Co mments in DTDs Co mmo n DTDs that can be shared amo ng do cuments









    192

    Part II ✦ Document Type Definitions

    Suc h external DTDs c an be shared by different do c uments and Web sites. DTDs pro vide a means fo r applic atio ns, o rganizatio ns, and interest gro ups to agree upo n, do c ument, and enfo rc e adherenc e to markup standards. Fo r example, a publisher may want an autho r to adhere to a partic ular fo rmat bec ause it makes it easier to lay o ut a bo o k. An autho r may prefer writing wo rds in a ro w witho ut wo rrying abo ut matc hing up eac h bullet po int in the fro nt o f the c hapter with a subhead inside the c hapter. If the autho r writes in XML, it’s easy fo r the publisher to c hec k whether the autho r adhered to the predetermined fo rmat spec ified by the DTD, and even to find o ut exac tly where and ho w the autho r deviated fro m the fo rmat. This is muc h easier than having edito rs read thro ugh do c uments with the ho pe that they spo t all the mino r deviatio ns fro m the fo rmat, based o n style alo ne. DTDs also help ensure that different peo ple and pro grams c an read eac h o ther’s files. Fo r instanc e, if c hemists agree o n a single DTD fo r basic c hemic al no tatio n, po ssibly via the intermediary o f an appro priate pro fessio nal o rganizatio n suc h as the Americ an Chemic al So c iety, then they c an be assured that they c an all read and understand o ne ano ther’s papers. The DTD defines exac tly what is and is no t allo wed to appear inside a do c ument. The DTD establishes a standard fo r the elements that viewing and editing so ftware must suppo rt. Even mo re impo rtantly, it establishes extensio ns beyo nd tho se that the DTD dec lares are invalid. Thus, it helps prevent so ftware vendo rs fro m embrac ing and extending o pen pro to c o ls in o rder to lo c k users into their pro prietary so ftware. Furthermo re, a DTD sho ws ho w the different elements o f a page are arranged witho ut ac tually pro viding their data. A DTD enables yo u to see the struc ture o f yo ur do c ument separate fro m the ac tual data. This means yo u c an slap a lo t o f fanc y styles and fo rmatting o nto the underlying struc ture witho ut destro ying it, muc h as yo u paint a ho use witho ut c hanging its basic arc hitec tural plan. The reader o f yo ur page may no t see o r even be aware o f the underlying struc ture, but as lo ng as it’s there, human autho rs and JavaSc ripts, CGIs, servlets, databases, and o ther pro grams c an use it. There’s mo re yo u c an do with DTDs. Yo u c an use them to define glo ssary entities that insert bo ilerplate text suc h as a signature blo c k o r an address. Yo u c an asc ertain that data entry c lerks are adhering to the fo rmat yo u need. Yo u c an migrate data to and fro m relatio nal and o bjec t databases. Yo u c an even use XML as an intermediate fo rmat to c o nvert different fo rmats with suitable DTDs. So let’s get started and see what DTDs really lo o k like.

    Document Type Declarations A do cume nt type de claratio n spec ifies the DTD a do c ument uses. The do c ument type dec laratio n appears in a do c ument’s pro lo g, after the XML dec laratio n but befo re the ro o t element. It may c o ntain the do c ument type definitio n o r a URL identifying the file where the do c ument type definitio n is fo und. It may even c o ntain bo th, in

    Chapter 8 ✦ Document Type Definitions and Validity

    whic h c ase the do c ument type definitio n has two parts, the internal and external subsets. Caution

    A docum ent type declaration is not the sam e thing as a docum ent type definition . Only the docum ent type definition is abbreviated DTD. A docum ent type declaration m ust contain or refer to a docum ent type definition, but a docum ent type definition never contains a docum ent type declaration. I agree that this is unnecessarily confusing. Unfortunately, XM L seem s stuck w ith this term inology. Fortunately, m ost of the tim e the difference betw een the tw o is not significant.

    Rec all Listing 3-2 (greeting.xml) fro m Chapter 3. It is sho wn belo w:

    Hello XML!

    This do c ument c o ntains a single element, GREETING. ( Remember, is a pro c essing instruc tio n, no t an element.) Listing 8-1 sho ws this do c ument, but no w with a do c ument type dec laratio n. The do c ument type dec laratio n dec lares that the ro o t element is GREETING. The do c ument type dec laratio n also c o ntains a do c ument type definitio n, whic h dec lares that the GREETING element c o ntains parsed c harac ter data.

    Listing 8-1: Hello XM L with DTD

    Hello XML!

    The o nly differenc e between Listing 3-2 and Listing 8-1 are the three new lines added to Listing 8-1:

    These lines are this Listing 8-1’s do c ument type dec laratio n. The do c ument type dec laratio n c o mes between the XML dec laratio n and the do c ument itself. The XML dec laratio n and the do c ument type dec laratio n to gether are c alled the pro lo g o f the do c ument. In this sho rt example, is the XML dec laratio n; is the do c ument type dec laratio n; is the

    193

    194

    Part II ✦ Document Type Definitions

    do c ument type definitio n; and Hello XML! is the do c ument o r ro o t element. A do c ument type dec laratio n begins with . It’s c usto mary to plac e the beginning and end o n separate lines, but line breaks and extra whitespac e are no t signific ant. The same do c ument type dec laratio n c o uld be written o n a single line:

    The name o f the ro o t element— GREETING in this example fo llo ws (c ase-sensitive as mo st things are in XML) is an e le me nt type de claratio n. In this c ase, the name o f the dec lared element is GREETING. It is the o nly element. This element may c o ntain parsed c harac ter data (o r #PCDATA). Parsed c harac ter data is essentially any text that’s no t markup text. This also inc ludes entity referenc es, suc h as &, that are replac ed by text when the do c ument is parsed. Yo u c an lo ad this do c ument into an XML bro wser as usual. Figure 8-1 sho ws Listing 8-1 in Internet Explo rer 5.0. The result is pro bably what yo u’d expec t, a c o llapsible o utline view o f the do c ument so urc e. Internet Explo rer indic ates that a do c ument type dec laratio n is present by adding the line in blue.

    Figure 8-1: Hello XML w ith DTD displayed in Internet Explorer 5.0

    Chapter 8 ✦ Document Type Definitions and Validity

    Of c o urse, the do c ument c an be c o mbined with a style sheet just as it was in Listing 3-6 in Chapter 3. In fac t, yo u c an use the same style sheet. Just add the usual pro c essing instruc tio n to the pro lo g as sho wn in Listing 8-2.

    Listing 8-2: Hello XM L with a DTD and style sheet



    Hello XML!

    Figure 8-2 sho ws the resulting Web page. This is e xactly the same as it was in Figure 3-3 in Chapter 3 witho ut the DTD. Fo rmatting generally do es no t c o nsider the DTD.

    Figure 8-2 Hello XML w ith a DTD and style sheet displayed in Internet Explorer 5.0

    Validating Against a DTD A valid do c ument must meet the c o nstraints spec ified by the DTD. Furthermo re, its ro o t element must be the o ne spec ified in the do c ument type dec laratio n. What the do c ument type dec laratio n and DTD in Listing 8-1 say is that a valid do c ument must lo o k like this:

    various random text but no markup

    195

    196

    Part II ✦ Document Type Definitions

    A valid do c ument may no t lo o k like this:

    various random text

    No r may it lo o k like this:

    various random text

    This do c ument must c o nsist o f no thing mo re and no thing less than parsed c harac ter data between an o pening tag and a c lo sing tag. Unlike a merely well-fo rmed do c ument, a valid do c ument do es no t allo w arbitrary tags. Any tags used must be dec lared in the do c ument’s DTD. Furthermo re, they must be used o nly in the way permitted by the DTD. In Listing 8-1, the tag c an be used o nly to start the ro o t element, and it may no t be nested. Suppo se we make a simple c hange to Listing 8-2 by replac ing the and tags with and , as sho wn in Listing 8-3. Listing 8-3 is invalid. It is a well-fo rmed XML do c ument, but it do es no t meet the c o nstraints spec ified by the do c ument type dec laratio n and the DTD it c o ntains.

    Listing 8-3: Invalid Hello XM L does not meet DTD rules



    Hello XML!

    Note

    Not all docum ents have to be valid, and not all parsers check docum ents for validity. In fact, m ost Web brow sers including IE5 and Mozilla do not check docum ents for validity.

    A validating parser reads a DTD and c hec ks whether a do c ument adheres to the rules spec ified by the DTD. If it do es, the parser passes the data alo ng to the XML applic atio n (suc h as a Web bro wser o r a database). If the parser finds a mistake, then it repo rts the erro r. If yo u’re writing XML by hand, yo u’ll want to validate yo ur

    Chapter 8 ✦ Document Type Definitions and Validity

    do c uments befo re po sting them so yo u c an be c o nfident that readers wo n’t enc o unter erro rs. There are abo ut a do zen different validating parsers available o n the Web. Mo st o f them are free. Mo st are libraries intended fo r pro grammers to inc o rpo rate into their o wn, mo re finished pro duc ts, and they have minimal (if any) user interfac es. Parsers in this c lass inc lude IBM’s alphaWo rks’ XML fo r Java, Mic ro so ft and DataChannel’s XJParser, and Silfide’s SXP. XML fo r Java: http://www.alphaworks.ibm.com/tech/xml XJParser: http://www.datachannel.com/xml_resources/ SXP: http://www.loria.fr/projets/XSilfide/EN/sxp/ So me libraries also inc lude stand-alo ne parsers that run fro m the c o mmand line. These are pro grams that read an XML file and repo rt any erro rs fo und but do no t display them. Fo r example, XJParse is a Java pro gram inc luded with IBM’s XML fo r Java 1.1.16 c lass library in the samples.XJParse pac kage. To run this pro gram, yo u first have to add the XML fo r Java jar files to yo ur Java c lass path. Yo u c an then validate a file by o pening a DOS Windo w o r a shell pro mpt and passing the lo c al name o r remo te URL o f the file yo u want to validate to the XJParse pro gram, like this:

    C:\xml4j>java samples.XJParse.XJParse -d D:\XML\08\invalid.xml Note

    At the tim e of this w riting IBM’s alphaWorks released version 2.0.6 of XML for Java. In this version you invoke only XJParse instead of sam ples.XJParse. How ever, version 1.1.16 provides m ore features for stand-alone validation.

    Yo u c an use a URL instead o f a file name, as sho wn belo w:

    C:\xml4j>java samples.XJParse.XJParse -d http://metalab.unc.edu/books/bible/examples/08/invalid.xml In either c ase, XJParse respo nds with a list o f the erro rs fo und, fo llo wed by a tree fo rm o f the do c ument. Fo r example:

    D:\XML\07\invalid.xml: 6, 4: Document root element, “foo”, must match DOCTYPE root, “GREETING”. D:\XML\07\invalid.xml: 8, 6: Element “” is not valid in this context.



    Hello XML!

    197

    198

    Part II ✦ Document Type Definitions

    This is no t espec ially attrac tive o utput. Ho wever, the purpo se o f a validating parser suc h as XJParse isn’t to display XML files. Instead, the parser’s jo b is to divide the do c ument into a tree struc ture and pass the no des o f the tree to the pro gram that will display the data. This might be a Web bro wser suc h as Netsc ape Navigato r o r Internet Explo rer. It might be a database. It might even be a c usto m pro gram yo u’ve written yo urself. Yo u use XJParse, o r o ther c o mmand line, validating parser to verify that yo u’ve written go o d XML that o ther pro grams c an handle. In essenc e, this is a pro o freading o r quality assuranc e phase, no t finished o utput. Bec ause XML fo r Java and mo st o ther validating parsers are written in Java, they share all the disadvantages o f c ro ss-platfo rm Java pro grams. First, befo re yo u c an run the parser yo u must have the Java Develo pment Kit (JDK) o r Java Runtime Enviro nment installed. Sec o ndly, yo u need to add the XML fo r Java jar files to yo ur c lass path. Neither o f these tasks is as simple as it sho uld be. No ne o f these to o ls were designed with an eye to ward no npro grammer end-users; they tend to be po o rly designed and frustrating to use. If yo u’re writing do c uments fo r Web bro wsers, the simplest way to validate them is to lo ad them into the bro wser and see what erro rs it repo rts. Ho wever, no t all Web bro wsers validate do c uments. So me may merely ac c ept well-fo rmed do c uments witho ut regard to validity. Internet Explo rer 5.0 beta 2 validated do c uments, but the release versio n did no t. On the CD-ROM

    The JRE for Window s and Unix is included on the CD-ROM in the m isc/ jre folder.

    Web-based validato rs are an alternative if the do c uments are plac ed o n a Web server and aren’t partic ularly private. These parsers o nly require that yo u enter the URL o f yo ur do c ument in a simple fo rm. They have the distinc t advantage o f no t requiring yo u to muc k aro und with Java runtime so ftware, c lass paths, and enviro nment variables. Ric hard To bin’s RXP-based, Web-ho sted XML well-fo rmedness c hec ker and validato r is sho wn in Figure 8-3. Yo u’ll find it at http://www.cogsci.ed.ac.uk/%7Erichard/xml-check.html. Figure 8-4 sho ws the erro rs displayed as a result o f using this pro gram to validate Listing 8-3. Bro wn University’s Sc ho larly Tec hno lo gy Gro up pro vides a validato r at http:// www.stg.brown.edu/service/xmlvalid/ that’s no table fo r allo wing yo u to uplo ad files fro m yo ur c o mputer instead o f plac ing them o n a public Web server. This is sho wn in Figure 8-5. Figure 8-6 sho ws the results o f using this pro gram to validate Listing 8-3.

    Chapter 8 ✦ Document Type Definitions and Validity

    Figure 8-3: Richard Tobin’s RXP-based, Web-hosted XML w ell-form edness checker and validator

    Figure 8-4: The errors w ith Listing 8-3, as reported by Richard Tobin’s XML validator

    199

    200

    Part II ✦ Document Type Definitions

    Figure 8-5: Brow n University’s Scholarly Technology Group’s Web-hosted XML validator

    Figure 8-6: The errors w ith Listing 8-3, as reported by Brow n University’s Scholarly Technology Group’s XML validator

    Chapter 8 ✦ Document Type Definitions and Validity

    Listing the Elements The first step to c reating a DTD appro priate fo r a partic ular do c ument is to understand the struc ture o f the info rmatio n yo u’ll enc o de using the elements defined in the DTD. So metimes info rmatio n is quite struc tured, as in a c o ntac t list. Other times it is relatively free-fo rm, as in an illustrated sho rt sto ry o r a magazine artic le. Let’s use a relatively struc tured do c ument as an example. In partic ular, let’s return to the baseball statistic s first sho wn in Chapter 4. Adding a DTD to that do c ument enables us to enfo rc e c o nstraints that were previo usly adhered to o nly by c o nventio n. Fo r instanc e, we c an require that a SEASON c o ntain exac tly two LEAGUE c hildren, every TEAM have a TEAM_CITY and a TEAM_NAME, and the TEAM_CITY always prec ede the TEAM_NAME. Rec all that a c o mplete baseball statistic s do c ument c o ntains the fo llo wing elements:

    SEASON

    RBI

    YEAR

    STEALS

    LEAGUE

    CAUGHT_STEALING

    LEAGUE_NAME

    SACRIFICE_HITS

    DIVISION

    SACRIFICE_FLIES

    DIVISION_NAME

    ERRORS

    TEAM

    WALKS

    TEAM_CITY

    STRUCK_OUT

    TEAM_NAME

    HIT_BY_PITCH

    PLAYER

    COMPLETE_GAMES

    SURNAME

    SHUT_OUTS

    GIVEN_NAME

    ERA

    POSITION

    INNINGS

    GAMES

    HOME_RUNS

    GAMES_STARTED

    RUNS

    AT_BATS

    EARNED_RUNS

    RUNS

    HIT_BATTER

    HITS

    WILD_PITCHES

    DOUBLES

    BALK

    TRIPLES

    WALKED_BATTER

    HOME_RUNS

    STRUCK_OUT_BATTER

    201

    202

    Part II ✦ Document Type Definitions

    WINS

    COMPLETE_GAMES

    LOSSES

    SHUT_OUTS

    SAVES The DTD yo u write needs element dec laratio ns fo r eac h o f these. Eac h element dec laratio n lists the name o f an element and the c hildren the element may have. Fo r instanc e, a DTD c an require that a LEAGUE have exac tly three DIVISION c hildren. It c an also require that the SURNAME element be inside a PLAYER element, never o utside. It c an insist that a DIVISION have an indefinite number o f TEAM elements but never less than o ne. A DTD c an require that a PLAYER have exac tly o ne eac h o f the GIVEN_NAME, SURNAME, POSITION, and GAMES elements, but make it o ptio nal whether a PLAYER has an RBI o r an ERA. Furthermo re, it c an require that the GIVEN_NAME, SURNAME, POSITION, and GAMES elements be used in a partic ular o rder. A DTD c an also require that elements o c c ur in a partic ular c o ntext. Fo r instanc e, the GIVEN_NAME, SURNAME, POSITION, and GAMES may be used o nly inside a PLAYER element. It’s o ften easier to begin if yo u have a c o nc rete, well-fo rmed example do c ument in mind that uses all the elements yo u want in yo ur DTD. The examples in Chapter 4 serve that purpo se here. Listing 8-4 is a trimmed-do wn versio n o f Listing 4-1 in Chapter 4. Altho ugh it has o nly two players, it demo nstrates all the essential elements.

    Listing 8-4: A well-formed XM L document for which a DTD will be written

    1998

    National

    East

    Florida Marlins

    Ludwick Eric Starting Pitcher 1 4 0 13

    Chapter 8 ✦ Document Type Definitions and Validity

    6 0 0 7.44 32.2 46 7 31 27 0 2 0 17

    Daubach Brian First Base 10 3 15 0 3 1 0 0 3 0 0 0 0 0 1 5 1

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Continued

    203

    204

    Part II ✦ Document Type Definitions

    Listing 8-4 (continued)

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago White Sox

    West

    Anaheim Angels



    Chapter 8 ✦ Document Type Definitions and Validity

    Table 8-1 lists the different elements in this partic ular listing, as well as the c o nditio ns they must adhere to . Eac h element has a list o f the o ther elements it must c o ntain, the o ther elements it may c o ntain, and the element in whic h it must be c o ntained. In so me c ases, an element may c o ntain mo re than o ne c hild element o f the same type. A SEASON c o ntains o ne YEAR and two LEAGUE elements. A DIVISION generally c o ntains mo re than o ne TEAM. Less o bvio usly, so me batters alternate between designated hitter and the o utfield fro m game to game. Thus, a single PLAYER element might have mo re than o ne POSITION. In the table, a requirement fo r a partic ular number o f c hildren is indic ated by prefixing the element with a number (fo r example, 2 LEAGUE) and the po ssibility o f multiple c hildren is indic ated by adding to the end o f the element’s name, suc h as PLAYER(s). Listing 8-4 adheres to these c o nditio ns. It c o uld be sho rter if the two PLAYER elements and so me TEAM elements were o mitted. It c o uld be lo nger if many o ther PLAYER elements were inc luded. Ho wever, all the o ther elements are required to be in the po sitio ns in whic h they appear. Note

    Elem ents have tw o basic types in XML. Sim ple elem ents contain text, also know n as parsed character data, #PCDATA or PCDATA in this context. Com pound elem ents contain other elem ents or, m ore rarely, text and other elem ents. There are no integer, floating point, date, or other data types in standard XML. Thus, you can’t use a DTD to say that the num ber of w alks m ust be a non-negative integer, or that the ERA m ust be a floating point num ber betw een 0.0 and 1.0, even though doing so w ould be useful in exam ples like this one. There are som e early efforts to define schem as that use XML syntax to describe inform ation that m ight traditionally be encoded in a DTD, as w ell as data type inform ation. As of m id-1999, these are m ostly theoretical w ith few practical im plem entations.

    No w that yo u’ve identified the info rmatio n yo u’re sto ring, and the o ptio nal and required relatio nships between these elements, yo u’re ready to build a DTD fo r the do c ument that c o nc isely — if a bit o paquely — summarizes tho se relatio nships. It’s o ften po ssible and c o nvenient to c ut and paste fro m o ne DTD to ano ther. Many elements c an be reused in o ther c o ntexts. Fo r instanc e, the desc riptio n o f a TEAM wo rks equally well fo r fo o tball, ho c key, and mo st o ther team spo rts. Yo u c an inc lude o ne DTD within ano ther so that a do c ument draws tags fro m bo th. Yo u might, fo r example, use a DTD that desc ribes the statistic s o f individual players in great detail, and then nest that DTD inside the bro ader DTD fo r team spo rts. To c hange fro m baseball to fo o tball, simply swap o ut yo ur baseball player DTD fo r a fo o tball player DTD. CrossReference

    To do this, the file containing the DTD is defined as an external entity. External param eter entity references are discussed in Chapter 9, Entities.

    205

    206

    Part II ✦ Document Type Definitions

    Table 8-1 The Elements in the Baseball Statistics Element (if any) in Which It M ust Be Contained

    Element

    Elements It M ust Contain

    Elements It M ay Contain

    SEASON

    YEAR,

    2 LEAGUE

    YEAR

    Text

    SEASON

    LEAGUE

    LEAGUE_NAME, 3 DIVISION

    SEASON

    LEAGUE_NAME

    Text

    LEAGUE

    DIVISION

    DIVISION_NAME , TEAM

    DIVISION _NAME

    Text

    TEAM

    TEAM_CITY, TEAM_NAME

    TEAM_CITY

    Text

    TEAM

    TEAM_NAME

    Text

    TEAM

    PLAYER

    SURNAME, GIVEN _NAME, POSITION, GAMES

    SURNAME

    Text

    PLAYER

    GIVEN_NAME

    Text

    PLAYER

    POSITION

    Text

    PLAYER

    TEAM(s)

    LEAGUE DIVISION

    PLAYER(s)

    GAMES_STARTED, AT _BATS, RUNS, HITS, DOUBLES, TRIPLES, HOME_RUNS, RBI, STEALS, CAUGHT_ STEALING, SACRIFICE_HITS, SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT, HIT_ BY_PITCH, COMPLETE _GAMES, SHUT_OUTS, ERA, INNINGS, HIT_ BATTER, WILD_ PITCHES, BALK, WALKED_BATTER, STRUCK_OUT_ BATTER

    DIVISION

    TEAM

    Chapter 8 ✦ Document Type Definitions and Validity

    Elements It M ay Contain

    Element (if any) in Which It M ust Be Contained

    Element

    Elements It M ust Contain

    GAMES

    Text

    PLAYER

    GAMES_ STARTED

    Text

    PLAYER

    AT_BATS

    Text

    PLAYER

    RUNS

    Text

    PLAYER

    HITS

    Text

    PLAYER

    DOUBLES

    Text

    PLAYER

    TRIPLES

    Text

    PLAYER

    HOME_RUNS

    Text

    PLAYER

    RBI

    Text

    PLAYER

    STEALS

    Text

    PLAYER

    CAUGHT_ STEALING

    Text

    PLAYER

    SACRIFICE_ HITS

    Text

    PLAYER

    SACRIFICE _FLIES

    Text

    PLAYER

    ERRORS

    Text

    PLAYER

    WALKS

    Text

    PLAYER

    STRUCK_OUT

    Text

    PLAYER

    HIT_BY_ PITCH

    Text

    PLAYER

    COMPLETE_ GAMES

    Text

    PLAYER

    SHUT_OUTS

    Text

    PLAYER

    ERA

    Text

    PLAYER

    INNINGS

    Text

    PLAYER

    HOME_RUNS _AGAINST

    Text

    PLAYER

    Co ntinue d

    207

    208

    Part II ✦ Document Type Definitions

    Table 8-1 (continued)

    Element

    Elements It M ust Contain

    Elements It M ay Contain

    Element (if any) in Which It M ust Be Contained

    RUNS_ AGAINST

    Text

    PLAYER

    HIT_BATTER

    Text

    PLAYER

    WILD_ PITCHES

    Text

    PLAYER

    BALK

    Text

    PLAYER

    WALKED_ BATTER

    Text

    PLAYER

    STRUCK_OUT _BATTER

    Text

    PLAYER

    Element Declarations Eac h tag used in a valid XML do c ument must be dec lared with an element dec laratio n in the DTD. An element dec laratio n spec ifies the name and po ssible c o ntents o f an element. The list o f c o ntents is so metimes c alled the c o ntent spec ific atio n. The c o ntent spec ific atio n uses a simple grammar to prec isely spec ify what is and isn’t allo wed in a do c ument. This so unds c o mplic ated, but all it really means is that yo u add a punc tuatio n mark suc h as *, ?, o r + to an element name to indic ate that it may o c c ur mo re than o nc e, may o r may no t o c c ur, o r must o c c ur at least o nc e. DTDs are c o nservative. Everything no t explic itly permitted is fo rbidden. Ho wever, DTD syntax do es enable yo u to c o mpac tly spec ify relatio nships that are c umberso me to spec ify in sentenc es. Fo r instanc e, DTDs make it easy to say that GIVEN_NAME must c o me befo re SURNAME — whic h must c o me befo re POSITION, whic h must c o me befo re GAMES, whic h must c o me befo re GAMES_STARTED, whic h must c o me befo re AT_BATS, whic h must c o me befo re RUNS, whic h must c o me befo re HITS — and that all o f these may appear o nly inside a PLAYER. It’s easiest to build DTDs hierarc hic ally, wo rking fro m the o utside in. This enables yo u to build a sample do c ument at the same time yo u build the DTD to verify that the DTD is itself c o rrec t and ac tually desc ribes the fo rmat yo u want.

    Chapter 8 ✦ Document Type Definitions and Validity

    ANY The first thing yo u have to do is identify the ro o t element. In the baseball example, SEASON is the ro o t element. The !DOCTYPE dec laratio n spec ifies this:

    Ho wever, this merely says that the ro o t tag is SEASON. It do es no t say anything abo ut what a SEASON element may o r may no t c o ntain, whic h is why yo u must next dec lare the SEASON element in an element dec laratio n. That’s do ne with this line o f c o de:

    All element type dec laratio ns begin with . They inc lude the name o f the element being dec lared ( SEASON in this example) fo llo wed by the c o ntent spec ific atio n. The ANY keywo rd (again c ase-sensitive) says that all po ssible elements as well as parsed c harac ter data c an be c hildren o f the SEASON element. Using ANY is c o mmo n fo r ro o t elements — espec ially o f unstruc tured do c uments — but sho uld be avo ided in mo st o ther c ases. Generally it’s better to be as prec ise as po ssible abo ut the c o ntent o f eac h tag. DTDs are usually refined thro ugho ut their develo pment, and tend to bec o me less stric t o ver time as they reflec t uses and c o ntexts unimagined in the first c ut. Therefo re, it’s best to start o ut stric t and lo o sen things up later.

    # PCDATA Altho ugh any element may appear inside the do c ument, elements that do appear must also be dec lared. The first o ne needed is YEAR. This is the element dec laratio n fo r the YEAR element:

    This dec laratio n says that a YEAR may c o ntain o nly parsed c harac ter data, that is, text that’s no t markup. It may no t c o ntain c hildren o f its o wn. Therefo re, this YEAR element is valid:

    1998

    209

    210

    Part II ✦ Document Type Definitions

    These YEAR elements are also valid:

    98 1998 C.E.

    The year of our lord one thousand, nine hundred, & ninety-eight

    Even this YEAR element is valid bec ause XML do es no t attempt to validate the c o ntents o f PCDATA, o nly that it is text that do esn’t c o ntain markup.

    Delicious, delicious, oh how boring Ho wever, this YEAR element is invalid bec ause it c o ntains c hild elements:

    January February March April May June July August September October November December

    The SEASON and YEAR element dec laratio ns are inc luded in the do c ument type dec laratio n, like this:

    ]> As usual, spac ing and indentatio n are no t signific ant. The o rder in whic h the element dec laratio ns appear isn’t relevant either. This next do c ument type dec laratio n means exac tly the same thing:

    ]>

    Chapter 8 ✦ Document Type Definitions and Validity

    Bo th o f these say that a SEASON element may c o ntain parsed c harac ter data and any number o f any o ther dec lared elements in any o rder. The o nly o ther suc h dec lared element is YEAR, whic h may c o ntain o nly parsed c harac ter data. Fo r instanc e, c o nsider the do c ument in Listing 8-5.

    Listing 8-5: A valid document

    ]>

    1998

    Bec ause the SEASON element may also c o ntain parsed c harac ter data, yo u c an add additio nal text o utside o f the YEAR. Listing 8-6 demo nstrates this.

    Listing 8-6: A valid document that contains a text

    YEAR

    and normal

    ]>

    1998 Major League Baseball

    Eventually we’ll disallo w do c uments suc h as this. Ho wever, fo r no w it’s legal bec ause SEASON is dec lared to ac c ept ANY c o ntent. Mo st o f the time it’s easier to start with ANY fo r an element until yo u define all o f it’s c hildren. Then yo u c an replac e it with the ac tual c hildren yo u want to use. Yo u c an attac h a simple style sheet, suc h as the baseballstats.c ss style sheet develo ped in Chapter 4, to Listing 8-6 — as sho wn in Listing 8-7 — and lo ad it into a Web bro wser, as sho wn in Figure 8-7. The baseballstats.c ss style sheet c o ntains

    211

    212

    Part II ✦ Document Type Definitions

    style rules fo r elements that aren’t present in the DTD o r the do c ument part o f Listing 8-7, but this is no t a pro blem. Web bro wsers simply igno re any style rules fo r elements that aren’t present in the do c ument.

    Listing 8-7: A valid document that contains a style sheet, a YEAR, and normal text

    ]>

    1998 Major League Baseball

    Figure 8-7: A valid docum ent that contains a style sheet, a YEAR elem ent, and norm al text displayed in Internet Explorer 5.0

    Child Lists Bec ause the SEASON element was dec lared to ac c ept any element as a c hild, elements c o uld be to ssed in willy-nilly. This is useful when yo u have text that’s mo re o r less unstruc tured, suc h as a magazine artic le where paragraphs, sidebars, bulleted lists, numbered lists, graphs, pho to graphs, and subheads may appear pretty muc h anywhere in the do c ument. Ho wever, so metimes yo u may want to exerc ise mo re disc ipline and c o ntro l o ver the plac ement o f yo ur data. Fo r example,

    Chapter 8 ✦ Document Type Definitions and Validity

    yo u c o uld require that every LEAGUE have o ne LEAGUE_NAME, that every PLAYER have a GIVEN_NAME and a SURNAME, and that the GIVEN_NAME c o me befo re the SURNAME. To dec lare that a LEAGUE must have a name, simply dec lare a LEAGUE_NAME element, then inc lude LEAGUE_NAME in parentheses at the end o f the LEAGUE dec laratio n, like this:

    Eac h element sho uld be dec lared in its o wn dec laratio n exac tly o nc e, even if it appears as a c hild in o ther dec laratio ns. Here I’ve plac ed the dec laratio n LEAGUE_NAME after the dec laratio n o f LEAGUE that refers to it, but that do esn’t matter. XML allo ws these so rts o f fo rward referenc es. The o rder in whic h the element tags appear is irrelevant as lo ng as their dec laratio ns are all c o ntained inside the DTD. Yo u c an add these two dec laratio ns to the do c ument, and then inc lude LEAGUE and LEAGUE_NAME elements in the SEASON. Listing 8-8 demo nstrates this. Figure 8-8 sho ws the rendered do c ument.

    Listing 8-8: A SEASON with two LEAGUE children



    ]>

    1998

    American League

    National League

    213

    214

    Part II ✦ Document Type Definitions

    Figure 8-8: A valid docum ent that contains a style sheet, a YEAR elem ent, and tw o LEAGUE children

    Sequences Let’s restric t the SEASON element as well. A SEASON c o ntains exac tly o ne YEAR, fo llo wed by exac tly two LEAGUE elements. Instead o f saying that a SEASON c an c o ntain ANY elements, spec ify these three c hildren by inc luding them in SEASON’s element dec laratio n, enc lo sed in parentheses and separated by c o mmas, as fo llo ws:

    A list o f c hild elements separated by c o mmas is c alled a sequenc e. With this dec laratio n, every valid SEASON element must c o ntain exac tly o ne YEAR element, fo llo wed by exac tly two LEAGUE elements, and no thing else. The c o mplete do c ument type dec laratio n no w lo o ks like this:



    ]> The do c ument part o f Listing 8-8 do es adhere to this DTD bec ause its SEASON element c o ntains o ne YEAR c hild fo llo wed by two LEAGUE c hildren, and no thing else. Ho wever, if the do c ument inc luded o nly o ne LEAGUE, then the do c ument, tho ugh well-fo rmed, wo uld be invalid. Similarly, if the LEAGUE c ame befo re the YEAR element instead o f after it, o r if the LEAGUE element had YEAR c hildren, o r if the do c ument in any o ther way did no t adhere to the DTD, then the do c ument wo uld be invalid and validating parsers wo uld rejec t it.

    Chapter 8 ✦ Document Type Definitions and Validity

    It’s straightfo rward to expand these tec hniques to c o ver divisio ns. As well as a LEAGUE_NAME, eac h LEAGUE has three DIVISION c hildren. Fo r example:

    One or M ore Children Eac h DIVISION has a DIVISION_NAME and between fo ur and six TEAM c hildren. Spec ifying the DIVISION_NAME is easy. This is demo nstrated belo w:

    Ho wever, the TEAM c hildren are tric kier. It’s easy to say yo u want fo ur TEAM c hildren in a DIVISION, as sho wn belo w:

    Five and six are no t harder. But ho w do yo u say yo u want between fo ur and six inc lusive? In fac t, XML do esn’t pro vide an easy way to do this. But yo u c an say yo u want o ne o r mo re o f a given element by plac ing a plus sign ( +) after the element name in the c hild list. Fo r example:

    This says that a DIVISION element must c o ntain a DIVISION_NAME element fo llo wed by o ne o r mo re TEAM elements. Tip

    There is a hard w ay to say that a DIVISION contains betw een four and six TEAM elem ents, but not three and not seven. How ever, it’s so ridiculously com plex that nobody w ould actually use it in practice. Once you finish reading this chapter, see if you can figure out how to do it.

    Zero or M ore Children Eac h TEAM sho uld c o ntain o ne TEAM_CITY, o ne TEAM_NAME, and an indefinite number o f PLAYER elements. In reality, yo u need at least nine players fo r a baseball team. Ho wever, in the examples in this bo o k, many teams are listed witho ut players fo r reaso ns o f spac e. Thus, we want to spec ify that a TEAM c an c o ntain zero o r mo re PLAYER c hildren. Do this by appending an asterisk ( *) to the element name in the c hild list. Fo r example:



    215

    216

    Part II ✦ Document Type Definitions

    Zero or One Child The final elements in the do c ument to be bro ught into play are the c hildren o f the PLAYER. All o f these are simple elements that c o ntain o nly text. Here are their dec laratio ns:

    GIVEN_NAME (#PCDATA)> POSITION (#PCDATA)> GAMES (#PCDATA)> GAMES_STARTED (#PCDATA)> AT_BATS (#PCDATA)> RUNS (#PCDATA)> HITS (#PCDATA)> DOUBLES (#PCDATA)> TRIPLES (#PCDATA)> HOME_RUNS (#PCDATA)> RBI (#PCDATA)> STEALS (#PCDATA)> CAUGHT_STEALING (#PCDATA)> SACRIFICE_HITS (#PCDATA)> SACRIFICE_FLIES (#PCDATA)> ERRORS (#PCDATA)> WALKS (#PCDATA)> STRUCK_OUT (#PCDATA)> HIT_BY_PITCH (#PCDATA)> COMPLETE_GAMES (#PCDATA)> SHUT_OUTS (#PCDATA)> ERA (#PCDATA)> INNINGS (#PCDATA)> EARNED_RUNS (#PCDATA)> HIT_BATTER (#PCDATA)> WILD_PITCHES (#PCDATA)> BALK (#PCDATA)> WALKED_BATTER (#PCDATA)> WINS (#PCDATA)> LOSSES (#PCDATA)> SAVES (#PCDATA)> COMPLETE_GAMES (#PCDATA)> STRUCK_OUT_BATTER (#PCDATA)>

    No w we c an write the dec laratio n fo r the PLAYER element. All players have o ne SURNAME, o ne GIVEN_NAME, o ne POSITION, and o ne GAMES. We c o uld dec lare that eac h PLAYER also has o ne AT_BATS, RUNS, HITS, and so fo rth. Ho wever, I’m no t sure it’s ac c urate to list zero runs fo r a pitc her who hasn’t batted. Fo r o ne thing, this likely will lead to divisio n by zero erro rs when yo u start c alc ulating batting averages and so o n. If a partic ular element do esn’t apply to a given player, o r if it’s no t available, then the mo re sensible thing to do is to o mit the partic ular statistic fro m the player’s info rmatio n. We do n’t allo w mo re than o ne o f eac h element fo r a given

    Chapter 8 ✦ Document Type Definitions and Validity

    player. Thus, we want zero o r o ne element o f the given type. Indic ate this in a c hild element list by appending a questio n mark ( ?) to the element, as sho wn belo w:

    This says that every PLAYER has a SURNAME, GIVEN_NAME, POSITION, GAMES, and GAMES_STARTED. Furthermo re, eac h player may o r may no t have a single AT_BATS, RUNS, HITS, DOUBLES, TRIPLES, HOME_RUNS, RBI, STEALS, CAUGHT_STEALING, SACRIFICE_HITS, SACRIFICE_FLIES, ERRORS, WALKS, STRUCK_OUT, and HIT_BY_PITCH.

    The Complete Document and DTD We no w have a c o mplete DTD fo r baseball statistic s. This DTD, alo ng with the do c ument part o f Listing 8-4, is sho wn in Listing 8-9. On the CD-ROM

    Listing 8-9 only covers a single team and nine players. On the CD-ROM you’ll find a docum ent containing statistics for all 1998 Major League team s and players in the exam ples/ baseball/ 1998validstats.xm l directory.

    Listing 8-9: A valid XM L document on baseball statistics with a DTD







    GIVEN_NAME (#PCDATA)> POSITION (#PCDATA)> GAMES (#PCDATA)> GAMES_STARTED (#PCDATA)> COMPLETE_GAMES (#PCDATA)> WINS (#PCDATA)> LOSSES (#PCDATA)> SAVES (#PCDATA)> AT_BATS (#PCDATA)> RUNS (#PCDATA)> HITS (#PCDATA)> DOUBLES (#PCDATA)> TRIPLES (#PCDATA)> HOME_RUNS (#PCDATA)> RBI (#PCDATA)> STEALS (#PCDATA)> CAUGHT_STEALING (#PCDATA)> SACRIFICE_HITS (#PCDATA)> SACRIFICE_FLIES (#PCDATA)> ERRORS (#PCDATA)> WALKS (#PCDATA)> STRUCK_OUT (#PCDATA)> HIT_BY_PITCH (#PCDATA)> SHUT_OUTS (#PCDATA)> ERA (#PCDATA)> INNINGS (#PCDATA)> HOME_RUNS_AGAINST (#PCDATA)> RUNS_AGAINST (#PCDATA)> EARNED_RUNS (#PCDATA)> HIT_BATTER (#PCDATA)> WILD_PITCHES (#PCDATA)> BALK (#PCDATA)> WALKED_BATTER (#PCDATA)> STRUCK_OUT_BATTER (#PCDATA)>

    ]>

    1998

    National

    Chapter 8 ✦ Document Type Definitions and Validity

    East

    Florida Marlins

    Eric Ludwick Starting Pitcher 13 6 1 4 0 0 0 7.44 32.2 31 27 0 2 0 17

    Brian Daubach First Base 10 3 15 0 3 1 0 0 3 0 0 0 0 0 1 5 1

    Montreal Expos

    219

    220

    Part II ✦ Document Type Definitions

    New York Mets

    Philadelphia Phillies

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago White Sox

    West

    Anaheim Angels



    Chapter 8 ✦ Document Type Definitions and Validity

    Listing 8-9 is no t the o nly po ssible do c ument that matc hes this DTD, ho wever. Listing 8-10 is also a valid do c ument, bec ause it c o ntains all required elements in their required o rder and do es no t c o ntain any elements that aren’t dec lared. This is pro bably the smallest reaso nable do c ument that yo u c an c reate that fits the DTD. The limiting fac to rs are the requirements that eac h SEASON c o ntain two LEAGUE c hildren, that eac h LEAGUE c o ntain three DIVISION c hildren, and that eac h DIVISION c o ntain at least o ne TEAM.

    Listing 8-10: Another XM L document that’s valid according to the baseball DTD

















    Continued

    221

    222

    Part II ✦ Document Type Definitions

    Listing 8-10 (continued)

    STEALS (#PCDATA)> CAUGHT_STEALING (#PCDATA)> SACRIFICE_HITS (#PCDATA)> SACRIFICE_FLIES (#PCDATA)> ERRORS (#PCDATA)> WALKS (#PCDATA)> STRUCK_OUT (#PCDATA)> HIT_BY_PITCH (#PCDATA)> SHUT_OUTS (#PCDATA)> ERA (#PCDATA)> INNINGS (#PCDATA)> HOME_RUNS_AGAINST (#PCDATA)> RUNS_AGAINST (#PCDATA)> EARNED_RUNS (#PCDATA)> HIT_BATTER (#PCDATA)> WILD_PITCHES (#PCDATA)> BALK (#PCDATA)> WALKED_BATTER (#PCDATA)> STRUCK_OUT_BATTER (#PCDATA)>

    ]>

    1998

    National

    East

    Atlanta Braves

    Florida Marlins

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Chapter 8 ✦ Document Type Definitions and Validity

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago White Sox

    West

    Anaheim Angels



    Choices In general, a single parent element has many c hildren. To indic ate that the c hildren must o c c ur in sequenc e, they are separated by c o mmas. Ho wever, eac h suc h c hild element may be suffixed with a questio n mark, a plus sign, o r an asterisk to adjust the number o f times it appears in that plac e in the sequenc e.

    223

    224

    Part II ✦ Document Type Definitions

    So far, the assumptio n has been made that c hild elements appear o r do no t appear in a spec ific o rder. Yo u may, ho wever, wish to make yo ur DTD mo re flexible, suc h as by allo wing do c ument autho rs to c ho o se between different elements in a given plac e. Fo r example, in a DTD desc ribing a purc hase by a c usto mer, eac h PAYMENT element might have either a CREDIT_CARD c hild o r a CASH c hild pro viding info rmatio n abo ut the metho d o f payment. Ho wever, an individual PAYMENT wo uld no t have bo th. Yo u c an indic ate that the do c ument autho r needs to input either o ne o r ano ther element by separating c hild elements with a vertic al bar ( |) rather than a c o mma ( ,) in the parent’s element dec laratio n. Fo r example, the fo llo wing says that the PAYMENT element must have a single c hild o f type CASH o r CREDIT_CARD.

    This so rt o f c o ntent spec ific atio n is c alled a c ho ic e. Yo u c an separate any number o f c hildren with vertic al bars when yo u want exac tly o ne o f them to be used. Fo r example, the fo llo wing says that the PAYMENT element must have a single c hild o f type CASH, CREDIT_CARD, o r CHECK.

    The vertic al bar is even mo re useful when yo u gro up elements with parentheses. Yo u c an gro up c o mbinatio ns o f elements in parentheses, then suffix the parentheses with asterisks, questio n marks, and plus signs to indic ate that partic ular c o mbinatio ns o f elements must o c c ur zero o r mo re, zero o r o ne, o r o ne o r mo re times.

    Children with Parentheses The final thing yo u need to kno w abo ut arranging c hild elements in parent element dec laratio ns is ho w to gro up elements with parentheses. Eac h set o f parentheses c o mbines several elements as a single element. This parenthesized element c an then be nested inside o ther parentheses in plac e o f a single element. Furthermo re, it may then have a plus sign, a c o mma, o r a questio n mark affixed to it. Yo u c an gro up these parenthesized c o mbinatio ns into still larger parenthesized gro ups to pro duc e quite c o mplex struc tures. This is a very po werful tec hnique. Fo r example, c o nsider a list c o mpo sed o f two elements that must alternate with eac h o ther. This is essentially ho w HTML’s definitio n list wo rks. Eac h tag sho uld matc h o ne tag. If yo u replic ate this struc ture in XML, the dec laratio n o f the dl element lo o ks like this:

    The parentheses indic ate that it’s the matc hed pair being repeated, no t alo ne.

    Chapter 8 ✦ Document Type Definitions and Validity

    Often elements appear in mo re o r less rando m o rders. News magazine artic les generally have a title mo stly fo llo wed by paragraphs o f text, but with graphs, pho to s, sidebars, subheads, and pull quo tes interspersed thro ugho ut, perhaps with a byline at the end. Yo u c an indic ate this so rt o f arrangement by listing all the po ssible c hild elements in the parent’s element dec laratio n separated by vertic al bars and gro uped inside parentheses. Yo u c an then plac e an asterisk o utside the c lo sing parenthesis to indic ate that zero o r mo re o c c urrenc es o f any o f the elements in the parentheses are allo wed. Fo r example;

    As ano ther example, suppo se yo u want to say that a DOCUMENT element, rather than having any c hildren at all, must have o ne TITLE fo llo wed by any number o f paragraphs o f text and images that may be freely intermingled, fo llo wed by an o ptio nal SIGNATURE blo c k. Write its element dec laratio n this way:

    This is no t the o nly way to desc ribe this struc ture. In fac t, it may no t even be the best way. An alternative is to dec lare a BODY element that c o ntains PARAGRAPH and IMAGE elements and nest that between the TITLE and the SIGNATURE. Fo r example:

    The differenc e between these two appro ac hes is that the sec o nd requires an additio nal BODY element in the do c ument. This element pro vides an additio nal level o f o rganizatio n that may (o r may no t) be useful to the applic atio n that’s reading the do c ument. The questio n to ask is whether the reader o f this do c ument (who may be ano ther c o mputer pro gram) may want to c o nsider the BODY as a single item in its o wn right, separate fro m the TITLE and the SIGNATURE and distinguished fro m the sum o f its elements. Fo r ano ther example, c o nsider internatio nal addresses. Addresses o utside the United States do n’t always fo llo w U.S. c o nventio ns. In partic ular, po stal c o des so metimes prec ede the state o r fo llo w the c o untry, as in these two examples: Do berman-YPPAN Bo x 2021 St. Nic ho las QUEBEC CAN GOS-3LO or

    Editio ns Sybex 10/ 12 Villa Co eur-de-Vey 75685 Paris Cedex 14 Franc e

    225

    226

    Part II ✦ Document Type Definitions

    Altho ugh yo ur mail will pro bably arrive even if piec es o f the address are o ut o f o rder, it’s better to allo w an address to be mo re flexible. Here’s o ne address element dec laratio n that permits this:

    This says that an ADDRESS element must have o ne o r mo re STREET c hildren fo llo wed by any number o f CITY, STATE, POSTAL_CODE, o r COUNTRY elements. Even this is less than ideal if yo u’d like to allo w fo r no mo re than o ne o f eac h. Unfo rtunately, this is beyo nd the po wer o f a DTD to enfo rc e. By allo wing a mo re flexible o rdering o f elements, yo u give up so me ability to c o ntro l the maximum number o f eac h element. On the o ther hand, yo u may have a list c o mpo sed o f different kinds o f elements, whic h may appear in an arbitrary o rder, as in a list o f rec o rdings that may c o ntain CDs, albums, and tapes. An element dec laratio n to differentiate between the different c atego ries fo r this list wo uld lo o k like this:

    Yo u c o uld use parentheses in the baseball DTD to spec ify different sets o f statistic s fo r pitc hers and batters. Eac h player c o uld have o ne set o r the o ther, but no t bo th. The element dec laratio n lo o ks like this:

    There are still a few things that are diffic ult to handle in element dec laratio ns. Fo r example, there’s no go o d way to say that a do c ument must begin with a TITLE element and end with a SIGNATURE element, but may c o ntain any o ther elements between tho se two . This is bec ause ANY may no t c o mbine with o ther c hild elements. And, in general, the less prec ise yo u are abo ut where things appear, the less c o ntro l yo u have o ver ho w many o f them there are. Fo r example, yo u c an’t say that a do c ument sho uld have exac tly o ne TITLE element but that the TITLE may appear anywhere in the do c ument. No netheless, using parentheses to c reate blo c ks o f elements, either in sequenc e with a c o mma o r in parallel with a vertic al bar, enables yo u to c reate c o mplic ated

    Chapter 8 ✦ Document Type Definitions and Validity

    struc tures with detailed rules fo r ho w different elements fo llo w o ne ano ther. Try no t to go o verbo ard with this tho ugh. Simpler so lutio ns are better. The mo re c o mplic ated yo ur DTD is, the harder it is to write valid files that satisfy the DTD, to say no thing o f the c o mplexity o f maintaining the DTD itself.

    M ixed Content Yo u may have no tic ed that in mo st o f the examples sho wn so far, elements either c o ntained c hild elements o r parsed c harac ter data, but no t bo th. The o nly exc eptio ns were the ro o t elements in early examples where the full list o f tags was no t yet develo ped. In these c ases, bec ause the ro o t element c o uld c o ntain ANY data, it was allo wed to c o ntain bo th c hild elements and raw text. Yo u c an dec lare tags that c o ntain bo th c hild elements and parsed c harac ter data. This is c alled mixe d co nte nt. Yo u c an use this to allo w an arbitrary blo c k o f text to be suffixed to eac h TEAM. Fo r example:

    Mixing c hild elements with parsed c harac ter data severely restric ts the struc ture yo u c an impo se o n yo ur do c uments. In partic ular, yo u c an spec ify o nly the names o f the c hild elements that c an appear. Yo u c anno t c o nstrain the o rder in whic h they appear, the number o f eac h that appears, o r whether they appear at all. In terms o f DTDs, think o f this as meaning that the c hild part o f the DTD must lo o k like this:

    Almo st everything else, o ther than c hanging the number o f c hildren, is invalid. Yo u c anno t use c o mmas, questio n marks, o r plus signs in an element dec laratio n that inc ludes #PCDATA. A list o f elements and #PCDATA separated by vertic al bars is valid. Any o ther use is no t. Fo r example, the fo llo wing is illegal:

    The primary reaso n to mix c o ntent is when yo u’re in the pro c ess o f c o nverting o ld text data to XML, and testing yo ur DTD by validating as yo u add new tags rather than finishing the entire c o nversio n and then trying to find the bugs. This is a go o d tec hnique, and I do rec o mmend yo u use it — after all, it is muc h easier to rec o gnize a mistake in yo ur c o de immediately after yo u made it rather than several ho urs later — ho wever, this is o nly a c rutc h fo r use when develo ping. It sho uld no t be visible to the end-user. When yo ur DTD is finished it sho uld no t mix element c hildren with parsed c harac ter data. Yo u c an always c reate a new tag that ho lds parsed c harac ter data.

    227

    228

    Part II ✦ Document Type Definitions

    Fo r example, yo u c an inc lude a blo c k o f text at the end o f eac h TEAM element by dec laring a new BLURB that ho lds o nly #PCDATA and adding it as the last c hild element o f TEAM. Here’s ho w this lo o ks:

    This do es no t signific antly c hange the text o f the do c ument. All it do es is add o ne mo re o ptio nal element with its o pening and c lo sing tags to eac h TEAM element. Ho wever, it do es make the do c ument muc h mo re ro bust. Furthermo re, XML applic atio ns that rec eive the tree fro m the XML pro c esso r have an easier time handling the data when it’s in the mo re struc tured fo rmat allo wed by no nmixed c o ntent.

    Empty Elements As disc ussed earlier, it’s o c c asio nally useful to define an element that has no c o ntent. Examples in HTML inc lude the image, ho rizo ntal rule, and break , , and
    . In XML, suc h empty elements are identified by empty tags that end with />, suc h as , , and
    . Valid do c uments must dec lare bo th the empty and no nempty elements used. Bec ause empty elements by definitio n do n’t have c hildren, they’re easy to dec lare. Use an dec laratio n c o ntaining the name o f the empty element as no rmal, but use the keywo rd EMPTY ( c ase-sensitive as all XML tags are) instead o f a list o f c hildren. Fo r example:



    Listing 8-11 is a valid do c ument that uses bo th empty and no nempty elements.

    Listing 8-11: A valid document that uses empty tags





    ]>

    Empty Tags

    1998 Elliotte Rusty Harold
    [email protected]
    Thursday, April 22, 1999

    Comments in DTDs DTDs c an c o ntain c o mments, just like the rest o f an XML do c ument. These c o mments c anno t appear inside a dec laratio n, but they c an appear o utside o ne. Co mments are o ften used to o rganize the DTD in different parts, to do c ument the allo wed c o ntent o f partic ular elements, and to further explain what an element is. Fo r example, the element dec laratio n fo r the YEAR element might have a c o mment suc h as this:

    As with all c o mments, this is o nly fo r the benefit o f peo ple reading the so urc e c o de. XML pro c esso rs will igno re it. One po ssible use o f c o mments is to define abbreviatio ns used in the markup. Fo r example, in this and previo us c hapters, I’ve avo ided using abbreviatio ns fo r baseball terms bec ause they’re simply no t o bvio us to the c asual fan. An alternative appro ac h is to use abbreviatio ns but define them with c o mments in the DTD. Listing 8-12 is similar to previo us baseball examples, but uses DTD c o mments and abbreviated tags.

    Listing 8-12: A valid XM L document that uses abbreviated tags defined in DTD comments























    Chapter 8 ✦ Document Type Definitions and Validity























    Continued

    231

    232

    Part II ✦ Document Type Definitions

    Listing 8-12 (continued)













    ]>

    1998

    National

    East

    Atlanta Braves

    Ozzie Guillen

    Shortstop

    83 59 264 35 73

    Chapter 8 ✦ Document Type Definitions and Validity

    15 1 1 22 1 4 4 2 6 24 25 1

    Florida Marlins

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    Continued

    233

    234

    Part II ✦ Document Type Definitions

    Listing 8-12 (continued) East

    Baltimore Orioles

    Central

    Chicago White Sox

    West

    Anaheim Angels



    When the entire Majo r League is inc luded, the resulting do c ument shrinks fro m 699K with lo ng tags to 391K with sho rt tags, a savings o f 44 perc ent. The info rmatio n c o ntent, ho wever, is virtually the same. Co nsequently, the c o mpressed sizes o f the two do c uments are muc h c lo ser, 58K fo r the do c ument with sho rt tags versus 66K fo r the do c ument with lo ng tags. There’s no limit to the amo unt o f info rmatio n yo u c an o r sho uld inc lude in c o mments. Inc luding mo re do es make yo ur DTDs lo nger (and thus bo th harder to sc an and slo wer to do wnlo ad). Ho wever, in the next c o uple o f c hapters, yo u’ll learn ways to reuse the same DTD in multiple XML do c uments, as well as break lo ng DTDs into mo re manageable piec es. Thus, the disadvantages o f using c o mments are tempo rary. I rec o mmend using c o mments liberally in all o f yo ur DTDs, but espec ially in tho se intended fo r public use.

    Sharing Common DTDs Among Documents Previo us valid examples inc luded the DTD in the do c ument’s pro lo g. The real po wer o f XML, ho wever, c o mes fro m c o mmo n DTDs that c an be shared amo ng

    Chapter 8 ✦ Document Type Definitions and Validity

    many do c uments written by different peo ple. If the DTD is no t direc tly inc luded in the do c ument but is linked in fro m an external so urc e, c hanges made to the DTD auto matic ally pro pagate to all do c uments using that DTD. On the o ther hand, bac kward c o mpatibility is no t guaranteed when a DTD is mo dified. Inc o mpatible c hanges c an break do c uments. When yo u use an external DTD, the do c ument type dec laratio n c hanges. Instead o f inc luding the DTD in square brac kets, the SYSTEM keywo rd is fo llo wed by an abso lute o r relative URL where the DTD c an be fo und. Fo r example:

    Here root_element_name is simply the name o f the ro o t element as befo re, SYSTEM is an XML keywo rd, and DTD_URL is a relative o r an abso lute URL where the DTD c an be fo und. Fo r example:

    Let’s c o nvert a familiar example to demo nstrate this pro c ess. Listing 8-12 inc ludes an internal DTD fo r baseball statistic s. We’ll c o nvert this listing to use an external DTD. First, strip o ut the DTD and put it in a file o f its o wn. This is everything between the o pening exc lusive. are no t inc luded. This c an be saved in a file c alled baseball.dtd, as sho wn in Listing 8-13. The file name is no t impo rtant, tho ugh the extensio n .dtd is c o nventio nal.

    Listing 8-13: The baseball DTD file





























    Chapter 8 ✦ Document Type Definitions and Validity























    Continued

    237

    238

    Part II ✦ Document Type Definitions

    Listing 8-13 (continued)







    Next, yo u need to mo dify the do c ument itself. The XML dec laratio n is no lo nger a stand-alo ne do c ument bec ause it depends o n a DTD in ano ther file. Therefo re, the standalone attribute must be c hanged to no, as fo llo ws:

    Then yo u must c hange the tag so it po ints to the DTD by inc luding the SYSTEM keywo rd and a URL (usually relative) where the DTD is fo und:

    The rest o f the do c ument is the same as befo re. Ho wever, no w the pro lo g c o ntains o nly the XML dec laratio n and the do c ument type dec laratio n. It do es no t c o ntain the DTD. Listing 8-14 sho ws the c o de.

    Listing 8-14: Baseball statistics with an external DTD

    1998

    National

    East

    Atlanta Braves

    Chapter 8 ✦ Document Type Definitions and Validity

    Ozzie Guillen

    Shortstop

    83 59 264 35 73 15 1 1 22 1 4 4 2 6 24 25 1

    Florida Marlins

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Central

    Chicago Cubs

    West

    Continued

    239

    240

    Part II ✦ Document Type Definitions

    Listing 8-14 (continued) Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago White Sox

    West

    Anaheim Angels



    Make sure that bo th Listing 8-14 and baseball.dtd are in the same direc to ry and then lo ad Listing 8-14 into yo ur Web bro wser as usual. If all is well, yo u see the same o utput as when yo u lo aded Listing 8-12. Yo u c an no w use this same DTD to desc ribe o ther do c uments, suc h as statistic s fro m o ther years. Onc e yo u add a style sheet, yo u have the three essential parts o f the do c ument sto red in three different files. The data is in the do c ument file, the struc ture and semantic s applied to the data is in the DTD file, and the fo rmatting is in the style sheet. This struc ture enables yo u to inspec t o r c hange any o r all o f these relatively independently. The DTD and the do c ument are mo re c lo sely linked than the do c ument and the style sheet. Changing the DTD generally requires revalidating the do c ument and

    Chapter 8 ✦ Document Type Definitions and Validity

    may require edits to the do c ument to bring it bac k into c o nfo rmanc e with the DTD. The nec essity o f this sequenc e depends o n yo ur edits; adding elements is rarely an issue, tho ugh remo ving elements may be pro blematic .

    DTDs at Remote URLs If a DTD is applied to multiple do c uments, yo u c anno t always put the DTD in the same direc to ry as eac h do c ument fo r whic h it is used. Instead, yo u c an use a URL to spec ify prec isely where the DTD is fo und. Fo r example, let’s suppo se the baseball DTD is fo und at http://metalab.unc.edu/xml/dtds/baseball.dtd. Yo u c an link to it by using the fo llo wing tag in the pro lo g:

    This example uses a full URL valid fro m anywhere. Yo u may also wish to lo c ate DTDs relative to the Web server’s do c ument ro o t o r the c urrent direc to ry. In general, any referenc e that fo rms a valid URL relative to the lo c atio n o f the do c ument is ac c eptable. Fo r example, these are all valid do c ument type dec laratio ns:



    Note

    A docum ent can’t have m ore than one docum ent type declaration, that is, m ore than one tag. To use elem ents declared in m ore than one DTD, you need to use external param eter entity references. These are discussed in the next chapter.

    Public DTDs The SYSTEM keywo rd is intended fo r private DTDs used by a single autho r o r gro up. Part o f the pro mise o f XML, ho wever, is that bro ader o rganizatio ns c o vering an entire industry, suc h as the ISO o r the IEEE, c an standardize public DTDs to c o ver their fields. This standardizatio n saves peo ple fro m having to reinvent tag sets fo r the same items and makes it easier fo r users to exc hange intero perable do c uments. DTDs designed fo r writers o utside the c reating o rganizatio n use the PUBLIC keywo rd instead o f the SYSTEM keywo rd. Furthermo re, the DTD gets a name. The syntax fo llo ws:

    241

    242

    Part II ✦ Document Type Definitions

    Onc e again, root_element_name is the name o f the ro o t element. PUBLIC is an XML keywo rd indic ating that this DTD is intended fo r bro ad use and has a name. DTD_name is the name asso c iated with this DTD. So me XML pro c esso rs may attempt to use this name to retrieve the DTD fro m a c entral repo sito ry. Finally, DTD_URL is a relative o r abso lute URL where the DTD c an be fo und if it c anno t be retrieved by name fro m a well-kno wn repo sito ry. DTD names are slightly different fro m XML names. They may c o ntain o nly the ASCII alphanumeric c harac ters, the spac e, the c arriage return, the linefeed c harac ters, and the fo llo wing punc tuatio n marks: -’()+,/ :=?;!*#@$_%. Furthermo re, the names o f public DTDs fo llo w a few c o nventio ns. If a DTD is an ISO standard, its name begins with the string “ISO.” If a no n-ISO standards bo dy has appro ved the DTD, its name begins with a plus sign (+). If no standards bo dy has appro ved the DTD, its name begins with a hyphen (-). These initial strings are fo llo wed by a do uble slash (/ / ) and the name o f the DTD’s o wner, whic h is fo llo wed by ano ther do uble slash and the type o f do c ument the DTD desc ribes. Then there’s ano ther do uble slash fo llo wed by an ISO 639 language identifier, suc h as EN fo r English. A c o mplete list o f ISO 639 identifiers is available fro m http://www.ics.uci.edu/pub/ietf/http/related/iso639.txt. Fo r example, the baseball DTD c an be named as fo llo ws:

    -//Elliotte Rusty Harold//DTD baseball statistics//EN This example says this DTD is no t standards-bo dy appro ved (-), belo ngs to Ellio tte Rusty Haro ld, desc ribes baseball statistic s, and is written in English. A full do c ument type dec laratio n po inting to this DTD with this name fo llo ws:

    Yo u may have no tic ed that many HTML edito rs suc h as BBEdit auto matic ally plac e the fo llo wing string at the beginning o f every HTML file they c reate:

    No w yo u kno w what this string means! It says the do c ument fo llo ws a no nstandards-bo dy-appro ved (-) DTD fo r HTML pro duc ed by the W3C in the English language. Note

    Technically the W3C is not a standards organization because it’s m em bership is lim ited to corporations that pay its fees rather than to official governm entapproved bodies. It only publishes recommendations instead of standards. In practice, the distinction is irrelevant.

    Chapter 8 ✦ Document Type Definitions and Validity

    Internal and External DTD Subsets Altho ugh mo st do c uments c o nsist o f easily defined piec es, no t all do c uments use a c o mmo n template. Many do c uments may need to use standard DTDs suc h as the HTML 4.0 DTD while adding c usto m elements fo r their o wn use. Other do c uments may use o nly standard elements, but need to reo rder them. Fo r instanc e, o ne HTML page may have a BODY that must c o ntain exac tly o ne H1 header fo llo wed by a DL definitio n list while ano ther may have a BODY that c o ntains many different headers, paragraphs, and images in no partic ular o rder. If a partic ular do c ument has a different struc ture than o ther pages o n the site, it c an be useful to define its struc ture in the do c ument itself rather than in a separate DTD. This appro ac h also makes the do c ument easier to edit. To this end, a do c ument c an use bo th an internal and an external DTD. The internal dec laratio ns go inside square brac kets at the end o f the tag. Fo r example, suppo se yo u want a page that inc ludes baseball statistic s but also has a header and a fo o ter. Suc h a do c ument might lo o k like Listing 8-15. The baseball info rmatio n is pulled fro m the file baseball.dtd, whic h fo rms the external DTD subset. The definitio n o f the ro o t element DOCUMENT as well as the TITLE and SIGNATURE elements c o me fro m the internal DTD subset inc luded in the do c ument itself. This is a little unusual. Mo re c o mmo nly, the mo re generic piec es are likely to be part o f an external DTD while the internal piec es are mo re to pic -spec ific .

    Listing 8-15: A baseball document whose DTD has both an internal and an external subset



    ]>

    1998 Major League Baseball Statistics

    1998

    National

    East Continued

    243

    244

    Part II ✦ Document Type Definitions

    Listing 8-15 (continued)

    Atlanta Braves

    Florida Marlins

    Montreal Expos

    New York Mets

    Philadelphia Phillies

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago

    Chapter 8 ✦ Document Type Definitions and Validity

    White Sox

    West

    Anaheim Angels



    Copyright 1999 Elliotte Rusty Harold [email protected] March 10, 1999

    In the event o f a c o nflic t between elements o f the same name in the internal and external DTD subsets, the elements dec lared internally take prec edenc e. This prec edenc e pro vides a c rude, partial inheritanc e mec hanism. Fo r example, suppo se yo u want to o verride the definitio n o f a PLAYER element so that it c an o nly c o ntain batting statistic s while disallo wing pitc hing statistic s. Yo u c o uld use mo st o f the same dec laratio ns in the baseball DTD, c hanging the PLAYER element as fo llo ws:

    Summary In this c hapter, yo u learned ho w to use a DTD to desc ribe the struc ture o f a do c ument, that is, bo th the required and o ptio nal elements it c o ntains and ho w tho se elements relate to o ne ano ther. In partic ular yo u learned:

    ✦ A do c ument type definitio n (DTD) pro vides a list o f the elements, tags, attributes, and entities c o ntained in the do c ument, and their relatio nships to o ne ano ther.

    ✦ A do c ument’s pro lo g may c o ntain a do c ument type dec laratio n that spec ifies the ro o t element and c o ntains a DTD. This is plac ed between the XML dec laratio n and befo re where the ac tual do c ument begins. It is delimited by , where ROOT is the name o f the ro o t element.

    245

    246

    Part II ✦ Document Type Definitions

    ✦ DTDs lay o ut the permissible tags and the struc ture o f a do c ument. A do c ument that adheres to the rules o f its DTD is said to be valid.

    ✦ Element type dec laratio ns dec lare the name and c hildren o f an element. ✦ Children separated by c o mmas in an element type dec laratio n must appear in the same o rder in that element inside the do c ument.

    ✦ A plus sign means o ne o r mo re instanc es o f the element may appear. ✦ An asterisk means zero o r mo re instanc es o f the element may appear. ✦ A questio n mark means zero o r o ne instanc es o f the c hild may appear. ✦ A vertic al bar means o ne element o r ano ther is to be used. ✦ Parentheses gro up c hild elements to allo w fo r mo re detailed element dec laratio ns.

    ✦ Mixed c o ntent c o ntains bo th elements and parsed c harac ter data but limits the struc ture yo u c an impo se o n the parent element.

    ✦ Empty elements are dec lared with the EMPTY keywo rd. ✦ Co mments make DTDs muc h mo re legible. ✦ External DTDs c an be lo c ated using the SYSTEM keywo rd and a URL in the do c ument type dec laratio n.

    ✦ Standard DTDs c an be lo c ated using the PUBLIC keywo rd in the do c ument type dec laratio n.

    ✦ Dec laratio ns in the internal DTD subset o verride c o nflic ting dec laratio ns in the external DTD subset In the next c hapter, yo u learn mo re abo ut DTDs, inc luding ho w entity referenc es pro vide replac ement text and ho w to separate DTDs fro m the do c uments they desc ribe so they c an be easily shared between do c uments. Yo u also learn ho w to use multiple DTDs to desc ribe a single do c ument.







    9

    C H A P T E R

    Entities and External DTD Subsets









    In This Cha pter W hat is an entity?

    A

    single XML do c ument may draw bo th data and dec laratio ns fro m many different so urc es, in many different files. In fac t, so me o f the data may draw direc tly fro m databases, CGI sc ripts, o r o ther no n-file so urc es. The items where the piec es o f an XML file are sto red, in whatever fo rm they take, are c alled entities. Entity referenc es lo ad these entities into the main XML do c ument. General entity referenc es lo ad data into the ro o t element o f an XML do c ument, while parameter entity referenc es lo ad data into the do c ument’s DTD.

    What Is an Entity? Lo gic ally speaking, an XML do c ument is c o mpo sed o f a pro lo g fo llo wed by a ro o t element whic h stric tly c o ntains all o ther elements. But in prac tic e, the ac tual data o f an XML do c ument c an spread ac ro ss multiple files. Fo r example, eac h PLAYER element might appear in a separate file even tho ugh the ro o t element c o ntains all 900 o r so players in a league. The sto rage units that c o ntain partic ular parts o f an XML do c ument are c alled e ntitie s. An entity may c o nsist o f a file, a database rec o rd, o r any o ther item that c o ntains data. Fo r example, all the c o mplete XML files and style sheets in this bo o k are entities. The sto rage unit that c o ntains the XML dec laratio n, the do c ument type dec laratio n, and the ro o t element is c alled the do cume nt e ntity . Ho wever, the ro o t element and its desc endents may also c o ntain entity referenc es po inting to additio nal data that sho uld be inserted into the do c ument. A validating XML pro c esso r c o mbines all the different referenc ed entities into a single lo gic al do c ument befo re it passes the do c ument o nto the end applic atio n o r displays the file.

    Internal g eneral entities External g eneral entities Internal parameter entities External parameter entities Ho w to build a do cument fro m pieces Entities and DTDs in well-fo rmed do cuments









    248

    Part II ✦ Document Type Definition

    Note

    Non-validating processors m ay, but do not have to, insert external entities. They m ust insert internal entities.

    The primary purpo se o f an entity is to ho ld c o ntent: well-fo rmed XML, o ther fo rms o f text, o r binary data. The pro lo g and the do c ument type dec laratio n are part o f the ro o t entity o f the do c ument they belo ng to . An XSL style sheet qualifies as an entity, but o nly bec ause it itself is a well-fo rmed XML do c ument. The entity that makes up the style sheet is no t o ne o f the entities that c o mpo ses the XML do c ument to whic h the style sheet applies. A CSS style sheet is no t an entity at all. Mo st entities have names by whic h yo u c an refer to them. The o nly exc eptio n is the do c ument entity-the main file c o ntaining the XML do c ument (altho ugh there’s no requirement that this be a file as o ppo sed to a database rec o rd, the o utput o f a CGI pro gram, o r so mething else). The do c ument entity is the sto rage unit, in whatever fo rm it takes, that ho lds the XML dec laratio n, the do c ument type dec laratio n (if any), and the ro o t element. Thus, every XML do c ument has at least o ne entity. There are two kinds o f entities: internal and external. Internal entities are defined c o mpletely within the do c ument entity. The do c ument itself is o ne suc h entity, so all XML do c uments have at least o ne internal entity. External entities, by c o ntrast, draw their c o ntent fro m ano ther so urc e lo c ated via a URL. The main do c ument o nly inc ludes a referenc e to the URL where the ac tual c o ntent resides. In HTML, an IMG element represents an external entity (the ac tual image data) while the do c ument itself c o ntained between the and tags is an internal entity. Entities fall into two c atego ries: parsed and unparsed. Parsed entities c o ntain wellfo rmed XML text. Unparsed entities c o ntain either binary data o r no n-XML text (like an email message). Currently, unparsed entities aren’t well suppo rted (if at all) by mo st XML pro c esso rs. In this c hapter, we fo c us o n parsed entities. CrossReference

    Chapter 11, Embedding Non-XML Data , covers unparsed entities.

    Entity referenc es enable data fro m multiple entities to merge to gether to fo rm a single do c ument. General entity referenc es merge data into the do c ument c o ntent. Parameter entity referenc es merge dec laratio ns into the do c ument’s DTD. , ', "e;, and & are predefined entity referenc es that refer to the text entities , ‘, “, and &, respec tively. Ho wever, yo u c an also define new entities in yo ur do c ument’s DTD.

    Chapter 9 ✦ Entities and External DTD Subsets

    Internal General Entities Yo u c an think o f an internal general entity referenc e as an abbreviatio n fo r c o mmo nly used text o r text that’s hard to type. An tag in the DTD defines an abbreviatio n and the text the abbreviatio n stands fo r. Fo r instanc e, instead o f typing the same fo o ter at the bo tto m o f every page, yo u c an simply define that text as the footer entity in the DTD and then type &footer; at the bo tto m o f eac h page. Furthermo re, if yo u dec ide to c hange the fo o ter blo c k (perhaps bec ause yo ur email address c hanges), yo u o nly need to make the c hange o nc e in the DTD instead o f o n every page that shares the fo o ter. General entity referenc es begin with an ampersand ( &) and end with a semic o lo n ( ;), with the entity’s name between these two c harac ters. Fo r instanc e, < is a general entity referenc e fo r the less than sign ( Listing 9-1 demo nstrates the &ERH; general entity referenc e. Figure 9-1 sho ws this do c ument lo aded into Internet Explo rer. Yo u see that the &ERH; entity referenc e in the so urc e c o de is replac ed by Elliotte Rusty Harold in the o utput.

    249

    250

    Part II ✦ Document Type Definition

    Listing 9-1: The ERH internal general entity reference

    TITLE (#PCDATA)> COPYRIGHT (#PCDATA)> EMAIL (#PCDATA)> LAST_MODIFIED (#PCDATA)> SIGNATURE (COPYRIGHT, EMAIL, LAST_MODIFIED)>

    ]>

    &ERH;

    1999 &ERH; [email protected] March 10, 1999

    Figure 9-1: Listing 9-1 after the internal general entity reference has been replaced by the actual entity

    No tic e that the general entity referenc e, &ERH; appears inside bo th the COPYRIGHT and TITLE elements even tho ugh these are dec lared to ac c ept o nly #PCDATA as c hildren. This arrangement is legal bec ause the replac ement text o f the &ERH; entity referenc e is parsed c harac ter data. Validatio n is do ne against the do c ument after all entity referenc es have been replac ed by their values.

    Chapter 9 ✦ Entities and External DTD Subsets

    The same thing o c c urs when yo u use a style sheet. The styles are applied to the element tree as it exists after entity values replac e the entity referenc es. Yo u c an fo llo w the same mo del to dec lare general entity referenc es fo r the c o pyright, the email address, o r the last mo dified date:



    I o mitted the date in the &LM; entity bec ause it’s likely to c hange fro m do c ument to do c ument. There is no advantage to making it an entity referenc e. No w yo u c an rewrite the do c ument part o f Listing 9-1 even mo re c o mpac tly:

    &ERH;

    ©99; &ERH; &EMAIL; &LM; March 10, 1999

    One o f the advantages o f using entity referenc es instead o f the full text is that these referenc es make it easy to c hange the text. This is espec ially useful when a single DTD is shared between multiple do c uments. ( Yo u’ll learn this skill in the sec tio n o n sharing c o mmo n DTDs amo ng do c uments.) Fo r example, suppo se I dec ide to use the email address [email protected] instead o f [email protected]. Rather than searc hing and replac ing thro ugh multiple files, I simply c hange o ne line o f the DTD as fo llo ws:

    Using General Entity References in the DTD Yo u may wo nder whether it’s po ssible to inc lude o ne general entity referenc e inside ano ther as fo llo ws:

    This example is in fac t valid, bec ause the ERH entity appears as part o f the COPY99 entity that itself will ultimately bec o me part o f the do c ument’s c o ntent. Yo u c an also use general entity referenc es in o ther plac es in the DTD that ultimately bec o me part o f the do c ument c o ntent (suc h as a default attribute value), altho ugh there are restric tio ns. The first restric tio n: The statement c anno t use a c irc ular referenc e like this o ne:



    251

    252

    Part II ✦ Document Type Definition

    The sec o nd restric tio n: General entity referenc es may no t insert text that is o nly part o f the DTD and will no t be used as part o f the do c ument c o ntent. Fo r example, the fo llo wing attempted sho rtc ut fails:



    It’s o ften useful, ho wever, to have entity referenc es merge text into a do c ument’s DTD. Fo r this purpo se, XML uses the parameter entity referenc e, whic h is disc ussed later in this c hapter. The o nly restric tio n o n general entity values is that they may no t c o ntain the three c harac ters %, &, and “ direc tly, tho ugh yo u c an inc lude them via c harac ter referenc es. & and % may be inc luded if they’re starting an entity referenc e rather than simply representing themselves. The lac k o f restric tio ns means that an entity may c o ntain tags and span multiple lines. Fo r example, the fo llo wing SIGNATURE entity is valid:

    1999 Elliotte Rusty Harold [email protected] March 10, 1999 ” > The next o bvio us questio n is whether it’s po ssible fo r entities to have parameters. Can yo u use the abo ve SIGNATURE entity but c hange the date in eac h separate LAST_MODIFIED element o n eac h page? The answer is no ; entities are o nly fo r static replac ement text. If yo u need to pass data to an entity, yo u sho uld use a tag alo ng with the appro priate rendering instruc tio ns in the style sheet instead.

    Predefined General Entity References XML predefines five general entity referenc es, as listed in Table 9-1. These five entity referenc es appear in XML do c uments in plac e o f spec ific c harac ters that wo uld o therwise be interpreted as markup. Fo r instanc e, the entity referenc e < stands fo r the less-than sign (

    "



    '



    Listing 9-2: Declarations for predefined general entity references

    “>”> “&”> “'”> “"”>

    External General Entities External entities are data o utside the main file c o ntaining the ro o t element/do c ument entity. External entity referenc es let yo u embed these external entities in yo ur do c ument and build XML do c uments fro m multiple independent files. Do c uments using o nly internal entities c lo sely resemble the HTML mo del. The c o mplete text o f the do c ument is available in a single file. Images, applets, so unds, and o ther no n-HTML data may be linked in, but at least all the text is present. Of c o urse, the HTML mo del has so me pro blems. In partic ular, it’s quite diffic ult to embed dynamic info rmatio n in the file. Yo u c an embed dynamic info rmatio n by using CGI, Java applets, fanc y database so ftware, server side inc ludes, and vario us o ther means, but HTML alo ne o nly pro vides a static do c ument. Yo u have to go o utside HTML to build a do c ument fro m multiple piec es. Frames are perhaps the simplest HTML so lutio n to this pro blem, but they are a user interfac e disaster that c o nsistently c o nfuse and anno y users. Part o f the pro blem is that o ne HTML do c ument do es no t naturally fit inside ano ther. Every HTML do c ument sho uld have exac tly o ne BODY, but no mo re. Server side inc ludes o nly enable yo u to embed fragments o f HTML—never an entire valid do c ument—inside a do c ument. In additio n, server side inc ludes are server dependent and no t truly part o f HTML.

    253

    254

    Part II ✦ Document Type Definition

    XML, ho wever, is mo re flexible. One do c ument’s ro o t element is no t nec essarily the same as ano ther do c ument’s ro o t element. Even if two do c uments share the same ro o t element, the DTD may dec lare that the element is allo wed to c o ntain itself. The XML standard do es no t prevent well-fo rmed XML do c uments fro m being embedded in o ther well-fo rmed XML do c uments when c o nvenient. XML go es further, ho wever, by defining a mec hanism whereby an XML do c ument c an be built o ut o f multiple smaller XML do c uments fo und either o n lo c al o r remo te systems. The parseris respo nsible fo r merging all the different do c uments to gether in a fixed o rder. Do c uments may c o ntain o ther do c uments, whic h may c o ntain o ther do c uments. As lo ng as there’s no rec ursio n (an erro r repo rted by the pro c esso r), the applic atio n o nly sees a single, c o mplete do c ument. In essenc e, this pro vides c lient-side inc ludes. With XML, yo u c an use an external general entity referenc e to embed o ne do c ument in ano ther. In the DTD, yo u dec lare the external referenc e with the fo llo wing syntax:

    Note

    URI stands for Uniform Resource Identifier. URIs are sim ilar to URLs but allow for m ore precise specification of the linked resource. In theory, URIs separate the resource from the location so a Web brow ser can select the nearest or least congested of several m irrors w ithout requiring an explicit link to that m irror. URIs are an area of active research and heated debate. Therefore, in practice and certainly in this book, URIs are URLs for all purposes.

    Fo r example, yo u may want to put the same signature blo c k o n almo st every page o f a site. Fo r the sake o f definiteness, let’s assume the signature blo c k is the XML c o de sho wn in Listing 9-3. Furthermo re, let’s assume that yo u c an retrieve this c o de fro m the URL http://metalab.unc.edu/xml/signature.xml.

    Listing 9-3: An XM L signature file

    1999 Elliotte Rusty Harold [email protected]

    Chapter 9 ✦ Entities and External DTD Subsets

    Asso c iate this file with the entity referenc e &SIG; by adding the fo llo wing dec laratio n to the DTD:

    Yo u c an also use a relative URL. Fo r example,

    If the file to be inc luded is in the same direc to ry as the file do ing the inc luding, yo u o nly need to use the file name. Fo r example,

    With any o f these dec laratio ns, yo u c an inc lude the c o ntents o f the signature file in a do c ument at any po int merely by using &SIG;, as illustrated with the simple do c ument in Listing 9-4. Figure 9-2 sho ws the rendered do c ument in Internet Explo rer 5.0.

    Listing 9-4: The SIG external general entity reference



    ]>

    Entity references &SIG;

    Aside fro m the additio n o f the external entity referenc e, no te that the standalone attribute o f the XML dec laratio n no w has the value no bec ause this file is no lo nger c o mplete. Parsing the file requires additio nal data fro m the external file signature.xml.

    255

    256

    Part II ✦ Document Type Definition

    Figure 9-2: A docum ent that uses an external general entity reference.

    Internal Parameter Entities General entities bec o me part o f the do c ument, no t the DTD. They c an be used in the DTD but o nly in plac es where they will bec o me part o f the do c ument bo dy. General entity referenc es may no t insert text that is o nly part o f the DTD and will no t be used as part o f the do c ument c o ntent. It’s o ften useful, ho wever, to have entity referenc es in a DTD. Fo r this purpo se, XML pro vides the parame te r e ntity re fe re nce . Parameter entity referenc es are very similar to general entity referenc es—with these two key differenc es:

    1. Parameter entity referenc es begin with a perc ent sign ( %) rather than an «ampersand ( &).

    2. Parameter entity referenc es c an o nly appear in the DTD, no t the do c ument «c o ntent. Parameter entities are dec lared in the DTD like general entities with the additio n o f a perc ent sign befo re the name. The syntax lo o ks like this:

    The name is the abbreviatio n fo r the entity. The reader sees the replac ement text, whic h must appear in quo tes. Fo r example:



    Chapter 9 ✦ Entities and External DTD Subsets

    Our earlier failed attempt to abbreviate (#PCDATA) wo rks when a parameter entity referenc e replac es the general entity referenc e:



    The real value o f parameter entity referenc es appears in sharing c o mmo n lists o f c hildren and attributes between elements. The larger the blo c k o f text yo u’re replac ing and the mo re times yo u use it, the mo re useful parameter entity referenc es bec o me. Fo r instanc e, suppo se yo ur DTD dec lares a number o f blo c k level c o ntainer elements like PARAGRAPH, CELL, and HEADING. Eac h o f these c o ntainer elements may c o ntain an indefinite number o f inline elements like PERSON, DEGREE, MODEL, PRODUCT, ANIMAL, INGREDIENT, and so fo rth. The element dec laratio ns fo r the c o ntainer elements c o uld appear as the fo llo wing:



    The c o ntainer elements all have the same c o ntents. If yo u invent a new element like EQUATION, CD, o r ACCOUNT, this element must be dec lared as a po ssible c hild o f all three c o ntainer elements. Adding it to two , but fo rgetting to add it to the third element, may c ause tro uble. This pro blem multiplies when yo u have 30 o r 300 c o ntainer elements instead o f three. The DTD is muc h easier to maintain if yo u do n’t give eac h c o ntainer a separate c hild list. Instead, make the c hild list a parameter entity referenc e; then use that parameter entity referenc e in eac h o f the c o ntainer element dec laratio ns. Fo r example:



    To add a new element, yo u o nly have to c hange a single parameter entity dec laratio n, rather than three, 30, o r 300 element dec laratio ns. Parameter entity referenc es must be dec lared befo re they’re used. The fo llo wing example is invalid bec ause the %PCD; referenc e is no t dec lared until it’s already been used twic e:



    257

    258

    Part II ✦ Document Type Definition

    Parameter entities c an o nly be used to pro vide part o f a dec laratio n in the external DTD subset. That is, parameter entity referenc es c an o nly appear inside a dec laratio n in the external DTD subset. The abo ve examples are all invalid if they’re used in an internal DTD subset. In the internal DTD subset, parameter entity referenc es c an o nly be used o utside o f dec laratio ns. Fo r example, the fo llo wing is valid in bo th the internal and external DTD subsets:

    ”> %hr; Of c o urse, this really isn’t any easier than dec laring the HR element witho ut parameter entity referenc es:

    Yo u’ll mainly use parameter entity referenc es in internal DTD subsets when they’re referring to external parameter entities; that is, when they’re pulling in dec laratio ns o r parts o f dec laratio ns fro m a different file. This is the subjec t o f the next sec tio n.

    External Parameter Entities The prec eding examples used mo no lithic DTDs that define all the elements used in the do c ument. This tec hnique bec o mes unwieldy with lo nger do c uments, ho wever. Furthermo re, yo u o ften want to use part o f a DTD in many different plac es. Fo r example, c o nsider a DTD that desc ribes a snail mail address. The definitio n o f an address is quite general, and c an easily be used in many different c o ntexts. Similarly, the list o f predefined entity referenc es in Listing 9-2 is useful in mo st XML files, but yo u’d rather no t c o py and paste it all the time. External parameter entities enable yo u to build large DTDs fro m smaller o nes. That is, o ne external DTD may link to ano ther and in so do ing pull in the elements and entities dec lared in the first. Altho ugh c yc les are pro hibited—DTD 1 may no t refer to DTD 2 if DTD 2 refers to DTD 1—suc h nested DTDs c an bec o me large and c o mplex. At the same time, breaking a DTD into smaller, mo re manageable c hunks makes the DTD easier to analyze. Many o f the examples in the last c hapter were unnec essarily large bec ause an entire do c ument and its c o mplete DTD were sto red in a single file. Bo th the do c ument and its DTD bec o me muc h easier to understand when split into separate files.

    Chapter 9 ✦ Entities and External DTD Subsets

    Furthermo re, using smaller, mo dular DTDs that o nly desc ribe o ne set o f elements makes it easier to mix and matc h DTDs c reated by different peo ple o r o rganizatio ns. Fo r instanc e, if yo u’re writing a tec hnic al artic le abo ut high temperature superc o nduc tivity, yo u c an use a mo lec ular sc ienc es DTD to desc ribe the mo lec ules invo lved, a math DTD to write do wn yo ur equatio ns, a vec to r graphic s DTD fo r the figures, and a basic HTML DTD to handle the explanato ry text. Note

    In particular, you can use the m ol.dtd DTD from Peter Murray-Rust’s Chem ical Markup Language, the MathML DTD from the W3C’s Mathem atical Markup Language, the SVG DTD for the W3C’s Scalable Vector Graphics, and the W3C’s HTML-in-XML DTD.

    Yo u c an pro bably think o f mo re examples where yo u need to mix and matc h c o nc epts (and therefo re tags) fro m different fields. Human tho ught do esn’t restric t itself to narro wly defined c atego ries. It tends to wander all o ver the map. The do c uments yo u write will reflec t this. Let’s see ho w to o rganize the baseball statistic s DTD as a c o mbinatio n o f several different DTDs. This example is extremely hierarc hic al. One po ssible divisio n is to write separate DTDs fo r PLAYER, TEAM, and SEASON. This is far fro m the o nly way to divide the DTD into mo re manageable c hunks, but it will serve as a reaso nable example. Listing 9-5 sho ws a DTD so lely fo r a player that c an be sto red in a file named player.dtd:

    Listing 9-5: A DTD for the PLAYER element and its children (player.dtd)







    Continued

    259

    260

    Part II ✦ Document Type Definition

    Listing 9-5 (continued)























    Chapter 9 ✦ Entities and External DTD Subsets

























    261

    262

    Part II ✦ Document Type Definition

    By itself, this DTD doesn't enable you to create very interesting docum ents. Listing 9-6 show s a sim ple valid file that only uses the PLAYER DTD in Listing 9-5. By itself, this sim ple file is not im portant; how ever, you can build other, m ore com plicated files out of these sm all parts.

    Listing 9-6: A valid document using the PLAYER DTD

    Chris Hoiles

    Catcher

    97 81 267 36 70 12 0 15 56 0 1 5 4 3 38 50 4

    What o ther parts o f the do c ument c an have their o wn DTDs? Obvio usly, a TEAM is a big part. Yo u c o uld write its DTD as fo llo ws:



    Chapter 9 ✦ Entities and External DTD Subsets

    On c lo ser inspec tio n, ho wever, yo u sho uld no tic e that so mething is missing: the definitio n o f the PLAYER element. The definitio n is in the separate file player.dtd and needs to be c o nnec ted to this DTD. Yo u c o nnec t DTDs with external parameter entity referenc es. Fo r a private DTD, this c o nnec tio n takes the fo llo wing fo rm:

    %name; Fo r example:

    %player; This example uses a relative URL ( player.dtd) and assumes that the file player.dtd will be fo und in the same plac e as the linking DTD. If that’s no t the c ase, yo u c an use a full URL as fo llo ws:

    %player; Listing 9-7 sho ws a c o mpleted TEAM DTD that inc ludes a referenc e to the PLAYER DTD:

    Listing 9-7: The TEAM DTD (team.dtd)



    %player;

    A SEASON c o ntains LEAGUE, DIVISION, and TEAM elements. Altho ugh LEAGUE and DIVISION c o uld eac h have their o wn DTD, it do esn’t pay to go o verbo ard with splitting DTDs. Unless yo u expec t yo u’ll have so me do c uments that c o ntain LEAGUE o r DIVISION elements that are no t part o f a SEASON, yo u might as well inc lude all three in the same DTD. Listing 9-8 demo nstrates.

    263

    264

    Part II ✦ Document Type Definition

    Listing 9-8: The SEASON DTD (season.dtd)





    %team;

    Building a Document from Pieces The baseball examples have been quite large. Altho ugh o nly a trunc ated versio n with limited numbers o f players appears in this bo o k, the full do c ument is mo re than half a megabyte, way to o large to c o mfo rtably do wnlo ad o r searc h, espec ially if the reader is o nly interested in a single team, player, o r divisio n. The tec hniques disc ussed in the previo us sec tio n o f this c hapter allo w yo u to split the do c ument into many different, smaller, mo re manageable do c uments, o ne fo r eac h team, player, divisio n, and league. External entity referenc es c o nnec t the players to fo rm teams, the teams to fo rm divisio ns, the divisio ns to fo rm leagues, and the leagues to fo rm a seaso n. Unfo rtunately yo u c anno t embed just any XML do c ument as an external parsed entity. Co nsider, fo r example, Listing 9-9, ChrisHo iles.xml. This is a revised versio n o f Listing 9-6. Ho wever, if yo u lo o k c lo sely yo u’ll no tic e that the pro lo g is different. Listing 9-6’s pro lo g is:

    Listing 9-9’s pro lo g is simply the XML dec laratio n with no standalone attribute and with an encoding attribute. Furthermo re the do c ument type dec laratio n is c o mpletely o mitted. In a file like Listing 9-9 that’s meant to be embedded in ano ther do c ument, this so rt o f XML dec laratio n is c alled a te xt de claratio n, tho ugh as yo u c an see it’s really just a legal XML dec laratio n.

    Chapter 9 ✦ Entities and External DTD Subsets

    Listing 9-9: ChrisHoiles.xml

    Chris Hoiles

    Catcher

    97 81 267 36 70 12 0 15 56 0 1 5 4 3 38 50 4

    On the CD-ROM

    I’ll spare you the other 1,200 or so players, although you’ll find them all on the accom panying CD-ROM in the exam ples/ baseball/ players folder.

    Text dec laratio ns must have an encoding attribute (unlike XML dec laratio ns whic h may but do no t have to have an encoding attribute) that spec ifies the c harac ter set the entity uses. This allo ws c o mpo und do c uments to be assembled fro m entities written in different c harac ter sets. Fo r example, a do c ument in Latin-5 might c o mbine with a do c ument in UTF-8. The pro c esso r/ bro wser still has to understand all the enc o dings used by the different entities. The examples in this c hapter are all given in ASCII. Sinc e ASCII is a stric t subset o f bo th ISO Latin-1 and UTF-8, yo u c o uld use either o f these text dec laratio ns:

    Listing 9-10, mets.dtd, and Listing 9-11, mets.xml, sho w ho w yo u c an use external parsed entities to put to gether a c o mplete team. The DTD defines external entity referenc es fo r eac h player o n the team. The XML do c ument lo ads the DTD using an

    265

    266

    Part II ✦ Document Type Definition

    external parameter entity referenc e in its internal DTD subset. Then, its do c ument entity inc ludes many external general entity referenc es that lo ad in the individual players.

    Listing 9-10: The New York M ets DTD with entity references for players (mets.dtd)

    ArmandoReynoso SYSTEM “mets/ArmandoReynoso.xml”> BobbyJones SYSTEM “mets/BobbyJones.xml”> BradClontz SYSTEM “mets/BradClontz.xml”> DennisCook SYSTEM “mets/DennisCook.xml”> GregMcmichael SYSTEM “mets/GregMcmichael.xml”> HideoNomo SYSTEM “mets/HideoNomo.xml”> JohnFranco SYSTEM “mets/JohnFranco.xml”> JosiasManzanillo SYSTEM “mets/JosiasManzanillo.xml”> OctavioDotel SYSTEM “mets/OctavioDotel.xml”> RickReed SYSTEM “mets/RickReed.xml”> RigoBeltran SYSTEM “mets/RigoBeltran.xml”> WillieBlair SYSTEM “mets/WillieBlair.xml”>

    Figure 9-3 sho ws the XML do c ument lo aded into Internet Explo rer. No tic e that all data fo r all players is present even tho ugh the main do c ument o nly c o ntains referenc es to the entities where the player data resides. Internet Explo rer reso lves external referenc es-no t all XML parsers/ bro wsers do . Yo u c an find the remaining teams o n the CD-ROM in the direc to ry examples/ baseball. No tic e in partic ular ho w c o mpac tly external entity referenc es enable yo u to embed multiple players.

    Listing 9-11: The New York M ets with players loaded from external entities (mets.xml)

    New York Mets &AlLeiter;

    Chapter 9 ✦ Entities and External DTD Subsets

    &ArmandoReynoso; &BobbyJones; &BradClontz; &DennisCook; &GregMcmichael; &HideoNomo; &JohnFranco; &JosiasManzanillo; &OctavioDotel; &RickReed; &RigoBeltran; &WillieBlair;

    Figure 9-3: The XML docum ent displays all players on the 1998 New York Mets.

    It wo uld be nic e to c o ntinue this pro c edure building a divisio n by c o mbining team files, a league by c o mbining divisio ns, and a seaso n by c o mbining leagues. Unfo rtunately, if yo u try this yo u rapidly run into a wall. The do c uments embedded via external entities c anno t have their o wn DTDs. At mo st, their pro lo g c an c o ntain the text dec laratio n. This means yo u c an o nly have a single level o f do c ument embedding. This c o ntrasts with DTD embedding where DTDs c an be nested arbitrarily deeply. Thus, yo ur o nly likely alternative is to inc lude all teams, divisio ns, leagues, and seaso ns in a single do c ument whic h refers to the many different player do c uments. This requires a few mo re than 1,200 entity dec laratio ns (o ne fo r eac h player). Sinc e DTDs c an nest arbitrarily, we begin with a DTD that pulls in DTDs like Listing 9-10 c o ntaining entity definitio ns fo r all the teams. This is sho wn in Listing 9-12:

    267

    268

    Part II ✦ Document Type Definition

    Listing 9-12: The players DTD (players.dtd)

    %angels;

    %astros;

    %athletics;

    %bluejays;

    %braves;

    %brewers;

    %cubs;

    %devilrays;

    %diamondbacks;

    %dodgers;

    %expos;

    %giants;

    %indians;

    %mariners;

    %marlins;

    %mets;

    %orioles;

    %padres;

    %phillies;

    %pirates;

    %rangers;

    %redsox;

    %reds;

    %rockies;

    Chapter 9 ✦ Entities and External DTD Subsets

    tigers SYSTEM “tigers.dtd”> twins SYSTEM “twins.dtd”> whitesox SYSTEM “whitesox.dtd”> yankees SYSTEM “yankees.dtd”>

    Listing 9-13, a master do c ument, pulls to gether all the player sub-do c uments as well as the DTDs that define the entities fo r eac h player. Altho ugh this do c ument is muc h smaller than the mo no lithic do c ument develo ped earlier (32K vs. 628K), it’s still quite lo ng, so no t all players are inc luded here. The full versio n o f Listing 9-13 relies o n 33 DTDs and o ver 1,000 XML files to pro duc e the finished do c ument. The largest pro blem with this appro ac h is that it requires o ver 1000 separate c o nnec tio ns to the Web server befo re the do c ument c an be displayed. On the CD-ROM

    The full exam ple is on the CD-ROM in the file exam ples/ baseball/ players/ index.xm l.

    Listing 9-13: M aster document for the 1998 season using external entity references for players

    1998

    National

    East

    Florida Marlins

    Montreal Expos Continued

    269

    270

    Part II ✦ Document Type Definition

    Listing 9-13 (continued)

    New York Mets &RigoBeltran; &DennisCook; &SteveDecker; &JohnFranco; &MattFranco; &ButchHuskey; &BobbyJones; &MikeKinkade; &HideoNomo; &VanceWilson;

    Philadelphia Phillies

    Central

    Chicago Cubs

    West

    Arizona Diamondbacks



    American

    East

    Baltimore Orioles

    Central

    Chicago

    Chapter 9 ✦ Entities and External DTD Subsets

    White Sox &JeffAbbott; &MikeCameron; &MikeCaruso; &LarryCasian; &TomFordham; &MarkJohnson; &RobertMachado; &JimParque; &ToddRizzo;

    West

    Anaheim Angels



    Yo u do have so me flexibility in whic h levels yo u c ho o se fo r yo ur master do c ument and embedded data. Fo r instanc e, o ne alternative to the struc ture used by Listing 9-12 plac es the teams and all their players in individual do c uments, then c o mbines tho se team files into a seaso n with external entities as sho wn in Listing 9-14. This has the advantage o f using a smaller number o f XML files o f mo re even sizes that plac es less lo ad o n the Web server and wo uld do wnlo ad and display mo re quic kly. To be ho nest, ho wever, the intrinsic advantage o f o ne appro ac h o r the o ther is minimal. Feel free to use whic hever o ne mo re c lo sely matc hes the o rganizatio n o f yo ur data, o r simply whic hever o ne yo u feel mo re c o mfo rtable with.

    Listing 9-14: The 1998 season using external entity references for teams

    athletics SYSTEM “athletics.xml”> bluejays SYSTEM “bluejays.xml”> braves SYSTEM “braves.xml”> brewers SYSTEM “brewers.xml”> cubs SYSTEM “cubs.xml”> Continued

    271

    272

    Part II ✦ Document Type Definition

    Listing 9-14 (continued)

    diamondbacks SYSTEM “diamondbacks.xml”> dodgers SYSTEM “dodgers.xml”> expos SYSTEM “expos.xml”> giants SYSTEM “giants.xml”> indians SYSTEM “indians.xml”> mariners SYSTEM “mariners.xml”> marlins SYSTEM “marlins.xml”> mets SYSTEM “mets.xml”> orioles SYSTEM “orioles.xml”> padres SYSTEM “padres.xml”> phillies SYSTEM “phillies.xml”> pirates SYSTEM “pirates.xml”> rangers SYSTEM “rangers.xml”> redsox SYSTEM “red sox.xml”> reds SYSTEM “reds.xml”> rockies SYSTEM “rockies.xml”> royals SYSTEM “royals.xml”> tigers SYSTEM “tigers.xml”> twins SYSTEM “twins.xml”> whitesox SYSTEM “whitesox.xml”> yankees SYSTEM “yankees.xml”>

    ]>

    1998

    National

    East &marlins; &braves; &expos; &mets; &phillies;

    Central &cubs; &reds; &astros; &brewers; &pirates;

    West &diamondbacks; &rockies; &dodgers; &padres; &giants;

    Chapter 9 ✦ Entities and External DTD Subsets



    American

    East &orioles; &redsox; &yankees; &devilrays; &bluejays

    Central &whitesox; &indians; &tigers; &royals; &twins;

    West &angels; &athletics; &mariners; &rangers;



    A final, less likely, alternative is to ac tually build teams fro m external player entities into separate files and then c o mbine tho se team files into the divisio ns, leagues, and seaso ns. The master do c ument c an define the entity referenc es used in the c hild team do c uments. Ho wever, in this c ase the team do c uments are no t usable o n their o wn bec ause the entity referenc es are no t defined until they’re aggregated into the master do c ument. It’s truly unfo rtunate that o nly the to p-level do c ument c an be attac hed to a DTD. This so mewhat limits the utility o f external parsed entities. Ho wever, when yo u learn abo ut XLinks and XPo inters, yo u’ll see so me o ther ways to build large, c o mpo und do c uments o ut o f small parts. Ho wever, tho se tec hniques are no t part o f the c o re XML standard and no t nec essarily suppo rted by any validating XML pro c esso r and Web bro wser like the tec hniques o f this c hapter. CrossReference

    Chapter 16, XLinks, covers XLinks and Chapter 17, XPointers, discusses XPointers.

    273

    274

    Part II ✦ Document Type Definition

    Entities and DTDs in Well-Formed Documents Part I o f this bo o k explo red well-fo rmed XML do c uments witho ut DTDs. And Part II has been explo ring do c uments that have DTDs and adhere to the c o nstraints in the DTD, that is valid do c uments. But there is a third level o f c o nfo rmanc e to the XML standard: do c uments that have DTDs and are well-fo rmed but aren’t valid, either bec ause the DTD is inc o mplete o r bec ause the do c ument do esn’t fit the DTD’s c o nstraints. This is the least c o mmo n o f the three types. Ho wever, no t all do c uments need to be valid. So metimes it suffic es fo r an XML do c ument to be merely well-fo rmed. DTDs also have a plac e in well-fo rmed XML do c uments (tho ugh they aren’t required as they are fo r valid do c uments). And so me no n-validating XML pro c esso rs c an take advantage o f info rmatio n in a DTD witho ut requiring perfec t c o nfo rmanc e to it. We explo re that o ptio n in this sec tio n. If a well-fo rmed but invalid XML do c ument do es have a DTD, that DTD must have the same general fo rm as explo red in previo us c hapters. That is, it begins with a do c ument type dec laratio n and may c o ntain ELEMENT, ATTLIST, and ENTITY dec laratio ns. Suc h a do c ument differs fro m a valid do c ument in that the pro c esso r o nly c o nsiders the ENTITY dec laratio ns.

    Internal Entities The primary advantage o f using a DTD in invalid well-fo rmed XML do c uments is that yo u may use internal general entity referenc es o ther than the five pre-defined referenc es >,

    &LOGO;

    This is the c o rrec t way to embed the unparsed entity LOGO in the do c ument:



    ]>



    [ Well-formedness Constraint: No Recursion ] This well-fo rmedness c o nstraint states that a parsed entity c anno t refer to itself. Fo r example, this o pen so urc e c lassic is malfo rmed:

    Circ ular referenc es are a little tric kier to spo t, but are equally illegal:

    No te that it’s o nly the rec ursio n that’s malfo rmed, no t the mere use o f o ne entity referenc e inside ano ther. The fo llo wing is perfec tly fine bec ause altho ugh the

    Appendix A ✦ XM L Reference M aterial

    COPY99 entity depends o n the ERH entity, the ERH entity do es no t depend o n the COPY99 entity.

    [69] PEReference ::= ‘%’ Name ‘;’ [ Well-formedness Constraint: No Recursion ] This is the same c o nstraint that applies to Pro duc tio n [68]. Parameter entities c an’t rec urse any mo re than general entities c an. Fo r example, this entity dec laratio n is also malfo rmed:

    And this is still illegal:



    [ Well-formedness Constraint: In DTD ] This well-fo rmedness c o nstraint requires that parameter entity referenc es c an o nly appear in the DTD. They may no t appear in the c o ntent o f the do c ument o r anywhere else that’s no t the DTD.

    Validity Constraints This referenc e to pic is designed to help yo u understand what is required in o rder fo r an XML do c ument to be valid. Validity is o ften useful, but is no t always required. Yo u c an do a lo t with simply well-fo rmed do c uments, and suc h do c uments are o ften easier to write bec ause there are fewer rules to fo llo w. Fo r valid do c uments, yo u must fo llo w the BNF grammar, the well-fo rmedness c o nstraints, and the validity c o nstraints disc ussed in this sec tio n.

    What Is a Validity Constraint? A validity c o nstraint is a rule that must be adhered to by a valid do c ument. No t all XML do c uments are, o r need to be, valid. It is no t nec essarily an erro r fo r a do c ument to fail to satisfy a validity c o nstraint. Validating pro c esso rs have the o ptio n o f repo rting vio latio ns o f these c o nstraints as erro rs, but they do no t have to . All syntax (BNF) erro rs and well-fo rmedness vio latio ns must still be repo rted ho wever.

    909

    910

    Appendixes

    Only do c uments with DTDs may be validated. Almo st all the validity c o nstraints deal with the relatio nships between the c o ntent o f the do c ument and the dec laratio ns in the DTD.

    Validity Constraints in XM L 1.0 This sec tio n lists and explains all o f the validity c o nstraints in the XML 1.0 standard. These are o rganized ac c o rding to the BNF rule eac h applies to .

    [28] doctypedecl ::= ‘< !DOCTYPE’ S Name (S ExternalID)? S? (‘[‘ (markupdecl | PEReference | S)* ‘]’ S?)? ‘> ’ Validity Constraint: Root Element Type This c o nstraint simply states that the name given in the DOCTYPE dec laratio n must matc h the name o f the ro o t element. In o ther wo rds, the bo ld parts belo w have to all be the same.



    content

    It’s also true that the ro o t element must be dec lared — that’s do ne by the line in italic — ho wever that dec laratio n is required by a different validity c o nstraint, no t this o ne.

    [29] markupdecl ::= elementdecl | AttlistDecl | EntityDecl | NotationDecl | PI | Comment Validity Constraint: Proper Declaration/ PE Nesting This c o nstraint requires that a markup dec laratio n c o ntain o r be c o ntained in o ne o r mo re parameter entities, but that it may no t be split ac ro ss a parameter entity. Fo r example, c o nsider this element dec laratio n:

    The parameter entity dec lared by the fo llo wing entity dec laratio n is a valid substitute fo r the c o ntent mo del, bec ause the parameter entity c o ntains bo th the < and the >:

    ”>

    Appendix A ✦ XM L Reference M aterial

    Given that entity, yo u c an rewrite the element dec laratio n like this:

    %PARENT_DECL; This is valid bec ause the parameter entity c o ntains bo th the < and the >. Ano ther o ptio n is to inc lude o nly part o f the element dec laratio n in the parameter entity. Fo r example, if yo u had many elements who se c o ntent mo del was ( FATHER | MOTHER), then it might be useful to do so mething like this:

    Here, neither the < o r > is inc luded in the parameter entity. Yo u c anno t enc lo se o ne o f the angle brac kets in the parameter entity witho ut inc luding its mate. The fo llo wing, fo r example, is invalid, even tho ugh it appears to expand into a legal element dec laratio n:

    ”> c harac ter. That’s legal (unlike the use o f a < c harac ter, whic h wo uld be illegal in an internal parameter entity dec laratio n). The pro blem is ho w the > c harac ter is used to terminate an element dec laratio n that began in ano ther entity.

    [32] SDDecl ::= S ‘standalone’ Eq ((“‘“ (‘yes’ | ‘no’) “‘“) | (‘“‘ (‘yes’ | ‘no’) ‘“‘)) Validity Constraint: Standalone Document Declaration In sho rt, this c o nstraint says that a do c ument must have a standalo ne do c ument dec laratio n with the value no ( standalone=”no”) if any o ther files are required to pro c ess this file and determine its validity. Mo stly this affec ts external DTD subsets linked in thro ugh parameter entities. This is the c ase if any o f the fo llo wing are true:

    ✦ An entity used in the do c ument is dec lared in an external DTD subset. ✦ The external DTD subset pro vides default values fo r attributes that appear in the do c ument witho ut values.

    ✦ The external DTD subset c hanges ho w attribute values in the do c ument may be no rmalized.

    ✦ The external DTD subset dec lares elements who se c hildren are o nly elements (no c harac ter data o r mixed c o ntent) when tho se c hildren may themselves c o ntain whitespac e.

    911

    912

    Appendixes

    [39] element ::= EmptyElemTag | STag content ETag Validity Constraint: Element Valid This c o nstraint simply states that this element matc hes an element dec laratio n in the DTD. Mo re prec isely o ne o f the fo llo wing c o nditio ns must be true:

    1. The element has no c o ntent and the element dec laratio n dec lares the element EMPTY.

    2. The element c o ntains o nly c hild elements that matc h the regular expressio n in the element’s c o ntent mo del.

    3. The element is dec lared to have mixed c o ntent, and the element’s c o ntent c o ntains c harac ter data and c hild elements that are dec lared in the mixedc o ntent dec laratio n.

    4. The element is dec lared ANY, and all c hild elements are dec lared.

    [41] Attribute ::= Name Eq AttValue Validity Constraint: Attribute Value Type This c o nstraint simply states that the attribute’s name must have been dec lared in an ATTLIST dec laratio n in the DTD. Furthermo re, the attribute value must matc h the dec lared type in the ATTLIST dec laratio n.

    [45] elementdecl ::= ‘< !ELEM ENT’ S Name S contentspec S? ‘> ’ Validity Constraint: Unique Element Type Declaration An element c anno t be dec lared mo re than o nc e in the DTD, whether the dec laratio ns are c o mpatible o r no t. Fo r example, this is valid:

    This, ho wever, is no t valid:

    Neither is this valid:

    This is mo st likely to c ause pro blems when merging external DTD subsets fro m different so urc es that bo th dec lare so me o f the same elements. To a limited extent, namespac es c an help reso lve this.

    Appendix A ✦ XM L Reference M aterial

    [49] choice ::= ‘(‘ S? cp ( S? ‘|’ S? cp )* S? ‘)’ Validity Constraint: Proper Group/ PE Nesting This c o nstraint states that a c ho ic e may c o ntain o r be c o ntained in o ne o r mo re parameter entities, but that it may no t be split ac ro ss a parameter entity. Fo r example, c o nsider this element dec laratio n:

    The parameter entity dec lared b y the fo llo wing entity dec laratio n is a valid sub stitute fo r the c o ntent mo del b ec ause the parameter entity c o ntains b o th the ( and the ):

    That is, yo u c an rewrite the element dec laratio n like this:

    This is valid bec ause the parameter entity c o ntains bo th the ( and the ). Ano ther o ptio n is to inc lude o nly the c hild elements, but leave o ut bo th parentheses. Fo r example:

    The advantage here is that yo u c an easily add additio nal elements no t defined in the parameter entity. Fo r example:

    What yo u c anno t do , ho wever, is enc lo se o ne o f the parentheses in the parameter entity witho ut inc luding its mate. The fo llo wing, fo r example, is invalid, even tho ugh it appears to expand into a legal element dec laratio n.



    The pro blem in this example is the ELEMENT dec laratio n, no t the ENTITY dec laratio ns. It is valid to dec lare the entities as do ne here. It’s their use in the c o ntext o f a c ho ic e that makes them invalid.

    913

    914

    Appendixes

    [50] seq ::= ‘(‘ S? cp ( S? ‘,’ S? cp )* S? ‘)’ Validity Constraint: Proper Group/ PE Nesting This is exac tly the same c o nstraint as abo ve, exc ept no w it’s applied to sequenc es rather than c ho ic es. It requires that a sequenc e may c o ntain o r be c o ntained in o ne o r mo re parameter entities, but it may no t be split ac ro ss a parameter entity. Fo r example, c o nsider this element dec laratio n:

    The parameter entity dec lared b y the fo llo wing entity dec laratio n is a valid sub stitute fo r the c o ntent mo del b ec ause the parameter entity c o ntains b o th the ( and the ):

    That is, yo u c an rewrite the element dec laratio n like this:

    This is valid bec ause the parameter entity c o ntains bo th the ( and the ). Ano ther o ptio n is to inc lude o nly the c hild elements, but leave o ut bo th parentheses. Fo r example:

    The advantage here is that yo u c an easily add additio nal elements no t defined in the parameter entity. Fo r example:

    What yo u c anno t do , ho wever, is enc lo se o ne o f the parentheses in the parameter entity witho ut inc luding its mate. The fo llo wing, fo r example, is invalid, even tho ugh it appears to expand into a legal element dec laratio n:



    The pro blem in this example is the ELEMENT dec laratio n, no t the ENTITY dec laratio ns. It is valid to dec lare the entities like this. It’s their use in the c o ntext o f a sequenc e that makes them invalid.

    Appendix A ✦ XM L Reference M aterial

    [51] M ixed ::= ‘(‘ S? ‘# PCDATA’ (S? ‘|’ S? Name)* S? ‘)*’ | ‘(‘ S? ‘# PCDATA’ S? ‘)’ Validity Constraint: Proper Group/ PE Nesting This is exac tly the same c o nstraint as abo ve, exc ept no w it’s applied to mixed c o ntent rather than c ho ic es o r sequenc es. It requires that a mixed-c o ntent mo del may c o ntain o r be c o ntained in a parameter entity, but it may no t be split ac ro ss a parameter entity. Fo r example, c o nsider this element dec laratio n:

    The parameter entity dec lared b y the fo llo wing entity dec laratio n is a valid sub stitute fo r the c o ntent mo del b ec ause the parameter entity c o ntains b o th the ( and the ):

    That is, yo u c an rewrite the element dec laratio n like this:

    This is valid bec ause the parameter entity c o ntains bo th the ( and the ). Ano ther o ptio n is to inc lude o nly the c o ntent partic les, but leave o ut bo th parentheses. Fo r example:

    The advantage here is that yo u c an easily add additio nal elements no t defined in the parameter entity. Fo r example:

    What yo u c anno t do , ho wever, is enc lo se o ne o f the parentheses in the parameter entity witho ut inc luding its mate. The fo llo wing, fo r example, is invalid, even tho ugh it appears to expand into a legal element dec laratio n:



    The pro blem in this example is the ELEMENT dec laratio n, no t the ENTITY dec laratio ns. It is valid to dec lare the entities as is do ne here. It’s their use in the c o ntext o f a c ho ic e (o r sequenc e) that makes them invalid.

    915

    916

    Appendixes

    Validity Constraint: No Duplicate Types No element c an be repeated in a mixed-c o ntent dec laratio n. Fo r example, the fo llwing is invalid:

    ( #PCDATA | I | EM | I | EM ) There’s really no reaso n to write a mixed-c o ntent dec laratio n like this, but at the same time, it’s no t o bvio us what the harm is. Interestingly, pure c ho ic es do allo w c o ntent mo dels like this:

    ( I | EM | I | EM ) It o nly bec o mes a pro blem when #PCDATA gets mixed in. Caution

    This choice is am biguous —that is, w hen the parser encounters an I or an EM, it doesn’t know w hether it m atches the first or the second instance in the content m odel. So although legal, som e parsers w ill report it as an error, and it should be avoided if possible.

    [56] TokenizedType ::= ‘ID’ | ‘IDREF’ | ‘IDREFS’ | ‘ENTITY’ | ‘ENTITIES’ | ‘NM TOKEN’ | ‘NM TOKENS’ Validity Constraint: ID Attribute values o f ID type must be valid XML names (Pro duc tio n [5]). Furthermo re, a single name c anno t be used mo re than o nc e in the same do c ument as the value o f an ID type attribute. Fo r example, this is invalid given that ID is dec lared to be ID:

    This is also invalid bec ause XML names c anno t begin with numbers:

    This is valid if NAME do es no t have type ID:

    On the o ther hand, that example is invalid if NAME do es have type ID, even tho ugh the NAME attribute is different fro m the ID attribute. Furthermo re, the fo llo wing is invalid if NAME has type ID, even tho ugh two different elements are invo lved:



    Appendix A ✦ XM L Reference M aterial

    ID attribute values must be unique ac ro ss all elements and ID attributes, no t just a partic ular c lass o f, o r attributes o f, a partic ular c lass o f elements.

    Validity Constraint: One ID per Element Type Eac h element c an have at mo st o ne attribute o f type ID. Fo r example, the fo llo wing is invalid:



    Validity Constraint: ID Attribute Default All attributes o f ID type must be dec lared #IMPLIED o r #REQUIRED. #FIXED is no t allo wed. Fo r example, the fo llo wing is invalid:

    The pro blem is that if there’s mo re than o ne PERSON element in the do c ument, the ID validity c o nstraint will auto matic ally be vio lated.

    Validity Constraint: IDREF The IDREF validity c o nstraint spec ifies that an attribute value o f an IDREF type attribute must be the same as the value o f an ID type attribute o f an element in the do c ument. Multiple IDREF attributes in the same o r different elements may po int to a single element. ID attribute values must be unique (at least amo ng o ther ID attribute values in the same do c ument), but IDREF attributes do no t need to be. Additio nally, attribute values o f type IDREFS must be a whitespac e-separated list o f ID attribute values fro m elements in the do c ument.

    Validity Constraint: Entity Name The value o f an attribute who se dec lared type is ENTITY must be the name o f an unparsed general (no n-parameter) entity dec lared in the DTD, whether in the internal o r external subset. The value o f an attribute who se dec lared type is ENTITIES must be a whitespac eseparated list o f the names o f unparsed general (no n-parameter) entities dec lared in the DTD, whether in the internal o r external subset.

    917

    918

    Appendixes

    Validity Constraint: Name Token The value o f an attribute who se dec lared type is NMTOKEN must matc h the NMTOKEN pro duc tio n o f XML (Pro duc tio n [7]). That is, it must be c o mpo sed o f o ne o r mo re name c harac ters. It differs fro m an XML name in that it may start with a digit, a perio d, a hyphen, a c o mbining c harac ter, o r an extender. The value o f an attribute who se dec lared type is NMTOKENS must be a whitespac eseparated list o f name to kens. Fo r example, this is a valid element with a COLORS attribute o f type NMTOKENS:

    This is an invalid element with a COLORS attribute o f type NMTOKENS:

    [58] NotationType ::= ‘NOTATION’ S ‘(‘ S? Name (S? ‘|’ S? Name)* S? ‘)’ Validity Constraint: Notation Attributes The value o f an attribute who se dec lared type is NOTATION must be the name o f a no tatio n that’s been dec lared in the DTD.

    [59] Enumeration ::= ‘(‘ S? Nmtoken (S? ‘|’ S? Nmtoken)* S? ‘)’ Validity Constraint: Enumeration The value o f an attribute who se dec lared type is ENUMERATION must be a whitespac e-separated list o f name to kens. These name to kens do no t nec essarily have to be the names o f anything dec lared in the DTD o r elsewhere. They simply have to matc h the NMTOKEN pro duc tio n (Pro duc tio n [7]). Fo r example, this is an invalid enumeratio n bec ause c o mmas rather than whitespac e are used to separate the name to kens:

    ( red, green, blue) This is an invalid enumeratio n bec ause the name to kens are enc lo sed in quo te marks:

    ( “red” “green” “blue”) Neither c o mmas no r quo te marks are valid name c harac ters so there’s no po ssibility fo r these c o mmo n mistakes to be misinterpreted as a whitespac eseparated list o f unusual name to kens.

    Appendix A ✦ XM L Reference M aterial

    [60] DefaultDecl ::= ‘# REQUIRED’ | ‘# IM PLIED’ | ((‘# FIXED’ S)? AttValue) Validity Constraint: Required Attribute If an attribute o f an element is dec lared to be #REQUIRED, then it is a validity erro r fo r any instanc e o f the element no t to pro vide a value fo r that attribute.

    Validity Constraint: Attribute Default Legal This c o mmo n-sense validity c o nstraint merely states that any default attribute value pro vided in an ATTLIST dec laratio n must satisfy the c o nstraints fo r an attribute o f that type. Fo r example, the fo llo wing is invalid bec ause the default value, UNKNOWN, is no t o ne o f the c ho ic es given by the c o ntent mo del.

    UNKNOWN wo uld be invalid fo r this attribute whether it was pro vided as a default value o r in an ac tual element like the fo llo wing:

    Validity Constraint: Fixed Attribute Default This c o mmo n-sense validity c o nstraint merely states that if an attribute is dec lared #FIXED in its ATTLIST dec laratio n, then that same ATTLIST dec laratio n must also pro vide a default value. Fo r example, the fo llo wing is invalid:

    Here’s a c o rrec ted dec laratio n:

    [68] EntityRef ::= ‘&’ Name ‘;’ Validity Constraint: Entity Declared This c o nstraint expands o n the well-fo rmedness c o nstraint o f the same name. In a valid do c ument, all referenc ed entities must be defined by dec laratio ns in the DTD. Definitio ns must prec ede any use o f the entity they define. The lo o pho le fo r standalone=”no” do c uments that applies to merely well-fo rmed do c uments is no lo nger available. The lo o pho le fo r the five predefined entities: , ", and & is still available. Ho wever, it is rec o m-

    919

    920

    Appendixes

    mended that yo u dec lare them, even tho ugh yo u do n’t abso lutely have to . Tho se dec laratio ns wo uld lo o k like this:

    “>”> “&”> “'”> “"”>

    [69] PEReference ::= ‘%’ Name ‘;’ Validity Constraint: Entity Declared This is the same c o nstraint as the previo us o ne, merely applied to parameter entity referenc es instead o f general entity referenc es.

    [76] NDataDecl ::= S ‘NDATA’ S Name Validity Constraint: Notation Declared The name used in a no tatio n data dec laratio n (whic h is in turn used in an entity definitio n fo r an unparsed entity) must be the name o f a no tatio n dec lared in the DTD. Fo r example, the fo llo wing do c ument is valid. Ho wever, if yo u take away the line dec laring the GIF no tatio n (sho wn in bo ld) it bec o mes invalid.



    ]>

    &LOGO;







    B

    A P P E N D I X

    The XM L 1.0 Specification

    T

    his appendix has the c o mplete, final XML 1.0 spec ific atio n as published by the Wo rld Wide Web c o nso rtium. This do c ument has been reviewed by W3C Members and o ther interested parties and has been endo rsed by the Direc to r as a W3C Rec o mmendatio n. It is a stable do c ument and may be used as referenc e material o r c ited as a no rmative referenc e fro m ano ther do c ument. If any c hanges to XML are required in the future (as they undo ubtedly will be) a new versio n number will be applied. This do c ument isn’t always easy reading. Prec isio n is preferred o ver c larity. Ho wever, when yo u’re banging yo ur head against the wall, and trying to dec ide whether the pro blem is with yo ur XML pro c esso r o r with yo ur XML c o de, this is the dec iding do c ument. Therefo re, it’s impo rtant to have at least a c urso ry familiarity with it, and be able to find things in it when yo u need to . This do c ument was primarily written by Tim Bray and C. M. Sperberg-Mc Queen with assistanc e fro m many o thers c redited at the end o f the do c ument. REC-xml-19980210

    W3C Recommendation 10-February-1998 This versio n:

    http://www.w3.org/TR/1998/REC-xml-19980210 http://www.w3.org/TR/1998/REC-xml19980210.xml http://www.w3.org/TR/1998/REC-xml19980210.html http://www.w3.org/TR/1998/REC-xml19980210.pdf http://www.w3.org/TR/1998/REC-xml-19980210.ps

    922

    Appendixes

    Latest version: http://www.w3.org/TR/REC-xml

    Previous version: http://www.w3.org/TR/PR-xml-971208

    Editors: Tim Bray (Textuality and Netsc ape) Jean Pao li (Mic ro so ft) C. M. Sperberg-Mc Queen (University o f Illino is at Chic ago )

    Abstract The Extensible Markup Language (XML) is a subset o f SGML that is c o mpletely desc ribed in this do c ument. Its go al is to enable generic SGML to be served, rec eived, and pro c essed o n the Web in the way that is no w po ssible with HTML. XML has been designed fo r ease o f implementatio n and fo r intero perability with bo th SGML and HTML.

    Status of This Document This do c ument has been reviewed by W3C Members and o ther interested parties and has been endo rsed by the Direc to r as a W3C Rec o mmendatio n. It is a stable do c ument and may be used as referenc e material o r c ited as a no rmative referenc e fro m ano ther do c ument. W3C’s ro le in making the Rec o mmendatio n is to draw attentio n to the spec ific atio n and to pro mo te its widespread deplo yment. This enhanc es the func tio nality and intero perability o f the Web. This do c ument spec ifies a syntax c reated by subsetting an existing, widely used internatio nal text pro c essing standard (Standard Generalized Markup Language, ISO 8879:1986(E) as amended and c o rrec ted) fo r use o n the Wo rld Wide Web. It is a pro duc t o f the W3C XML Ac tivity, details o f whic h c an be fo und at http:/ / www.w3. o rg/ XML. A list o f c urrent W3C Rec o mmendatio ns and o ther tec hnic al do c uments c an be fo und at http:/ / www.w3.o rg/ TR. This spec ific atio n uses the term URI, whic h is defined by [Berners-Lee et al.], a wo rk in pro gress expec ted to update [IETF RFC1738] and [IETF RFC1808]. The list o f kno wn erro rs in this spec ific atio n is available at http:/ / www.w3.o rg/ XML/ xml-19980210-errata.

    Appendix B ✦ The XM L 1.0 Specification

    Please repo rt erro rs in this do c ument to xml-edito [email protected] rg.

    Extensible M arkup Language (XM L) 1.0 Table of Contents 1. Intro duc tio n 1.1 Origin and Go als 1.2 Termino lo gy 2. Do c uments 2.1 Well-Fo rmed XML Do c uments 2.2 Charac ters 2.3 Co mmo n Syntac tic Co nstruc ts 2.4 Charac ter Data and Markup 2.5 Co mments 2.6 Pro c essing Instruc tio ns 2.7 CDATA Sec tio ns 2.8 Pro lo g and Do c ument Type Dec laratio n 2.9 Standalo ne Do c ument Dec laratio n 2.10 White Spac e Handling 2.11 End-o f-Line Handling 2.12 Language Identific atio n 3. Lo gic al Struc tures 3.1 Start-Tags, End-Tags, and Empty-Element Tags 3.2 Element Type Dec laratio ns 3.2.1 Element Co ntent 3.2.2 Mixed Co ntent 3.3 Attribute-List Dec laratio ns 3.3.1 Attribute Types 3.3.2 Attribute Defaults 3.3.3 Attribute-Value No rmalizatio n 3.4 Co nditio nal Sec tio ns

    923

    924

    Appendixes

    4. Physic al Struc tures 4.1 Charac ter and Entity Referenc es 4.2 Entity Dec laratio ns 4.2.1 Internal Entities 4.2.2 External Entities 4.3 Parsed Entities 4.3.1 The Text Dec laratio n 4.3.2 Well-Fo rmed Parsed Entities 4.3.3 Charac ter Enc o ding in Entities 4.4 XML Pro c esso r Treatment o f Entities and Referenc es 4.4.1 No t Rec o gnized 4.4.2 Inc luded 4.4.3 Inc luded If Validating 4.4.4 Fo rbidden 4.4.5 Inc luded in Literal 4.4.6 No tify 4.4.7 Bypassed 4.4.8 Inc luded as PE 4.5 Co nstruc tio n o f Internal Entity Replac ement Text 4.6 Predefined Entities 4.7 No tatio n Dec laratio ns 4.8 Do c ument Entity 5. Co nfo rmanc e 5.1 Validating and No n-Validating Pro c esso rs 5.2 Using XML Pro c esso rs 6. No tatio n Appendic es A. Referenc es A.1 No rmative Referenc es A.2 Other Referenc es B. Charac ter Classes C. XML and SGML (No n-No rmative) D. Expansio n o f Entity and Charac ter Referenc es (No n-No rmative)

    Appendix B ✦ The XM L 1.0 Specification

    E. Deterministic Co ntent Mo dels (No n-No rmative) F. Auto detec tio n o f Charac ter Enc o dings (No n-No rmative) G. W3C XML Wo rking Gro up (No n-No rmative)

    1. Introduction Extensible Markup Language, abbreviated XML, desc ribes a c lass o f data o bjec ts c alled XML do c uments and partially desc ribes the behavio r o f c o mputer pro grams whic h pro c ess them. XML is an applic atio n pro file o r restric ted fo rm o f SGML, the Standard Generalized Markup Language [ISO 8879]. By c o nstruc tio n, XML do c uments are c o nfo rming SGML do c uments. XML do c uments are made up o f sto rage units c alled entities, whic h c o ntain either parsed o r unparsed data. Parsed data is made up o f c harac ters, so me o f whic h fo rm c harac ter data, and so me o f whic h fo rm markup. Markup enc o des a desc riptio n o f the do c ument’s sto rage layo ut and lo gic al struc ture. XML pro vides a mec hanism to impo se c o nstraints o n the sto rage layo ut and lo gic al struc ture. A so ftware mo dule c alled an XML processor is used to read XML do c uments and pro vide ac c ess to their c o ntent and struc ture. It is assumed that an XML pro c esso r is do ing its wo rk o n behalf o f ano ther mo dule, c alled the application . This spec ific atio n desc ribes the required behavio r o f an XML pro c esso r in terms o f ho w it must read XML data and the info rmatio n it must pro vide to the applic atio n.

    1.1 Origin and Goals XML was develo ped by an XML Wo rking Gro up (o riginally kno wn as the SGML Edito rial Review Bo ard) fo rmed under the auspic es o f the Wo rld Wide Web Co nso rtium (W3C) in 1996. It was c haired by Jo n Bo sak o f Sun Mic ro systems with the ac tive partic ipatio n o f an XML Spec ial Interest Gro up (previo usly kno wn as the SGML Wo rking Gro up) also o rganized by the W3C. The membership o f the XML Wo rking Gro up is given in an appendix. Dan Co nno lly served as the WG’s c o ntac t with the W3C. The design go als fo r XML are:

    1. XML shall be straightfo rwardly usable o ver the Internet. 2. XML shall suppo rt a wide variety o f applic atio ns. 3. XML shall be c o mpatible with SGML. 4. It shall be easy to write pro grams whic h pro c ess XML do c uments. 5. The number o f o ptio nal features in XML is to be kept to the abso lute minimum, ideally zero .

    6. XML do c uments sho uld be human-legible and reaso nably c lear. 7. The XML design sho uld be prepared quic kly.

    925

    926

    Appendixes

    8. The design o f XML shall be fo rmal and c o nc ise. 9. XML do c uments shall be easy to c reate. 10. Terseness in XML markup is o f minimal impo rtanc e. This spec ific atio n, to gether with asso c iated standards (Unic o de and ISO/ IEC 10646 fo r c harac ters, Internet RFC 1766 fo r language identific atio n tags, ISO 639 fo r language name c o des, and ISO 3166 fo r c o untry name c o des), pro vides all the info rmatio n nec essary to understand XML Versio n 1.0 and c o nstruc t c o mputer pro grams to pro c ess it. This versio n o f the XML spec ific atio n may be distributed freely, as lo ng as all text and legal no tic es remain intac t.

    1.2 Terminology The termino lo gy used to desc ribe XML do c uments is defined in the bo dy o f this spec ific atio n. The terms defined in the fo llo wing list are used in building tho se definitio ns and in desc ribing the ac tio ns o f an XML pro c esso r:

    may Co nfo rming do c uments and XML pro c esso rs are permitted to but need no t behave as desc ribed.

    must Co nfo rming do c uments and XML pro c esso rs are required to behave as desc ribed; o therwise they are in erro r. error A vio latio n o f the rules o f this spec ific atio n; results are undefined. Co nfo rming so ftware may detec t and repo rt an erro r and may rec o ver fro m it. fatal error An erro r whic h a c o nfo rming XML pro c esso r must detec t and repo rt to the applic atio n. After enc o untering a fatal erro r, the pro c esso r may c o ntinue pro c essing the data to searc h fo r further erro rs and may repo rt suc h erro rs to the applic atio n. In o rder to suppo rt c o rrec tio n o f erro rs, the pro c esso r may make unpro c essed data fro m the do c ument (with intermingled c harac ter data and markup) available to the applic atio n. Onc e a fatal erro r is detec ted, ho wever, the pro c esso r must no t c o ntinue no rmal pro c essing (i.e., it must no t c o ntinue to pass c harac ter data and info rmatio n abo ut the do c ument’s lo gic al struc ture to the applic atio n in the no rmal way).

    at user option Co nfo rming so ftware may o r must (depending o n the mo dal verb in the sentenc e) behave as desc ribed; if it do es, it must pro vide users a means to enable o r disable the behavio r desc ribed.

    validity constraint A rule whic h applies to all valid XML do c uments. Vio latio ns o f validity c o nstraints are erro rs; they must, at user o ptio n, be repo rted by validating XML pro c esso rs. well-formedness constraint A rule whic h applies to all well-fo rmed XML do c uments. Vio latio ns o f well-fo rmedness c o nstraints are fatal erro rs.

    Appendix B ✦ The XM L 1.0 Specification

    match (Of strings o r names:) Two strings o r names being c o mpared must be identic al. Charac ters with multiple po ssible representatio ns in ISO/ IEC 10646 (e.g. c harac ters with bo th prec o mpo sed and base+diac ritic fo rms) matc h o nly if they have the same representatio n in bo th strings. At user o ptio n, pro c esso rs may no rmalize suc h c harac ters to so me c ano nic al fo rm. No c ase fo lding is perfo rmed. (Of strings and rules in the grammar:) A string matc hes a grammatic al pro duc tio n if it belo ngs to the language generated by that pro duc tio n. (Of c o ntent and c o ntent mo dels:) An element matc hes its dec laratio n when it c o nfo rms in the fashio n desc ribed in the c o nstraint “Element Valid”.

    for compatibility A feature o f XML inc luded so lely to ensure that XML remains c o mpatible with SGML.

    for interoperability A no n-binding rec o mmendatio n inc luded to inc rease the c hanc es that XML do c uments c an be pro c essed by the existing installed base o f SGML pro c esso rs whic h predate the WebSGML Adaptatio ns Annex to ISO 8879.

    2. Documents A data o bjec t is an XML document if it is well-fo rmed, as defined in this spec ific atio n. A well-fo rmed XML do c ument may in additio n be valid if it meets c ertain further c o nstraints. Eac h XML do c ument has bo th a lo gic al and a physic al struc ture. Physic ally, the do c ument is c o mpo sed o f units c alled entities. An entity may refer to o ther entities to c ause their inc lusio n in the do c ument. A do c ument begins in a “ro o t” o r do c ument entity. Lo gic ally, the do c ument is c o mpo sed o f dec laratio ns, elements, c o mments, c harac ter referenc es, and pro c essing instruc tio ns, all o f whic h are indic ated in the do c ument by explic it markup. The lo gic al and physic al struc tures must nest pro perly, as desc ribed in “4.3.2 Well-Fo rmed Parsed Entities”.

    2.1 Well-Formed XM L Documents A textual o bjec t is a well-fo rmed XML do c ument if:

    ✦ Taken as a who le, it matc hes the pro duc tio n labeled do c ument. ✦ It meets all the well-fo rmedness c o nstraints given in this spec ific atio n. Eac h o f the parsed entities whic h is referenc ed direc tly o r indirec tly within the do c ument is well-fo rmed.

    Document [1] document ::= prolog element Misc* Matc hing the do c ument pro duc tio n implies that:

    ✦ It c o ntains o ne o r mo re elements.

    927

    928

    Appendixes

    ✦ There is exac tly o ne element, c alled the root, o r do c ument element, no part o f whic h appears in the c o ntent o f any o ther element. Fo r all o ther elements, if the start-tag is in the c o ntent o f ano ther element, the end-tag is in the c o ntent o f the same element. Mo re simply stated, the elements, delimited by start- and end-tags, nest pro perly within eac h o ther.

    ✦ As a c o nsequenc e o f this, fo r eac h no n-ro o t element C in the do c ument, there is o ne o ther element P in the do c ument suc h that C is in the c o ntent o f P, but is no t in the c o ntent o f any o ther element that is in the c o ntent o f P. P is referred to as the parent o f C, and C as a child o f P.

    2.2 Characters A parsed entity c o ntains text, a sequenc e o f c harac ters, whic h may represent markup o r c harac ter data. A character is an ato mic unit o f text as spec ified by ISO/ IEC 10646 [ISO/ IEC 10646]. Legal c harac ters are tab, c arriage return, line feed, and the legal graphic c harac ters o f Unic o de and ISO/ IEC 10646. The use o f “c o mpatibility c harac ters”, as defined in sec tio n 6.8 o f [Unic o de], is disc o uraged.

    Character Range [2] Char ::= #x9 | #xA | #xD | [#x20-#xD7FF] /* any Unicode character, | [#xE000-#xFFFD] excluding the surrogate | [#x10000-#x10FFFF] blocks, FFFE, and FFFF. */ The mec hanism fo r enc o ding c harac ter c o de po ints into bit patterns may vary fro m entity to entity. All XML pro c esso rs must ac c ept the UTF-8 and UTF-16 enc o dings o f 10646; the mec hanisms fo r signaling whic h o f the two is in use, o r fo r bringing o ther enc o dings into play, are disc ussed later, in “4.3.3 Charac ter Enc o ding in Entities”.

    2.3 Common Syntactic Constructs This sec tio n defines so me symbo ls used widely in the grammar. S (white spac e) c o nsists o f o ne o r mo re spac e (#x20) c harac ters, c arriage returns, line feeds, o r tabs.

    White Space [3] S ::= (#x20 | #x9 | #xD | #xA)+ Charac ters are c lassified fo r c o nvenienc e as letters, digits, o r o ther c harac ters. Letters c o nsist o f an alphabetic o r syllabic base c harac ter po ssibly fo llo wed by o ne o r mo re c o mbining c harac ters, o r o f an ideo graphic c harac ter. Full definitio ns o f the spec ific c harac ters in eac h c lass are given in “B. Charac ter Classes”. A Name is a to ken beginning with a letter o r o ne o f a few punc tuatio n c harac ters, and c o ntinuing with letters, digits, hyphens, undersc o res, c o lo ns, o r full sto ps, to gether kno wn as name c harac ters. Names beginning with the string “xml”, o r any string whic h wo uld matc h ((‘X’|’x’) (‘M’|’m’) (‘L’|’l’)), are reserved fo r standardizatio n in this o r future versio ns o f this spec ific atio n.

    Appendix B ✦ The XM L 1.0 Specification

    Note: The c o lo n c harac ter within XML names is reserved fo r experimentatio n with name spac es. Its meaning is expec ted to be standardized at so me future po int, at whic h po int tho se do c uments using the c o lo n fo r experimental purpo ses may need to be updated. (There is no guarantee that any name-spac e mec hanism ado pted fo r XML will in fac t use the c o lo n as a name-spac e delimiter.) In prac tic e, this means that autho rs sho uld no t use the c o lo n in XML names exc ept as part o f name-spac e experiments, but that XML pro c esso rs sho uld ac c ept the c o lo n as a name c harac ter. An Nmto ken (name to ken) is any mixture o f name c harac ters.

    Names and Tokens [4] NameChar ::= Letter | Digit | ‘.’ | ‘-’ | ‘_’ | ‘:’ | CombiningChar | Extender [5] Name ::= (Letter | ‘_’ | ‘:’) (NameChar)* [6] Names ::= Name (S Name)* [7] Nmtoken ::= (NameChar)+ [8] Nmtokens ::= Nmtoken (S Nmtoken)* Literal data is any quo ted string no t c o ntaining the quo tatio n mark used as a delimiter fo r that string. Literals are used fo r spec ifying the c o ntent o f internal entities ( EntityValue), the values o f attributes ( AttValue), and external identifiers ( SystemLiteral). No te that a SystemLiteral c an be parsed witho ut sc anning fo r markup.

    Literals [9] ‘“‘

    EntityValue

    ::= ‘“‘ ([^%&”] | PEReference | Reference)* | “‘“ ([^%&’] | PEReference |

    Reference)* “‘“ [10] AttValue

    ::= ‘“‘ ([^”. To allo w attribute values to c o ntain bo th single and do uble quo tes, the apo stro phe o r single-quo te c harac ter (‘) may be represented as “&apo s;”, and the do uble-quo te c harac ter (“) as “&quo t;”.

    Character Data [14] CharData ::= [^ —>

    2.6 Processing Instructions Processing instructions (PIs) allo w do c uments to c o ntain instruc tio ns fo r applic atio ns.

    Processing Instructions [16] PI ::= ‘’ [17] PITarget ::= Name - ((‘X’ | ‘x’) (‘M’ | ‘m’) (‘L’ | ‘l’))

    Appendix B ✦ The XM L 1.0 Specification

    PIs are no t part o f the do c ument’s c harac ter data, but must be passed thro ugh to the applic atio n. The PI begins with a target (PITarget) used to identify the applic atio n to whic h the instruc tio n is direc ted. The target names “XML”, “xml”, and so o n are reserved fo r standardizatio n in this o r future versio ns o f this spec ific atio n. The XML No tatio n mec hanism may be used fo r fo rmal dec laratio n o f PI targets.

    2.7 CDATA Sections CDATA sections may o c c ur anywhere c harac ter data may o c c ur; they are used to esc ape blo c ks o f text c o ntaining c harac ters whic h wo uld o therwise be rec o gnized as markup. CDATA sec tio ns begin with the string “”: CDATA Sections [18] [19] [20] [21]

    CDSect CDStart CData CDEnd

    ::= ::= ::= ::=

    CDStart CData CDEnd ‘’ Char*)) ‘]]>’

    Within a CDATA sec tio n, o nly the CDEnd string is rec o gnized as markup, so that left angle brac kets and ampersands may o c c ur in their literal fo rm; they need no t (and c anno t) be esc aped using “

    2.8 Prolog and Document Type Declaration XML do c uments may, and sho uld, begin with an XML declaration whic h spec ifies the versio n o f XML being used. Fo r example, the fo llo wing is a c o mplete XML do c ument, well-fo rmed but no t valid:

    Hello, world! and so is this:

    Hello, world! The versio n number “1.0” sho uld be used to indic ate c o nfo rmanc e to this versio n o f this spec ific atio n; it is an erro r fo r a do c ument to use the value “1.0” if it do es no t c o nfo rm to this versio n o f this spec ific atio n. It is the intent o f the XML wo rking gro up to give later versio ns o f this spec ific atio n numbers o ther than “1.0”, but this intent do es no t indic ate a c o mmitment to pro duc e any future versio ns o f XML, no r

    931

    932

    Appendixes

    if any are pro duc ed, to use any partic ular numbering sc heme. Sinc e future versio ns are no t ruled o ut, this c o nstruc t is pro vided as a means to allo w the po ssibility o f auto matic versio n rec o gnitio n, sho uld it bec o me nec essary. Pro c esso rs may signal an erro r if they rec eive do c uments labeled with versio ns they do no t suppo rt. The func tio n o f the markup in an XML do c ument is to desc ribe its sto rage and lo gic al struc ture and to asso c iate attribute-value pairs with its lo gic al struc tures. XML pro vides a mec hanism, the do c ument type dec laratio n, to define c o nstraints o n the lo gic al struc ture and to suppo rt the use o f predefined sto rage units. An XML do c ument is valid if it has an asso c iated do c ument type dec laratio n and if the do c ument c o mplies with the c o nstraints expressed in it. The do c ument type dec laratio n must appear befo re the first element in the do c ument.

    Prolog [22] prolog ::= XMLDecl? Misc* (doctypedecl Misc*)? [23] XMLDecl ::= ‘’ [24] VersionInfo ::= S ‘version’ Eq (‘ VersionNum ‘ | “ VersionNum “) [25] Eq ::= S? ‘=’ S? [26] VersionNum ::= ([a-zA-Z0-9_.:] | ‘-’)+ [27] Misc ::= Comment | PI | S The XML document type declaration c o ntains o r po ints to markup dec laratio ns that pro vide a grammar fo r a c lass o f do c uments. This grammar is kno wn as a do c ument type definitio n, o r DTD. The do c ument type dec laratio n c an po int to an external subset (a spec ial kind o f external entity) c o ntaining markup dec laratio ns, o r c an c o ntain the markup dec laratio ns direc tly in an internal subset, o r c an do bo th. The DTD fo r a do c ument c o nsists o f bo th subsets taken to gether. A markup declaration is an element type dec laratio n, an attribute-list dec laratio n, an entity dec laratio n, o r a no tatio n dec laratio n. These dec laratio ns may be c o ntained in who le o r in part within parameter entities, as desc ribed in the wellfo rmedness and validity c o nstraints belo w. Fo r fuller info rmatio n, see “4. Physic al Struc tures”.

    Document Type Definition [28] doctypedecl ::= ‘’ [ VC: Root Element Type ] [29] markupdecl ::= elementdecl | AttlistDecl

    Appendix B ✦ The XM L 1.0 Specification

    | | [ [

    EntityDecl | NotationDecl | PI Comment VC: Proper Declaration/PE Nesting ] WFC: PEs in Internal Subset ]

    The markup dec laratio ns may be made up in who le o r in part o f the replac ement text o f parameter entities. The pro duc tio ns later in this spec ific atio n fo r individual no nterminals (elementdec l, AttlistDec l, and so o n) desc ribe the dec laratio ns afte r all the parameter entities have been inc luded.

    Validity Constraint: Root Element Type: The Name in the do c ument type dec laratio n must matc h the element type o f the ro o t element. Validity Constraint: Proper Declaration/ PE Nesting: Parameter-entity replac ement text must be pro perly nested with markup dec laratio ns. That is to say, if either the first c harac ter o r the last c harac ter o f a markup dec laratio n (markupdec l abo ve) is c o ntained in the replac ement text fo r a parameter-entity referenc e, bo th must be c o ntained in the same replac ement text.

    Well-Formedness Constraint: PEs in Internal Subset: In the internal DTD subset, parameter-entity referenc es c an o c c ur o nly where markup dec laratio ns c an o c c ur, no t within markup dec laratio ns. (This do es no t apply to referenc es that o c c ur in external parameter entities o r to the external subset.) Like the internal subset, the external subset and any external parameter entities referred to in the DTD must c o nsist o f a series o f c o mplete markup dec laratio ns o f the types allo wed by the no n-terminal symbo l markupdec l, interspersed with white spac e o r parameter-entity referenc es. Ho wever, po rtio ns o f the c o ntents o f the external subset o r o f external parameter entities may c o nditio nally be igno red by using the c o nditio nal sec tio n c o nstruc t; this is no t allo wed in the internal subset.

    External Subset [30] extSubset ::= TextDecl? extSubsetDecl [31] extSubsetDecl ::= ( markupdecl | conditionalSect | PEReference | S )* The external subset and external parameter entities also differ fro m the internal subset in that in them, parameter-entity referenc es are permitted within markup dec laratio ns, no t o nly between markup dec laratio ns. An example o f an XML do c ument with a do c ument type dec laratio n:

    Hello, world!

    933

    934

    Appendixes

    The system identifier “hello .dtd” gives the URI o f a DTD fo r the do c ument. The dec laratio ns c an also be given lo c ally, as in this example:

    Hello, world! If bo th the external and internal subsets are used, the internal subset is c o nsidered to o c c ur befo re the external subset. This has the effec t that entity and attribute-list dec laratio ns in the internal subset take prec edenc e o ver tho se in the external subset.

    2.9 Standalone Document Declaration Markup dec laratio ns c an affec t the c o ntent o f the do c ument, as passed fro m an XML pro c esso r to an applic atio n; examples are attribute defaults and entity dec laratio ns. The standalo ne do c ument dec laratio n, whic h may appear as a c o mpo nent o f the XML dec laratio n, signals whether o r no t there are suc h dec laratio ns whic h appear external to the do c ument entity.

    Standalone Document Declaration [32] SDDecl ::= S ‘standalone’ Eq ((“‘“ (‘yes’ | ‘no’) “‘“) | (‘“‘ (‘yes’ | ‘no’) ‘“‘)) [ VC: Standalone Document Declaration ] In a standalo ne do c ument dec laratio n, the value “yes” indic ates that there are no markup dec laratio ns external to the do c ument entity (either in the DTD external subset, o r in an external parameter entity referenc ed fro m the internal subset) whic h affec t the info rmatio n passed fro m the XML pro c esso r to the applic atio n. The value “no ” indic ates that there are o r may be suc h external markup dec laratio ns. No te that the standalo ne do c ument dec laratio n o nly deno tes the presenc e o f external dec laratio ns; the presenc e, in a do c ument, o f referenc es to external e ntitie s, when tho se entities are internally dec lared, do es no t c hange its standalo ne status. If there are no external markup dec laratio ns, the standalo ne do c ument dec laratio n has no meaning. If there are external markup dec laratio ns but there is no standalo ne do c ument dec laratio n, the value “no ” is assumed. Any XML do c ument fo r whic h standalo ne=”no ” ho lds c an be c o nverted algo rithmic ally to a standalo ne do c ument, whic h may be desirable fo r so me netwo rk delivery applic atio ns.

    Validity Constraint: Standalone Document Declaration: The standalo ne do c ument dec laratio n must have the value “no ” if any external markup dec laratio ns c o ntain dec laratio ns o f:

    Appendix B ✦ The XM L 1.0 Specification

    ✦ attributes with default values, if elements to whic h these attributes apply appear in the do c ument witho ut spec ific atio ns o f values fo r these attributes, or

    ✦ entities (o ther than amp, lt, gt, apo s, quo t), if referenc es to tho se entities appear in the do c ument, o r

    ✦ attributes with values subjec t to no rmalizatio n, where the attribute appears in the do c ument with a value whic h will c hange as a result o f no rmalizatio n, o r

    ✦ element types with element c o ntent, if white spac e o c c urs direc tly within any instanc e o f tho se types. An example XML dec laratio n with a standalo ne do c ument dec laratio n:

    2.10 White Space Handling In editing XML do c uments, it is o ften c o nvenient to use “white spac e” (spac es, tabs, and blank lines, deno ted by the no nterminal S in this spec ific atio n) to set apart the markup fo r greater readability. Suc h white spac e is typic ally no t intended fo r inc lusio n in the delivered versio n o f the do c ument. On the o ther hand, “signific ant” white spac e that sho uld be preserved in the delivered versio n is c o mmo n, fo r example in po etry and so urc e c o de. An XML pro c esso r must always pass all c harac ters in a do c ument that are no t markup thro ugh to the applic atio n. A validating XML pro c esso r must also info rm the applic atio n whic h o f these c harac ters c o nstitute white spac e appearing in element c o ntent. A spec ial attribute named xml:spac e may be attac hed to an element to signal an intentio n that in that element, white spac e sho uld be preserved by applic atio ns. In valid do c uments, this attribute, like any o ther, must be dec lared if it is used. When dec lared, it must be given as an enumerated type who se o nly po ssible values are “default” and “preserve”. Fo r example:

    The value “default” signals that applic atio ns’ default white-spac e pro c essing mo des are ac c eptable fo r this element; the value “preserve” indic ates the intent that applic atio ns preserve all the white spac e. This dec lared intent is c o nsidered to apply to all elements within the c o ntent o f the element where it is spec ified, unless o verridden with ano ther instanc e o f the xml:spac e attribute. The ro o t element o f any do c ument is c o nsidered to have signaled no intentio ns as regards applic atio n spac e handling, unless it pro vides a value fo r this attribute o r the attribute is dec lared with a default value.

    935

    936

    Appendixes

    2.11 End-of-Line Handling XML parsed entities are o ften sto red in c o mputer files whic h, fo r editing c o nvenienc e, are o rganized into lines. These lines are typic ally separated by so me c o mbinatio n o f the c harac ters c arriage-return (#xD) and line-feed (#xA). To simplify the tasks o f applic atio ns, wherever an external parsed entity o r the literal entity value o f an internal parsed entity c o ntains either the literal two c harac ter sequenc e “#xD#xA” o r a standalo ne literal #xD, an XML pro c esso r must pass to the applic atio n the single c harac ter #xA. (This behavio r c an c o nveniently be pro duc ed by no rmalizing all line breaks to #xA o n input, befo re parsing.)

    2.12 Language Identification In do c ument pro c essing, it is o ften useful to identify the natural o r fo rmal language in whic h the c o ntent is written. A spec ial attribute named xml:lang may be inserted in do c uments to spec ify the language used in the c o ntents and attribute values o f any element in an XML do c ument. In valid do c uments, this attribute, like any o ther, must be dec lared if it is used. The values o f the attribute are language identifiers as defined by [IETF RFC 1766], “Tags fo r the Identific atio n o f Languages”:

    Language Identification [33] [34] [35] [36] [37] [38]

    LanguageID Langcode ISO639Code IanaCode UserCode Subcode

    ::= ::= ::= ::= ::= ::=

    Langcode (‘-’ Subcode)* ISO639Code | IanaCode | UserCode ([a-z] | [A-Z]) ([a-z] | [A-Z]) (‘i’ | ‘I’) ‘-’ ([a-z] | [A-Z])+ (‘x’ | ‘X’) ‘-’ ([a-z] | [A-Z])+ ([a-z] | [A-Z])+

    The Langcode may be any o f the fo llo wing:

    ✦ a two -letter language c o de as defined by [ISO 639], “Co des fo r the representatio n o f names o f languages”

    ✦ a language identifier registered with the Internet Assigned Numbers Autho rity [IANA]; these begin with the prefix “i-” (o r “I-”)

    ✦ a language identifier assigned by the user, o r agreed o n between parties in private use; these must begin with the prefix “x-” o r “X-” in o rder to ensure that they do no t c o nflic t with names later standardized o r registered with IANA. There may be any number o f Subcode segments; if the first subc o de segment exists and the Subc o de c o nsists o f two letters, then it must be a c o untry c o de fro m [ISO 3166], “Co des fo r the representatio n o f names o f c o untries.” If the first subc o de c o nsists o f mo re than two letters, it must be a subc o de fo r the language in questio n registered with IANA, unless the Langc o de begins with the prefix “x-” o r “X-”. It is c usto mary to give the language c o de in lo wer c ase, and the c o untry c o de (if any) in upper c ase. No te that these values, unlike o ther names in XML do c uments, are c ase insensitive.

    Appendix B ✦ The XM L 1.0 Specification

    Fo r example:

    The quick brown fox jumps over the lazy dog.

    What colour is it?

    What color is it?



    Habe nun, ach! Philosophie, Juristerei, und Medizin und leider auch Theologie durchaus studiert mit heißem Bemüh’n.

    The intent dec lared with xml:lang is c o nsidered to apply to all attributes and c o ntent o f the element where it is spec ified, unless o verridden with an instanc e o f xml:lang o n ano ther element within that c o ntent. A simple dec laratio n fo r xml:lang might take the fo rm:

    xml:lang

    NMTOKEN

    #IMPLIED

    but spec ific default values may also be given, if appro priate. In a c o llec tio n o f Frenc h po ems fo r English students, with glo sses and no tes in English, the xml:lang attribute might be dec lared this way:

    xml:lang NMTOKEN ‘en’> xml:lang NMTOKEN ‘en’>

    3. Logical Structures Eac h XML do c ument c o ntains o ne o r mo re elements , the bo undaries o f whic h are either delimited by start-tags and end-tags, o r, fo r empty elements, by an emptyelement tag. Eac h element has a type, identified by name, so metimes c alled its “generic identifier” (GI), and may have a set o f attribute spec ific atio ns. Eac h attribute spec ific atio n has a name and a value.

    Element [39] element ::= EmptyElemTag | STag content ETag [ WFC: Element Type Match ] [ VC: Element Valid ] This spec ific atio n do es no t c o nstrain the semantic s, use, o r (beyo nd syntax) names o f the element types and attributes, exc ept that names beginning with a matc h to ((‘X’|’x’)(‘M’|’m’)(‘L’|’l’)) are reserved fo r standardizatio n in this o r future versio ns o f this spec ific atio n.

    937

    938

    Appendixes

    Well-Formedness Constraint: Element Type Match:The Name in an element’s endtag must matc h the element type in the start-tag.

    Validity Constraint: Element Valid: An element is valid if there is a dec laratio n matc hing elementdecl where the Name matc hes the element type, and o ne o f the fo llo wing ho lds:

    1. The dec laratio n matc hes EMPTY and the element has no c o ntent. 2. The dec laratio n matc hes children and the sequenc e o f c hild elements belo ngs to the language generated by the regular expressio n in the c o ntent mo del, with o ptio nal white spac e (c harac ters matc hing the no nterminal S) between eac h pair o f c hild elements.

    3. The dec laratio n matc hes Mixed and the c o ntent c o nsists o f c harac ter data and c hild elements who se types matc h names in the c o ntent mo del.

    4. The dec laratio n matc hes ANY, and the types o f any c hild elements have been dec lared.

    3.1 Start-Tags, End-Tags, and Empty-Element Tags The beginning o f every no n-empty XML element is marked by a start-tag .

    Start-tag [40] STag ::= ‘’ [ WFC: Unique Att Spec ] [41] Attribute ::= Name Eq AttValue [ VC: Attribute Value Type ] [ WFC: No External Entity References ] [ WFC: No < in Attribute Values ] The Name in the start- and end-tags gives the element’s type . The Name-AttValue pairs are referred to as the attribute specifications o f the element, with the Name in eac h pair referred to as the attribute name and the c o ntent o f the AttValue (the text between the ‘ o r “ delimiters) as the attribute value .

    Well-Formedness Constraint: Unique Att Spec: No attribute name may appear mo re than o nc e in the same start-tag o r empty-element tag.

    Validity Constraint: Attribute Value Type: The attribute must have been dec lared; the value must be o f the type dec lared fo r it. (Fo r attribute types, see “3.3 AttributeList Dec laratio ns”.)

    Well-Formedness Constraint: No External Entity References: Attribute values c anno t c o ntain direc t o r indirec t entity referenc es to external entities. Well-Formedness Constraint: No < in Attribute Values: The replac ement text o f any entity referred to direc tly o r indirec tly in an attribute value (o ther than “ %name.para; %content.para; > container ANY>

    3.2.1 Element Content An element type has element content when elements o f that type must c o ntain o nly c hild elements (no c harac ter data), o ptio nally separated by white spac e (c harac ters matc hing the no nterminal S). In this c ase, the c o nstraint inc ludes a c o ntent mo del, a simple grammar go verning the allo wed types o f the c hild elements and the o rder in whic h they are allo wed to appear. The grammar is built o n c o ntent partic les (c ps), whic h c o nsist o f names, c ho ic e lists o f c o ntent partic les, o r sequenc e lists o f c o ntent partic les:

    Element-content Models [47] children ::= (choice | seq) (‘?’ | ‘*’ | ‘+’)? [48] cp ::= (Name | choice | seq) (‘?’ | ‘*’ | ‘+’)? [49] choice ::= ‘(‘ S? cp ( S? ‘|’ S? cp )* S? ‘)’ [ VC: Proper Group/PE Nesting ] [50] seq ::= ‘(‘ S? cp ( S? ‘,’ S? cp )* S? ‘)’ [ VC: Proper Group/PE Nesting ] where eac h Name is the type o f an element whic h may appear as a c hild. Any c o ntent partic le in a c ho ic e list may appear in the element c o ntent at the lo c atio n where the c ho ic e list appears in the grammar; c o ntent partic les o c c urring in a sequenc e list must eac h appear in the element c o ntent in the o rder given in the list.

    Appendix B ✦ The XM L 1.0 Specification

    The o ptio nal c harac ter fo llo wing a name o r list go verns whether the element o r the c o ntent partic les in the list may o c c ur o ne o r mo re (+), zero o r mo re (*), o r zero o r o ne times (?). The absenc e o f suc h an o perato r means that the element o r c o ntent partic le must appear exac tly o nc e. This syntax and meaning are identic al to tho se used in the pro duc tio ns in this spec ific atio n. The c o ntent o f an element matc hes a c o ntent mo del if and o nly if it is po ssible to trac e o ut a path thro ugh the c o ntent mo del, o beying the sequenc e, c ho ic e, and repetitio n o perato rs and matc hing eac h element in the c o ntent against an element type in the c o ntent mo del. Fo r c o mpatibility, it is an erro r if an element in the do c ument c an matc h mo re than o ne o c c urrenc e o f an element type in the c o ntent mo del. Fo r mo re info rmatio n, see “E. Deterministic Co ntent Mo dels”.

    Validity Constraint: Proper Group/ PE Nesting: Parameter-entity replac ement text must be pro perly nested with parenthesized gro ups. That is to say, if either o f the o pening o r c lo sing parentheses in a choice, seq, o r Mixed c o nstruc t is c o ntained in the replac ement text fo r a parameter entity, bo th must be c o ntained in the same replac ement text. Fo r intero perability, if a parameter-entity referenc e appears in a choice, seq, o r Mixed c o nstruc t, its replac ement text sho uld no t be empty, and neither the first no r last no n-blank c harac ter o f the replac ement text sho uld be a c o nnec to r (| o r ,). Examples o f element-c o ntent mo dels:



    3.2.2 M ixed Content An element type has mixed content when elements o f that type may c o ntain c harac ter data, o ptio nally interspersed with c hild elements. In this c ase, the types o f the c hild elements may be c o nstrained, but no t their o rder o r their number o f o c c urrenc es:

    Mixed-content Declaration [51] Mixed ::= ‘(‘ S? ‘#PCDATA’ (S? ‘|’ S? Name)* S? ‘)*’ | ‘(‘ S? ‘#PCDATA’ S? ‘)’ [ VC: Proper Group/PE Nesting ] [ VC: No Duplicate Types ] where the Names give the types o f elements that may appear as c hildren.

    Validity Constraint: No Duplicate Types: The same name must no t appear mo re than o nc e in a single mixed-c o ntent dec laratio n.

    941

    942

    Appendixes

    Examples o f mixed c o ntent dec laratio ns:



    3.3 Attribute-List Declarations Attributes are used to asso c iate name-value pairs with elements. Attribute spec ific atio ns may appear o nly within start-tags and empty-element tags; thus, the pro duc tio ns used to rec o gnize them appear in “3.1 Start-Tags, End-Tags, and EmptyElement Tags”. Attribute-list dec laratio ns may be used:

    ✦ To define the set o f attributes pertaining to a given element type. ✦ To establish type c o nstraints fo r these attributes. ✦ To pro vide default values fo r attributes. Attribute-list declarations spec ify the name, data type, and default value (if any) o f eac h attribute asso c iated with a given element type:

    Attribute-list Declaration [52] AttlistDecl ::= ‘’ [53] AttDef ::= S Name S AttType S DefaultDecl The Name in the AttlistDec l rule is the type o f an element. At user o ptio n, an XML pro c esso r may issue a warning if attributes are dec lared fo r an element type no t itself dec lared, but this is no t an erro r. The Name in the AttDef rule is the name o f the attribute. When mo re than o ne AttlistDec l is pro vided fo r a given element type, the c o ntents o f all tho se pro vided are merged. When mo re than o ne definitio n is pro vided fo r the same attribute o f a given element type, the first dec laratio n is binding and later dec laratio ns are igno red. Fo r intero perability, writers o f DTDs may c ho o se to pro vide at mo st o ne attribute-list dec laratio n fo r a given element type, at mo st o ne attribute definitio n fo r a given attribute name, and at least o ne attribute definitio n in eac h attribute-list dec laratio n. Fo r intero perability, an XML pro c esso r may at user o ptio n issue a warning when mo re than o ne attribute-list dec laratio n is pro vided fo r a given element type, o r mo re than o ne attribute definitio n is pro vided fo r a given attribute, but this is no t an erro r.

    3.3.1 Attribute Types XML attribute types are o f three kinds: a string type, a set o f to kenized types, and enumerated types. The string type may take any literal string as a value; the to kenized types have varying lexic al and semantic c o nstraints, as no ted:

    Appendix B ✦ The XM L 1.0 Specification

    Attribute Types [54] AttType ::= StringType | TokenizedType | EnumeratedType [55] StringType ::= ‘CDATA’ [56] TokenizedType ::= ‘ID’ [ VC: ID ] [ VC: One ID per Element Type ] [ VC: ID Attribute Default ] | ‘IDREF’ [ VC: IDREF ] | ‘IDREFS’ [ VC: IDREF ] | ‘ENTITY’ [ VC: Entity Name ] | ‘ENTITIES’ [ VC: Entity Name ] | ‘NMTOKEN’ [ VC: Name Token ] | ‘NMTOKENS’ [ VC: Name Token ]

    Validity Constraint: ID: Values o f type ID must matc h the Name pro duc tio n. A name must no t appear mo re than o nc e in an XML do c ument as a value o f this type; i.e., ID values must uniquely identify the elements whic h bear them. Validity Constraint: One ID per Element Type: No element type may have mo re than o ne ID attribute spec ified.

    Validity Constraint: ID Attribute Default: An ID attribute must have a dec lared default o f #IMPLIED o r #REQUIRED.

    Validity Constraint: IDREF: Values o f type IDREF must matc h the Name pro duc tio n, and values o f type IDREFS must matc h Names; eac h Name must matc h the value o f an ID attribute o n so me element in the XML do c ument; i.e. IDREF values must matc h the value o f so me ID attribute. Validity Constraint: Entity Name: Values o f type ENTITY must matc h the Name pro duc tio n, values o f type ENTITIES must matc h Names; eac h Name must matc h the name o f an unparsed entity dec lared in the DTD. Validity Constraint: Name Token: Values o f type NMTOKEN must matc h the Nmtoken pro duc tio n; values o f type NMTOKENS must matc h Nmtokens.

    Enumerated attributes c an take o ne o f a list o f values pro vided in the dec laratio n. There are two kinds o f enumerated types:

    Enumerated Attribute Types [57] EnumeratedType ::= NotationType | Enumeration [58] NotationType ::= ‘NOTATION’ S ‘(‘ S? [ VC: Notation Attributes] Name (S? ‘|’ S? Name)* S? ‘)’ [59] Enumeration ::= ‘(‘ S? Nmtoken (S? [ VC: Enumeration ] ‘|’ S?Nmtoken)* S? ‘)’

    943

    944

    Appendixes

    A NOTATION attribute identifies a no tatio n, dec lared in the DTD with asso c iated system and/ o r public identifiers, to be used in interpreting the element to whic h the attribute is attac hed.

    Validity Constraint: Notation Attributes: Values o f this type must matc h o ne o f the no tatio n names inc luded in the dec laratio n; all no tatio n names in the dec laratio n must be dec lared.

    Validity Constraint: Enumeration: Values o f this type must matc h o ne o f the Nmtoken to kens in the dec laratio n. Fo r intero perability, the same Nmtoken sho uld no t o c c ur mo re than o nc e in the enumerated attribute types o f a single element type.

    3.3.2 Attribute Defaults An attribute dec laratio n pro vides info rmatio n o n whether the attribute’s presenc e is required, and if no t, ho w an XML pro c esso r sho uld reac t if a dec lared attribute is absent in a do c ument.

    Attribute Defaults [60] DefaultDecl ::= ‘#REQUIRED’ | ‘#IMPLIED’ | ((‘#FIXED’ S)? AttValue) [ VC: Required Attribute ] [ VC: Attribute Default Legal ] [ WFC: No < in Attribute Values ] [ VC: Fixed Attribute Default ] In an attribute dec laratio n, #REQUIRED means that the attribute must always be pro vided, #IMPLIED that no default value is pro vided. If the dec laratio n is neither #REQUIRED no r #IMPLIED, then the AttValue value c o ntains the dec lared default value; the #FIXED keywo rd states that the attribute must always have the default value. If a default value is dec lared, when an XML pro c esso r enc o unters an o mitted attribute, it is to behave as tho ugh the attribute were present with the dec lared default value.

    Validity Constraint: Required Attribute: If the default dec laratio n is the keywo rd #REQUIRED, then the attribute must be spec ified fo r all elements o f the type in the attribute-list dec laratio n.

    Validity Constraint: Attribute Default Legal: The dec lared default value must meet the lexic al c o nstraints o f the dec lared attribute type.

    Validity Constraint: Fixed Attribute Default: If an attribute has a default value dec lared with the #FIXED keywo rd, instanc es o f that attribute must matc h the default value.

    Appendix B ✦ The XM L 1.0 Specification

    Examples o f attribute-list dec laratio ns:

    (bullets|ordered|glossary) CDATA

    “ordered”>

    #FIXED “POST”>

    3.3.3 Attribute-Value Normalization Befo re the value o f an attribute is passed to the applic atio n o r c hec ked fo r validity, the XML pro c esso r must no rmalize it as fo llo ws:

    ✦ a c harac ter referenc e is pro c essed by appending the referenc ed c harac ter to the attribute value.

    ✦ an entity referenc e is pro c essed by rec ursively pro c essing the replac ement text o f the entity.

    ✦ a whitespac e c harac ter (#x20, #xD, #xA, #x9) is pro c essed by appending #x20 to the no rmalized value, exc ept that o nly a single #x20 is appended fo r a “#xD#xA” sequenc e that is part o f an external parsed entity o r the literal entity value o f an internal parsed entity.

    ✦ o ther c harac ters are pro c essed by appending them to the no rmalized value. If the dec lared value is no t CDATA, then the XML pro c esso r must further pro c ess the no rmalized attribute value by disc arding any leading and trailing spac e (#x20) c harac ters, and by replac ing sequenc es o f spac e (#x20) c harac ters by a single spac e (#x20) c harac ter. All attributes fo r whic h no dec laratio n has been read sho uld be treated by a no nvalidating parser as if dec lared CDATA.

    3.4 Conditional Sections Conditional sections are po rtio ns o f the do c ument type dec laratio n external subset whic h are inc luded in, o r exc luded fro m, the lo gic al struc ture o f the DTD based o n the keywo rd whic h go verns them.

    Conditional Section [61] conditionalSect ::= includeSect | ignoreSect [62] includeSect ::= ‘’ [63] ignoreSect ::= ‘’

    945

    946

    Appendixes

    [64] ignoreSectContents ::= Ignore (‘’ Ignore)* [65] Ignore ::= Char* - (Char* (‘’) Char*) Like the internal and external DTD subsets, a c o nditio nal sec tio n may c o ntain o ne o r mo re c o mplete dec laratio ns, c o mments, pro c essing instruc tio ns, o r nested c o nditio nal sec tio ns, intermingled with white spac e. If the keywo rd o f the c o nditio nal sec tio n is INCLUDE, then the c o ntents o f the c o nditio nal sec tio n are part o f the DTD. If the keywo rd o f the c o nditio nal sec tio n is IGNORE, then the c o ntents o f the c o nditio nal sec tio n are no t lo gic ally part o f the DTD. No te that fo r reliable parsing, the c o ntents o f even igno red c o nditio nal sec tio ns must be read in o rder to detec t nested c o nditio nal sec tio ns and ensure that the end o f the o utermo st (igno red) c o nditio nal sec tio n is pro perly detec ted. If a c o nditio nal sec tio n with a keywo rd o f INCLUDE o c c urs within a larger c o nditio nal sec tio n with a keywo rd o f IGNORE, bo th the o uter and the inner c o nditio nal sec tio ns are igno red. If the keywo rd o f the c o nditio nal sec tio n is a parameter-entity referenc e, the parameter entity must be replac ed by its c o ntent befo re the pro c esso r dec ides whether to inc lude o r igno re the c o nditio nal sec tio n. An example:



    ]]>

    ]]>

    4. Physical Structures An XML do c ument may c o nsist o f o ne o r many sto rage units. These are c alled entities ; they all have content and are all (exc ept fo r the do c ument entity, see belo w, and the external DTD subset) identified by name . Eac h XML do c ument has o ne entity c alled the do c ument entity, whic h serves as the starting po int fo r the XML pro c esso r and may c o ntain the who le do c ument. Entities may be either parsed o r unparsed. A parsed entity’s c o ntents are referred to as its replac ement text; this text is c o nsidered an integral part o f the do c ument. An unparsed entity is a reso urc e who se c o ntents may o r may no t be text, and if text, may no t be XML. Eac h unparsed entity has an asso c iated no tatio n, identified by name. Beyo nd a requirement that an XML pro c esso r make the identifiers fo r the

    Appendix B ✦ The XM L 1.0 Specification

    entity and no tatio n available to the applic atio n, XML plac es no c o nstraints o n the c o ntents o f unparsed entities.

    Parsed entities are invoked by name using entity references; unparsed entities by name, given in the value of ENTITY or ENTITIES attributes.

    General entities are entities fo r use within the do c ument c o ntent. In this spec ific atio n, general entities are so metimes referred to with the unqualified term e ntity when this leads to no ambiguity. Parameter entities are parsed entities fo r use within the DTD. These two types o f entities use different fo rms o f referenc e and are rec o gnized in different c o ntexts. Furthermo re, they o c c upy different namespac es; a parameter entity and a general entity with the same name are two distinc t entities.

    4.1 Character and Entity References A character reference refers to a spec ific c harac ter in the ISO/ IEC 10646 c harac ter set, fo r example o ne no t direc tly ac c essible fro m available input devic es.

    Character Reference [66] CharRef ::= ‘&#’ [0-9]+ ‘;’ | ‘&#x’ [0-9a-fA-F]+ ‘;’ [ WFC: Legal Character ]

    Well-Formedness Constraint: Legal Character: Charac ters referred to using c harac ter referenc es must matc h the pro duc tio n fo r Char. If the c harac ter referenc e begins with “&#x”, the digits and letters up to the terminating; pro vide a hexadec imal representatio n o f the c harac ter’s c o de po int in ISO/ IEC 10646. If it begins just with “&#”, the digits up to the terminating ; pro vide a dec imal representatio n o f the c harac ter’s c o de po int. An entity reference refers to the c o ntent o f a named entity. Referenc es to parsed general entities use ampersand (&) and semic o lo n (;) as delimiters. Parameterentity references use perc ent-sign (%) and semic o lo n (;) as delimiters.

    Entity Reference [67] Reference [68] EntityRef

    ::= EntityRef | CharRef ::= ‘&’ Name ‘;’ [ WFC: Entity Declared ] [ VC: Entity Declared ] [ WFC: Parsed Entity ] [ WFC: No Recursion ] [69] PEReference ::= ‘%’ Name ‘;’ [ VC: Entity Declared ] [ WFC: No Recursion ] [ WFC: In DTD ]

    947

    948

    Appendixes

    Well-Formedness Constraint: Entity Declared In a do c ument witho ut any DTD, a do c ument with o nly an internal DTD subset whic h c o ntains no parameter entity referenc es, o r a do c ument with “standalo ne=’yes’”, the Name given in the entity referenc e must matc h that in an entity dec laratio n, exc ept that well-fo rmed do c uments need no t dec lare any o f the fo llo wing entities: amp, lt, gt, apo s, quo t. The dec laratio n o f a parameter entity must prec ede any referenc e to it. Similarly, the dec laratio n o f a general entity must prec ede any referenc e to it whic h appears in a default value in an attribute-list dec laratio n. No te that if entities are dec lared in the external subset o r in external parameter entities, a no n-validating pro c esso r is no t o bligated to read and pro c ess their dec laratio ns; fo r suc h do c uments, the rule that an entity must be dec lared is a well-fo rmedness c o nstraint o nly if standalo ne=’yes’.

    Validity Constraint: Entity Declared: In a do c ument with an external subset o r external parameter entities with “standalo ne=’no ’”, the Name given in the entity referenc e must matc h that in an entity dec laratio n. Fo r intero perability, valid do c uments sho uld dec lare the entities amp, lt, gt, apo s, quo t, in the fo rm spec ified in “4.6 Predefined Entities”. The dec laratio n o f a parameter entity must prec ede any referenc e to it. Similarly, the dec laratio n o f a general entity must prec ede any referenc e to it whic h appears in a default value in an attribute-list dec laratio n.

    Well-Formedness Constraint: Parsed Entity: An entity referenc e must no t c o ntain the name o f an unparsed entity. Unparsed entities may be referred to o nly in attribute values dec lared to be o f type ENTITY o r ENTITIES.

    Well-Formedness Constraint: No Recursion: A parsed entity must no t c o ntain a rec ursive referenc e to itself, either direc tly o r indirec tly.

    Well-Formedness Constraint: In DTD: Parameter-entity referenc es may o nly appear in the DTD. Examples o f c harac ter and entity referenc es:

    Type less-than ( “&”> “'”> “"”>

    No te that the < and & c harac ters in the dec laratio ns o f “lt” and “amp” are do ubly esc aped to meet the requirement that entity replac ement be well-fo rmed.

    4.7 Notation Declarations Notations identify by name the fo rmat o f unparsed entities, the fo rmat o f elements whic h bear a no tatio n attribute, o r the applic atio n to whic h a pro c essing instruc tio n is addressed. Notation declarations pro vide a name fo r the no tatio n, fo r use in entity and attribute-list dec laratio ns and in attribute spec ific atio ns, and an external identifier fo r the no tatio n whic h may allo w an XML pro c esso r o r its c lient applic atio n to lo c ate a helper applic atio n c apable o f pro c essing data in the given no tatio n. Notation Declarations [82] NotationDecl [83] PublicID

    ::= ‘’ ::= ‘PUBLIC’ S PubidLiteral

    XML pro c esso rs must pro vide applic atio ns with the name and external identifier(s) o f any no tatio n dec lared and referred to in an attribute value, attribute definitio n, o r entity dec laratio n. They may additio nally reso lve the external identifier into the system identifier, file name, o r o ther info rmatio n needed to allo w the applic atio n to c all a pro c esso r fo r data in the no tatio n desc ribed. (It is no t an erro r, ho wever, fo r XML do c uments to dec lare and refer to no tatio ns fo r whic h no tatio n-spec ific applic atio ns are no t available o n the system where the XML pro c esso r o r applic atio n is running.)

    4.8 Document Entity The document entity serves as the ro o t o f the entity tree and a starting-po int fo r an XML pro c esso r. This spec ific atio n do es no t spec ify ho w the do c ument entity is to be lo c ated by an XML pro c esso r; unlike o ther entities, the do c ument entity has no name and might well appear o n a pro c esso r input stream witho ut any identific atio n at all.

    Appendix B ✦ The XM L 1.0 Specification

    5. Conformance 5.1 Validating and Non-Validating Processors Co nfo rming XML pro c esso rs fall into two c lasses: validating and no n-validating. Validating and no n-validating pro c esso rs alike must repo rt vio latio ns o f this spec ific atio n’s well-fo rmedness c o nstraints in the c o ntent o f the do c ument entity and any o ther parsed entities that they read.

    Validating processors must repo rt vio latio ns o f the c o nstraints expressed by the dec laratio ns in the DTD, and failures to fulfill the validity c o nstraints given in this spec ific atio n. To ac c o mplish this, validating XML pro c esso rs must read and pro c ess the entire DTD and all external parsed entities referenc ed in the do c ument. No n-validating pro c esso rs are required to c hec k o nly the do c ument entity, inc luding the entire internal DTD subset, fo r well-fo rmedness. While they are no t required to c hec k the do c ument fo r validity, they are required to process all the dec laratio ns they read in the internal DTD subset and in any parameter entity that they read, up to the first referenc e to a parameter entity that they do no t read; that is to say, they must use the info rmatio n in tho se dec laratio ns to no rmalize attribute values, inc lude the replac ement text o f internal entities, and supply default attribute values. They must no t pro c ess entity dec laratio ns o r attribute-list dec laratio ns enc o untered after a referenc e to a parameter entity that is no t read, sinc e the entity may have c o ntained o verriding dec laratio ns.

    5.2 Using XM L Processors The behavio r o f a validating XML pro c esso r is highly predic table; it must read every piec e o f a do c ument and repo rt all well-fo rmedness and validity vio latio ns. Less is required o f a no n-validating pro c esso r; it need no t read any part o f the do c ument o ther than the do c ument entity. This has two effec ts that may be impo rtant to users o f XML pro c esso rs:

    ✦ Certain well-fo rmedness erro rs, spec ific ally tho se that require reading external entities, may no t be detec ted by a no n-validating pro c esso r. Examples inc lude the c o nstraints entitled Entity Dec lared, Parsed Entity, and No Rec ursio n, as well as so me o f the c ases desc ribed as fo rbidden in “4.4 XML Pro c esso r Treatment o f Entities and Referenc es”.

    ✦ The info rmatio n passed fro m the pro c esso r to the applic atio n may vary, depending o n whether the pro c esso r reads parameter and external entities. Fo r example, a no n-validating pro c esso r may no t no rmalize attribute values, inc lude the replac ement text o f internal entities, o r supply default attribute values, where do ing so depends o n having read dec laratio ns in external o r parameter entities.

    957

    958

    Appendixes

    Fo r maximum reliability in intero perating between different XML pro c esso rs, applic atio ns whic h use no n-validating pro c esso rs sho uld no t rely o n any behavio rs no t required o f suc h pro c esso rs. Applic atio ns whic h require fac ilities suc h as the use o f default attributes o r internal entities whic h are dec lared in external entities sho uld use validating XML pro c esso rs.

    6. Notation The fo rmal grammar o f XML is given in this spec ific atio n using a simple Extended Bac kus-Naur Fo rm (EBNF) no tatio n. Eac h rule in the grammar defines o ne symbo l, in the fo rm

    symbol ::= expression Symbo ls are written with an initial c apital letter if they are defined by a regular expressio n, o r with an initial lo wer c ase letter o therwise. Literal strings are quo ted. Within the expressio n o n the right-hand side o f a rule, the fo llo wing expressio ns are used to matc h strings o f o ne o r mo re c harac ters:

    #xN where N is a hexadec imal integer, the expressio n matc hes the c harac ter in ISO/ IEC 10646 who se c ano nic al (UCS-4) c o de value, when interpreted as an unsigned binary number, has the value indic ated. The number o f leading zero s in the #xN fo rm is insignific ant; the number o f leading zero s in the c o rrespo nding c o de value is go verned by the c harac ter enc o ding in use and is no t signific ant fo r XML.

    [a-zA-Z], [#xN-#xN] matc hes any c harac ter with a value in the range(s) indic ated (inc lusive).

    [^a-z], [^#xN-#xN] matc hes any c harac ter with a value o utside the range indic ated.

    [^abc], [^#xN#xN#xN] matc hes any c harac ter with a value no t amo ng the c harac ters given.

    “string” matc hes a literal string matc hing that given inside the do uble quo tes.

    ‘string’ matc hes a literal string matc hing that given inside the single quo tes.

    Appendix B ✦ The XM L 1.0 Specification

    These symbo ls may be c o mbined to matc h mo re c o mplex patterns as fo llo ws, where A and B represent simple expressio ns:

    (expression) expressio n is treated as a unit and may be c o mbined as desc ribed in this list.

    A? matc hes A o r no thing; o ptio nal A.

    A B matc hes A fo llo wed by B.

    A | B matc hes A o r B but no t bo th.

    A - B matc hes any string that matc hes A but do es no t matc h B.

    A+ matc hes o ne o r mo re o c c urrenc es o f A.

    A* matc hes zero o r mo re o c c urrenc es o f A. Other no tatio ns used in the pro duc tio ns are:

    /* ... */ c o mment.

    [ wfc: ... ] well-fo rmedness c o nstraint; this identifies by name a c o nstraint o n well-fo rmed do c uments asso c iated with a pro duc tio n.

    [ vc: ... ] validity c o nstraint; this identifies by name a c o nstraint o n valid do c uments asso c iated with a pro duc tio n.

    959

    960

    Appendixes

    Appendices A. References A.1 Normative References IANA (Internet Assigned Numbers Autho rity). Official Name s fo r Characte r Se ts, ed. Keld Simo nsen et al. See ftp:/ / ftp.isi.edu/ in-no tes/ iana/ assignments/ c harac ter-sets. IETF RFC 1766 IETF (Internet Engineering Task Fo rc e). RFC 1766: Tags fo r the Ide ntificatio n o f Language s, ed. H. Alvestrand. 1995. ISO 639 (Internatio nal Organizatio n fo r Standardizatio n). ISO 639:1988 ( E) . Co de fo r the re pre se ntatio n o f name s o f language s. [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1988.

    ISO 3166 (Internatio nal Organizatio n fo r Standardizatio n). ISO 3166-1:1997 ( E) . Co de s fo r the re pre se ntatio n o f name s o f co untrie s and the ir subdivisio ns — Part 1: Co untry co de s [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1997.

    ISO/ IEC 10646 ISO (Internatio nal Organizatio n fo r Standardizatio n). ISO/IEC 10646-1993 ( E) . Info rmatio n te chno lo gy — Unive rsal Multiple -Octe t Co de d Characte r Se t ( UCS) — Part 1: Archite cture and Basic Multilingual Plane . [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1993 (plus amendments AM 1 thro ugh AM 7). Unicode The Unic o de Co nso rtium. The Unico de Standard, Ve rsio n 2.0. Reading, Mass.: Addiso n-Wesley Develo pers Press, 1996.

    A.2 Other References Aho/ Ullman Aho , Alfred V., Ravi Sethi, and Jeffrey D. Ullman. Co mpile rs: Principle s, Te chnique s, and To o ls. Reading: Addiso n-Wesley, 1986, rpt. c o rr. 1988.

    Berners-Lee et al. Berners-Lee, T., R. Fielding, and L. Masinter. Unifo rm Re so urce Ide ntifie rs ( URI) : Ge ne ric Syntax and Se mantics. 1997. (Wo rk in pro gress; see updates to RFC1738.)

    Brüggemann-Klein Brüggemann-Klein, Anne. Re gular Expre ssio ns into Finite Auto mata . Extended abstrac t in I. Simo n, Hrsg., LATIN 1992, S. 97-98. SpringerVerlag, Berlin 1992. Full Versio n in Theo retic al Co mputer Sc ienc e 120: 197-213, 1993.

    Brüggemann-Klein and Wood Brüggemann-Klein, Anne, and Deric k Wo o d. De te rministic Re gular Language s. Universität Freiburg, Institut für Info rmatik, Beric ht 38, Okto ber 1991.

    Appendix B ✦ The XM L 1.0 Specification

    Clark James Clark. Co mpariso n o f SGML and XML. See http://www.w3.org/TR/ NOTE-sgml-xml-971215.

    IETF RFC1738 IETF (Internet Engineering Task Fo rc e). RFC 1738: Unifo rm Re so urce Lo cato rs ( URL) , ed. T. Berners-Lee, L. Masinter, M. Mc Cahill. 1994. IETF RFC1808 IETF (Internet Engineering Task Fo rc e). RFC 1808: Re lative Unifo rm Re so urce Lo cato rs, ed. R. Fielding. 1995. IETF RFC2141 IETF (Internet Engineering Task Fo rc e). RFC 2141: URN Syntax , ed. R. Mo ats. 1997.

    ISO 8879 ISO (Internatio nal Organizatio n fo r Standardizatio n). ISO 8879:1986( E) . Info rmatio n pro ce ssing — Te xt and Office Syste ms — Standard Ge ne raliz e d Markup Language ( SGML) . First editio n — 1986-10-15. [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1986. ISO/ IEC 10744 ISO (Internatio nal Organizatio n fo r Standardizatio n). ISO/IEC 10744-1992 ( E) . Info rmatio n te chno lo gy — Hype rme dia/Time -base d Structuring Language ( HyTime ) . [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1992. Exte nde d Facilitie s Anne xe . [Geneva]: Internatio nal Organizatio n fo r Standardizatio n, 1996.

    B. Character Classes Fo llo wing the c harac teristic s defined in the Unic o de standard, c harac ters are c lassed as base c harac ters (amo ng o thers, these c o ntain the alphabetic c harac ters o f the Latin alphabet, witho ut diac ritic s), ideo graphic c harac ters, and c o mbining c harac ters (amo ng o thers, this c lass c o ntains mo st diac ritic s); these c lasses c o mbine to fo rm the c lass o f letters. Digits and extenders are also distinguished.

    Characters [84] Letter ::= BaseChar | Ideographic [85] BaseChar ::= [#x0041-#x005A] | [#x0061-#x007A] | [#x00C0-#x00D6] | [#x00D8-#x00F6] | [#x00F8-#x00FF] | [#x0100-#x0131] | [#x0134-#x013E] | [#x0141-#x0148] | [#x014A-#x017E] | [#x0180-#x01C3] | [#x01CD-#x01F0] | [#x01F4-#x01F5] | [#x01FA-#x0217] | [#x0250-#x02A8] | [#x02BB-#x02C1] | #x0386 | [#x0388-#x038A] | #x038C | [#x038E-#x03A1] | [#x03A3-#x03CE] | [#x03D0-#x03D6] | #x03DA | #x03DC | #x03DE | #x03E0 | [#x03E2-#x03F3] | [#x0401-#x040C] | [#x040E-#x044F] | [#x0451-#x045C] | [#x045E-#x0481] | [#x0490-#x04C4]

    961

    962

    Appendixes

    | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | | |

    [#x04C7-#x04C8] | [#x04CB-#x04CC] [#x04D0-#x04EB] | [#x04EE-#x04F5] [#x04F8-#x04F9] | [#x0531-#x0556] | #x0559 [#x0561-#x0586] | [#x05D0-#x05EA] [#x05F0-#x05F2] | [#x0621-#x063A] [#x0641-#x064A] | [#x0671-#x06B7] [#x06BA-#x06BE] | [#x06C0-#x06CE] [#x06D0-#x06D3] | #x06D5 | [#x06E5-#x06E6] [#x0905-#x0939] | #x093D | [#x0958-#x0961] [#x0985-#x098C] | [#x098F-#x0990] [#x0993-#x09A8] | [#x09AA-#x09B0] #x09B2 | [#x09B6-#x09B9] | [#x09DC-#x09DD] [#x09DF-#x09E1] | [#x09F0-#x09F1] [#x0A05-#x0A0A] | [#x0A0F-#x0A10] [#x0A13-#x0A28] | [#x0A2A-#x0A30] [#x0A32-#x0A33] | [#x0A35-#x0A36] [#x0A38-#x0A39] | [#x0A59-#x0A5C] #x0A5E | [#x0A72-#x0A74] | [#x0A85-#x0A8B] #x0A8D | [#x0A8F-#x0A91] | [#x0A93-#x0AA8] [#x0AAA-#x0AB0] | [#x0AB2-#x0AB3] [#x0AB5-#x0AB9] | #x0ABD | #x0AE0 [#x0B05-#x0B0C] | [#x0B0F-#x0B10] [#x0B13-#x0B28] | [#x0B2A-#x0B30] [#x0B32-#x0B33] | [#x0B36-#x0B39] #x0B3D | [#x0B5C-#x0B5D] | [#x0B5F-#x0B61] [#x0B85-#x0B8A] | [#x0B8E-#x0B90] [#x0B92-#x0B95] | [#x0B99-#x0B9A] | #x0B9C [#x0B9E-#x0B9F] | [#x0BA3-#x0BA4] [#x0BA8-#x0BAA] | [#x0BAE-#x0BB5] [#x0BB7-#x0BB9] | [#x0C05-#x0C0C] [#x0C0E-#x0C10] | [#x0C12-#x0C28] [#x0C2A-#x0C33] | [#x0C35-#x0C39] [#x0C60-#x0C61] | [#x0C85-#x0C8C] [#x0C8E-#x0C90] | [#x0C92-#x0CA8] [#x0CAA-#x0CB3] | [#x0CB5-#x0CB9] | #x0CDE [#x0CE0-#x0CE1] | [#x0D05-#x0D0C] [#x0D0E-#x0D10] | [#x0D12-#x0D28] [#x0D2A-#x0D39] | [#x0D60-#x0D61] [#x0E01-#x0E2E] | #x0E30 | [#x0E32-#x0E33] [#x0E40-#x0E45] | [#x0E81-#x0E82] | #x0E84 [#x0E87-#x0E88] | #x0E8A | #x0E8D [#x0E94-#x0E97] | [#x0E99-#x0E9F] [#x0EA1-#x0EA3] | #x0EA5 | #x0EA7 [#x0EAA-#x0EAB] | [#x0EAD-#x0EAE] | #x0EB0 [#x0EB2-#x0EB3] | #x0EBD | [#x0EC0-#x0EC4] [#x0F40-#x0F47] | [#x0F49-#x0F69] [#x10A0-#x10C5] | [#x10D0-#x10F6] | #x1100 [#x1102-#x1103] | [#x1105-#x1107] | #x1109 [#x110B-#x110C] | [#x110E-#x1112] | #x113C #x113E | #x1140 | #x114C | #x114E | #x1150 [#x1154-#x1155] | #x1159 | [#x115F-#x1161]

    Appendix B ✦ The XM L 1.0 Specification

    | | | | | | | | | | | | | | | | | [86] Ideographic

    #x1163 | #x1165 | #x1167 | #x1169 [#x116D-#x116E] | [#x1172-#x1173] | #x1175 #x119E | #x11A8 | #x11AB | [#x11AE-#x11AF] [#x11B7-#x11B8] | #x11BA | [#x11BC-#x11C2] #x11EB | #x11F0 | #x11F9 | [#x1E00-#x1E9B] [#x1EA0-#x1EF9] | [#x1F00-#x1F15] [#x1F18-#x1F1D] | [#x1F20-#x1F45] [#x1F48-#x1F4D] | [#x1F50-#x1F57] | #x1F59 #x1F5B | #x1F5D | [#x1F5F-#x1F7D] [#x1F80-#x1FB4] | [#x1FB6-#x1FBC] | #x1FBE [#x1FC2-#x1FC4] | [#x1FC6-#x1FCC] [#x1FD0-#x1FD3] | [#x1FD6-#x1FDB] [#x1FE0-#x1FEC] | [#x1FF2-#x1FF4] [#x1FF6-#x1FFC] | #x2126 | [#x212A-#x212B] #x212E | [#x2180-#x2182] | [#x3041-#x3094] [#x30A1-#x30FA] | [#x3105-#x312C] [#xAC00-#xD7A3] ::= [#x4E00-#x9FA5] | #x3007 | [#x3021-#x3029] [87] CombiningChar ::= [#x0300-#x0345] | [#x0360-#x0361] | [#x0483-#x0486] | [#x0591-#x05A1] | [#x05A3-#x05B9] | [#x05BB-#x05BD] | #x05BF | [#x05C1-#x05C2] | #x05C4 | [#x064B-#x0652] | #x0670 | [#x06D6-#x06DC] | [#x06DD-#x06DF] | [#x06E0-#x06E4] | [#x06E7-#x06E8] | [#x06EA-#x06ED] | [#x0901-#x0903] | #x093C | [#x093E-#x094C] | #x094D | [#x0951-#x0954] | [#x0962-#x0963] | [#x0981-#x0983] | #x09BC | #x09BE | #x09BF | [#x09C0-#x09C4] | [#x09C7-#x09C8] | [#x09CB-#x09CD] | #x09D7 | [#x09E2-#x09E3] | #x0A02 | #x0A3C | #x0A3E | #x0A3F | [#x0A40-#x0A42] | [#x0A47-#x0A48] | [#x0A4B-#x0A4D] | [#x0A70-#x0A71] | [#x0A81-#x0A83] | #x0ABC | [#x0ABE-#x0AC5] | [#x0AC7-#x0AC9] | [#x0ACB-#x0ACD] | [#x0B01-#x0B03] | #x0B3C | [#x0B3E-#x0B43] | [#x0B47-#x0B48] | [#x0B4B-#x0B4D] | [#x0B56-#x0B57] | [#x0B82-#x0B83] | [#x0BBE-#x0BC2] | [#x0BC6-#x0BC8] | [#x0BCA-#x0BCD] | #x0BD7 | [#x0C01-#x0C03] | [#x0C3E-#x0C44] | [#x0C46-#x0C48] | [#x0C4A-#x0C4D] | [#x0C55-#x0C56] | [#x0C82-#x0C83] | [#x0CBE-#x0CC4] | [#x0CC6-#x0CC8] | [#x0CCA-#x0CCD] | [#x0CD5-#x0CD6] | [#x0D02-#x0D03] | [#x0D3E-#x0D43] | [#x0D46-#x0D48] | [#x0D4A-#x0D4D]

    963

    964

    Appendixes

    | | | | | | | | | | | [88] Digit ::= | | | | | | | [89] Extender ::= | | |

    #x0D57 | #x0E31 | [#x0E34-#x0E3A] [#x0E47-#x0E4E] | #x0EB1 [#x0EB4-#x0EB9] | [#x0EBB-#x0EBC] [#x0EC8-#x0ECD] | [#x0F18-#x0F19] #x0F35 | #x0F37 | #x0F39 | #x0F3E #x0F3F | [#x0F71-#x0F84] [#x0F86-#x0F8B] | [#x0F90-#x0F95] #x0F97 | [#x0F99-#x0FAD] [#x0FB1-#x0FB7] | #x0FB9 [#x20D0-#x20DC] | #x20E1 [#x302A-#x302F] | #x3099 | #x309A [#x0030-#x0039] | [#x0660-#x0669] [#x06F0-#x06F9] | [#x0966-#x096F] [#x09E6-#x09EF] | [#x0A66-#x0A6F] [#x0AE6-#x0AEF] | [#x0B66-#x0B6F] [#x0BE7-#x0BEF] | [#x0C66-#x0C6F] [#x0CE6-#x0CEF] | [#x0D66-#x0D6F] [#x0E50-#x0E59] | [#x0ED0-#x0ED9] [#x0F20-#x0F29] #x00B7 | #x02D0 | #x02D1 | #x0387 #x0640 | #x0E46 | #x0EC6 | #x3005 [#x3031-#x3035] | [#x309D-#x309E] [#x30FC-#x30FE]

    The c harac ter c lasses defined here c an be derived fro m the Unic o de c harac ter database as fo llo ws:

    ✦ Name start c harac ters must have o ne o f the c atego ries Ll, Lu, Lo , Lt, Nl. ✦ Name c harac ters o ther than Name-start c harac ters must have o ne o f the c atego ries Mc , Me, Mn, Lm, o r Nd.

    ✦ Charac ters in the c o mpatibility area (i.e. with c harac ter c o de greater than #xF900 and less than #xFFFE) are no t allo wed in XML names.

    ✦ Charac ters whic h have a fo nt o r c o mpatibility dec o mpo sitio n (i.e. tho se with a “c o mpatibility fo rmatting tag” in field 5 o f the database — marked by field 5 beginning with a “” > then the XML pro c esso r will rec o gnize the c harac ter referenc es when it parses the entity dec laratio n, and reso lve them befo re sto ring the fo llo wing string as the value o f the entity “example”:

    An ampersand (&) may be escaped numerically (&) or with a general entity (&).

    A referenc e in the do c ument to “&example;” will c ause the text to be reparsed, at whic h time the start- and end-tags o f the “p” element will be rec o gnized and the three referenc es will be rec o gnized and expanded, resulting in a “p” element with the fo llo wing c o ntent (all data, no delimiters o r markup):

    An ampersand (&) may be escaped numerically (&) or with a general entity (&). A mo re c o mplex example will illustrate the rules and their effec ts fully. In the fo llo wing example, the line numbers are so lely fo r referenc e.

    1 2 3 4 5 6 7 8

    ’ > %xx; ]> This sample shows a &tricky; method.

    965

    966

    Appendixes

    This pro duc es the fo llo wing: in line 4, the referenc e to c harac ter 37 is expanded immediately, and the parameter entity “xx” is sto red in the symbo l table with the value “%zz;”. Sinc e the replac ement text is no t resc anned, the referenc e to parameter entity “zz” is no t rec o gnized. (And it wo uld be an erro r if it were, sinc e “zz” is no t yet dec lared.) in line 5, the c harac ter referenc e “