Poor-Folks SGML (PSGML): <title>A Rational Subset of SGML <author>C. M. Sperberg-McQueen <docnum>TEI ED W 36 <date>&docdate. </titlep> <abstract> <p>This paper defines a simple subset of the Standard Generalized Markup Language (SGML) designed for simple parsing and processing. The subset includes all the most useful features of SGML, but radically simplifies the legal forms of documents and omits a number of bells and whistles which make it virtually impossible to write simple ad hoc SGML parsers. </abstract> <!> </frontm> <!> <body> <h1>Introduction <p>PSGML is a simple subset of SGML designed for ease of parsing; by removing much of the syntactic sugar, and many of the bells and whistles, from SGML, we make it plausible to include rudimentary SGML processing in many applications where it would otherwise have taken too much time, effort, and resources. Because many of the SGML constructs removed by PSGML are intended to make it easier to create SGML documents with standard editors, their removal in PSGML means it is better suited to interprocess communication than to human-interface applications such as editing documents using SGML-ignorant editors. <p>The initial impetus for the definition of PSGML was the need for information servers to return structured information to clients; when the only possible data format is flat ASCII text without markup, it is very difficult to implement or imagine serious cooperative processing in distributed environments. Using PSGML, a server might offer a client a choice of straight ASCII-only text without markup, a comma-delimited row-column form (if the data are from a rectangular table), a full SGML document, or a PSGML document. It would be plausible for a class of clients and servers to agree on a common tag set for which the normal processing is understood, but that tag set is not defined here. <p>A second reason for defining a subset of SGML is to make the strengths of SGML more accessible to programmers and technical readers. SGML proponents have had good reason for mixed feelings as ongoing projects such as the definition of MIME and the development of the World-Wide Web software have moved publicly to embrace SGML: SGML should definitely be used in such contexts, but the public discussion has made very clear that even proficient programmers and technical people often find SGML too complex to implement (or grasp) readily. A rational subset of SGML should make it easier for technical readers to grasp the most important features of the language, and to implement either the entire language or the subset. <p>The SGML subset offered here attempts to capture the most useful features of SGML, while eliminating all but the most trivial <soCalled>syntactic sugar</soCalled> (which can be defined as any construct provided merely as a convenient alternative notation for things which could be written in more uniform ways)<note>Following Abelson and Sussman 1985, p. 11).</note> and omitting several of the less frequently needed facilities. The formal grammar is written in strict BNF form, to simplify translation into common parser generators, and some other changes are made from the formal grammar provided by ISO 8879: where possible, productions of the form <eg> <![ CDATA [ a = b ]]> </eg> have been eliminated (<term>a</term> and <term>b</term> are merged); structural ambiguities have been removed from the syntax of attribute specifications; in some cases (e.g. the definition of name groups and model groups) constraints expressed only in the prose of ISO 8879 have been expressed in the syntax; and an explicit distinction is made between the lexical and syntactic layers (at the cost of forbidding comments in some places where ISO 8879 allows them). <h1>Overall Definition <p>A PSGML document is a <term>document instance</term>, preceded by a <term>document type declaration</term> (DTD); the DTD may be omitted, in which case the processor makes certain simple assumptions about the markup found in the document. <xmp> psgml ::= dtd doc | doc ; </xmp> <h1>Document Type Declarations (DTDs) <p>We allow a simple subset of normal SGML DTDs. A known DTD may be used, a specific DTD file may be named, or the full DTD can be given in the <term>DTD subset</term> (between the square brackets). Unlike SGML, PSGML does not allow the DTD subset to coexist with references to system files. <xmp> dtd ::= '<!DOCTYPE' NAME '>' | '<!DOCTYPE' NAME 'SYSTEM' LITERAL '>' | '<!DOCTYPE' NAME '[' decls ']' '>' ; decls ::= /* nothing */ | decls comment | decls elemdecl | decls attldecl | decls entdecl ; </xmp> <p>DTDs may contain comments, element declarations, attribute list declarations, and entity declarations. All the other declaration types of SGML (NOTATION, SHORTREF, USEMAP, etc.) are forbidden. <h2>Comments <p>Comments can occur in the DTD between other declarations; they can also occur within the document instance. The PSGML comment corresponds to the SGML <term>comment declaration</term>; unlike SGML, PSGML does not allow comments within other declarations. <xmp> comment ::= '' ; </xmp> <p>No double hyphen is allowed within the STRING, but newlines are legal. <h2>2b Element declarations <xmp> elemdecl ::= '<!ELEMENT' NAME '-' '-' 'ANY' '>' | '<!ELEMENT' NAME '-' '-' model '>' | '<!ELEMENT' NAME '-' 'O' 'EMPTY' '>' ; </xmp> <p>In the element declaration, white space is required between the two minus signs or between the minus sign and the O. <p>If no DTD is provided, all elements found are processed as if declared by the first form (ANY), which means any element may contain any combination of other elements and parsed character data. 'Parsed' character data is character content within which start-tags, end-tags, and entity references are recognized. Tags and further entity references are recognized within entities embedded by entity reference. <p>If a DTD is provided, each element used must be declared using the ANY form, the EMPTY form, or normal content models. <xmp> model ::= '(' tokgroup ')' || occ ; occ ::= /* nothing */ | '?' | '*' | '+' ; tokgroup ::= seqgroup | andgroup | orgroup ; seqgroup ::= token | seqgroup ',' token ; andgroup ::= token '&' token | andgroup '&' token ; orgroup ::= token '|' token | orgroup '|' token ; token ::= '#PCDATA' // == CHAR* | NAME || occ | model ; </xmp> <p>PSGML does not allow <term>exceptions</term> to the content model, so the inclusion exceptions and exclusion exceptions used by SGML do not appear. <h2>Attribute List Declarations <p>If attribute values are to be specified on any element, the legal attributes for that element must be declared with an ATTLIST declaration. <xmp> attldecl ::= '<!ATTLIST' NAME attdefs '>' ; attdefs ::= attdef | attdefs attdef ; attdef ::= NAME type default ; type ::= 'CDATA' | | 'ID' | | 'IDREF' | 'IDREFS' | 'ENTITY' | 'ENTITIES' | 'NAME' | 'NAMES' // Name, etc. are legal in PSGML | 'NMTOKEN' | 'NMTOKENS' // for compatibility with SGML, | 'NUMBER' | 'NUMBERS' // but they are mostly pointless | 'NUTOKEN' | 'NUTOKENS' // and are discouraged. | '(' nmtokgrp ')' ; nmtokgrp ::= ntseq | ntor | ntand ; ntseq ::= nametoken | ntseq ',' nametoken ; ntor ::= nametoken '|' nametoken | ntor '|' nametoken ; ntand ::= nametoken '&' nametoken | ntand '&' nametoken ; default ::= value | '#FIXED' value | '#REQUIRED' | '#CURRENT' | '#IMPLIED' ; // No #CONREF defaults. value ::= LITERAL | NAME | NUMBER | NUMTOKEN ; </xmp> <h2>Entity Declarations <xmp> entdecl ::= '<!ENTITY' NAME LITERAL '>' | '<!ENTITY' NAME 'SYSTEM' LITERAL '>' | '<!ENTITY' '%' NAME LITERAL '>' | '<!ENTITY' '%' NAME 'SYSTEM' LITERAL '>' ; </xmp> <p>If '%' is present, entity is a 'parameter entity' and is recognized only within the DTD, with references of the form %foo; -- otherwise, it's a 'general entity' recognized within the document. <p>Some entities should always be recognized in PSGML, even if not declared: they include entities for the characters of ISO 8859-1 and (identically) IBM CP 1014. <h1>Document Body <p>The body of a PSGML document is very simple: the document is tagged as a single element (the root element or document element), whose content, like that of any other element, may include character data and other elements; if declared as EMPTY, an element has no content at all. <xmp> doc ::= element ; element ::= empty | full ; empty ::= starttag // only if element declared empty ; full ::= starttag content endtag ; starttag ::= '<' || NAME atts '>' ; atts ::= /* nothing */ | atts NAME '=' value // if DTD present, NAME must // be declared ; endtag ::= '</' || NAME '>' | '</> ; </xmp> <h1>Differences from SGML: <p>PSGML deviates from standard SGML in a number of ways, most prominently in these: <ul> <li>PSGML does not allow processing instructions at all. <li>PSGML does not allow comments before the DTD. <li>PSGML allows no comments within declarations, only as independent declarations. <li>PSGML allows only one <Q>-- ... --</Q> sequence within a comment. <li>No notation declarations (let alone usemaps etc.) </ul> </gdoc>

.sr docfile = &sysfnam. ;.sr docversion = 'Draft';.im teigmlp1 .* UIC Document No: UIC CC DB 92-10 .* TEI Document No: TEI ED W 36 .* Title: PSGML .* Drafted: MSM .* Revised: MSM .* .* Document proper begins. Poor-Folks SGML (PSGML): <title>A Rational Subset of SGML <author>C. M. Sperberg-McQueen <docnum>TEI ED W 36 <date>&docdate. </titlep> <abstract> <p>This paper defines a simple subset of the Standard Generalized Markup Language (SGML) designed for simple parsing and processing. The subset includes all the most useful features of SGML, but radically simplifies the legal forms of documents and omits a number of bells and whistles which make it virtually impossible to write simple ad hoc SGML parsers. </abstract> <!> </frontm> <!> <body> <h1>Introduction <p>PSGML is a simple subset of SGML designed for ease of parsing; by removing much of the syntactic sugar, and many of the bells and whistles, from SGML, we make it plausible to include rudimentary SGML processing in many applications where it would otherwise have taken too much time, effort, and resources. Because many of the SGML constructs removed by PSGML are intended to make it easier to create SGML documents with standard editors, their removal in PSGML means it is better suited to interprocess communication than to human-interface applications such as editing documents using SGML-ignorant editors. <p>The initial impetus for the definition of PSGML was the need for information servers to return structured information to clients; when the only possible data format is flat ASCII text without markup, it is very difficult to implement or imagine serious cooperative processing in distributed environments. Using PSGML, a server might offer a client a choice of straight ASCII-only text without markup, a comma-delimited row-column form (if the data are from a rectangular table), a full SGML document, or a PSGML document. It would be plausible for a class of clients and servers to agree on a common tag set for which the normal processing is understood, but that tag set is not defined here. <p>A second reason for defining a subset of SGML is to make the strengths of SGML more accessible to programmers and technical readers. SGML proponents have had good reason for mixed feelings as ongoing projects such as the definition of MIME and the development of the World-Wide Web software have moved publicly to embrace SGML: SGML should definitely be used in such contexts, but the public discussion has made very clear that even proficient programmers and technical people often find SGML too complex to implement (or grasp) readily. A rational subset of SGML should make it easier for technical readers to grasp the most important features of the language, and to implement either the entire language or the subset. <p>The SGML subset offered here attempts to capture the most useful features of SGML, while eliminating all but the most trivial <soCalled>syntactic sugar</soCalled> (which can be defined as any construct provided merely as a convenient alternative notation for things which could be written in more uniform ways)<note>Following Abelson and Sussman 1985, p. 11).</note> and omitting several of the less frequently needed facilities. The formal grammar is written in strict BNF form, to simplify translation into common parser generators, and some other changes are made from the formal grammar provided by ISO 8879: where possible, productions of the form <eg> <![ CDATA [ a = b ]]> </eg> have been eliminated (<term>a</term> and <term>b</term> are merged); structural ambiguities have been removed from the syntax of attribute specifications; in some cases (e.g. the definition of name groups and model groups) constraints expressed only in the prose of ISO 8879 have been expressed in the syntax; and an explicit distinction is made between the lexical and syntactic layers (at the cost of forbidding comments in some places where ISO 8879 allows them). <h1>Overall Definition <p>A PSGML document is a <term>document instance</term>, preceded by a <term>document type declaration</term> (DTD); the DTD may be omitted, in which case the processor makes certain simple assumptions about the markup found in the document. <xmp> psgml ::= dtd doc | doc ; </xmp> <h1>Document Type Declarations (DTDs) <p>We allow a simple subset of normal SGML DTDs. A known DTD may be used, a specific DTD file may be named, or the full DTD can be given in the <term>DTD subset</term> (between the square brackets). Unlike SGML, PSGML does not allow the DTD subset to coexist with references to system files. <xmp> dtd ::= '<!DOCTYPE' NAME '>' | '<!DOCTYPE' NAME 'SYSTEM' LITERAL '>' | '<!DOCTYPE' NAME '[' decls ']' '>' ; decls ::= /* nothing */ | decls comment | decls elemdecl | decls attldecl | decls entdecl ; </xmp> <p>DTDs may contain comments, element declarations, attribute list declarations, and entity declarations. All the other declaration types of SGML (NOTATION, SHORTREF, USEMAP, etc.) are forbidden. <h2>Comments <p>Comments can occur in the DTD between other declarations; they can also occur within the document instance. The PSGML comment corresponds to the SGML <term>comment declaration</term>; unlike SGML, PSGML does not allow comments within other declarations. <xmp> comment ::= '' ; </xmp> <p>No double hyphen is allowed within the STRING, but newlines are legal. <h2>2b Element declarations <xmp> elemdecl ::= '<!ELEMENT' NAME '-' '-' 'ANY' '>' | '<!ELEMENT' NAME '-' '-' model '>' | '<!ELEMENT' NAME '-' 'O' 'EMPTY' '>' ; </xmp> <p>In the element declaration, white space is required between the two minus signs or between the minus sign and the O. <p>If no DTD is provided, all elements found are processed as if declared by the first form (ANY), which means any element may contain any combination of other elements and parsed character data. 'Parsed' character data is character content within which start-tags, end-tags, and entity references are recognized. Tags and further entity references are recognized within entities embedded by entity reference. <p>If a DTD is provided, each element used must be declared using the ANY form, the EMPTY form, or normal content models. <xmp> model ::= '(' tokgroup ')' || occ ; occ ::= /* nothing */ | '?' | '*' | '+' ; tokgroup ::= seqgroup | andgroup | orgroup ; seqgroup ::= token | seqgroup ',' token ; andgroup ::= token '&' token | andgroup '&' token ; orgroup ::= token '|' token | orgroup '|' token ; token ::= '#PCDATA' // == CHAR* | NAME || occ | model ; </xmp> <p>PSGML does not allow <term>exceptions</term> to the content model, so the inclusion exceptions and exclusion exceptions used by SGML do not appear. <h2>Attribute List Declarations <p>If attribute values are to be specified on any element, the legal attributes for that element must be declared with an ATTLIST declaration. <xmp> attldecl ::= '<!ATTLIST' NAME attdefs '>' ; attdefs ::= attdef | attdefs attdef ; attdef ::= NAME type default ; type ::= 'CDATA' | | 'ID' | | 'IDREF' | 'IDREFS' | 'ENTITY' | 'ENTITIES' | 'NAME' | 'NAMES' // Name, etc. are legal in PSGML | 'NMTOKEN' | 'NMTOKENS' // for compatibility with SGML, | 'NUMBER' | 'NUMBERS' // but they are mostly pointless | 'NUTOKEN' | 'NUTOKENS' // and are discouraged. | '(' nmtokgrp ')' ; nmtokgrp ::= ntseq | ntor | ntand ; ntseq ::= nametoken | ntseq ',' nametoken ; ntor ::= nametoken '|' nametoken | ntor '|' nametoken ; ntand ::= nametoken '&' nametoken | ntand '&' nametoken ; default ::= value | '#FIXED' value | '#REQUIRED' | '#CURRENT' | '#IMPLIED' ; // No #CONREF defaults. value ::= LITERAL | NAME | NUMBER | NUMTOKEN ; </xmp> <h2>Entity Declarations <xmp> entdecl ::= '<!ENTITY' NAME LITERAL '>' | '<!ENTITY' NAME 'SYSTEM' LITERAL '>' | '<!ENTITY' '%' NAME LITERAL '>' | '<!ENTITY' '%' NAME 'SYSTEM' LITERAL '>' ; </xmp> <p>If '%' is present, entity is a 'parameter entity' and is recognized only within the DTD, with references of the form %foo; -- otherwise, it's a 'general entity' recognized within the document. <p>Some entities should always be recognized in PSGML, even if not declared: they include entities for the characters of ISO 8859-1 and (identically) IBM CP 1014. <h1>Document Body <p>The body of a PSGML document is very simple: the document is tagged as a single element (the root element or document element), whose content, like that of any other element, may include character data and other elements; if declared as EMPTY, an element has no content at all. <xmp> doc ::= element ; element ::= empty | full ; empty ::= starttag // only if element declared empty ; full ::= starttag content endtag ; starttag ::= '<' || NAME atts '>' ; atts ::= /* nothing */ | atts NAME '=' value // if DTD present, NAME must // be declared ; endtag ::= '</' || NAME '>' | '</> ; </xmp> <h1>Differences from SGML: <p>PSGML deviates from standard SGML in a number of ways, most prominently in these: <ul> <li>PSGML does not allow processing instructions at all. <li>PSGML does not allow comments before the DTD. <li>PSGML allows no comments within declarations, only as independent declarations. <li>PSGML allows only one <Q>-- ... --</Q> sequence within a comment. <li>No notation declarations (let alone usemaps etc.) </ul> </gdoc>