TEI: British National Corpus Samples
The British National Corpus
This subdirectory contains some files demonstrating a particularly tricky SGML to XML conversion problem! It has been prepared for the use of the TEI Workgroup on SGML/XML Migration only; files herein are copyright and may not be re-distributed without permission. See the BNC website cited below for licensing conditions.
The British National Corpus (BNC) is a 100 million word corpus of
modern British English, in which each distinct word has a part of
speech code (POS) attached to it, as well as all the usual TEI
flummery. This makes it Big. The DTD for the current release of the
BNC is TEI compliant, inasmuchas
- its DTD can be expressed as TEI plus a pair of TEI modification files
- it comes with documentation explaining how the DTD has been derived from P3 (and also, confusingly, how it differs from that used by the original release of the BNC which predated publication of P3)
The documentation is included here in the original XML source and may also, more easily perhaps, be read at the BNC website starting from The BNC online user guide; in both cases there is a minor error, which I leave the discerning reader to discover.
This archive also contains:
- bncMods.dtd and bncMods.ent (the extension files)
- bnc.dtd (as produced by the Pizza chef)
- a driver file which uses the compiled DTD, and a compatibility driver file which uses the modification files plus the P4X SGML dtds
- Three sample files, used by the above drivers(F71, KB0 and KC5; you also need an extended sgml declaration to cope with the quantities needed by the DTD.
The whole directory is also available as a single zip archive
On my system, in this directory, typing either of the following lines
nsgmls -s sgmldecl driver.sgm
nsgmls -s -c TEIcatalog sgmldecl driver.sgm
produces satisfyingly no messages other than a successful
compilation. Your mileage may, as they say, vary.Lou Burnard, on Guy Fawkes Day, 2002