So, reading xhtml. Dead easy, right? After all, it's just xml. Whack it into an XmlReader and Bob's your uncle.
Unless your xhtml uses an entity such as £.
Xml has 5 defined entities, < (<), > (>), & (&), ' (') and " ("). All self-respecting xml parsers will handle these. But xhtml, and it's poorer cousin, html define a whole raft more. Try and put such an xhtml file through an xml parser, and there will be problems.
What you need to do is tell the xml parser about these extra entities. Which means getting the xml parser to also read in a bunch of dtd's. Again - dead easy, right?
Well, when you know the correct voodoo, yes, it's kinda easy.
This post provides sample code on how to do this in .net. The idea is that you need to tell the parser that it's reading an xhtml file, and then provide the xhtml dtd for it when it asks. The code here does just this, and also shows how to keep that dtd and associated files in your applications resources. Unfortunately, it's all a bit confusing as to how and why it actually works, so I thought I'd try and demystify it a bit. For the moment, we're going to ignore the idea of pulling content from resources, and just explain what's happening in the normal case.
This file is xhtml
Firstly, let's tell the parser that we're reading an xhtml file. This just means giving it a DocType, such as:
<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">
This should really be specified in the file itself, but if you're just parsing a fragment, you need to tell the XmlReader explicitly. This is accomplished by populating the DocTypeName and PublicId fields of XmlParserContext, as the post demonstrates.
The next step is to get the dtd into the parser. Which is where we start looking at XmlResolver, and where things get a little confusing. The interaction and relationship between the XmlReader and the XmlResover isn't very well documented, but it boils down to this - any time the XmlReader has to get content from a URI, it defers to the XmlResolver.
The best way to explain this is by example.
Say you're reading an xhtml file, via a call to XmlReader.Create, passing in a filename - at least, that's the common usage. The filename is actually a URL and could easily be a http URL. The first thing XmlReader does is pass this URL into XmlResolver.ResolveUri. This allows us a hook to modify or replace the URL of the file, if we want to (e.g. instead of loading it over http, get it from a cache on the local file system). Essentially, we just return back a new URI that is the actual location of the file.
Once the URL to the file has been resolved, it's passed into XmlResolver.GetEntity, which will open and return a stream to the file. Since it's a stream, the file could be anywhere - on the file system, over http or in a resource. Now the XmlReader has the file to parse, and the resolved URL is considered to be the base URI of the file.
Incidentally, if we don't use a URL to load the file into the XmlReader, we can still pass in a URI via the BaseURI field of XmlParserContext. The resolved version of this URI is then the base URI of the file.
Note that the base URI is the URI to the file itself - not to the parent "directory" of the file. This is actually the same as the base attribute in html, even though I was expecting it to be the directory.
If the XmlReader needs to bring in any more content from within the file (a nice example would be xlink or xinclude, except I don't think they are supported), it will pass the URI identifying the content to the XmlResolver, and then pass the resolved URI back to GetEntity to actually get the content.
If there is a DocType associated with the parser (via the file, or via the context) that will need to be resolved. So the public id is passed into ResolveUri. In the case of (strict) xhtml, this will be "-//W3C//DTD XHTML 1.0 Strict//EN". The XmlResolver needs to know about this DocType and return back a URI that can GetEntity will be able to open a stream on. Let's assume we've subclassed XmlUrlResolver so it's ResolveUri knows that the xhtml DocType maps to the correct http URL so we simply return "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" and let the standard implementation of GetEntity download it for us.
Now the xhtml1-strict.dtd references other files to pull in the actual entity definitions. The XmlReader follows the same procedure - it calls ResolveUri and then GetEntity. The important part to remember here is that these references might be relative. In other words, the dtd might not have fully qualified http URLs. The base URI passed to ResolveUri is the resolved URI of the DocType - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". Resolving "xhtml-lat1.ent" against that URL means we should return "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent", which is the correct location for the entity file. The reader will just download that file and continue.
Of course, it's entirely possible that the reference is fully qualified, in which case, we could just return it directly.
Lett's revision guide version
- Resolve the file URL and download it.
- Resolve the public Id of the DocType fully qualified URI, using the resolved URI of the file as the "current directory" if required, and download the dtd.
- Parse the dtd and resolve any external references against the resolved URI of the dtd (if appropriate) and download them.
- Parse the references and the file and resolve and download any other external references.
So, it's actually quite straightforward, especially in the use case of file:// and http:// URIs.
Pulling content from resources
Now, back to the sample code. The code is absolutely fine as it stands, but I think I'd implement it differently. It currently has a list of known URIs, made up of "urn:" plus the DocType or the dtd/entity filename. In the ResolveUri method, it takes the given relative URI and appends it to "urn:" and then compares it against the list of know URIs. The GetEntity method just compares the given absolute URI and returns the relevant resource stream.
This feels a bit fragile. I'd implement it more like the file or http URI handlers. I'd have one known resource, and that's the xhtml dtd, keyed on the DocType URI. ResolveUri would match the relative URI against this DocType URI and return back a resource:// URI, which would contain an assembly identifier, plus the namespaced resource name, such as "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml1-strict.dtd". The GetEntity method can then parse this to get the relevant resource stream. I think this way is better because when the dtd requires an external reference, it will call ResolveUri with "xhtml-lat1.ent" as the relative URI and "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml1-strict.dtd" as the base URI. Simply combining the URIs gives me "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml-lat1.ent". Doing this requires that all the external references are stored in the same resource namespace, the same as with the file or http cases, and the same for the sample code. But it also means that the resolver only needs to know about the mapping between the DocType and the dtd, and doesn't have to worry about creating resource URIs for all stored files.
Now I've just got to implement it.