How to use XmlResolver. Or, reading an xhtml file in .net

by Matt 28. June 2007 06:24

So, reading xhtml. Dead easy, right? After all, it's just xml. Whack it into an XmlReader and Bob's your uncle.

Unless your xhtml uses an entity such as £.

Xml has 5 defined entities, &lt; (<), &gt; (>), &amp; (&), &apos; (') and &quot; ("). All self-respecting xml parsers will handle these. But xhtml, and it's poorer cousin, html define a whole raft more. Try and put such an xhtml file through an xml parser, and there will be problems.

What you need to do is tell the xml parser about these extra entities. Which means getting the xml parser to also read in a bunch of dtd's. Again - dead easy, right?

Well, when you know the correct voodoo, yes, it's kinda easy.

This post provides sample code on how to do this in .net. The idea is that you need to tell the parser that it's reading an xhtml file, and then provide the xhtml dtd for it when it asks. The code here does just this, and also shows how to keep that dtd and associated files in your applications resources. Unfortunately, it's all a bit confusing as to how and why it actually works, so I thought I'd try and demystify it a bit. For the moment, we're going to ignore the idea of pulling content from resources, and just explain what's happening in the normal case.

This file is xhtml

Firstly, let's tell the parser that we're reading an xhtml file. This just means giving it a DocType, such as:

<!DOCTYPE html PUBLIC "-//W3C//DTD XHTML 1.0 Strict//EN" "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd">

This should really be specified in the file itself, but if you're just parsing a fragment, you need to tell the XmlReader explicitly. This is accomplished by populating the DocTypeName and PublicId fields of XmlParserContext, as the post demonstrates.

Define xhtml...

The next step is to get the dtd into the parser. Which is where we start looking at XmlResolver, and where things get a little confusing. The interaction and relationship between the XmlReader and the XmlResover isn't very well documented, but it boils down to this - any time the XmlReader has to get content from a URI, it defers to the XmlResolver.

The best way to explain this is by example.

Say you're reading an xhtml file, via a call to XmlReader.Create, passing in a filename - at least, that's the common usage. The filename is actually a URL and could easily be a http URL. The first thing XmlReader does is pass this URL into XmlResolver.ResolveUri. This allows us a hook to modify or replace the URL of the file, if we want to (e.g. instead of loading it over http, get it from a cache on the local file system). Essentially, we just return back a new URI that is the actual location of the file.

Once the URL to the file has been resolved, it's passed into XmlResolver.GetEntity, which will open and return a stream to the file. Since it's a stream, the file could be anywhere - on the file system, over http or in a resource. Now the XmlReader has the file to parse, and the resolved URL is considered to be the base URI of the file.

Incidentally, if we don't use a URL to load the file into the XmlReader, we can still pass in a URI via the BaseURI field of XmlParserContext. The resolved version of this URI is then the base URI of the file.

Note that the base URI is the URI to the file itself - not to the parent "directory" of the file. This is actually the same as the base attribute in html, even though I was expecting it to be the directory.

If the XmlReader needs to bring in any more content from within the file (a nice example would be xlink or xinclude, except I don't think they are supported), it will pass the URI identifying the content to the XmlResolver, and then pass the resolved URI back to GetEntity to actually get the content.

If there is a DocType associated with the parser (via the file, or via the context) that will need to be resolved. So the public id is passed into ResolveUri. In the case of (strict) xhtml, this will be "-//W3C//DTD XHTML 1.0 Strict//EN". The XmlResolver needs to know about this DocType and return back a URI that can GetEntity will be able to open a stream on. Let's assume we've subclassed XmlUrlResolver so it's ResolveUri knows that the xhtml DocType maps to the correct http URL so we simply return "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd" and let the standard implementation of GetEntity download it for us.

Now the xhtml1-strict.dtd references other files to pull in the actual entity definitions. The XmlReader follows the same procedure - it calls ResolveUri and then GetEntity. The important part to remember here is that these references might be relative. In other words, the dtd might not have fully qualified http URLs. The base URI passed to ResolveUri is the resolved URI of the DocType - "http://www.w3.org/TR/xhtml1/DTD/xhtml1-strict.dtd". Resolving "xhtml-lat1.ent" against that URL means we should return "http://www.w3.org/TR/xhtml1/DTD/xhtml-lat1.ent", which is the correct location for the entity file. The reader will just download that file and continue.

Of course, it's entirely possible that the reference is fully qualified, in which case, we could just return it directly.

Lett's revision guide version

To recap:

  1. Resolve the file URL and download it.
  2. Resolve the public Id of the DocType fully qualified URI, using the resolved URI of the file as the "current directory" if required, and download the dtd.
  3. Parse the dtd and resolve any external references against the resolved URI of the dtd (if appropriate) and download them.
  4. Parse the references and the file and resolve and download any other external references.

So, it's actually quite straightforward, especially in the use case of file:// and http:// URIs.

Pulling content from resources

Now, back to the sample code. The code is absolutely fine as it stands, but I think I'd implement it differently. It currently has a list of known URIs, made up of "urn:" plus the DocType or the dtd/entity filename. In the ResolveUri method, it takes the given relative URI and appends it to "urn:" and then compares it against the list of know URIs. The GetEntity method just compares the given absolute URI and returns the relevant resource stream.

This feels a bit fragile. I'd implement it more like the file or http URI handlers. I'd have one known resource, and that's the xhtml dtd, keyed on the DocType URI. ResolveUri would match the relative URI against this DocType URI and return back a resource:// URI, which would contain an assembly identifier, plus the namespaced resource name, such as "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml1-strict.dtd". The GetEntity method can then parse this to get the relevant resource stream. I think this way is better because when the dtd requires an external reference, it will call ResolveUri with "xhtml-lat1.ent" as the relative URI and "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml1-strict.dtd" as the base URI. Simply combining the URIs gives me "resource://sticklebackplastic.xhtml/sticklebackplastic.xhtml.resources.xhtml-lat1.ent". Doing this requires that all the external references are stored in the same resource namespace, the same as with the file or http cases, and the same for the sample code. But it also means that the resolver only needs to know about the mapping between the DocType and the dtd, and doesn't have to worry about creating resource URIs for all stored files.

Now I've just got to implement it.

Tags:

Comments (20) -

Jignesh
Jignesh
2/9/2009 10:56:40 AM #

RE: How to use XmlResolver. Or, reading an xhtml file in .net

Can you please post code.

Reply

Lady
Lady
2/26/2011 3:21:18 AM #

I find that readers respond very well to posts that show your own weaknesses, failings and the gaps in your own knowledge rather than those posts where you come across as knowing everything there is to know on a topic. People are attracted to humility and are more likely to respond to it than a post written in a tone of someone who might harshly respond to their comments.

Reply

Franchise Advice
Franchise Advice
5/24/2011 5:02:39 AM #

Nice platform, just researching some blogs to use for my franchise advice website.  All the best, Matt.

Reply

best suv 2010
best suv 2010
7/20/2011 10:34:34 PM #

Per il tuo bambino scegli Moncler. Una scelta di capi, estivi ed invernali, eccezionali. Tuo figlio sarà sempre alla moda e potrà muoversi in totale comodità.

Reply

LAURENCE  Debora
LAURENCE Debora
7/30/2011 8:35:34 AM #

Argent: faites de l’argent avec votre annuaire via allopass et encore plus...annuaires.

Reply

Lexapro class action
Lexapro class action
10/20/2011 9:14:13 PM #

I have been searching for this quality blogs regarding this niche. Searching in Yahoo drove me here, I just found this kind of satisfactory readings i was looking for. I must bookmark this website to avoid missing it again.

Reply

yaz lawsuit
yaz lawsuit
10/21/2011 1:50:23 AM #

I really love the way information is presented in your post. I have added you in my social bookmark. Cheers.

Reply

yaz lawsuit
yaz lawsuit
10/22/2011 4:27:36 AM #

I am really admired for the great info is visible in this blog and using the great services in this blog

Reply

Stephane
Stephane
10/22/2011 9:31:23 AM #

hello there,

interresting, post here will came back
continue updating your blog

Reply

Btissam
Btissam
10/23/2011 11:58:33 AM #

hi,

Nice, post here

Reply

Eric
Eric
10/23/2011 7:26:12 PM #

hello there,

Nice, post here
Keep like this

Reply

Robert
Robert
10/23/2011 8:45:39 PM #

hi,

Nice, post here may come back soon
continue updating your blog

Reply

Lisa
Lisa
10/23/2011 10:52:06 PM #

Hello

Interresting post here may come back soon
Keep like this

Reply

Robert
Robert
10/24/2011 12:23:03 AM #

Hi

nice post here will came back

Reply

Eric
Eric
10/24/2011 12:24:15 AM #

Hi

nice post here may come back soon

Reply

disability lawyer
disability lawyer
10/24/2011 4:01:15 AM #

i so much enjoy with this blog readings and it looks like that this blog was very helpful,thank you for tagging this blog.

Reply

List of Social Bookmarking Sites
List of Social Bookmarking Sites
11/10/2011 7:36:17 AM #

This is such a good article. SEO may target different kinds of search, including image search, local search, video search, academic search,[1] news search and industry-specific vertical search engines.

Reply

List of Social Bookmarking Sites
List of Social Bookmarking Sites
11/10/2011 9:38:01 AM #

This is such a good article. SEO may target different kinds of search, including image search, local search, video search, academic search,[1] news search and industry-specific vertical search engines.

Reply

fort worth bankruptcy attorneys
fort worth bankruptcy attorneys
11/22/2011 6:52:49 AM #

many thanks again pauline ,you often make decent statements here, you could make serious money for your services ,please drop me a line thanks

Reply

iPhone games reviews
iPhone games reviews
12/1/2011 2:56:49 PM #

Good job  , amazing Post !!!

Reply

Pingbacks and trackbacks (1)+

Add comment

biuquote
  • Comment
  • Preview
Loading

Rel=Me

Month List

RecentComments

Comment RSS