Remembering IFilters

by Matt 11. April 2008 05:34

OK. Quick bug fix to my previous post. I said that when you selected "Index Properties Only", it stripped the registered IFilter from the file type. Strictly speaking, it copies the existing "persistent handler" class id, saves it in a value of "OriginalPersistentHandler" and then deletes the current registration.

This way, when you reselect "Index Properties and File Contents", it can copy the original value back, and use the proper IFilter and not have to default to the Plain Text filter.

Just the facts, ma'am.


Windows Desktop Search

Indexing Windows Live Writer posts

by Matt 9. April 2008 10:44

While googling for something else, I came across a post that pointed out that Windows Live Writer's saved posts aren't being indexed. Well, the contents weren't - only the file properties. Which is odd, because WLW comes with an IFilter - a plugin that exposes the contents of a .wpost file to Windows Search's index.


The article mentions that you can fix this by going to the Indexing Options in the control panel (and going to Advanced -> File Types), selecting the wpost extension, and changing the radio button from "Index Properties Only" to "Index Properties and File Contents".

This works, but not as you expect. It's not using the Windows Live Writer IFilter.

When you select "Index Properties Only", the registered filter is removed from the file type. If a file has no filter registered, the indexer will use the system provided "File Properties Filter", which extracts various properties such as filename, size, dates (and maybe the OLE DocFile structured storage properties) but doesn't touch the contents.

Selecting "Index Properties and File Contents" doesn't magically wire up the correct filter. Instead, it registers the "Plain Text Filter", which just extracts as much text out of the file as it can, and then hands it to the indexer as content. You can use it on arbitrary binary files, but it won't understand the file format, so won't be able to output more advanced properties, such as Author, Subject or Perceived Type. If you try to use the advanced search features of explorer to find blog posts with a certain subject, it will fail. Not too much of a hardship, perhaps, because the text will still match the full content search, but by missing the Perceived Type, the indexer doesn't know if it's a document, email, picture, audio, video or whatever. Bang goes your filtering.

We can fix this, but let's see why it wasn't registered in the first place. A great tool to help with this is Citeknet's IFilter Explorer.

 IFilter Explorer - Citeknet

Take a look for the .wpost extension. It's not there. Now we know why the proper filter wasn't being used - it's not registered.

You might have noticed the bewildering array of tabs across the top of the list. Windows Search shares a history with a long line of search products from Microsoft, from server side search engines such as SQL Server full text search, Sharepoint and Exchange search, to desktops, with Windows Search (3.x), Windows Desktop Search (2.x - MSN Desktop Search), Indexing Service and even the aborted WinFS.

On a hunch, check out Windows Desktop Search 2.x.

There it is. The .wpost extension has the WebPostFilter class registered against it.

And that's because despite sharing ancestry and the IFilter technology, registration between the different implementations can be subtly (and not so subtly) different. For example, the SQL Server registration needs extra data in a system table.

There does appear to be a common thread amongst registrations, though, and this is partly described in the docs for the current version of Windows Search. Namely, registration hangs off the file extension in the registry, or off the document type pointed to by the file extension. Or even from the MIME content type (which I didn't know worked, but explains why so many xml files are indexed).

Windows Desktop Search 2.x simply had some overrides that were checked before the system defined places, and the Windows Live Writer developers chose to register it there:


Now we know what the problem is, it's pretty straight forward to fix. We just need to deal with the mind-bogglingly odd way of registering IFilters.

Hanging off the file extension, the document type or the MIME type, you need to add a key called "PersistentHandler". This has a GUID that is stored in HKLM\CLSID. That GUID has a key called PersistentAddinsRegistered, which has another subkey named after the interface IID for IFilter. The default value of this is a CLSID for the IFilter COM object.


I have absolutely no idea why they added that bonkers level of abstraction, but it's been there for years, so who are we to argue with tradition. To make it easy, save this as a .reg file and double click:



@="Windows Live Writer persistent handler"




Note that I've wrapped a couple of lines for legibility. Oh, and that PersistentHandler GUID? Brand new one. Never before used. ({60734...} that is. {89BCB...} is the IID for IFilter and {4DFA6...} is the CLSID of the Windows Live Writer filter).

Advanced Options

Now you just have to get the indexer to re-index those files, and Bob's yer uncle. I took the lazy route, and just rebuilt the whole index (Control Panel -> Indexing Options -> Advanced -> Rebuild).

Painless, eh? What I want to know now, is what does the null filter do?


Windows Desktop Search

What happened to my abstraction?

by Matt 8. April 2008 17:17

Silverlight 1.0 has just had a minor update. Here's what's changed, according to Dr Sneath:

The changes are minor in nature and shouldn't affect existing applications; they include an audio bug fix for nForce 4 motherboards, an update to...

Goodness. 2008 and we're still updating *applications* to fix bugs in *motherboards*.


IoC containers finally make sense

by Matt 26. March 2008 11:27

I think I've just had a small epiphany. I've never really grokked the need for Inversion of Control containers, like Castle Windsor, StructureMap or Unity. They've always seemed like overkill. And of course, the reason I've felt like this is that they have been overkill - for the projects where I would have used them.

It's just another case of the right tool for the right job.

Most of the projects I've used have been loosely coupled by default, and while I've used dependency injection, it's never been too many layers deep, so setting up the dependencies has never crossed boundaries they shouldn't have crossed. Sometimes I've put them in config, and a container would definitely have saved me some work here, but not too much. I've always made do with a poor man's solution.

This snippet of code (from an msdn article) showed me how, working on different code, dependency injection would have led me too far down the plug hole:

// Somewhere in UI Layer 
InvoiceSubmissionPresenter presenter = new InvoiceSubmissionPresenter( 
  new InvoiceService( 
    new AuthorizationService(), 
    new InvoiceValidator(), 
    new InvoiceRepository())); 

It was always this wiring up of the dependencies that bothered me, but I was never far enough aware from the deepest layer for it to be properly wrong. Fixing this is exactly what the IoC abstraction is good at. That you get goodies such as lifetime management (per thread, singletons, etc), auto-wiring of dependencies and dynamically generated decorators is just icing on the cake.

Now it's just a case of figuring out which one to use.


More than one way to auto-update a cat

by Matt 23. March 2008 16:40

This wasn't a good idea by Apple, but whatever. Apple Software Updater falls under the "crapware" category as far as I'm concerned. It's just like the kind of software that OEMs install on new PCs that do something that's already a part of the OS (seriously, get rid of all those programs from your laptop. Just hit Windows + X).

Apple Software Updater makes sure iTunes is up-to-date. The funny thing is, so does iTunes.



Reinventing LINQ

by Matt 20. March 2008 19:54

Daniel Cazzulino made an interesting post a little while ago that I've been meaning to follow up. Ian Griffiths picked up on one aspect of it (the interesting, obscure type inference issue - and I love the idea of "mumble types") which reminded me to go have another look.

The gist of Daniel's post is to rewrite some Ruby code using some of the nice new C# features. The Ruby code is using Fibers (think iterators) to take a sequence as input, and output a new sequence of different values, or a new sequence made by skipping values.

And I really had to torture that sentence to avoid saying "map" and "filter". Remember, they're the magic words...

Daniel follows the original article very faithfully. He replaces the Fibers with iterators, adds some syntax sugar with extension methods and lambda functions, and duplicates the functionality of LINQ. He just stops short of using the nice syntax. So let's have a look at the last few examples.

Get 10 integers that are both even and multiples of three:

var evenMultiplesOfThree = from x in Enumerable.Range(1, 1000)
                           where (x % 2) == 0 && (x % 3) == 0
                           select x;
foreach (int i in evenMultiplesOfThree.Take(10))

(OK, slightly different. Daniel's range is infinite, I've capped mine.)

The pipeline for tripling even numbers, adding one and only printing the first 5 multiples of 5 (and this one lets me use "let"):

var results = from x in Enumerable.Range(1, 1000)
              let y = (x * 3) + 1
              where (y % 5) == 0
              select y;
foreach (int i in results.Take(5))

And the (somewhat unwieldy) palindrome finder:

var words = "Madam, the civic radar rotator is not level".Split(' ');
var palindromes = from word in words
                  let normalised = new { 
                      Original = word, 
                      Normalised = word.ToLower() 
                  select "'" + normalised.Original + "' is " + 
                    ((normalised.Normalised != new string(normalised.Normalised.Reverse().ToArray()))
                      ? "not " : "") + "a palindrome";
foreach(var palindrome in palindromes)

But this is an awkward query. Perhaps a poor example to convert into a query, and a simple loop might have served better.

Personally, I don't like the idea of having a ForEach extension method. I like doing the explicit foreach command. I think it separates the enumeration over the results from the description of the query (pipeline).

I think Daniel's post validates the design of LINQ. Independently, he took iterators, lambdas and extension methods and built something that, apart from the syntax, is LINQ. That's good design, and shows that LINQ is really evolutionary in the way it's built on so many smaller features in the runtime. But by not going that final step, it also shows that LINQ, and especially the query syntax, is really revolutionary. I think it's quite a mind shift to go from a "pipeline", a processing concept that's very procedural, explicitly procedural, and convert it into a query, something that's much more declarative.


In which I discover, again, the perils of premature optimisation

by Matt 18. March 2008 06:42

You'd think I'd only have to learn this once, wouldn't you?

For various reasons, I'm still using SharpReader as my RSS reader (I could just export my OPML and move, but I'm not happy to hit the panic button just yet. I did try to migrate to a much older version of RSS Bandit, but found lots of little bugs while importing my back catalogue, and it used just as much memory as SharpReader).

Unfortunately, SharpReader is a little, um, basic. It mostly suits my needs, but it would be really nice to have all the items grouped by published date (today, yesterday, last week, etc).

So I thought I'd add it. Hey, it's got a plugin model, it's almost asking to be hacked.

And oh boy, is this a hack. I have a class that implements IBlogExtension, but it doesn't implement any of the methods (so if you try and Blog This, it'll crash!). Instead, the plugin class subscribes to the Application.Idle event, spelunks around for the list view and subclasses it, using the lovely NativeWindow.AssignHandle. And that's just for starters.

I add the range of groups to the list view (the range is something like "next week", "later this week", "tomorrow", "today", "yesterday", "earlier this week", "last week", "two weeks ago", etc) and then loop through all the items in the list. Fortunately for me, Luke puts the RSS item class in the Tag property of the item, so I can get at it, check the date and assign the correct group.

There's a little bit of jiggery pokery with comments (indented items are comments, so should be in the same group as their parents, not their date. And of course, SharpReader is .net 1, so the IndentCount property of the ListViewItem isn't set) but I've now got a list of RSS items, grouped by date.


Where does the moral come in?

Things started to get a bit hairy when adding new items. Thanks to my pack rat mentality, I could have up to a couple of thousand items in the list view. Adding a new item shouldn't affect those I've already got grouped. And of course, when a new feed is downloaded, many items could be added one after the other. Again, we want to handle this as efficiently as we can.

Which to me meant only doing the update on application idle, and only updating the new items.

So, listening on WndProc for changes to the list (OCM_NOTIFY + LVN_INSERTITEM), looking up the corresponding ListViewItem and adding it to a dictionary.

On idle, I'd spin through the items in the dictionary and set the group. Disappointingly, this means that items get added to the end of the group, rather than being added into the group in the expected order. Let's chuck the unordered group into another dictionary and keep adding. Once all the items are done, loop through all affected groups, unassign all items, sort the items and add them back.

Hairy, right? And of course, because I was trying to be efficient, and store things in lookup tables, and only do things once, it's dead slow.

What I should have done was try the naive method first. The current code simply sets a flag when a new item is inserted. When the application goes idle, it unassigns all items from all groups, and reassigns. And it's dead fast.

Keep it simple, stupid.

(Oh, and about the horrifying nature of the code - hacks like this have their place, as long as you know and advertise that they are hacks and could fail in the worst possible way. Don't try and pass this off as production code. I know I'm running with scissors, but I'm an adult, they're my scissors, and I'm doing this in the safety of my own home.)


IE8. Where's the Good Stuff?

by Matt 17. March 2008 06:55

...yadda yadda standards compliant yadda yadda Activities yadda yadda WebSlices yadda yadda...

Where's the interesting new stuff in IE8?

I'm all for standards compliance, but it's not exactly exciting, is it? Activities aren't bad, but they're really just fancy context menu extensions, and WebSlices are just single item RSS feeds with a terrible name. The fancy Address Bar shading has been in Firefox for ages. The Favourites Bar is ok, but it's hardly revolutionary. At least we have icons in the menus now...

Tools menu with icons

The built in Developer Tools are a step in the right direction, especially the Layout tab and the built in Javascript debugger. (But they'd better fix the memory usage from IE7's plugin.)

Developer tools layout tab

And the Data URI? Do we really need to be able to embed images inline? I think it'll be useful for HTML email and the Acid2 test.

But to get to some of the good stuff, you need to have a look at the IE8 Whitepapers. Unfortunately, these don't really go into enough technical detail, preferring to be a high level feature overview. But there's still some nice nuggets in there.

For example, proper circular memory leak detection. They've fixed it (better than IE7's fix), but they don't tell us how. I'd like to know, but I guess I'm just nosey. IE7's optical zoom is changing to an adaptive zoom model, where layout is performed after zoom. Seems to be a bit subjective as to which way is better.

Getting a bit more interesting, there's Improved Namespace Support. Previous versions of IE have had some form of XML namespace support, but now it's improved! If you declared an XML namespace on the html element, all elements in that namespace would be ignored by the parser. You could then add an object tag specifying an ActiveX object that would implement a Binary Behaviour. You'd then add an <?import> processing instruction to tie the object to the namespace. Finally, your little namespace xml island would get rendered by the Binary Behaviour. IE8 makes this a little more sensible by registering the object client side, in the registry. No more object tag, no more import PI, and you can actually put the xml namespace declaration on the xml island's root element. This is how we can get SVG support (although Adobe have just end-of-lifed their plugin).

My favourite is Loosely-coupled Internet Explorer and Automatic Crash Recovery. ACR is simply Welcomed, Just Needed and Catching Up With the Competition, so we won't dwell on that. LCIE is nothing short of staggering. IE7 on Vista introduced us to Protected Mode, a change in the OS to run IE as a low-integrity process, limiting what it had access to. This in itself is above and beyond what any other browser maker has done (hey, they don't all write Operating Systems) for security. And they're building on it in IE8.

IE7 would host separate processes for running in Protected Mode and not. Navigate between Zones that had different Protected Mode requirements, and the navigation would have to complete in a separate process. A side effect of this is that a program driving IE via automation wouldn't get notification of the new process, and wouldn't get the new interface pointer. IE8 adds a new DWebBrowser2 event to handle this.

LCIE also builds on this. Crazily, the IE frame window is now in one process and each tab (or set of tabs grouped by integrity level, it's not clear which) run in another process. The frame can now host tabs from any integrity level, and any crash just closes that tab, not the whole process. ACR also remembers the navigation history and should restore things correctly. This is an impressive level of robustness. It'll be interesting to see what effect the asynchronous nature of the parsing and rendering has on performance. The IE blog has a good post on this.

So what's missing? Perhaps the biggest surprise is the lack of support for xhtml. And I haven't really seen anything about how these changes affect the Web Browser component.


Do yourself a favour - get LINQPad

by Matt 14. March 2008 05:36

I've just discovered the goodness that is LINQPad. If you're playing around with LINQ at all (LINQ to SQL, LINQ to Objects, LINQ to XML) then you really need to get this.


Added to my list of indispensable programs...


IE finally feeling some love?

by Matt 12. March 2008 18:52

It's rather nice to see the IE team getting so many positive comments for once.

I've been enjoying this debate/issue. Microsoft have clearly painted themselves into a corner with this one, but it's a prime example of a really tough design decision. It's also a great reminder that everything has consequences, and that everything is complex the deeper into it you go.

So let's look a little closer at this. How would you fix this? Default to new, "super" standards mode, or require pages to opt in? It's perhaps not quite as simple as the comments make it seem.

The original solution, adding a meta tag to "lock" you to a particular rendering engine version was pragmatic at best, clearly clumsy, yet it did solve the problem. But at the cost of the web developers. They were now asking for standards mode twice and they really didn't like to be cleaning up Microsoft's mess. And they were rather vocal about it.

But if you just fix your standards mode, then sites expecting a poorly implemented standards mode wouldn't work, and you've just Broken the Web.

And I think this is were most people have missed the point.

While the comments correctly stated that this would be easy to fix, either by updating the site or adding the meta tag, most failed to see the bigger picture.

Like, who pays for the fix? And what if the site doesn't get fixed?

So now Microsoft has just cost a lot of companies a lot of money (and that's not including those dodgy intranet sites...). And if the company doesn't bother fixing it, the problem just gets pushed down to the user, who'll blame the browser, not the site. And that's assuming the site is owned by a company.

This is not a great position for Microsoft to be in.

And this is why I think Microsoft have been bending over backwards to maintain compatibility. And it's interesting that it's the web developers that were against the original idea, it's not the web developers who'll foot the bill, or get penalised, and now it's the web developers giving all the praise.

But that's not to say that I disagree with their latest decision.

They made an interesting post after the original meta tag solution describing the new User Agent string. Slightly hypocritically, the advice was to update your sites, in case the new UA version Broke the Web.

The big problem with Breaking the Web is one of scale. How many sites will break because the standards implementation is fixed? How badly will they be broken? Are the unmaintained sites that I'm so worried about relying on poorly implemented standards? If they're unmaintained, do they matter any more? Is it a huge problem that if they have layout issues? Is it enough for the top 200 sites to work with a decent standards mode?

Firefox, Opera and Safari can implement proper standards support without the world ending. I guess we'll survive.



Month List


Comment RSS