Remembering IFilters

by Matt 11. April 2008 05:34

OK. Quick bug fix to my previous post. I said that when you selected "Index Properties Only", it stripped the registered IFilter from the file type. Strictly speaking, it copies the existing "persistent handler" class id, saves it in a value of "OriginalPersistentHandler" and then deletes the current registration.

This way, when you reselect "Index Properties and File Contents", it can copy the original value back, and use the proper IFilter and not have to default to the Plain Text filter.

Just the facts, ma'am.

Tags:

Windows Desktop Search

Indexing Windows Live Writer posts

by Matt 9. April 2008 10:44

While googling for something else, I came across a post that pointed out that Windows Live Writer's saved posts aren't being indexed. Well, the contents weren't - only the file properties. Which is odd, because WLW comes with an IFilter - a plugin that exposes the contents of a .wpost file to Windows Search's index.

image

The article mentions that you can fix this by going to the Indexing Options in the control panel (and going to Advanced -> File Types), selecting the wpost extension, and changing the radio button from "Index Properties Only" to "Index Properties and File Contents".

This works, but not as you expect. It's not using the Windows Live Writer IFilter.

When you select "Index Properties Only", the registered filter is removed from the file type. If a file has no filter registered, the indexer will use the system provided "File Properties Filter", which extracts various properties such as filename, size, dates (and maybe the OLE DocFile structured storage properties) but doesn't touch the contents.

Selecting "Index Properties and File Contents" doesn't magically wire up the correct filter. Instead, it registers the "Plain Text Filter", which just extracts as much text out of the file as it can, and then hands it to the indexer as content. You can use it on arbitrary binary files, but it won't understand the file format, so won't be able to output more advanced properties, such as Author, Subject or Perceived Type. If you try to use the advanced search features of explorer to find blog posts with a certain subject, it will fail. Not too much of a hardship, perhaps, because the text will still match the full content search, but by missing the Perceived Type, the indexer doesn't know if it's a document, email, picture, audio, video or whatever. Bang goes your filtering.

We can fix this, but let's see why it wasn't registered in the first place. A great tool to help with this is Citeknet's IFilter Explorer.

 IFilter Explorer - Citeknet

Take a look for the .wpost extension. It's not there. Now we know why the proper filter wasn't being used - it's not registered.

You might have noticed the bewildering array of tabs across the top of the list. Windows Search shares a history with a long line of search products from Microsoft, from server side search engines such as SQL Server full text search, Sharepoint and Exchange search, to desktops, with Windows Search (3.x), Windows Desktop Search (2.x - MSN Desktop Search), Indexing Service and even the aborted WinFS.

On a hunch, check out Windows Desktop Search 2.x.

There it is. The .wpost extension has the WebPostFilter class registered against it.

And that's because despite sharing ancestry and the IFilter technology, registration between the different implementations can be subtly (and not so subtly) different. For example, the SQL Server registration needs extra data in a system table.

There does appear to be a common thread amongst registrations, though, and this is partly described in the docs for the current version of Windows Search. Namely, registration hangs off the file extension in the registry, or off the document type pointed to by the file extension. Or even from the MIME content type (which I didn't know worked, but explains why so many xml files are indexed).

Windows Desktop Search 2.x simply had some overrides that were checked before the system defined places, and the Windows Live Writer developers chose to register it there:

HKLM\SOFTWARE\Microsoft\RSSearch\ContentIndexCommon\Filters\Extension\.wpost

Now we know what the problem is, it's pretty straight forward to fix. We just need to deal with the mind-bogglingly odd way of registering IFilters.

Hanging off the file extension, the document type or the MIME type, you need to add a key called "PersistentHandler". This has a GUID that is stored in HKLM\CLSID. That GUID has a key called PersistentAddinsRegistered, which has another subkey named after the interface IID for IFilter. The default value of this is a CLSID for the IFilter COM object.

Phew.

I have absolutely no idea why they added that bonkers level of abstraction, but it's been there for years, so who are we to argue with tradition. To make it easy, save this as a .reg file and double click:

[HKEY_CLASSES_ROOT\.wpost]

[HKEY_CLASSES_ROOT\.wpost\PersistentHandler]
@="{60734E5A-7C25-479f-B101-F14DEAF5ACB6}"

[HKEY_CLASSES_ROOT\CLSID\{60734E5A-7C25-479f-B101-F14DEAF5ACB6}]
@="Windows Live Writer persistent handler"

[HKEY_CLASSES_ROOT\CLSID\{60734E5A-7C25-479f-B101-F14DEAF5ACB6}
\PersistentAddinsRegistered]

[HKEY_CLASSES_ROOT\CLSID\{60734E5A-7C25-479f-B101-F14DEAF5ACB6}
\PersistentAddinsRegistered\{89BCB740-6119-101A-BCB7-00DD010655AF}]
@="{4DFA66FF-1EE1-4BAF-A034-0023FB7372EB}"

[HKEY_CLASSES_ROOT\CLSID\{60734E5A-7C25-479f-B101-F14DEAF5ACB6}
\PersistentHandler]
@="{60734E5A-7C25-479f-B101-F14DEAF5ACB6}"

Note that I've wrapped a couple of lines for legibility. Oh, and that PersistentHandler GUID? Brand new one. Never before used. ({60734...} that is. {89BCB...} is the IID for IFilter and {4DFA6...} is the CLSID of the Windows Live Writer filter).

Advanced Options

Now you just have to get the indexer to re-index those files, and Bob's yer uncle. I took the lazy route, and just rebuilt the whole index (Control Panel -> Indexing Options -> Advanced -> Rebuild).

Painless, eh? What I want to know now, is what does the null filter do?

Tags:

Windows Desktop Search

More reasons to move to WDS 3

by Matt 16. March 2007 19:12

Yeah, Microsoft have released the SDK for WDS3.x. They're just trying to make me move away from 2.x.

(Recap: 2.x runs as a program in your logged in context. Your RSS feeds are available from COM objects in your context. 3.x runs as a service. Not in your context. Ah. No RSS feeds for you, Mr Index.)

And they've just thoroughly updated the 3.x docs on MSDN too. Including the secret sauce for indexing stuff that's only available in the user's logged in context...

As if that's not enough to tempt you away from 2.x, they've removed the finally blocker by releasing WDS 3.01 - group policy support has now been re-implemented (it's in 2.x, but didn't make the cut for 3.0). Because you enterprise types are really just dying to upgrade, aren't you?

Of course, this doesn't scare me. I've got my VPC. I don't need no steenking secret sauce. I'm sticking with 2.x.

(Simply because I know it's going to be able to get at the rss feeds, while 3.x might still go wrong.)

But the upgrade will be sooner rather than later...

Tags:

Windows Desktop Search

0.0.1 better than 2.6.5

by Matt 14. March 2007 18:48

At the end of November, Microsoft updated Windows Desktop Search from 2.6.5 to 2.6.6. Now, eagle eyed readers might remember that there was a bug in 2.6.5 - after registering your protocol handler, WDS doesn't automatically start indexing. According to the post in the MSDN forums, this was going to get fixed in 2.6.6.

Add to that the fact that I had a sneaky feeling registration wasn't working right (and because I had to do a lot of VPC messing about anyway) I thought I'd have a closer look.

The MSDN docs for registration don't mention the search manager object, and list a load of registry keys to do it all manually. Registry Monitor to the rescue, and it looks like neither 2.6.5 nor 2.6.6 write the protocol handler details to HKLM when using the search manager. This means you'll need to install your protocol handler for each user on the system.

Of course, that's not very useful seeing as WDS doesn't look at the HKLM key document. It does look at HKLM\Software\Microsoft\Search\ProtocolHandlers (note Search instead of RSSearch) but the format is different. I think this is a Sharepoint registry key, so I'd be reluctant to use if WDS (mainly because we not writing a plugin for Sharepoint, but for WDS). Interestingly, once it is written under HKCU, WDS will also search for it explicitly in that Sharepoint key above.

And to make things a bit more confusing, the HasRequirements and HasStartPage values documented on that MSDN page don't appear to be read, either.

So, let's look at that other bug. On a clean 2.6.5 system, if I register the protocol handler, it gets queried for ISearchProtocol, and Init is called, and that's it. It's only after I restart WDS or choose to rebuild the index does ISearchProtocol::GetDefaultCrawlScope get called, and I can set the default root URL. Some time after that, ISearchProtocol is queried, as is IUrlAccessor, and everything looks good.

Under 2.6.6, there's not much difference. The useless call to ISearchProtocol before having set up the default URL doesn't happen. Other than that, it's all exactly the same - I only get to set the default URL after restarting the program or rebuilding the index.

So, I'm sure that we can really say that the bug is fixed.

And seeing that the registry doesn't seem to work as expected, I'll be glad to move up to version 3.

Tags:

Windows Desktop Search

It never rains...

by Matt 10. March 2007 18:49

And of course, just to demonstrate that the universe actually does understand irony (*), Microsoft have just released the SDK for WDS 3.x.

Some very interesting things here. The usual suspects for querying (OleDB and ADO.net), even a handy command line app.

The indexing section includes an example implementation of IFilter, and a rather useful ATL style C++ class called IFilterImpl to take a lot of work from you.

There's also a really impressive registry shell namespace extension which includes a protocol handler. Yep, with this example, you can search the registry from the Start menu. I think the nice thing about this sample is the way it's so integrated. The shell stuff leans on the new Vista property store and the protocol handler indexes the shell folder, rather than the registry itself.

(And I rather think there's a little secret in that example too. One that will come in very handy when I'm ready to migrate from the user mode 2.x to the service-based 3.x...)

I think I'm going to get a lot of use out of a couple of the samples in the management section too. One seems able to query and set what urls are included or excluded or set as roots. But the killer is a simple example that queries the index (based on the command line) and notifies the index that the urls have been deleted. Hence, they'll get removed from the index, the source will get crawled again, and the items will get re-indexed. Very useful.

(*) The irony only really being obvious if you read my post of 15 minutes ago.

Tags:

Windows Desktop Search

Back to the grindstone

by Matt 10. March 2007 18:33

Yeah, I guess it has been a little while since I last looked at the RSS protocol handler. There have been a number of reasons. Let me list them for you, I can tell you're dying to know.

  1. Real Life.
  2. Work.
  3. The release of WDS 2.6.6
  4. The release of Visual Studio 2005 SP1
  5. The release of IE7, and therefore the RSS Platform, which might mean a breaking change to the beta SDK I was going to be using
  6. The feeling that I needed to reset my VPC and start again. I'd just fiddled with too many registry settings.

I mean, who relishes the thought of having to do that much VPC maintenance? I intended to get round to it, but we all know about good intentions and roads being paved, etc. But I've had an email (thanks Sanin) that's given me a prod, so I'm updating my VPC even as we speak.

Adding more fuel to my laziness, the MSDN docs have been updated. The 3.x docs are still pitiful, but the 2.x ones are much more interesting. I expected them to be pretty much a copy of the old MSN docs, but there's more stuff. There's more info on querying the index (including the WDS Browser Helper Object, which allows a website to perform a query via script).

And then there's a whole heap of interesting goodness about "Developing Protocol Handler Add-ins". Not masses of new information, but a couple of useful snippets that mean changes to what I've already covered, such as:

Note

When you want to add a new data store, you'll need to select a name to identify it that does not conflict with current ones. We recommend this naming convention: companyName.scheme.

Which of course makes sense, but means I've got to go back and change it. This really isn't helping. But the real kicker is in the section on installing and registering protocol handlers. This version of the docs details what registry keys to write. It doesn't mention using the COM object to register at all. I don't know why, but I have a sneaky suspicion it's because the COM object doesn't register the protocol handler in HKLM, only HKCU, so other users won't get it. And looking into that was going to be awkward, seeing as I'd already registered.

See what I mean? Lots of little housekeeping jobs to do.

And perhaps more importantly, I was beginning to bore myself with those articles. They were just a little bit dry and preachy. I needed to write some code. I needed to make use of that lovely syntax highlighter plugin for Windows Live Writer. More than that, I needed a Plan.

I think I've got one now. But first, that housekeeping.

Tags:

Windows Desktop Search

Missing Vista features #4 - WinFS

by Matt 20. February 2007 19:03

Oh yes. The WinFS Post. Dare linked to a previous post of his while commenting on Microsoft focusing on vision, rather than shipping, which reminded me that I hadn't gotten around to writing about it.

Here's the link to the more interesting post, and if you've been living under a rock and don't have a scooby-doo what I'm talking about, here's a great Wikipedia article about it all.

I didn't get to play with any actual bits, so I'm not about to shoot my mouth off about the technology, but I loved the concept. I was very disappointed to see WinFS go. It was the most exciting of the original Three Pillars of Longhorn (WinFS, Indigo and Avalon).

On the surface, it was all about search. I think we've nailed that one without needing WinFS. There were some advanced searches that we were promised (as Wikipedia says: "the phone numbers of all persons who live in Acapulco and each have more than 100 appearances in my photo collection and with whom I have had e-mail within last month"), but I wouldn't bet against Windows Desktop Search being able to handle something like that, especially when you see the surprisingly-off-by-default natural language search.

I think a lot of people unfairly dismissed the more interesting part - the data store. To me, this was the killer app. WinFS stored any data you wanted it to, in a structured manner, with relationships between items and properties and allowed you to search over the lot. This was huge.

It had a bunch of built in schema, so you could store contacts, emails, IMs, documents, pictures, videos, music and more. But as Dare points out, many people felt there was a chicken and egg situation. Most apps already had massive investments in their own data store (e.g. Outlook) so why would they throw that away and migrate to WinFS?

As I understood it, the architects of WinFS had already thought of this, and you could promote metadata from your custom data store into WinFS, and get back notifications when the WinFS data store changed. Bingo. No need to rewrite your app.

The really big thing about the data store was that it totally blew open the data silos. It effectively normalised all data formats. Yes, I can search for all contacts with Windows Desktop Search, but once I've got the results, there's not a lot else I can do, because some are stored in Outlook, some in vcard files, some in Vista Contacts, some in IM, etc. With WinFS, you just have a contact. You manipulated the WinFS Contact datatype and saved the changes back. That's incredibly powerful.

This post by Brandon Paddock of the WDS team is great - read the comments. He points out that WinFS would still have to store music in different formats - WMA, MP3, etc - so is this unified API viable, or just a little too idealistic?

Even if the unified data store is a step too far, WinFS is still a killer app from a dev point of view. Just about every app you build has to have local storage, and you always have to roll it yourself. You've got a whole heap of options, none of which are ideal. Xml files means reading the whole of each file into memory (like Sharpreader and RssBandit do - hence the large memory footprint), you could use a database, such as Access, SQL Server (Express), SQL Server Everywhere, SQLite, etc. But these have their own limitations - is the data you want to store suitable for a database (RSS feeds or email messages in a SQL database?) Is client/server appropriate? Is an in-process DB robust enough? You could even roll your own solution, which is fraught with peril. Or you could just let WinFS worry about it and get free searching to boot.

Ah WinFS, we hardly knew you...

Tags:

Windows Desktop Search | Vista

More than one way to skin an rss feed platform

by Matt 4. January 2007 18:07

I'm on a bit of a quest to be able to search the RSS feed platform that comes with IE7 from Windows Desktop Search. The proper way of doing this is of course to write a protocol handler, and (maybe a little too slowly) that's what I'm doing.

It's not the only way.

Check out this post by Mitch Denny. It's got nothing to do with searching, but it does have a nice little screenshot of Mitch's start menu, which is showing search results for "performance". And the icons in the results look suspiciously like RSS icons...

Has Mitch struck gold?

Not exactly.

A bit of google-fu later, and it looks like no-one yet has a protocol handler for the feed store (the best I could find was a Jon Udell interview with Microsoft's Amar Gandhi about the feed platform, in which Amar mentions that you could write a protocol handler, not that MS has done so). So I started to think a bit more laterally.

Windows Mail Live Desktop has support for the feed platform, allowing you to manage your subscriptions and read items. Perhaps they added a protocol handler? (Windows Live Writer adds an IFilter for the .wpost file type, so there's precedent.) Obviously, I installed it.

Guess what - I can search rss items. But not quite as I expected.

When you view your feeds in WMLD, it copies each item out of the feed store and into your Users folder, as individual .rss files. And since my entire Users folder is being indexed (not sure if that's the default - I just logged on as a newly created user and all of their Users folder is also indexed), these files are up for indexing too.

And it just so happens that .rss files are associated with an IFilter. Fairly oddly, the files are output as MIME formatted mail messages, and associated with Windows Mail (not WMLD!). This does mean that they pick up Windows Mail's MIME IFilter, and so, along with .eml and .news files are indexed. I guess it just saves on development (and it is still in beta). They also pick up Windows Mail's preview handler, so selecting an rss item in the search results gives a preview.

The downside is that the files are all rather cryptically named (e.g. 1CD0366B-00000006.rss) so when you do search, this is the name that appears, not the subject of the item. But at least the data is there, even if the item does open in Windows Mail rather than WMLD.

(If you rename one of the files to .eml, tooltips suddenly start working and the name in search results is suddenly something sensible. Not sure what's gone wrong there, perhaps something with Vista's property system.)

So, we have searching of the feed store. Almost. If WMLD isn't running, the file cache is going to get stale, so a protocol handler is still a better way of doing this. But it's a nice, simple way of getting your data indexed. In fact, it's so useful, the rather nice WPF based New York Times Reader also does the same thing - it has a copy of all of it's items stored in your Documents folder, under New York Times/Search. They're simple text files with the content of the story, and double clicking the file opens the reader and navigates to the story within the app. Ideal.

But it wasn't what Mitch was using - wrong icons.

Turns out Mitch is using Outlook 2007, which has RSS support. Feed items are stored in Outlook's MAPI backend, and of course, this is fully indexed. Proper search integration again, but it's still not searching the common feed store.

Tags:

Windows Desktop Search

New desktop search plugins

by Matt 21. December 2006 18:56

Microsoft have recently released a bunch of plugins for Windows Desktop Search. And these support WDS 2.x on XP, 3.x on XP and WDS on Vista, which is a bit nifty.

From Brandon Paddock's blog:

Today we released updated Add-ins for Windows Desktop Search.  Most notable among the changes is that they now support Windows Vista.  This includes the UNC Protocol Handler which allows you to index remote network shares (without using Offline Files).

Add-in for Files on Microsoft Networks (UNC and mapped drives)

Add-in for Internet Explorer history

Add-in for .msg files

Add-in for Lotus Notes

I'm not going to use the .msg or Lotus Notes versions, but the network share one looks interesting, and I'll definitely give the IE history one a go. I'll let you know if it turns out useful (I'll have to update my list of programs what I run).

And it appears that Adobe Reader 8 includes a PDF filter that works on Vista (or WDS 3 on XP). Unfortunately, it looks like you can only get it bundled with the reader software, and I'm not sure I'm ready to trust Adobe to have built something that isn't bloated and overweight, especially when you see how lightweight FoxIt Reader is. And I'm not a big PDF user. Maybe I'll give it a go. Install it. Take one for the team.

Tags:

Windows Desktop Search

Implementing IFilter

by Matt 12. December 2006 18:54

Last time on the Windows Desktop Search show, IUrlAccessor was dishing out IFilter interfaces.

The good news is this is the last of the core interfaces a protocol handler has to implement - that's right, the theory is almost over. There'll be a test later, though.

IFilter is the work horse of desktop search - it's how WDS supports lots of different file formats. Here's the interface:

interface IFilter: IUnknown
{
    SCODE Init([in] ULONG grfFlags,
               [in] ULONG cAttributes,
               [in, size_is(cAttributes), unique] FULLPROPSPEC const * aAttributes,
               [out] ULONG *pFlags);
    SCODE GetChunk([out] STAT_CHUNK *pStat);
    SCODE GetText([in, out] ULONG *pcwcBuffer,
                  [out, size_is(*pcwcBuffer)] WCHAR *awcBuffer);
    SCODE GetValue([out] PROPVARIANT **ppPropValue);
    SCODE BindRegion([in] FILTERREGION origPos,
                     [in] REFIID riid,
                     [out] void ** ppunk);
}

BindRegion's the easiest method - it's reserved. Just return E_NOTIMPL.

Init is surprisingly complex. The flags value can modify some of the behaviour of the filter. If cAttributes is non-zero, the aAttributes array contains the list of properties to retrieve - the caller isn't interested in any others. If neither flags nor attributes are specified, the default PSGUID_STORAGE set of property attributes should be returned. These are a default set that include things such as modified time, size and contents. They're defined in stgprop.h in the Platform SDK (it looks like it's missing from the lately released Windows SDK, which replaces the Platform SDK - we might need this later). It returns a flag to say whether or not the file has OLE properties attached to it. This is only really relevant for structured storage files (like Word documents), so we will probably always return 0 here.

The remaining methods are called in a loop. GetChunk is called to continue parsing the file until it finds the next interesting "chunk", and it returns what it's found in the STAT_CHUNK parameter. This structure is rather busy, so we'll just look at the edited highlights; there's a chunk type - text or value. A text chunk is the main content, the body of the document. If it's one of those, GetText is called, and the text is returned back in a Unicode buffer. If it's a value chunk, the STAT_CHUNK will have a FULLPROPSPEC member which will contain a property set id and a property index. The property set id is a GUID which describes a set of properties, like a category. The index is just an integer value that represents a property within the property set. An example of a FULLPROPSPEC is the PSGUID_STORAGE property set and the PID_STG_SIZE index. No prizes for guessing what this represents. The value of the property is returned as a PROPVARIANT in the call to GetValue - this allows data types other than strings to be returned, such as dates and numbers. Both GetText and GetValue always work on the current chunk, which means the object has state, which means it's apartment threaded. (I have a feeling we're going to have to take a look at threading soon enough). Once the whole file is parsed, GetChunk returns FILTER_E_END_OF_CHUNKS.

There's actually quite a bit more to this interface than I've just described. Each method has a couple of different return codes, and GetText and GetValue can be called multiple times, depending on the size of the content or the number of properties (e.g. keywords). I don't intend this to be an exhaustive guide to writing an IFilter, just an overview. Pay MSDN a visit and Google is always your friend. And then you need to know what the standard property sets are. You can find these in the Windows SDK (WDS v3) and the Platform SDK (WDS v2) as defines beginning with PSGUID or FMTID - shlguid.h has loads.

And then there's a great big question that especially relates to protocol handlers - what if the item you're trying to parse isn't a document? What if it's a directory?

Tags:

Windows Desktop Search

Rel=Me

Month List

RecentComments

Comment RSS