Last time on the Windows Desktop Search show, IUrlAccessor was dishing out IFilter interfaces.
The good news is this is the last of the core interfaces a protocol handler has to implement - that's right, the theory is almost over. There'll be a test later, though.
IFilter is the work horse of desktop search - it's how WDS supports lots of different file formats. Here's the interface:
interface IFilter: IUnknown
{
SCODE Init([in] ULONG grfFlags,
[in] ULONG cAttributes,
[in, size_is(cAttributes), unique] FULLPROPSPEC const * aAttributes,
[out] ULONG *pFlags);
SCODE GetChunk([out] STAT_CHUNK *pStat);
SCODE GetText([in, out] ULONG *pcwcBuffer,
[out, size_is(*pcwcBuffer)] WCHAR *awcBuffer);
SCODE GetValue([out] PROPVARIANT **ppPropValue);
SCODE BindRegion([in] FILTERREGION origPos,
[in] REFIID riid,
[out] void ** ppunk);
}
BindRegion's the easiest method - it's reserved. Just return E_NOTIMPL.
Init is surprisingly complex. The flags value can modify some of the behaviour of the filter. If cAttributes is non-zero, the aAttributes array contains the list of properties to retrieve - the caller isn't interested in any others. If neither flags nor attributes are specified, the default PSGUID_STORAGE set of property attributes should be returned. These are a default set that include things such as modified time, size and contents. They're defined in stgprop.h in the Platform SDK (it looks like it's missing from the lately released Windows SDK, which replaces the Platform SDK - we might need this later). It returns a flag to say whether or not the file has OLE properties attached to it. This is only really relevant for structured storage files (like Word documents), so we will probably always return 0 here.
The remaining methods are called in a loop. GetChunk is called to continue parsing the file until it finds the next interesting "chunk", and it returns what it's found in the STAT_CHUNK parameter. This structure is rather busy, so we'll just look at the edited highlights; there's a chunk type - text or value. A text chunk is the main content, the body of the document. If it's one of those, GetText is called, and the text is returned back in a Unicode buffer. If it's a value chunk, the STAT_CHUNK will have a FULLPROPSPEC member which will contain a property set id and a property index. The property set id is a GUID which describes a set of properties, like a category. The index is just an integer value that represents a property within the property set. An example of a FULLPROPSPEC is the PSGUID_STORAGE property set and the PID_STG_SIZE index. No prizes for guessing what this represents. The value of the property is returned as a PROPVARIANT in the call to GetValue - this allows data types other than strings to be returned, such as dates and numbers. Both GetText and GetValue always work on the current chunk, which means the object has state, which means it's apartment threaded. (I have a feeling we're going to have to take a look at threading soon enough). Once the whole file is parsed, GetChunk returns FILTER_E_END_OF_CHUNKS.
There's actually quite a bit more to this interface than I've just described. Each method has a couple of different return codes, and GetText and GetValue can be called multiple times, depending on the size of the content or the number of properties (e.g. keywords). I don't intend this to be an exhaustive guide to writing an IFilter, just an overview. Pay MSDN a visit and Google is always your friend. And then you need to know what the standard property sets are. You can find these in the Windows SDK (WDS v3) and the Platform SDK (WDS v2) as defines beginning with PSGUID or FMTID - shlguid.h has loads.
And then there's a great big question that especially relates to protocol handlers - what if the item you're trying to parse isn't a document? What if it's a directory?