So, if you tilt your head, squint a little bit and use just a dash of imagination, you can say that ISearchProtocol::CreateAccessor is kind of analagous to CreateFile - it abstracts away the access to what the url refers to just as CreateFile abstracts away the accessing of the file system and the hard disk.
But while CreateFile gives you a handle you can pass in to other API functions, CreateAccessor returns back an instance of the IUrlAccessor interface.
And this is where the whole CreateFile analogy breaks down somewhat. Didn't last long, really, did it? IUrlAccessor is not intended to be an equivalent file system API for a url protocol. Instead, it's about getting access to the any of the url's data that's required for indexing - "file" metadata (size, last modified date), security data and actual "file" contents. (When I'm saying "file" it's really just shorthand for "resource referred to by the url passed to CreateAccessor".)
This simple metadata is available directly from the interface (GetSize, GetLastModified, GetSecurityDescriptor), but getting at the contents is a bit more work.
Obviously, the indexer cannot know about the format of all "files" it's asked to index. Especially when you consider that some files contain just content (such as plain text), some contain only metadata (such as mp3 files) and some contain both (e.g. Word files). We need another layer of abstraction. And that's where IFilter comes in.
The IFilter interface is called by the indexer to retrieve content and metadata from the underlying data source (file/url). The primary purpose of IUrlAccessor is to retrieve an IFilter for the resource represented by the url. So let's take a closer look at IUrlAccessor:
interface IUrlAccessor: IUnknown
{
HRESULT AddRequestParameter([in] PROPSPEC *pSpec,
[in] PROPVARIANT *pVar);
HRESULT GetDocFormat([out, length_is(*pdwLength), size_is(dwSize)] WCHAR wszDocFormat[],
[in] DWORD dwSize,
[out] DWORD *pdwLength);
HRESULT GetCLSID([out] CLSID *pClsid);
HRESULT GetHost([out, length_is(*pdwLength), size_is(dwSize)] WCHAR wszHost[],
[in] DWORD dwSize,
[out] DWORD *pdwLength);
HRESULT IsDirectory();
HRESULT GetSize([out] ULONGLONG *pllSize);
HRESULT GetLastModified([out] FILETIME *pftLastModified);
HRESULT GetFileName([out, length_is(*pdwLength), size_is(dwSize)] WCHAR wszFileName[],
[in] DWORD dwSize,
[out] DWORD *pdwLength);
HRESULT GetSecurityDescriptor([out, size_is(dwSize)] BYTE *pSD,
[in] DWORD dwSize,
[out] DWORD *pdwLength);
HRESULT GetRedirectedURL([out, length_is(*pdwLength), size_is(dwSize)] WCHAR wszRedirectedURL[],
[in] DWORD dwSize,
[out] DWORD *pdwLength);
HRESULT GetSecurityProvider([out] CLSID *pSPClsid);
HRESULT BindToStream([out] IStream **ppStream);
HRESULT BindToFilter([out] IFilter **ppFilter);
};
It's a bit of an odd interface, really - you're actually not expected to implement all of it. Methods that don't make sense for your implemented should return E_NOTIMPL.
There are a number of methods that aren't used - AddRequestParameter, GetHost and GetSecurityProvider. The simple metadata methods are pretty much self explanatory - GetSize, GetLastModified and GetSecurityDescriptor (although this last one will need investigating a bit more closely). The rest are all about getting an IFilter.
When the indexer is indexing the file system, the IFilter is selected based on file extension. When indexing via IUrlAccessor, there are more interesting things to take into account, and IUrlAccessor allows you to customise this simple file extension mapping. Remember that if you don't need this flexibility, you can just return E_NOTIMPL. Also note that the docs don't give an order in which these methods are called - I've listed them here in fairly random order:
- GetCLSID allows you to return back a class Id that can handle this file type (such as Microsoft Word). I'm guessing this is to do with ActiveDocuments? The main purpose for this is to be able to have a url such as (http://example.org/wordfile.file) actually be a Word file without having to have a .doc extension.
- GetDocFormat allows you to specify a MIME type that takes precedence over the url's extension.
- If your url scheme just happens to map UNC accessable files to urls, you can just return the file name here, and it'll get indexed the same as file system files.
- BindToStream allows you to provide a stream over your data. The indexer can then read the file contents from the stream and either save them to a temporary file and bind an IFilter to that file, or bind the IFilter directly to the stream.
- If none of those methods suit and you want to take complete control of hooking up the IFilter or if the data represented by your url isn't a normal desktop file format (such as a row in a database, or, as in our case, an RSS item), you can return your own IFilter implementation from BindToFilter.
The final two methods alter how the indexing occurs - GetRedirectedURL and IsDirectory.
GetRedirectedURL allows you to return the actual url that should be used while indexing. In other words, if you have a document at a url that gets redirected, this allows you to tell the indexer that a) it's been redirected, and b) any relative links that your IFilter emits are to be resolved against the redirected url. I don't know if this causes the previously stored url to be updated.
IsDirectory tells the indexer that the current url represents a directory. Surprising that. This means the indexer will treat any emitted child urls as being in this folder. Handy for using the "in:" and "under:" search syntax. (Think of searching in Outlook - "in:Inbox", "in:myfolder", "under:trash").
So that's IUrlAccessor. Doesn't look too tricky. I think the next thing to look at will be returning links from IFilter - this is how we're going to crawl the whole store. Then it'll have to be how to represent the RSS feed store as urls. Hopefully then I'll be able to get at some code - although threading might rear it's ugly head...