Indexing of files based on IFilter interface is pretty old Windows technology, which is today still important and consisted part of several search products like MOSS and Desktop search. The IFilter interface scans documents for text and properties (also called attributes). It extracts chunks of text from these documents, filtering out embedded formatting and retaining information about the position of the text. More about IFilter can be found here. My intension in this post is to show how to obtain the plain text from any document. This use case sound a bit strange, but it is very useful in scenarios when you are responsible to provide the text to be indexed. For example when you are acting as web service in BDC LOB application. In this case BDC engine would invoke your Web Service method (here and here are some interesting examples).
The idea is to write the method which receive the name of the file as input parameter and retrieves the the plain text. This method could look like:
public string DoIndexing(string file);
To make this working, it is necessary to find out which IFilter is responsible for target file extension. For example, if the file = "someFile.doc", the method has to look in the registry for the filter responsible for filtering of files with .doc extension. To do this, first we have to llok for the extension in registry as shown at the picture bellow:
After that the Default-value of the key PersistedHandler will contain the GUID value of the related COM component. The next picture shows the registration of this component with the UUID = {98de59a0 * 2794}. Under this component open the key PersistendAddInsRegistered, click on the key below and look at the Default value which is F07F3920-78BC-11CF-9BE8-00AA004B9986. Exactly this value is what we are looking for, the IFilter responsible for .DOC documents.
Following code snippet shows how to create the instance of the filter after the UUID of IFilter has been found.
Guid filterUuid = getFilterUuid(new FileInfo(file).Extension);
Type testFilterType = Type.GetTypeFromCLSID(filterUuid);
IFilter iFilter = (IFilter)testFilterType.InvokeMember("_ctor",
System.Reflection.BindingFlags.CreateInstance,
null, null, null);
System.Runtime.InteropServices.ComTypes.IPersistFile ipf =
(System.Runtime.InteropServices.ComTypes.IPersistFile)iFilter; ;
This code search for UUID by invoking of getFilterUuid. The it creates the Type instance of the filter and finally invokes the constructor on the type.
To perform indexing it is necessary to cast the IFilter interface to IPersistedFile.
I'm not going to describe how to exactly perform filtering, but there is very good post in Andrew's blog.
Posted
Jan 19 2008, 04:30 PM
by
Damir Dobric