Working with the index programmatically

SearchUnit comes with a comprehensive and user-friendly Index Management tool which allows you to work with many aspects of the index and import documents from a variety of sources.

There may be times however where you would like to work with the index programmatically; adding & removing documents incrementally for instance.

Importing An Entire Source

'Importing' a website/file-system folder/database/DataSet means that the indexer will scan for all available documents/pages/data and index everything that matches the import criteria. Reimporting will cause the indexer to rescan the source for changes (where possible, otherwise reindex everything).

To import programmatically, use the appropriate Import method in DocumentIndex;

C# Code Example:

DocumentIndex documentIndex = new DocumentIndex(configuration);
//import a website
documentIndex.ImportWebsite( startURL );
//or like this
documentIndex.Import(new WebsiteBasedIndexableSourceRecord( startURL, pathMatchesToBeIgnored, pathMatchesToBeIncluded));

//or import a file system folder
documentIndex.ImportFileSystemFolder(localFolderPath, virtualPath, targetMatchList, ignoreMatchList, recurseSubFolders);

//or import a database
documentIndex.ImportDatabase(sourceType, connectionString, sqlQuery, uniqueColumnName, resultUrlFormat);

//or import a DataSet (from an assembly)
documentIndex.ImportCustomDataSet(assemblyFilePath, fullClassName, uniqueColumnName, resultUrlFormat);

documentIndex.Close();

Adding One Document

Instead of importing an entire source, it is possible to add documents/data to the index incrementally. This is ideal for updating the index as documents are created/uploaded.

C# Code Example:

DocumentIndex documentIndex = new DocumentIndex(configuration);
try{
documentIndex.AddDocument(new Document("http://some/URL/document", configuration));
} finally {
documentIndex.Close();
}

Note that "AddDocument" may or may not complete in a trivial amount of time (the actual amount of time depends on many factors including machine load, document size/type, index size, whether the index is due optimization etc), therefore it is not advisable for use in web applications (as the web page doing the indexing will not return to the user until AddDocument has finished).

Asynchronous Adding (.NET 2 up)

Adding to the index asynchronously allows your code to return immediately (e.g. for a web application's upload document page to return immediately), while the document is queued up to be added to the index as soon as possible in the background.

To do this use the AsynchronousQueue class (in namespace Keyoti.SearchEngine.Index) - which will queue up AddDocument operations and call them in their original order.

AsynchronousQueue uses it's own instance of DocumentIndex, and will create and close that instance as necessary (therefore it is important not to have another instance of DocumentIndex open on the same index directory while there are items in the queue).

C# Code Example:

//...this code could be called in a button event handler in a web page for example

EventHandler finished = delegate(object sender, EventArgs e)
{
//at this point the index directory is unlocked and there are no more items pending adding to the index.
};

AsynchronousQueue.QueueForIndexing(new Document("http://someURL/somepage.aspx", Configuration), finished);
AsynchronousQueue.QueueForIndexing(new Document("http://someURL/somepage2.aspx", Configuration), finished);

Removing One Document

Use the RemoveDocument method in DocumentIndex to remove a document from the index. It's important that the document URL matches exactly with the URL already in the index. Please pay attention to trailing slashes (e.g. http://localhost/) and ensure any spaces are encoded as %20.

Removing a 'document' that originated in a DB

When a row is imported from a DB, we create our own URI for it. To delete that row/document, you need to recreate the URI.

C# Code Example:

IndexableSourceUri uri = new IndexableSourceUri(1, "d4", "col1");
//where 1 is the IndexableSource ID (see below)
//"d4" is the value in the unique field, that identifies the row to delete
//"col1" is the name of the unique field

documentIndex.RemoveDocument(new Document(uri.UriInstance.AbsoluteUri, Configuration));

In the above, the data was originally imported from a query like this;

	col1	data
	-------------
	a1	blah
	b2	some
	c3	empty
	d4	more
	

so the code will remove that last row from the index. The indexable source ID, can be obtained with code like this

C# Code Example:

ArrayList recs = documentIndex.GetIndexableSourceRecords(); (recs[0] as IndexableSourceRecord).ID;

assuming that the first record is the one you need. Otherwise you can iterate through 'recs' and look at the Query or Location properties to find the one you need.

Adding Data Directly As Strings

It is possible to add 'documents' to the index that are defined by strings only. In other words, it is possible to index data without the data having to actually reside in a document/page/database etc. This can be useful in the following scenarios for example;

  • Searchable content doesn't match actual content (e.g. extra description or meta info)
  • Indexing database content in a 'push' fashion (unlike Importing which is 'pull') - useful for incremental DB indexing
  • Indexing content from a custom parser
  • Indexing meta info for audio/video files
  • Whenever it's desirable to have full control of what is indexed, when.

To do this, use the PreloadedDocument class, which is a simple class where you pass the 'URI' that will identify the indexed data/document, and specify it's title, text and custom data - all as strings.

C# Code Example:

documentIndex.AddDocument(new PreloadedDocument(new Uri(uri), title, text, summary, null, null, null, customData, configuration));
    Where;
  • 'uri' is the real or fictitious Uri of the 'document' - this can point to an actual document or just be used as an arbitrary identifier for the indexed data
  • 'title' is string title of the document, searchable by the user
  • 'text' is the text body, this is searchable by the user
  • 'summary' is used for the result summary if a 'static' summary type is selected in the configuration (otherwise the result summary is generated from the text content based on hits)
  • The 3 null/nothings are respectively; content category list, location category name and security group list (please see the API docs)
  • 'customData' is any CustomData to be added to the document record
  • 'configuration' is the usual configuration object, as was used to create DocumentIndex

Removing A PreloadedDocument

To remove a 'document' added with PreloadedDocument, use documentIndex.RemoveDocument, passing in the same Uri that the document was created with.

More about programmatic usage

Please see the product Help documentation for further examples, including VB.NET code samples and programmatic searching