Keyoti - Products - SearchUnit - Document Retrieval

Note: This scenario is based on SearchUnit version 3

Requirements

A document retrieval web-site which has over 100,000 'raw' documents in multiple formats (.doc, .pdf, .rtf etc) and requires search functionality.
Each 'raw' document has a wrapper ASPX page (documentinfo.aspx), which uses "FlashPaper" to show the document, and contains meta information in HTML meta-tags (keywords and description). There is also a download page (download.aspx)
The documents are behind "Forms Authentication".
As documents are uploaded, the index must be updated, nightly.

Implementation

Importing The Documents (PDF, DOC, RTF etc) - Choosing A Strategy

The server contains a folder holding all of the 'raw' documents. An ASPX download page is also in use which takes a document ID and returns the actual document as a download, eg. download.aspx?id=NNN

How these are imported depends on exactly what should be searchable. For example, there are 3 immediately obvious ways it could work;

A. The user can search inside the 'raw' document (eg. pdf), but when the result is shown, it has the URL for the wrapper; documentinfo.aspx?id=NNN
(This means that the documentinfo.aspx is not actually indexed, but its URL is generated when the results are shown - this requires an efficient way to determine the documentinfo.aspx?id=NNN URL from the document filename).

B. The user can search ONLY the content in documentinfo.aspx?id=NNN which will limit the search to the document description and title etc.
(This means that the original document is not indexed).

C. The user can search the actual document (eg. pdf) and documentinfo.aspx
(This means everything is indexed, and that the results will have links to BOTH the original PDF and documentinfo.aspx)

In this scenario, the user chose option C (it's worth noting that the downside of this choice, 2 potential results for each document can be easily minimized by filtering the results to change raw document URLs to the wrapper ASPX).

The 'raw' documents are imported using the "FileSystemDocumentStore", which scans file folders. To import all of the wrapper (documentinfo.aspx) pages, there are two options;

A. create a 'map' page, which has links to each document. Eg. "alldocuments.aspx" which has a list of links

<a href="documentinfo.aspx?id=ZZZ">
<a href="documentinfo.aspx?id=XYZ">
.....

Then import that page as a Web-Site source. When there are new documents to index, just reimport the web-site.

As it will have 100,000 links on one page, it might take a while to show any progress, but this is OK.

B. write some code to add the documents programmatically

Configuration configuration = new Configuration();
configuration.IndexDirectory = "c:\\my\\indexdirectory";
DocumentIndex d = new DocumentIndex( configuration );
d.Open();
d.AddDocument( new Document( new DocumentRecord( "http://localhost/documentinfo.aspx?id=ZZZ"), configuration));
d.AddDocument( new Document( new DocumentRecord( "http://localhost/documentinfo.aspx?id=XYZ"), configuration));
d.Close();

Updating The Index

After the initial import and build, the index will need to be updated, either periodically or as documents are added to the web-site.

Periodic Updates

Using the Pro version of the product, configure the Windows Service to reimport and rebuild at some interval (which can be any number of hours, days, weeks etc).

Update As Documents Are Added

The upside of this approach over "Periodic Updates" is that obviously it makes the documents searchable almost immediately. The downside, is that in the current versions, each indexing operation carries an overhead (proportional to the size of the index), and therefore it is not as efficient as "Periodic Updates". To do this, use the code

The two "true" parameters mean, respectively, index now, and mark the index as valid for searching.

In this case, the user chose to use the Windows Service to update nightly.

Indexing Meta Tag Information

Please see this knowledge base article

Forms Authentication

For the standard approach to authenticated sites, please click here.

(The indexer does work with cookies, so if one is set, the indexer send it back on each request.)

Troubleshooting

During the implementation, an issue was discovered with the user's document download page (download.aspx). It specified the mimetype of the document using "ContentType=application/octet-stream", however whilst this is OK for most browsers, the search engine requires the specific document type for the document, eg. application/pdf Correcting the download page fixed this.

Questions?