Lucene Web Service Documentation
Here are some useful Lucene resources:
- Introduction to Text Indexing with Apache Jakarta Lucene
- Advanced Text Indexing with Lucene
The rest of this document assumes that you have read the docs on how to use Lucene.
Creating a Lucene Index
There are three ways to create a Lucene index for use in this web service:
- Create the index on the fly by sending one or more documents to the documentindexer resource. See below for more details. Please note that this approach for creating and indexing documents is NOT recommended due to the extra overhead created by the web service. If you're going to index a body of documents, you should do it with using one the approaches below.
- If you're going to create a lucene index from an SQL database, you should use the CreateIndex.java program in the utils directory of the source distribution. Please read the comments in the code for how to use that script.
- Use Lucene directly from your program to create the index and then have the web service provide access to the index.
Using the Lucene Web Service
The Lucene Web Service implements a set of REST resources that give web enabled access to creating and searching Lucene text indexes. This web service receives XML documents via HTTP POST commands and returns XML documents on HTTP GET commands. Any command that has a side-effect (read: it modifies the index) is done via a POST command -- all others are implemented using GET.
The following resources are supported by the Luence Web Service:
Searching (resource searchresult)
GET URL: /lucene/[index name]/searchresult?maxhits=[max hits]&defaultfield=[default field]&query=[query]
This resource peforms a Lucene search and returns the search results.
- [index name] -- The name of the index to search -- must be the name of an existing index.
- [max hits] -- The maximum number of search hits to return.
- [defaultfield] -- The default field to use while searching this index. Please see the Lucene resouces/documention on the meaning of this field.
- [query] -- The Lucene query string. For details on this query string, please see the Lucene Query Syntax.
This performs a Lucene search on the given index. The XML returned will look something like this:
<?xml version="1.0"?> <searchresults> <index>track</index> <hits total="787">1</hits> <searchresult> <document> <score>100</score> <field name="artistId">8f6bd1e4-fbe1-4f50-aa9b-94c450ec0f11</field> <field name="artist">Portishead</field> <field name="albumId">8f468f36-8c7e-4fc1-9166-50664d267127</field> <field name="album">Dummy</field> </document> </searchresult> </searchresults>
The index element gives the name of the index this search was performed on. The hits element shows how many hits were returned, and the total attribute on the hits element shows how many matches were found for this search. The searchresult element can have one or more document elements, and the document element gives information for a specifc document that matched the search.
For each document element, there will be a score element that gives the score for this match from 0 - 100. This is followed by one or more field elements that gives the name for the field in the name attribute.
Indexing documents (resource documentindexer)
POST URL: /lucene/[index name]/documentindexer
To add a document to an index POST the document to the URL above. If the index does not exist, it will be created.
- [index name] -- The name of the index to add this document to -- must be a valid directory name.
The document POSTed to the URL must be formatted like this:
<xml version="1.0"?> <documents analyzer="WithStopAnalyzer"> <document> <field name="artist">U2</field> <field name="album">October</field> <field name="track">Gloria</field> </document> </documents>
For the very first document indexed, the documents element should have the analyzer attribute defined. This defines the type of analyzer to use for this index. Once the index has been created, all subsequent calls to documenanalyzer will ignore the analyzer attribute. For more info on analyzers, please refer to the Lucene documentation. The following values for the analyzer attribute are valid:
- SimpleAnalyzer - Simply lowercases all of the input.
- StopAnalyzer - Lowercases the input and removes stop words.
- StandardAnalyzer - Lowercases the input, removes stop words and removes some punctuation.
- WithStopAnalyzer - Just like StandardAnalyzer, but it does not remove stop words.
The document element follows the same form as descibed in the searching section.
Updating documents (resource documentupdater)
POST URL: /lucene/[index name]/documentupdater
This service removes one or more documents from the index and re-adds one or more documents in one operation.
<?xml version="1.0"?> <documents> <document> <removequery defaultfield="album">artist:portishead dummies</removequery> <field name="artist">Portishead</field> <field name="album">Dummy</field> </document> </documents>
This document is structured like the documentindexer XML, but it adds the new removequery element that specifies what documents to remove before adding the new document. The removequery element has a defaultfield attribute that specifies the default field to use for the query. For more details on this field, please see the searchresult query.
Once the documents specified by removequery have been removed, the document specified below will be added to the index. This documentupdater query can also handle multiple document elements inside the documents element.
Removing Documents (resource documentremover)
POST URL: /lucene/[index name]/documentremover
Use the documentremover resource to remove a document from an index. The XML format used for this query is almost the same as the one used for documentupdater:
<?xml version="1.0"?> <documents> <document> <removequery defaultfield="album">artist:portishead dummies</removequery> </document> </documents>
In this case, the XML has no field elements, unlike the documentremover resource.
Optimizing an Index (resource optimizer)
POST URL: /lucene/[index name]/optimizer
To optimize a Lucene index after altering an index, POST an empty document to the URL above. The contents of the POST are ignored.