GSA-JAPI: Java API for interacting with the Google Search Appliance™

Please take a moment to read the license before using this library.
This project is neither affiliated to nor sponsored by Google®.
Google® and Google Search Appliance™ are registered trademarks of Google Inc.

On this page the term "GSA" may be used to informally refer to the target "Google Search Appliance™". If you find any inaccuracies in this document, please post a message on the forum setup on sourceforge for this project.
Last updated: May 18 2006  
GSAQuery and GSAQueryTerm

Understanding the reasons behind modelling a query using these 2 classes can be useful to quickly locate desired methods when writing a Java application that uses the GSA-JAPI library.

GSAQuery encapsulates all the parameters of an entire "Request" for a certain query. Some parameters like "number of results to fetch" cannot be indicated in a search input field on a typical GSA HTML interface. (These are instead held in hidden parameters in the simple search form or some advanced search forms may allow you to set values for these parameters). Methods in the GSAQuery class will allow you to set values for such parameters.

GSAQueryTerm on the other hand, models the "search input field" on a typical GSA HTML interface. Thus methods that enable you to set qualifications like "intitle:", "inurl:" etc. are in the GSAQueryTerm.

GSAQuery query = new GSAQuery();
GSAQueryTerm term =
null;
String
[] termStrings = new String[]{"java", "sdk"};

// 1. equiv to 'java sdk' in search input
query.setAndQueryTerms(termStrings);

// 1. Alternative to above using GSAQueryTerm
term = new GSAQueryTerm("java sdk");
query.setQueryTerm
(term);

// 2. equiv to 'java OR sdk' in search input
query.setOrQueryTerms(termStrings);

// 2. Alternative to above using GSAQueryTerm
term = new GSAQueryTerm("java OR sdk");
query.setQueryTerm
(term);

// 3. equiv to '"java sdk"' in search input
query.setExactPhraseQueryTerm("java sdk");

// 3. Alternative to above using GSAQueryTerm
term = new GSAQueryTerm("\"java sdk\"");
query.setQueryTerm
(term);
Back to 'Contents'↑
Advanced query terms
Some of the methods in the GSAQueryTerm class are in triplets as so:
setXyz, setNotXyz/setExcludeXyz, addXyz
with the general meaning being: to set a field for inclusion , set a field for exclusion and add a field with a boolean param indicating inclusion or exclusion respectively.
  • FileType: file extension of page included in the search results
  • InTitle: term in title of a search result
  • AllInTitle: all specified terms should exist in titles of search results. When using this option, no other option should be specified on the GSAQueryTerm instance.
  • Url: term in url of a search result
  • AllInUrl: all specified terms should exist in url of search results. When using this option, no other option should be specified on the GSAQueryTerm instance.
  • Site: restrict the search to pages under this domain
  • InTitle: term in title of a search result
  • CachedDocument: This is a special field that causes the query to return the cached version of the document at the specified url. This option should not be used for searching. No other options can be specified if this option is being used. Also, when using this option you should be using one of the search(..) methods in the GSAClient class and not any of the getGSAResponse(..) methods.
  • WebDocument: This is a special field that causes the query to return at most a single result which corresponds to the specified url. No other options can be specified if this option is being used.
Back to 'Contents'↑
Frontend
This is the value of the "client" query parameter.

(Notes: My understanding is that this value can be configured by a GSA admin, so if you dont know what the value for you target GSA is, you should probably contact the GSA admin. Often, it is possible to find the value of this parameter from the link to the HTML search/results page for the target GSA.)

Back to 'Contents'↑
Site collections
A GSA allows the admin to specify "colections" which are subsets of the crawled document space. When searching a GSA you need to specify the list of target collections across which you want to search using the setSiteCollections

(Notes: Since this value can be configured by a GSA admin, if you dont know what the value for you target GSA is, you should probably contact the GSA admin. Often, it is possible to find the value of this parameter in a dropdown or radio selector on the HTML search page for the target GSA.)

Back to 'Contents'↑
Get N results instead of 10 (default)
By default, a search returns 10 results. If you need more/less than 10 results with a single querying method call, you should set the number of requested results using setMaxResults. The maximum value for this parameter is 100. To get more than 100 results, you need to use this method along with setScrollAhead iteratively.
// fetch as many results as possible
int allresults = GSAQuery.MAX_RESULTS;

// num iterations needed to fetch allresults
int iterations =
    
(int) (allresults / GSAQuery.MAX_RESULTS_PER_QUERY) +
    
(int) ((allresults % GSAQuery.MAX_RESULTS_PER_QUERY) > 0     ? 1 : 0);

// set to atmost MAX_RESULTS_PER_QUERY
query.setMaxResults(allresults);

for (int i=0; i<iterations; i++) {
    
query.setScrollAhead(GSAQuery.MAX_RESULTS_PER_QUERY * i);
    GSAResponse response = client.getGSAResponse
(query);
    
// add response to list or do something with it
}
Back to 'Contents'↑
Number of KeyMatches
KeyMatches are pre-configured results for specific queries by a GSA administrator. The XML API allows you to request 0-5 KeyMatch results for a query. To set the number of KeyMatches desired (at most), use GSAQuery.setNumKeyMatches(byte b) method. If you do not specify the num keymatches, default is 3.
Back to 'Contents'↑
Search scope
Search scope indicates what part of the page will be considered by the GSA to match the query. Search scopes are defined using the Java2 enumeration pattern in the class net.sf.gsaapi.constants.SearchScope. To set the value, use GSAQuery.setSearchScope(SearchScope s) method.
Back to 'Contents'↑
Language
A language filter will return results that match the query and are/are_not in the specified language(s). To set a language filter, use GSAQuery.setLanguage(String s) method. The language filter is actually a string eg. -lang_fr to mean "exclude all results in French". You can even combine language filters using expressions as so: lang_en|lang_fr to mean "include all results in English or French". The thing to note is that the GSA-JAPI does not provide a way to build the expressions at present. You will need to build the expressions and pass the expression string to the setLanguage method. For more details on how to write suitable language expressions, see: http://code.google.com/gsa_apis/xml_reference.html#request_subcollections
Back to 'Contents'↑
Meta fields
You can request that the returned results information also include any meta information related to the document. To create such a request use the GSAQuery.setFetchMetaField(String[] mfs) method. You can specify multiple meta fields by building an expression of the form eg: author.title which will return value for meta fields named "author" and "title" in a document if such fields exist.
It is also possible to filter based on meta fields (Note: This feature is not available in release 1.0 of the GSA-JAPI)
To do so use the setRequiredMetaFields and setPartialMetaFields methods. Both these methods take a java.util.Properties instance as an argument. The properties instance is expected to contain key-value pairs used to match against the meta fields in the document. The only difference between the two methods is that the value specified for a key in the Properties instance is used to do an exact match if setRequiredMetaFields is used; whereas setPartialMetaFields is used to perform a subphrase match. For further details, refer to http://code.google.com/gsa_apis/xml_reference.html#request_meta
Back to 'Contents'↑
Sorting results
Results can be sorted by Relevance or by Date. Unless specified otherwise, results are always sorted by relevance (most relevant first). To specify sorting by date, you need to consider two factors viz. sort-direction and mode. Sort-direction can be ascending or descending. Mode can take one of 3 values to mean one of "sort relevant results", "sort all results" and "don't sort, but return date associated with each result (document)".
These can be specified using the GSAQuery.setSortByDate(boolean asc, char mode) method. The first parameter is true if sort direction is ascending, false otherwise. The second param can be one of 'S', 'R' or 'L' to mean "sort relevant results", "sort all results" and "don't sort, but return date associated with each result (document)" respectively.
Post release 1.2 only: You can also specify that results be sorted by relevance by calling the method GSAQuery.setSortByRelevance() The method takes no arguments since sort by relevance does not take any options. It is provided mainly as a way to "undo" effect of setSortByDate(..). For further details refer to http://code.google.com/gsa_apis/xml_reference.html#request_sort
Back to 'Contents'↑
Encoding
Input and output encoding can be modified by the GSAQuery.setInputEncoding and GSAQuery.setOutputEncoding methods respectively. To better understand this, please refer to: http://code.google.com/gsa_apis/xml_reference.html#request_i18n.
Back to 'Contents'↑
Automatic Filtering
(Note: this is not available in release 1.0 of the GSA-JAPI)
Results can be filtered using a combination of 2 types of filters: "Duplicate Directory" and "Duplicate Snippet". You can specify filtering based on any, both or none of these using the GSAQuery.setFilter(Filter f) method. net.sf.gsaapi.constants.Filter is a class that follows the Java2 enumeration pattern. It has predefined values for all possible combinations of filtering.
A related issue to understand is "Directory crowding". For this and other details related to automatic filtering please refer to: http://code.google.com/gsa_apis/xml_reference.html#request_filtering.
Back to 'Contents'↑
Output Format
(Note: this is not available in release 1.1 and prior of the GSA-JAPI)
The "output" query paramter can take one of two values: xml and xml_no_dtd The default is "xml". To set the OutputFormat, use GSAQuery.setOutputFormat(OutputFormat f) method. net.sf.gsaapi.constants.OutputFormat is a class that follows the Java2 enumeration pattern. It has predefined values for specifying valid OutputFormat values.
Back to 'Contents'↑
Access
(Note: this is not available in release 1.1 and prior of the GSA-JAPI)
The "access" query paramter can take one of following values: p, s, a correspondingly meaning "search public sources", "search secure sources" and "search both public and secure sources" The default is "p" or "search public sources only". To set the Access, use GSAQuery.setAccess(Access a) method. net.sf.gsaapi.constants.Access is a class that follows the Java2 enumeration pattern. It has predefined values for specifying valid access values.
Back to 'Contents'↑
Proxystylesheet and related
(Note: this is not available in release 1.2 and prior of the GSA-JAPI)
The GSA accepts a proxystylesheet parameter to allow the client to specify which of a pre-installed set of xsl stylesheets should be applied to the results XML before they are delivered to the user. This is the approach used when generating the HTML results page. Normally, when using the GSA-JAPI you should NOT have to use this (and the related) parameter. This is because the XML parser included in this API relies on the results response being in the standard GSA xml format. However, two possible uses of using a serverside stylesheet are:
  1. Advanced users of the GSA may deploy sophisticated xsl stylesheets on the server to include additional information on the results in the form of custom fields.
  2. It is desired to use the query formulation capabilities of this API, but the results are desired to be in the standard HTML format.
Both these usages are uncommon, so you should confirm that you really need to use a custom proxystylesheet before using this setter.
Other related parameters are: proxyreload and proxycustom. proxyreload is a boolean param that forces the GSA to reload the stylesheet specified by the proxystylesheet parameter. Please review the information at http://code.google.com/gsa_apis/xml_reference.html#results_xslt for more information on these parameters.
The following setters can be used to set these params using the API:
setProxystylesheet(String proxystylesheet)
setProxycustom(String proxycustom)
setProxyreload(boolean force)
Back to 'Contents'↑
GSAResponse and GSAResult
The built-in XML parser provides a Java binding for the returned XML results. A typical usage of GSAResponse and GSAResult is shown below:
TODO: CODE SAMPLE Details of GSAResult accessor methos (getter methods) are explained below.
Back to 'Contents'↑
Summary, Title & Url
These are the most widely used fields of the GSSAResult object. Retrieving these is accomplished by the getter methods: getSummary(), getTitle() and getUrl()
Back to 'Contents'↑
Meta fields
The GSA response XML has a built-in extensibility mechanism that allows arbitrary name-value pairs (or meta-fields) to be associated with each result (See <FS> tag in the GSA reference documentation). Currently, the only known meta-field is "date" which could be empty or contain the document-date in YYYY-MM-DD format. Thus, to retrieve the value for "date" meta-field, use the method: GSAResult.getMeta("date") Sample code:
TODO: CODE SAMPLE
Back to 'Contents'↑
Result rating
GSA results are "ranked" or "scored" by the GSA to suggest the relative relevance of a result to the specified query term. This is returned as an integer in the range 0-10 with 10 indicating high relevance and 0 indicating low relevance.
Back to 'Contents'↑
Language & Mime-type
Language indicates/suggests the language of the content of the page as inferred by the GSA. This is returned as a two letter language code. For the complete list of language codes and their interpretations, please refer to documentation at: http://code.google.com/enterprise/documentation/xml_reference.html#request_subcollections_auto
Back to 'Contents'↑
Cached document info
Depending on the admin configuration, the GSA could cache a crawled document. The cached document if available, can be retrieved from the GSA using a certain request format. The parameters required for this request are served up with the search results on a per result basis. The GSAResult class provides access to these params.
TODO: CODE SAMPLE
Back to 'Contents'↑
Using Client delegate
The GSA-JAPI does not provide a way to perform client authorization with the Google Search Appliance (in case it is required). However, a hook is provided to allow developers to handle the client authorization but use the query formulation and response processing (Java binding) features of the API. To use this approach, you must use the GSAClient.setClientDelegate(GSAClientDelegate d) method. The argument is an interface with one method public InputStream getResponseStream(String requestUrl); that must be implemented by the developer to perform the authorization request.
Some sample code to illustrate the usage:
GSAClient client = new GSAClient("my.host.com");
GSAClientDelegate delegate =
new CustomAuthDelegate();
client.setClientDelegate
(delegate);

GSAQuery query =
new GSAQuery();
GSAQueryTerm term =
new GSAQueryTerm("java sdk");
query.setQueryTerm
(term);

GSAResponse response = client.getGSAResponse
(query);

Note in the above example that CustomAuthDelegate is a developer supplied class that implements the GSAClientDelegate interface. An outline implementation:
class CustomAuthDelegate implements GSAClientDelegate {
 public InputStream getResponseStream(String requestUrl) {
    
InputStream is = null;
    
// 1. attempt to connect to the url
     // 2. Detect if authorization is required
     // 3. Send credentials...
     // ...return resulting inputstream
    
return is;
 }
}

The returned inputstream is expected to contain the standard GSA response XML.