GSA-JAPI: Java API for interacting with the Google Search Appliance™
Please take a moment to read the license before using this library.
This project is neither affiliated to nor sponsored by Google®.
Google® and Google Search Appliance™ are registered trademarks of Google Inc.
On this page the term "GSA" may be used to informally refer to the target "Google Search Appliance™". If you find any inaccuracies in this document, please post a message on the forum setup on sourceforge for this project.
Last updated: May 18 2006
GSAQuery and GSAQueryTerm
Understanding the reasons behind modelling a query using these 2 classes can be useful to quickly locate desired methods when writing a Java application that uses the GSA-JAPI library.
GSAQuery encapsulates all the parameters of an entire "Request" for a certain query. Some parameters like "number of results to fetch" cannot be indicated in a search input field on a typical GSA HTML interface. (These are instead held in hidden parameters in the simple search form or some advanced search forms may allow you to set values for these parameters). Methods in the GSAQuery class will allow you to set values for such parameters.
GSAQueryTerm on the other hand, models the "search input field" on a typical GSA HTML interface. Thus methods that enable you to set qualifications like "intitle:", "inurl:" etc. are in the GSAQueryTerm.
Advanced query terms
Some of the methods in the GSAQueryTerm class are in triplets as so:Back to 'Contents'↑
with the general meaning being: to set a field for inclusion , set a field for exclusion and add a field with a boolean param indicating inclusion or exclusion respectively.
This is the value of the "client" query parameter.Back to 'Contents'↑
(Notes: My understanding is that this value can be configured by a GSA admin, so if you dont know what the value for you target GSA is, you should probably contact the GSA admin. Often, it is possible to find the value of this parameter from the link to the HTML search/results page for the target GSA.)
A GSA allows the admin to specify "colections" which are subsets of the crawled document space. When searching a GSA you need to specify the list of target collections across which you want to search using the setSiteCollectionsBack to 'Contents'↑
(Notes: Since this value can be configured by a GSA admin, if you dont know what the value for you target GSA is, you should probably contact the GSA admin. Often, it is possible to find the value of this parameter in a dropdown or radio selector on the HTML search page for the target GSA.)
Get N results instead of 10 (default)
By default, a search returns 10 results. If you need more/less than 10 results with a single querying method call, you should set the number of requested results using setMaxResults. The maximum value for this parameter is 100. To get more than 100 results, you need to use this method along with setScrollAhead iteratively.Back to 'Contents'↑
Number of KeyMatches
KeyMatches are pre-configured results for specific queries by a GSA administrator. The XML API allows you to request 0-5 KeyMatch results for a query. To set the number of KeyMatches desired (at most), use GSAQuery.setNumKeyMatches(byte b) method. If you do not specify the num keymatches, default is 3.Back to 'Contents'↑
Search scope indicates what part of the page will be considered by the GSA to match the query. Search scopes are defined using the Java2 enumeration pattern in the class net.sf.gsaapi.constants.SearchScope. To set the value, use GSAQuery.setSearchScope(SearchScope s) method.Back to 'Contents'↑
A language filter will return results that match the query and are/are_not in the specified language(s). To set a language filter, use GSAQuery.setLanguage(String s) method. The language filter is actually a string eg. -lang_fr to mean "exclude all results in French". You can even combine language filters using expressions as so: lang_en|lang_fr to mean "include all results in English or French". The thing to note is that the GSA-JAPI does not provide a way to build the expressions at present. You will need to build the expressions and pass the expression string to the setLanguage method. For more details on how to write suitable language expressions, see: http://code.google.com/gsa_apis/xml_reference.html#request_subcollectionsBack to 'Contents'↑
You can request that the returned results information also include any meta information related to the document. To create such a request use the GSAQuery.setFetchMetaField(String mfs) method. You can specify multiple meta fields by building an expression of the form eg:Back to 'Contents'↑
It is also possible to filter based on meta fields (Note: This feature is not available in release 1.0 of the GSA-JAPI)
To do so use the setRequiredMetaFields and setPartialMetaFields methods. Both these methods take a java.util.Properties instance as an argument. The properties instance is expected to contain key-value pairs used to match against the meta fields in the document. The only difference between the two methods is that the value specified for a key in the Properties instance is used to do an exact match if setRequiredMetaFields is used; whereas setPartialMetaFields is used to perform a subphrase match. For further details, refer to http://code.google.com/gsa_apis/xml_reference.html#request_meta
Results can be sorted by Relevance or by Date. Unless specified otherwise, results are always sorted by relevance (most relevant first). To specify sorting by date, you need to consider two factors viz. sort-direction and mode. Sort-direction can be ascending or descending. Mode can take one of 3 values to mean one of "sort relevant results", "sort all results" and "don't sort, but return date associated with each result (document)".Back to 'Contents'↑
These can be specified using the GSAQuery.setSortByDate(boolean asc, char mode) method. The first parameter is true if sort direction is ascending, false otherwise. The second param can be one of 'S', 'R' or 'L' to mean "sort relevant results", "sort all results" and "don't sort, but return date associated with each result (document)" respectively.
Post release 1.2 only: You can also specify that results be sorted by relevance by calling the method GSAQuery.setSortByRelevance() The method takes no arguments since sort by relevance does not take any options. It is provided mainly as a way to "undo" effect of setSortByDate(..). For further details refer to http://code.google.com/gsa_apis/xml_reference.html#request_sort
Input and output encoding can be modified by the GSAQuery.setInputEncoding and GSAQuery.setOutputEncoding methods respectively. To better understand this, please refer to: http://code.google.com/gsa_apis/xml_reference.html#request_i18n.Back to 'Contents'↑
(Note: this is not available in release 1.0 of the GSA-JAPI)Back to 'Contents'↑
Results can be filtered using a combination of 2 types of filters: "Duplicate Directory" and "Duplicate Snippet". You can specify filtering based on any, both or none of these using the GSAQuery.setFilter(Filter f) method. net.sf.gsaapi.constants.Filter is a class that follows the Java2 enumeration pattern. It has predefined values for all possible combinations of filtering.
A related issue to understand is "Directory crowding". For this and other details related to automatic filtering please refer to: http://code.google.com/gsa_apis/xml_reference.html#request_filtering.
(Note: this is not available in release 1.1 and prior of the GSA-JAPI)Back to 'Contents'↑
The "output" query paramter can take one of two values: xml and xml_no_dtd The default is "xml". To set the OutputFormat, use GSAQuery.setOutputFormat(OutputFormat f) method. net.sf.gsaapi.constants.OutputFormat is a class that follows the Java2 enumeration pattern. It has predefined values for specifying valid OutputFormat values.
(Note: this is not available in release 1.1 and prior of the GSA-JAPI)Back to 'Contents'↑
The "access" query paramter can take one of following values: p, s, a correspondingly meaning "search public sources", "search secure sources" and "search both public and secure sources" The default is "p" or "search public sources only". To set the Access, use GSAQuery.setAccess(Access a) method. net.sf.gsaapi.constants.Access is a class that follows the Java2 enumeration pattern. It has predefined values for specifying valid access values.
Proxystylesheet and related
(Note: this is not available in release 1.2 and prior of the GSA-JAPI)Back to 'Contents'↑
The GSA accepts a proxystylesheet parameter to allow the client to specify which of a pre-installed set of xsl stylesheets should be applied to the results XML before they are delivered to the user. This is the approach used when generating the HTML results page. Normally, when using the GSA-JAPI you should NOT have to use this (and the related) parameter. This is because the XML parser included in this API relies on the results response being in the standard GSA xml format. However, two possible uses of using a serverside stylesheet are:
Other related parameters are: proxyreload and proxycustom. proxyreload is a boolean param that forces the GSA to reload the stylesheet specified by the proxystylesheet parameter. Please review the information at http://code.google.com/gsa_apis/xml_reference.html#results_xslt for more information on these parameters.
The following setters can be used to set these params using the API:
GSAResponse and GSAResult
The built-in XML parser provides a Java binding for the returned XML results. A typical usage of GSAResponse and GSAResult is shown below:Back to 'Contents'↑
TODO: CODE SAMPLE Details of GSAResult accessor methos (getter methods) are explained below.
Summary, Title & Url
These are the most widely used fields of the GSSAResult object. Retrieving these is accomplished by the getter methods: getSummary(), getTitle() and getUrl()Back to 'Contents'↑
The GSA response XML has a built-in extensibility mechanism that allows arbitrary name-value pairs (or meta-fields) to be associated with each result (See <FS> tag in the GSA reference documentation). Currently, the only known meta-field is "date" which could be empty or contain the document-date in YYYY-MM-DD format. Thus, to retrieve the value for "date" meta-field, use the method: GSAResult.getMeta("date") Sample code:Back to 'Contents'↑
TODO: CODE SAMPLE
GSA results are "ranked" or "scored" by the GSA to suggest the relative relevance of a result to the specified query term. This is returned as an integer in the range 0-10 with 10 indicating high relevance and 0 indicating low relevance.Back to 'Contents'↑
Language & Mime-type
Language indicates/suggests the language of the content of the page as inferred by the GSA. This is returned as a two letter language code. For the complete list of language codes and their interpretations, please refer to documentation at: http://code.google.com/enterprise/documentation/xml_reference.html#request_subcollections_autoBack to 'Contents'↑
Cached document info
Depending on the admin configuration, the GSA could cache a crawled document. The cached document if available, can be retrieved from the GSA using a certain request format. The parameters required for this request are served up with the search results on a per result basis. The GSAResult class provides access to these params.Back to 'Contents'↑
TODO: CODE SAMPLE
Using Client delegate
The GSA-JAPI does not provide a way to perform client authorization with the Google Search Appliance (in case it is required). However, a hook is provided to allow developers to handle the client authorization but use the query formulation and response processing (Java binding) features of the API. To use this approach, you must use the GSAClient.setClientDelegate(GSAClientDelegate d) method. The argument is an interface with one method public InputStream getResponseStream(String requestUrl); that must be implemented by the developer to perform the authorization request.
Some sample code to illustrate the usage:
Note in the above example that CustomAuthDelegate is a developer supplied class that implements the GSAClientDelegate interface. An outline implementation:
The returned inputstream is expected to contain the standard GSA response XML.