Search Overview

I’ll be blogging on search and Solr related issues.  This first post reviews how search is currently implemented.  Future posts will discuss new spatial features in Solr such as the spatial data types, advanced spatial query functions and heatmaps.

Searching a set of documents consists of two separate operations: filtering and ranking.  First, we apply filters that exclude documents that are not relevant.  Second, a score for each relevant document is computed which indicates how well the document matches the query.  These results are ordered by this score.  With Solr you generate a single query that incorporates both filters and scoring information.

In OGP, the simplest searches come from just moving the map.  Every zoom or pan event by the user defines a new map extent and so requires a new query incorporating both filtering and ranking.  Since it was written several years ago, OGP uses old Solr features (mostly 1.4) rather than the newer spatial features.  The query first filters out all layers that do not intersect the map extent.  This leaves potentially thousands of layers that do intersect the map.  They need to be ranked so the most appropriate layer can be presented to the user.  Several separate factors are computed and combined to compute the spatial score.  The query ranks each relevant layer’s center and area based on how it compares to the map’s area area and center.  It also generates a grid of nine points on the map and ranks each layer based on how many of these points it contains.  Finally, it gives layers completely contained within the map a small ranking boost.  Essentially, the query is ranking the bounding box of each layer based on how similar it is to the bounding box of the  map.

Spatial properties drive OGP search but there are other components.  Entering one or more terms in the basic search box results in a typical Solr text search against fields such as layer name, ISO, theme and place keywords, publisher, originator and institution.  Layers that do not match the text search terms are filtered out.  Layers that do match get a score based on the quality of the match.  For each layer, this text-based score combines with the spatial scores yielding a total score.  The results are ranked by total score and presented to the user.

Combining the spatial and textual scores for each layer is basically a weighting problem.  How important, for instance, is the score based on the layer’s center as compared to the score based on the layer’s area.  Alternatively, how important is matching a search term in the layer title field versus matching the term in the publisher field.  In OGP the weights in decreasing order of importance are area match, number of grid points in layer, center match.  Spatial terms are weighted more heavily than text based terms.  A text search against a layer title is worth twice as much as a match in the other keyword fields.

Using Advanced Search one can have additional filters applied to the document set.  For example, one can filter based on the type of data (raster, vector, etc.), the institution publishing the data, a date range, etc.  These filters remove documents from the list of relevant documents.

In OGP, the Solr query is constructed on the client.  The code resides in the file solr.js.  A jsonp query is sent directly to the Solr instance.  No server-side intermediary is used.  Nor are there any Solr plugins.  Instead a long, complex query is sent for every search request.  It is important to protect the Solr server with a firewall.  This firewall should allow select requests from any IP address but limit update URLs to your admin machine’s IP addresses.