Larger Scale Search

Can OpenGeoPortal’s search scale to meet the communities future needs?  To answer this question, I investigated scaling both vertically and horizontally.

What if a Solr instance had to handle many more layers?  Most OGP Solr instances currently provide access to under 10,000 GIS layers.  Searches typically execute in under 10 milliseconds.  To test with more layers, I wrote code to create new layers by processing my existing layers and changing their LayerId field.  With this approach I created a Solr instance containing over 60,000 layers.  Running on a development VM with 1.5 GB of RAM and only a single core, search times increased to 200 milliseconds.  This is sufficient and provides very quick searches.

Having a single Solr instance wil eventually become a performance limitation.  When there are too many layers for a single machine to search quickly, it is important to distribute the search query across multiple machines.   Fortunately, this functionality is built in to Solr.  The metadata is broken into multiple pieces called “shards”, each shard might hold, for example, 150,000 layer.   Each shard runs on its own Solr instance, on its own server.  A query is sent to one Solr instance.  This Solr instance shares the query with all the other shards, combines all the search results and returns them.  I have run tests where OGP connects to two Solr instances.  To maximize network latency, I used one at server Tufts and one at Berkeley.  Although the time perform a search slowed to 200 milliseconds, that is still fast enough.

Based on these tests, I estimate a single, very powerful server can probably handle over 100,000 layers.  Perhaps it could handle twice that.  When the limit of a single machine is reached, we can scale out to multiple shards running on multiple servers.  If I had to, I imagine I could build a Solr repository to handled half a million layers.