Category Archives: Uncategorized

Crawling Rockland Site

Data from the Rockland County (NY) GIS site is now available at  The following image shows some of the search results after a crawl based ingest of the Rockland data site:


Even though no web services may be available, these search results can still be previewed. The following image shows the Rockland boundaries for State Senate:


Here’s a screenshot with two separate bus routes previewed.


On the Rockland County site, many bus routes are stored in a single zip file.  These are individually searchable and previewable on WorldWideGeoWeb.

Directed crawling of new sites presents new challenges and reveals limitations in the existing code.  Several changes were made to successfully crawl Rockland, NY data at .

The Rockland site does not contain links to zip files.  A typical link to a data file is  The crawl code was changed change to support links to servlets rather than simple zip files.

The Rockland metadata files contain minimal, often cryptic titles; for example,“monsey2” and “TZX”.  These titles are not sufficient.  Fortunately, they can be augmented with information scraped from the crawled web page.  Specifically, the text from the anchor tag linking to DownloadData.jsp servlet is concatenated to the title field in the xml metadata file.  This creates user friendly titles, for example, “monsey2: TOR Bus Routes” and “TZX: Tappan Zee Express Bus Route”.

The Rockland site contains zip files that hold multiple shapefiles.  For example, the file contains 6 separate bus routes, each in a separate shapefile.  Each shapefile is ingested as a separate entity so it can be independently searched and previewed.  The 28 links on the Rockland data page expand into 47 searchable, previewable spatial resources.  Note that the OpenGeoPortal download operation pulls down entire shape files from the Rockland server, not the individual shapefiles.

Since the Rockland site only supports secure connections, the ingest code was enhanced to support https.

Crawling Westchester Site

At the NYC OGP meeting, I demoed my thesis site. Since there isn’t a video of it, here’s a write-up covering the same material. There are also some slides available.

The goal of my Masters thesis is to make all the world’s spatial data accessible. This goal is accomplished by expanding OpenGeoPortal in two significant ways. First, spatial data files are discovered via web crawls and then ingested. Second, the ability to preview and download layers without needing OGC protocols was developed. This expanded version of OpenGeoPortal is on the web at

Data on was discovered by crawling the web, relying exclusively on HTTP GET requests. This is the same technique used by Google and other search engines. The WorldWideGeoWeb crawler can be instructed to crawl a specific site. Sites are searched for links to zip files. The ingest code retrieves and unzips these files. If they contain a shapefile, the bounding box is determined using the shp and prj files. Any metadata file is also parsed. Information about each discovered layer is ingested into WorldWideGeoWeb’s Solr instance.

After ingest, OpenGeoPortal’s powerful search interface allows users to quickly and easily find spatial data layers. Preview of shapefiles on the map is based on parsing and rendering shapefiles entirely in JavaScript. It does not use image tiles from GeoServer or ArcGIS Server. When the user selects a layer to preview, the browser sends a request to the server to create a temporary, server-side copy of the zip file. During ingest, the URL of the zip file was stored in the Solr record. This is used and an HTTP GET request is issued to create local copy of the zip file. Then the file unzipped. At this point the browser requests the .shp, .shx, .prj and .dbf elements of the shapefile. They are processed in JavaScript as binary data streams. If the data is not in a suitable projection, it is reprojected on the browser. Then the features in the shapefile are parsed are rendered on OpenGeoPortal’s map. Attributes in the .dbf file are displayed as features are moused-over.

The following screenshot shows The search results were discovered by crawling Westchester County’s data web site at The map shows a preview for layer titled ”County Legislative Districts”. The browser debug panel at the bottom of the screenshot shows the network traffic generated by the preview request. The “cacheShapeFile.jsp” ajax call told the server to copy the shapefile from using an HTTP GET and unzip the results. After the Ajax request completes, the .shx, .shp, .dbf and .prj are requested by the browser and parsed in JavaScript. Transferring this 220 kilobyte layer first to the WorldWideGeoWeb server and then to the browser took just under 2 seconds.


The user can add any of these Westchester layers to the cart and download them. The zip files are transferred directly from the Westchester server to the browser. Clipping the data or converting it to another format are not supported.

WorldWideGeoWeb shows it is possible to build a powerful, interactive portal without requiring data holders create web services. Data only available on web sites designed for people can be ingested using a web crawl and previewed using advanced JavaScript techniques that weren’t available when the OGC protocols were created. Since WorldWideGeoWeb is built on OpenGeoPortal, data available with web services can also be supported.


There are significant limitations with the current version of the software. Most notable is its inability to deal with large shapefiles. Currently, shapefiles over one megabyte can cause the browser to hang. Search results are color-coded to advise the user. Green layers are small and should preview quickly. Yellow layers are larger but should preview without too much delay. Layers in red represent shapefiles over a megabyte and should not be previewed. Even these large layers can be downloaded easily downloaded, just not previewed.

My thesis code is not production quality.

Future Directions

During a crawl, only spatial resources in shapefiles are discovered and their associated metadata must be in FGDC or ISO19115. It would be trivial to add support for KML and KMZ files. Support for other file formats and metadata standards could also be integrated.

Crawling based OGC protocols such as Get Capabilities and CSW could be added.

The ranking of the search results could be based on the “Page Rank” of the page that linked to the zip file.

Semi-spatial data such as web pages about places could be ingested and searched spatially.

Other Notes

I now work for Voyager Search ( We are investigating how some of these ideas could be incorporated into their existing products. The code created for my thesis has been released under the GPL.

Some data an organization provides may be more critical or widely used. This data could be available via an OGC compliant server while other, less critical data is made available only via HTTP GET.

AWS Tip – Server Name

There’s a problem running OGP and its associated Solr instance under Amazon Web Services (AWS).  Every time you restart or reboot you server it gets a new IP address and a new DNS name.  OGP client and the server code both need the DNS name of the Solr instance.  This is stored in ogpConfig.json.  Here’s what Solr server information from my config file:

"search":{"serviceType": "solr", 
          "serviceAddress": "http://localhost:8983/solr/collection1/select"},

Since the name of the Solr server is in the config file, you must change it every time you reboot or restart.  Here’s a simple change in solr.js to the function

org.OpenGeoPortal.Solr.prototype.getServerName = function getServerName()
        var configInfo = org.OpenGeoPortal.InstitutionInfo.getSearch().serviceAddress;
        var elements = configInfo.split(",");
        var primaryServer = elements[0];
        if (!(primaryServer.indexOf("http://") == 0||primaryServer.indexOf("https://") == 0))
            // mostly for aws where ip address changes on every reboot                                                                             
            // support for no domain name in solr specification, just use current host                                                             
            primaryServer = "http://" + window.location.hostname + primaryServer;
        var select = "select";
        if ((primaryServer.substring(primaryServer.length - select.length) == select) == false)
            // here if the server name does not end in select                                                                                      
            primaryServer = primaryServer + "select";
        console.log("solr using " + primaryServer);
        return primaryServer;

That takes care of the client, but another change is probably needed for the server.  Unfortunately, I don’t have a fix for that yet.

Thesis Update

One of the goals of my Masters Thesis is to make the web’s spatial data resources discoverable from a single portal.  It builds on the earlier work of the entire OpenGeoPortal team.  Two ideas supporting the thesis are discussed in the following paragraphs.  They are: crawling the web for spatial data and parsing and rendering shapefiles in a web browser.  It should be noted that during the development of the OGC protocols, the approaches discussed below would not have been possible.  The early web held many more limitations.  It is only due to recent enhances in browser performance and advances in the HTML5 standard that allow these new approaches to be prototyped and analyzed.

If you would like to talk about or criticize these ideas, you can contact me at

Crawling the web for spatial data

Research universities employ OpenGeoPortal (OGP) to make their GIS repositories easily accessible.  OGP’s usefulness is partially a function of the data sets it makes available.  OGP, like most GIS portals, relies on a somewhat manual ingest process.  Metadata must be created, verified and ingested into an OGP instance.  This process helps assure very high quality metadata.  However this manual nature limits the rate at which data can be ingested.  If the goal is to build an index of the world’s spatial resources, a partially manual ingest process will not scale.  Instead it must be extremely automated.  To achieve this goal, OGP Ingest now supports data discovery via web crawling.

Ingest via small-scale crawling

The easiest way for an organization to make their spatial data available on the web is with zip files.  Typically, each zip file contains an individual shp file, an shx file, an xml file with metadata and any other files related to the data layer.  For example, the UN’s Office for the Coordination of Humanitarian Affairs put their data on the web in this fashion (  OGP’s ingest process can automatically harvest these kind of spatial data sites.  Via the OGP Ingest administration user interface, an OGP administrator enters the URL of a web site holding zipped shape files.  The OGP crawler processes the site searching all its pages for HTML anchor links to zip files.  Each individual zip file is analyzed for spatial resources.  OGP ingest code iterates over the elements of each zip file to determine if it has a spatial resource.  The user specified metadata quality standard is applied to the discovered resources to determine if they should be ingested.  This allows an information rich site with potentially thousands of data layers to be completely and easily ingested simply by providing the URL of the site.

OGP’s ingest crawl code is built on Crawler4j.  This crawl library follows robots.txt directives.  The number of http requests per second is limited so as to not significantly increase the load on the data server.  The crawler identifies itself as OpenGeoPortal (Tufts,

Ingest via large-scale crawling

Using Crawler4j works well for crawling a single site with a few hundred pages.  However, the web is an extremely large data store and well beyond the crawl capabilities of a single server.  Instead, it must be crawled using a large cluster of collaborating servers.  A proof-of-concept system capable of crawling large portions of the web was developed using Hadoop, Amazon Web Services and’s database.  It successfully demonstrated the ability to crawl a massive number of web pages for spatial data.

Crawling based ingest versus traditional ingest

Crawling for data and metadata has several advantages over a more traditional ingest strategy.  Since crawling often finds both data and metadata, the URL for the actual data is known and it can be used for data preview and downloading.  The bounding box and other attributes of the metadata can be verified against the actual data during ingest.  A hashcode of the shapefile can be computed and used to identify duplicate data.  A hashcode of the different elements in the shapefile can be used to identify derivative data that share a common geometry file.  Crawling builds on an extensive technology base.  Small scale crawling builds on Crawler4j, large scale crawling can economically leverage large clusters via AWS Electric MapReduce and’s database of web pages.  Data providers need only stand up an Apache web server, they can limit crawling of data by maintaining a robots.txt file.  Under appropriate circumstances, metadata can be enhanced using information scrapped from HTML pages.  The importance of a layer can be estimated in part by the page rank of the layer linking to the zip file and the overall site rank of the data server.

Browser based rendering of spatial data

Crawling for spatial data makes it searchable in OGP.  However that is not enough.  OGP also provides the ability to preview and download data layers.  Traditionally, these preview functions are provided by the OGP client code making OGC WMS requests to a map data server like GeoServer.  GeoServer processes the shape files and delivers the corresponding requested image tiles.  Each organization providing GIS data layers serves them using an instance of an OGC compliant data server (GeoServer) often with some additional supporting spatial data infrastructure (e.g., ESRI SDE or PostGIS).  However, data discovered by an OGP crawl resided on an HTTP server and are not known to be accessible via OGC protocols.  To make these layers previewable, code was developed to completely parse and render shapefiles in the browser.   This works in coordination with a server side proxy.  When the user requests a layer preview, the URL for the discovered zip file is retrieved from the repository of ingested layers.  OGP browser code makes an ajax request to the OGP server passing the URL of the zip file.  The OGP server uses an HTTP GET request to create a local cache copy of the zip file from the original data server.  Then it responds to the ajax request with name of the shape file.  The client browser then requests the shp and spx files from the OGP server, parses them as binary data and uses OpenLayers to render the result.  The parsing is provided by Thomas Lahoda’s ShapeFile JavaScript library.  It makes XMLHttpRequests for the shapefile elements and parses the result as a binary data stream.  After parsing, a shapefile is rendered by dynamically creating a new OpenLayers layer object and populating it with features.  Note that the cache copy on the OGP server is needed because JavaScript code can not make a cross-domain XMLHTTPRequest for a data file.

Advantages of browser based parsing and rendering

Browser based rendering of shapefiles has very different performance characteristics from traditional image tile rendering.  This performance has not yet been analyzed.  It is anticipated the performance for files over a couple megabytes will be slower because the entire zip file must be copied first to the OGP server acting as a proxy and then on to the client’s web browser.  For very large data files, from dozens to hundreds of megabytes, the data transfer time would make the client side parsing and rendering impractical.  Naturally, as networks and brwosers get faster, the time for this will decrease.   There are a variety of approaches that might improve performance for large files.  For example, one could use a persistent cache and create the local cache copy either during the discovery crawl phase or not delete the cache file after an initial request.  One could increase performance for especially large shape files by generating a pyramid of shape files at different levels of detail when it is put in the cache and having the client request the appropriate level.

There are several advantages to moving the parsing and rendering of shape files to the client.  First, the technology stack of the data server is simplified.  Traditionally, an institution serving spatial data might use GeoServer as well as ESRI SDE on top of MS SQL all running on Tomcat and Apache.  With client side parsing and rendering, an institution can serve spatial data using only Apache.  This simplified server infrastructure leads to reduced operational overhead.  The traditional approach requires publishing layers in GeoServer and keeping GeoServer synchronized with the spatial data infrastructure behind it.  The client parsing and rendering solution requires only keeping an HTML page with links to the zip files up to date.  A second advantage is scalability.  The traditional approach requires an application data server to convert the shape files to image tiles.  This centralizes a computationally significant task.  Whenever the data is needed at a different scale or with different color attributes, OGC requests are created and server load is generated.  When all parsing and rendering is done on the client, the data server is simply providing zip files.  This allows it to serve many more users.  As OGP requires scaling or changing a layer’s color it does not need to contact the server.  Since the browser has the entire shape file, it can locally compute any needed transform.  For efficiency, these transformations can leverage the client’s GPU.  Finally, since any transform it possible it becomes possible to create a fully featured map building application in a browser.

Performance Measurement

Preliminary tests were run using the “Ground Cover Of East Timor” layer from the UN’s Food And Agriculture Organization.  The zipped shape file is at  This file was used for initial testing because it is small and also easy to find using OGP.  During this test, a tomcat server was run on a MacBook Pro laptop running over the Tufts University wireless network in Somerville, MA, USA.  A Chrome browser instance connected to the OGP server using localhost.  The FAO server is located in Rome, Italy.

Zip file size: 702k
Time to create local cache copy (on localhost): 2.9  to 3.9 seconds
Unzipped shp file size: 920k
Unzipped shx file size: 828bytes
Time to transfer shp and shx files to browser and parse them: .3 seconds
Time to render in OpenLayers: 1.3 seconds
Total time: 4.5 seconds to 5.5 seconds

The following screenshot shows the “Ground Cover Of East Timor” layer previewed in OpenGeoPortal.  The Chrome Developer Tools panel at the bottom show the requests made by the browser.  The first is an ajax request to create a local copy of the shape file on the OGP server.  The server code makes an HTTP GET request to FAO’s server in Rome for the zip file, unzips the contents and returns name(s) of the shapefile(s).  The next two requests in the network tab show the XMLHttpRequests calls from Thomas Lahoda’s shapefile library.  This demonstrates the client is directly processing shapefiles rather than making WMS requests.

When parsing and rendering spatial data such as shapefiles in the browser, there is an initial time penalty for transferring all the original, raw data to the browser.  This initial time is potentially offset by handling all future scaling events within the browser instead of making tile requests over the network.  The profile for the potential higher initial load time versus the repeated requests for tiles may change as crawl based ingest makes spatial resources from servers around the globe available in a single portal.  In the above example, only a single HTTP GET request was issued to FAO’s Rome based server rather than repeated WMS requests to it.

MIT releasing Arcgis tool for searching OpenGeoPortal data

MIT is releasing an Arcgis toolbar for searching the SOLR indexes and loading OGP layers directly in users’ maps.  The toolbar is a standard Arcgis toolbar available in Arcmap.  The code constructs a SOLR query based on the users search criteria and populates a Windows form.  MIT will release the code to the large OpenGeoPortal community later in the spring.

The toolbar only searches the MIT repository.  A new version is being developed that will search other repositories.  This version will be able to add any public data as a WFS layer.  This version should be available in May, 2013.