See http://geojson.org for GeoJSON specification. A handy reason for converting to GeoJSON as an intermediate format is that github provides native map rendering of GeoJSON files, so you get a simple map visualization w/o doing any extra work (and the markers are clickable, so you can explore the attributes). These files can also be downloaded and easily opened in desktop GIS clients like QGIS. http://geojson.io is another nice online tool that enables interactive creation and examination of GeoJSON data. (Note: vizer doesn't consume GeoJSON directly; it uses a different, custom JSON format)
- IGSN samples GeoJSON: I've examined the web service responses (requested by IGSN number; here's an example) for the Shale Hills CZO IGSN's that Megan sent me in November. A little less than half of the ~1,700 IGSN's generated XML parsing errors with the standard Python XML parser I used, and my limited testing suggests these are due to invalid (not properly escaped) "<" and ">" characters in the XML responses; I ignored those. Of the remaining ~950 IGSN's, ~400 had no latitude and longitude entries ("Not Provided"). In order to examine the remaining IGSN sample responses, I first converted them into a standard GeoJSON structure, before converting and subsetting to what we'll use in the initial visualization portal test/pilot. The GeoJSON file available here has all ~550 IGSN's with lat & lon entries.
- MG-RAST metagenomic GeoJSON: Started with an MG-RAST metagenome MIxS request, issued on the browser with a limit of 10,000 records (Folker had mentioned in November that there were probably ~ 20,000 total records from that request; the request was http://api.metagenomics.anl.gov//metagenome?verbosity=mixs&limit=100, except for using limit=10000. Looks like
curl
on the shell would've worked, but I triedwget
and didn't succeed). The MIxS sequence metadata provides a manageable amount of metadata for initial exploration. I then eliminated records with invalid latitude or longitude values (372 records). Then kept only records withcountry in ('USA', 'United States of America')
(5393). The geographical distribution of locations (points all over the world, but mostly in the USA) showed that country was not the geographical location, but more likely the home base of the project PI; so I applied a rough bounding box filter to retain only sites within the USA lower 48(if (lon > -125.68 and lon < -65.04) and (lat > 24.53 and lat < 50.06))
. The MG-RAST collection includes all kinds of genomic sequences, including ones from sources that are not of interest to the BiGCZ project (eg, from human tissue); after exploring source, material, biome and similar "type" metadata vocabularies in the responses, I further appplied a filter based on env_package_type:env_package_type in ('air', 'built environment', 'microbial mat|biofilm', 'plant-associated', 'sediment', 'soil', 'water')
. These last two filters (bounding box and environment package type) greatly reduced the number of records to a final total of 883. They include a substantial number of marine sites.
I did these requests and processing on IPython notebooks. I can share those eventually, after I've cleaned them up; right now they're very messy.