To begin geospatial analysis, we need to understand some key underlying concepts that are unique to the field. The list isn’t long, but nearly every aspect of analysis traces back to one of these ideas.
Thematic maps
As its name suggests, a thematic map portrays a specific theme. A general reference map visually represents features as they relate geographically to navigation or planning. A thematic map goes beyond location to provide the geographic context for information around a central idea. Usually, a thematic map is designed for a targeted audience to answer specific questions. The value of thematic maps lies in what they do not show. A thematic map uses minimal geographic features to avoid distracting the reader from the theme. Most thematic maps include political boundaries such as country or state borders but omit navigational features, such as street names or points of interest beyond major landmarks that orient the reader.
The cholera map by Dr. John Snow, which was provided earlier in this chapter, is a perfect example of a thematic map. Common uses for thematic maps include visualizing health issues, such as disease, election results, and environmental phenomena such as rainfall. These maps are also the most common output of geospatial analysis. The following map from the United States Census Bureau shows monthly business applications by state:
Figure 1.12 – A modern example of a thematic map from the US Census Bureau showing the distribution of business applications by states
Thematic maps tell a story and are very useful. However, it is important to remember that, while thematic maps are models of reality just like any other map, they are also generalizations of information. Two different analysts using the same source of information will often come up with very different thematic maps, depending on how they analyze and summarize the data. They may also choose to focus on different aspects of the dataset. The technical nature of thematic maps often leads people to treat them as if they are scientific evidence. However, geospatial analysis is often inconclusive. While the analysis may be based on scientific data, the analyst does not always follow the rigor of the scientific method.
In his classic book, How to Lie with Maps, Mark Monmonier, University of Chicago Press, demonstrates in detail how maps are easily manipulated models of reality, which are commonly abused. This fact doesn’t degrade the value of these tools. The legendary statistician, George Box, wrote the following in his 1987 book, Empirical Model-Building and Response Surfaces:
“Essentially, all models are wrong, but some are useful.”
Thematic maps have been used as guides to start (and end) wars, stop deadly diseases in their tracks, win elections, feed nations, fight poverty, protect endangered species, and rescue those impacted by a disaster. Thematic maps may be the most useful models ever created.
Spatial databases
In its purest form, a database is simply an organized collection of information. A database management system (DBMS) is an interactive suite of software that can interact with a database. People often use the word database as a catch-all term that refers to both the DBMS and the underlying data structure. Databases typically contain alphanumeric data and, in some cases, binary large objects (blobs), which can store binary data such as images. Most databases also allow a relational database structure in which entries in normalized tables can be referenced to each other to create many-to-one and one-to-many relationships among data.
Spatial databases, also known as geodatabases, use specialized software to extend a traditional relational database management system (RDBMS) to store and query data defined in a two- or three-dimensional space. Some systems also account for a series of data over time. In a spatial database, attributes about geographic features are stored and queried as traditional relational database structures. These spatial extensions allow you to query geometries using Structured Query Language (SQL) in a similar way to traditional database queries. Spatial queries and attribute queries can also be combined to aid with selecting results based on both location and attributes.
Spatial indexing
Spatial indexing is a process that organizes geospatial vector data for faster retrieval. It is a way of prefiltering the data for common queries or rendering. Indexing is commonly used in large databases to speed up the returns to queries. Spatial data is no different. Even a moderately sized geodatabase can contain millions of points or objects. If you perform a spatial query, every point in the database must be considered by the system for it to include them or eliminate them from the results. Spatial indexing groups data in ways that allow large portions of the dataset to be eliminated from consideration by doing computationally simpler checks before going into a detailed and slower analysis of the remaining items.
Metadata
Metadata is defined as data about data. Accordingly, geospatial metadata is data about geospatial datasets that provides traceability for the source and history of a dataset, as well as a summary of the technical details. Metadata also provides long-term preservation of data by way of documenting the asset over time.
Geospatial metadata can be represented by several possible standards. One of the most prominent standards is the international standard, ISO 19115-1, which includes hundreds of potential fields to describe a single geospatial dataset. Additionally, the ISO 19115-2 standard includes extensions for geospatial imagery and gridded data. Some example fields include spatial representation, temporal extent, and lineage. ISO 19115-3 is the standard for describing the procedure to generate an XML schema from ISO geographic metadata. Dublin Core is another international standard that was developed for digital data that has been extended for geospatial data, along with the associated DCAT vocabulary for building catalogs of data from a single source.
The primary use of metadata is for cataloging datasets. Modern metadata can be ingested by geographic search engines, making it potentially discoverable by other systems automatically. It also lists points of contact for a dataset if you have questions.
Python and metadata
Metadata is an important support tool for geospatial analysts and adds credibility and accessibility to your work. The Open Geospatial Consortium (OGC), which created the Catalog Service for the Web (CSW), is used to manage metadata. The pycsw
Python library implements the CSW standard.
Map projections
Map projections have entire books devoted to them and can be a challenge for new analysts. If you take any 3D object and flatten it on a plane, such as your screen or a sheet of paper, the object will be distorted. Many grade school geography classes demonstrate this concept by having students peel an orange and then attempt to lay the peel flat on their desks to understand the resulting distortion. The same effect occurs when you take the round shape of the Earth and project it onto a computer screen.
In geospatial analysis, you can manipulate this distortion to preserve common properties, such as area, scale, bearing, distance, or shape. There is no one-size-fits-all solution to map projections. The choice of projection is always a compromise of gaining accuracy in one dimension in exchange for errors in another. Projections are typically represented as a set of over 40 parameters, either in XML or in a text format called Well-Known Text (WKT), which is used to define the transformation algorithm.
The International Association of Oil and Gas Producers (IOGP) maintains a registry of the most well-known projections. The organization was formerly known as the European Petroleum Survey Group (EPSG). The entries in the registry are still known as EPSG codes. The EPSG maintained the registry as a common benefit for the oil and gas industry, which is a prolific user of geospatial analysis for energy exploration. At the last count, this registry contained over 5,000 entries.
As recently as 10 years ago, map projections were of primary concern for a geospatial analyst. Data storage was expensive, high-speed internet was rare, and cloud computing didn’t really exist. Geospatial data was typically exchanged among small groups working in separate areas of interest. The technology constraints at the time meant that geospatial analysis was highly localized. Analysts would use the best projection for their area of interest.
Data in different projections could not be displayed on the same map because they represent two different models of the Earth. Any time an analyst received data from a third party, it had to be reprojected before they could use it with the existing data. This process was tedious and time-consuming.
Most geospatial data formats do not provide a way to store the projection information. This information is stored in an ancillary file, usually as text or XML. Since analysts didn’t exchange data often, many people wouldn’t bother defining projection information. Every analyst’s nightmare was to come across an extremely valuable dataset that was missing the projection information. It rendered the dataset useless. The coordinates in the file are just numbers and offer no clue about the projection. With over 5,000 choices, it was nearly impossible to guess.
Now, thanks to modern software and the internet making data exchange easier and more common, nearly every data format has added a metadata format that defines a projection or places it in the file header, if supported. Advances in technology have also allowed for global base maps, which allow for more common uses of projections, such as the common Google Mercator projection, which is used for Google Maps. This projection is also known as Web Mercator and uses code EPSG:3857 (or the deprecated EPSG:900913).
Geospatial portal projects such as OpenStreetMap (https://www.openstreetmap.org) have consolidated datasets for much of the world in common projections. Modern geospatial software can also reproject data on the fly, saving the analyst the trouble of preprocessing the data before using it. Closely related to map projections are geodetic datums. A datum is a model of the Earth’s surface that’s used to match the location of features on the Earth to a coordinate system. One common datum is called WGS 84 and is used by GPS.
Rendering
The exciting part of geospatial analysis is visualization. Since geospatial analysis is a computer-based process, it is good to be aware of how geographic data appears on a computer screen.
Geographic data including points, lines, and polygons are stored numerically as one or more points, which come in (X, Y) pairs or (X, Y, Z) tuples. The X represents the horizontal axis on a graph, while the Y represents the vertical axis. The Z represents terrain elevation. In computer graphics, a computer screen is represented by an X- and Y-axis. The Z-axis is not used because the computer screen is treated as a two-dimensional plane by most graphics software APIs. However, because desktop computing power continues to improve, three-dimensional maps are starting to become more common.
Another important factor is screen coordinates versus world coordinates. Geographic data is stored in a coordinate system representing a grid overlaid on the Earth, which is three-dimensional and round. Screen coordinates, also known as pixel coordinates, represent a grid of pixels on a flat, two-dimensional computer screen. Mapping X and Y world coordinates to pixel coordinates is fairly straightforward and involves a simple scaling algorithm. However, if a Z coordinate exists, then a more complicated transformation must be performed to map coordinates from a three-dimensional space to a two-dimensional plane. These transformations can be computationally costly and therefore slow if not handled correctly.
In the case of remote sensing data, the challenge is typically the file size. Even a moderately sized satellite image that is compressed can be tens, if not hundreds, of megabytes. Images can be compressed using two methods:
- Lossless methods: They use tricks to reduce the file size without discarding any data
- Lossy compression algorithms: They reduce the file size by reducing the amount of data in the image while avoiding a significant change in the appearance of the image
Rendering an image on the screen can be computationally intensive. Most remote sensing file formats allow you to store multiple lower-resolution versions of the image – called overviews or pyramids – for the sole purpose of faster rendering at different scales. When zoomed out from the image to a scale where you can see the detail of the full-resolution image, a preprocessed, lower-resolution version of the image is displayed quickly and seamlessly.