Aggregation is the process of grouping spatial data at a level of detail or resolution that is coarser than the level at which the data were collected. For example, a national census collects sociodemographic and socioeconomic information for households. However, to ensure confidentiality during dissemination, such information, by necessity, is aggregated to various census geographies that differ in size. These geographies include, among others, census tracts (or “districts,” as they are called in some countries), municipalities (or “shires”), and provinces or states.
The outcome from aggregation is always the same: There is a loss of spatial and attribute detail through the creation of coarser spatial data consisting of fewer observations. While such data may be a desired outcome for some tasks, this is not always the case. In many instances, it is necessary to work with aggregated spatial data simply because they are the only data available for the task at hand—in other words, there is no choice in the matter. This is especially true when relying on governmental data products such as a national census. Geographic information systems (GIS) not only facilitate aggregation through a variety of techniques, they can also be used to evaluate issues pertaining to the use of aggregate data.
Reasons for Aggregating Spatial Data
Spatial data are aggregated for a variety of reasons. This section describes briefly, with examples, some of the more common ones, notably, to ensure the confidentiality of individual records, to generate data, to generalize/ summarize data, to update spatial databases, to simplify maps, and to partition space into various spatial units consistent with some underlying meaning/process (e.g., zones, districts, regions, service areas).
As noted above, spatial data disseminated by government agencies are more often than not aggregates of individual records. Two related reasons account for this. First, when responses to detailed questionnaires are solicited from individual entities, such as persons, households, or business establishments, confidentiality of the responses is paramount. This implies that the agency conducting the survey must guarantee that individual entities cannot be identified by users of the data. Aggregation is the traditional means for ensuring such confidentiality. Second, due to the sheer volume of individual records, an agency may simply find it necessary to compute summary statistics (e.g., counts, sums, averages) on the data for release to the public. This is indeed the case for international trade data (i.e., imports and exports), which are based on cross-border shipment records.
Solutions to countless problems, both simple and complex, require aggregate spatial data. In fact, many indices (e.g., accessibility indices, location quotients, excess commute, segregation index D) and models/algorithms (e.g., user equilibrium traffic assignment model, location-allocation problems) are based on aggregate spatial data. If such data are not readily available, then they must be created by the analyst. For example, school-age children within a school board’s jurisdiction could be assigned to demand locations along streets, based on their home addresses. Such aggregate spatial data are a necessary input to location-allocation problems seeking to 6———Aggregation assign children to schools, while meeting very specific criteria such as a maximum travel time criterion.
A rather mundane, yet necessary, reason for aggregating spatial data is to ensure that spatial databases are current. Such is the case in many municipal planning departments, which must maintain an up-to-date inventory of land parcels. On occasion, for any number of reasons, two or more adjacent parcels may be merged to form one larger parcel.
Thematic maps are an effective means of communication only if geographic information is conveyed accurately, in an easily understood manner, such that any underlying spatial pattern is obvious. In many cases, this implies that the cartographer must decide upon an appropriate level of detail for portraying the phenomenon of interest. More detail is not necessarily better. This is especially true today given the ease by which individual-level spatial data can be created from analog sources (e.g., business directories) via geocoding, a core feature of GIS software. Although it might be tempting to create a thematic map from such data, it may not be appropriate, particularly when there are numerous observations. Instead, a more effective map can be created by aggregating the data to some form of zoning system (e.g., postal/ZIP codes in the case of business directories) and portraying the result via proportional symbol or choropleth maps.
Partitioning space into spatial units consistent with some underlying meaning/process is the goal of many projects. In virtually all cases, spatial data at one level of detail are aggregated to a coarser level of detail corresponding to the derived spatial units. Examples abound of spatial partitioning. They include, to name but a few, the derivation of traffic analysis zones from finer census geography such as enumeration areas or block groups, the delineation of metropolitan areas (e.g., census metropolitan areas in Canada, metropolitan statistical areas in the United States) based on commuting flows between an urban core and adjacent municipalities, and even the delineation of watersheds based on spatial data derived from digital elevation models.
GIS Techniques for Aggregating Spatial Data
GIS offer several possibilities for aggregating spatial data. However, the techniques employed are directly related to the data model used for digital representation, namely, the vector data model, which represents real-world entities as points, lines, and areas, and the raster data model, which divides space into an array of regularly spaced square cells, sometimes called pixels. Together, these cells form a lattice, or grid, which covers space.
GIS software packages typically offer three basic methods for generating aggregate vector data. Two techniques, dissolve and merge, operate on objects of the same layer. Dissolve groups objects based on whether they share the same value of an attribute. For instance, land parcels could be grouped according to land use type (e.g., residential, commercial, industrial, other), thus producing a new land use layer. The only caveat to using dissolve is whether multipart objects are allowed. Merge, on the other hand, is an interactive technique that allows the analyst to group objects during an editing session.
Unlike dissolve and merge, spatial join operates on objects from two layers that are related based on their locations. Furthermore, almost any combination of the three vector data types (i.e., points, lines, and areas) can be joined. Through a spatial join, spatial data from one layer can be aggregated and added to objects of the other layer, which is often referred to as the destination layer. Aggregation is accomplished via a distance criterion or containment, both of which are based on objects found in the destination layer. Like dissolve and merge, the analyst must decide how existing attributes will be summarized during aggregation (e.g., averages, sums, weighted averages). By default, counts are generated automatically.
Aggregation of raster data always involves a decrease in resolution; that is, cell size increases. This is accomplished by multiplying the cell size of the input raster by a cell factor, which must be an integer greater than 1. For instance, a cell factor of 4 means that the cell size of the output raster would be 4 times greater than that of the input raster (e.g., an input resolution of 10 m multiplied by 4 equals an output resolution of 40 m). The cell factor also determines how many input cells are used to derive a value for each output cell. In the example given, a cell factor of 4 requires 4 × 4, or 16, input cells. The value of each output cell is calculated as the sum, mean, median, minimum, or maximum of the input cells that fall within the output cell.
Issues Concerning Aggregation
A discussion of aggregation would not be complete without mention of issues concerning the use of aggregate spatial data. Thus, this one concludes with brief explanations of the modifiable areal unit problem (MAUP), the ecological fallacy, and cross-area aggregation.
The MAUP occurs when the zoning system used to collect aggregate spatial data is arbitrary in the sense that it is not designed to capture the underlying process giving rise to the data. In turn, this implies that the results from any analysis using the system may be arbitrary. In other words, the results may simply be artifacts of the zoning system itself. MAUP effects can be divided into two components: scale effects and zoning effects. The former relate to different levels of aggregation (i.e., spatial resolution), whereas the latter relate to the configuration of the zoning system given a fixed level of aggregation. MAUP effects have been documented in a wide variety of analytical contexts, including, among others, the computation of correlation coefficients, regression analysis, spatial interaction modeling, location-allocation modeling, the derivation of various indices (e.g., segregation index D, excess commute), and regional economic forecasting.
The MAUP is closely related to the ecological fallacy, which arises when a statistical relationship observed using aggregate spatial data is attributed to individuals. In fact, one cannot make any inference concerning the cause of the relationship without further analysis. Finally, cross-area aggregation refers to the transfer of aggregate spatial data from one zoning system to another. The most common approach for this task is to use area weighting, which assumes that data are distributed uniformly within zones. While GIS can facilitate this procedure, one must be cautioned that the new spatial data are unlikely to match reality.
