Manipulating Geographical Data with OpenStreetMap and QGIS
In a previous blog post titled "Can bullets fired into the air kill?", we had examined the potential of a misfired rifle to cause injuries and death. In order to perform the probabilistic calculations near the end of the article, it was necessary to extract and analyse large amounts of local geographical data. In this post we would see in fine detail how the data extraction and processing was done.
Before we begin, I would like to point out some excellent guides on this topic. "Extracting building footprints from OpenStreetMap" by Stelios Vitalis is the primary source that we'll be referencing; a different approach to downloading and extracting data is detailed in "Using OSM Data in QGIS" by LearnOSM. Stack exchange articles, such as this one, also offer instructions on how to navigate QGIS. My aim here is to piece these various pieces of information together into a contiguous narrative, which I hope would be useful to those looking to perform similar operations.
OpenStreetMap and QGIS
First of all, we have to find a source of geographical data. OpenStreetMap is a crowd-sourced mapping service with publicly available data, making it a convenient avenue for our purposes. We may, of course, choose to obtain our data through a different mean—WikiMapia Map, for instance, is another open-content mapping service, and OneMap offers a convenient API for downloading geographical information. In this post we'll be using OpenStreetMap.
We'll also need a software to process the data that we extract. QGIS is an excellent open-source geographical information processing software, which we'll be using for our work. There are a multitude of reasons why QGIS has become popular over the years, one being that it is free to use, and another being that it is somewhat more lightweight than its counterparts. Those with programming expertise and a lot of time on their hands may, of course, choose to deal with the data with code of their own; but we shall not join this party today.
Data Extraction & Filtering
The most obvious method of extracting data from OpenStreetMap is to click the export option right on its website, which would download an OSM file containing the geographical data within the selected bounds. The problem with this approach is that there is a hard limit on how much data one can download. And, rather unfortunately, this limit falls short of what we require, to extract Singapore's geographical data in one go—and so we look for another way.
There are numerous repositories of OpenStreetMap data available online, most of them containing packages for different regions and countries. BBBike, for instance, offers a compiled package for Singapore—and best of all, it seems to be updated frequently. This is the repository we shall use to obtain our data.
After downloading the OSM PBF file for Singapore, we can launch QGIS and import the data. The way we do this is by assessing Layers → Add Layer → Add Vector Layer. We then select our OSM file and load it in.
For those wondering about the weird-looking double file extension of OSM and PBF, the additional PBF specifier is to flag that the data in the file is encoded in Protocolbuffer Binary Format, an alternative to XML. OSM files encoded in XML would have the shorter file extension *.osm, whereas OSM files encoded in PBF would have the extension *.osm.pbf. Both types of files can be used with QGIS.
QGIS would then ask which data layer we wish to import. We should understand that data is organised in five distinct layers inside the OSM file—lines, multilinestrings, multipolygons, other relations, and points. Small objects such as landmarks, trees, and markers are stored as points. Geographical borders are stored as lines. In multipolygons we have the footprint of buildings, the outlines of rivers and water bodies, and land features such as plateaus and raised terrains.
We can choose to import everything, in which case the map would look something like this, with the default colour palette.
But we're actually only interested in building footprints, so a more efficient approach is to only import the multipolygons layer. Doing so cuts down on the amount of data our computer has to process dramatically. The map would then look like this:
A slight digression here. Why call them multipolygons? The difference between a multipolygon and a polygon is subtle, and has got to do with the allowed intersections between exterior and interior boundaries, and the way that the boundaries are stored. The rationale for using multipolygons for modelling building footprints is to allow for greater flexibility and generality. Some buildings might have footprints made up of several non-intersecting polygons, each belonging to a different block; and some buildings might have footprints with interior and exterior boundaries that are touching, because its facade is too thin to be captured. Modelling them as multipolygons would be much simpler than using polygons in these cases.
Anyway, now we have a set of multipolygons, which might be building footprints, or rivers, or reservoirs. We require only building footprints. To filter the data, we go to Layer → Filter. This opens up a filter panel, and we can type in the condition "buildings is not null". Tagged with each multipolygon is a buildings attribute field, which would contain some information, for instance "bridge" or "apartment" or "skyscraper", if the multipolygon is indeed a building; else the field would be empty. This is the schema adopted by OpenStreetMap, and we leverage upon this to build our filter.
Running the filter clears up everything that is not a building. But we have another problem—not all buildings are in Singapore. The map we had downloaded had a rectangular bound, and that means parts of Malaysia are captured as well. Naïvely attempting to select features on our rendered map and deleting them will not work—QGIS will state that the current data provider does not support deletion.
This is to be expected, because we have imported our data from an OSM file, which in QGIS is immutable. We have to first convert the data to a form QGIS can freely manipulate, say a shapefile or a database. An easy way of doing this is to right-click on our multipolygons vector layer and save it as an ERSI shapefile. We then relaunch QGIS clean, and import the shapefile as a vector layer, using the same procedure as before.
With the data imported from a shapefile, we can freely select and delete features. Remember to toggle edit mode on the toolbar before attempting to modify any geographical features, and to toggle it back off when you're done. When edit mode is deactivated, QGIS will automatically save the edits to the shapefile. After deleting all buildings that are not in Singapore, we obtain the following map.
At this juncture we have two options. The first is to export the data and use some other program to do post-processing. This might provide more flexibility, and would also allow us to combine the data with other datasets extracted from elsewhere. Mathematica, for instance, natively supports the import of ERSI shapefiles; and there are similarly libraries for Java and Python to process formatted geographical data.
The second option is to the post-processing within QGIS. We would be sticking to this option.
Calculating Area and Perimeter
Calculating the area and perimeter of the various building footprint multipolygons is almost trivial. We simply open the attributes table of our vector layer. It will look something like this:
And we now launch the field calculator from the attributes table. We change the value type to decimal number (real), and we type in $area for the expression. Running this would add a new column to the attributes table containing the evaluated area. To calculate perimeter, we type in $perimeter.
And we're done. If we seek the distribution of building footprint areas and perimeters, we can export the data and build the distributions using a symbolic processing software, say Mathematica. If we seek the total area or total perimeter, we simply do a sum. Exporting the data in the attributes table is simple—right-click on our vector layer, and save it to a CSV or text file for easy parsing later on.
For the sake of completeness, I figured I will present some of the results that I had obtained. As already mentioned in my "Can bullets fired into the air kill?" blog post, the total building footprint area in Singapore was evaluated to be around 84,178,390 square meters.
Perhaps more interesting is the footprint and perimeter distribution of the buildings in Singapore. The data from QGIS was imported into Mathematica, and histograms were constructed from these datasets. On the left we present the histogram for building footprint area, and on the right we present that for building perimeter.
And of course one may proceed to do more processing and analysis, but I shall call it a day. This is only meant to be an introduction, not an extensive guide. There are a lot of resources online for those wishing to explore deeper; besides, I am as far from an expert on this as you could possibly get, and I would be hard-pressed to elaborate further.
Till next time, folks, goodbye!