Extracting Data from Websites with Mathematica
In my blog post titled "Can bullets fired into the air kill?", we had examined the physics of bullets and their potential to cause lethal injuries—and towards the end of our investigation, we had attempted a calculation of the approximate probability of a lethal impact for a misfired rifle. In that process, we needed the height distribution of the buildings in Singapore, and we had explained that we were able to extract the necessary data from the website of Emporis, a real-estate data-mining company.
In this follow-up post, we would see exactly how this was done. We would be using Mathematica for our work, a mathematical symbolic computation software by Wolfram Research.
Importing Webpages
The first step is to realize that the Import function of Mathematica is capable of much more than reading files and images—it is able to import entire websites as well. And so we shall begin our experimentation with importing a relevant Emporis website, say, the first page of their catalogue on the central district of Singapore. In the middle of the website is a table containing the names and heights of the catalogued buildings, which is what we would like to extract. When viewed through a browser, it looks something like this:
Let us first attempt naively importing the website using Mathematica's Import function. We simply paste the entire URL of the website into our notebook and run our code.
And we see immediately that it's not something that we can use. Such an import returns plain text as downloaded from the website, and we would need a significant amount of effort to construct a program that can extract our data intelligently. The trouble is that, while we can visually find the data and write appropriate string manipulation algorithms to retrieve the data at specific locations, there is no guarantee that the preceding text in every page of the website is the same—and therefore, there would be no guarantee that our program would be generically applicable to all of Emporis's pages.
We require a different approach. We know that the table is formatted with HTML; and we also know that the data to be displayed in the table has to be communicated to our computer, through some way or another. So perhaps reading the HTML source of the webpage would be a viable method.
Thinking a step ahead, we would subsequently have to go through the trouble of parsing our downloaded HTML, possibly into objects, and afterwards we would have to sieve this mess of objects to find our buildings data. We can simplify our work a lot by requesting that Mathematica parse the HTML source into XML objects, which we can then natively manipulate.
And we're disappointed to see that it is still a huge mess. Mathematica could not manage to parse much of the returned HTML. And at this point we might curse and swear and leave this for another day.
It is rather surprising that Mathematica does, in fact, already have a built-in function to extract data out of webpages, but that the method of doing so had largely evaded documentation. Even Stack Exchange contains little information on this; it was by sheer chance that I came across this functionality of Import. We can include a Data specifier, and as if by magic all the clutter disappears.
And we can now easily extract the information we seek. We find, through trial-and-error, that the data for our table is encoded in the (9, 1, 2) sublist of the data source.
We might think we are close to having cracked this, but we are not. Closer inspection reveals that the lengths of the sublists here are not identical. Take the first two elements for instance—the first on Tanjong Pagar Centre contains 7 elements, but the second on Republic Plaza contains 8. If we cross-reference this to the webpage, we see that this is because there is no data on the number of images available for Tanjong Pagar Centre. In other words, fields with no data are entirely overlooked in the import—there isn't even a placeholder or a flag signalling their absence!
How do we parse this data correctly, then, with no guaranteed knowledge of which data belongs to which field?
We don't. We go all the way back and start from square one. It turns out, an even more obscure functionality of Import can solve all our problems—we can pass a FullData specifier, instead of Data. I do not know why something as useful as this had remained so hard to dig out, but it works perfectly.
With the FullData import specifier, empty fields are no longer completely overlooked, but are represented as empty strings in their respective lists. The position of the data had also shifted to the (15, 1, 2) sublist, but that is easily corrected for. Extracting the data from Emporis's webpages has now been effectively reduced to parsing this list, which is easily achieved.
Data Parsing and Extraction
With a direct reference to the FullData import shown above, we can deduce straightaway the correspondence between the various fields in the lists and on the webpage. We have summarized it in the following table.
Parsing the data now becomes a matter of writing functions to extract and convert the various elements. We shall write a total of five such functions, to extract building name, building height, the number of floors, building type, and the year of construction. The first function to extract building name, which we coin ExtractName, is trivial to construct.
The next function to extract building height is not so trivial. There is a possibility that there is no height data for a particular building, in which case the field would be populated with an empty string. If that is the case, we would like the function to return a no-data flag, so that we can treat such cases appropriately during post-processing. We would also like to know whether the building height is exact or approximate—Emporis does reflect this information, by prepending approximate building heights by an approximation symbol. Again, this is so that we can handle exact and approximate building heights separately during post-processing, if there is such a need.
And so we define four flags to go along with the function: the no-data flag, the exact-value flag, the approximate-value flag, and the not-applicable flag. To extract the numerical data from a string field potentially populated by non-digit characters, we use StringCases with a DigitCharacter specifier. And to check whether the extracted value is exact or approximate, we test for the approximation symbol. The complete function looks something like this:
The remaining three desired fields are relatively easy to parse. We use IntegerQ to check if there is data, returning the no-data flag if there is none; and we use ToExpression to parse the strings into numbers.
At this point we have five functions defined, each capable of extracting a particular field from a data entry. What remains is to construct a primary function leveraging upon these. We coin this function ExtractData, and its input shall be the URL prefix of that particular group of Emporis webpages, the page number, and the number of data entries on that page. Its output shall be a list of the form {building name, building height, number of floors, building type, year of construction}, which will make post-processing very convenient for us.
And now all that is left to do is to run this function over all of Emporis's webpages. We should understand that the buildings catalogue is split into five major geographical regions—central, east, west, north, and northeast. A good way of downloading the data is therefore to have a separate data list for each region, and run a separate Module for each; as a bonus, this also allows us to visualize the building distributions in each region separately.
The iteration for the central region can be written as follows. We use a ParellelTable as our looping construct, in order to speed things up via parallelization; and as always, we use Monitor coupled with ProgressIndicator to monitor progress as the loop runs. To support Monitor across the parallelized operation, we define a counter variable, to be shared across all kernels. The remainder of the code is rudimentary arithmetic, to determine how many data entries there would be on that particular page.
This particular piece of code is not very computationally intensive; it takes ten minutes at most. The data for the remaining east, west, north, and northeast regions can be downloaded in the exact same way, by repeating the above function.
We're pretty much done here. We may, if we prefer, export the data into files for storage and call it a day. However, in the context of my "Can bullets fired into the air kill?" blog post, we had wanted the building height distribution in Singapore, and that would require some post-processing. I'll proceed to show how this was done in the following section.
Post-Processing of Data
In order to visualize the large amounts of data we had just extracted, perhaps it would be good to generate some graphs. In particular, we might be interested in the building height histograms in the various regions in Singapore, as well as the overall building height distribution. Before we proceed to plot those histograms, we need first filter our data to exclude any marked with the no-data flag. This can be done easily using Select.
And then we use the Histogram function to conveniently plot our graphics. The code below would generate histograms for each individual district, and also for the combined dataset; and these plots would be arranged in a horizontal scrollable strip. Much of the code presented below concerns the styling of the Histograms, and are not, strictly speaking, necessary.
We present some of the plots we have generated below.
And to construct the probability density function that we have used in our blog post, we use BinCounts to evaluate the number of buildings with heights falling between each histogram bin. In order to satisfy the unit-area constraint of probability density functions, we have to normalize the counts by the bin width, as shown in the code below. This counting process is applied across our datasets, and the resulting empirical counts are used to construct a piecewise interpolating function, with a value of zero everywhere outside our sampled domain.
We can go beyond this, and analyze the distribution of building types in Singapore. We use DeleteDuplicates to create a list of distinct building types, and then do a count of the number of buildings of each type. We then plot the major categories in a pie chart, for visualization. The end results looks something like this:
And we can, of course, also analyze the number of floors in the buildings of Singapore. Here's a histogram of the distribution of number of floors, as computed from our data. It can be seen that most buildings don't exceed thirty stories or so; and only a select few reaches 50 stories and above.
Lastly, we can analyze the construction dates of the buildings. This might meaningfully reflect the historical development of Singapore. It appears that an increasing number of construction projects had been undertaken over the years, with this trend continuing into the twenty-first century; but we must bear in mind that this representation might be biased, because data might be missing for the older buildings.
This is where I will call it a day and finish this off. We had seen, in this post, how we were able to programatically download and extract data from webpages, and had also seen how we had post-processed our data to yield the distribution function for building heights, and histogram visualizations for various building statistics. I hope such an elaboration had been interesting, and I would be extremely glad if this had proved insightful or useful to you, if even a tiny bit.
Thank you for reading, and till next time, goodbye!