Yujin Kim,
PhD
Prediction using ArcGIS Pro (v.3.2)
Aim
The aim of this machine learning task is to predict annual cooling degree days (CDDs) in the UK, using a linear regression model in ArcGIS Pro (v. 3.2) [1] by analysing the correlations between a dependent variable and independent variables. Annual cooling degree days mean “annual sum of the number of degrees the daily average temperature is above 22°C each day [2]”, and it is one of the indicators for energy demand (e.g. power consumption) to cool down in hot weather [3]. Thus, the prediction of the cooling degree days can help estimate future energy demand [4].
The annual cooling degree days (the dependent variable) on each 12 -km of the British National Grid (BNG) were estimated in relation to the annual counts of hot summer days and latitudes (independent variables,) using a continuous (Gaussian) model.
Before implementing a linear regression process, variables were set up, and pre-analysis was conducted to choose appropriate independent variables so that a robust linear regression model could be built. A total of 13 variables [5] were set up:
1) Annual cooling degree days (indicated as COOLING in graphs and tables)
2) Annual count of hot summer days (Hot_summer): “annual number of days where the maximum daily temperature is above 30°C [6]”.
3) Annual count of tropical nights (Tropical_nights): days when “the minimum daily temperature does not fall below 20°C [7]”.
4) Annual growing degree days (Growing): “annual sum of the number of degrees the daily average temperature is above 5.5°C each day [8].”
5) Annual count of icing days (Icing): “annual number of days where the maximum daily temperature is below 0°C [9].”
6) Annual count of frost days (Frost): “annual number of days where the minimum daily temperature is below 0°C [10].”
7) Annual heating degree days (Heating): “annual sum of the number of degrees the daily average temperature is below 15.5°C each day [11].”
8) Summer average air temperature (Summer_ave_temp) in °C [12].
9) Winter average air temperature (Winter_ave_temp) in °C [13].
10) Average summer precipitation (Summer_prec) in mm/day [14].
11) Average winter precipitation (Winter_prec) in mm/day [15].
12) Latitude
13) Longitude
1) was the dependent variable, and 2)–13) were the independent variables. All the data of 1)–11) were based on a 12 -km BNG (Fig. 1.2) in the historical period 2001–2020. Note that the UK Living Atlas provided the data (1-11) as median, lower and upper values. For the current task, the median values of historical baseline data were used. A total of 26,975 data were analysed.
Setup and Pre-Analysis

Fig. 1.2. 12 km British National Grid: the data of 1)–11) were provided based on the BNG. The centres of the circles in violet colour indicate the centre of each 12 -km grid.
To select proper independent variables that were not in multicollinearity, the relationships between the dependent variable (annual cooling degree days = COOLING) and the 12 independent variables were analysed and represented as a scatter plot matrix with graphs (Fig. 1.3) and R-squared values (R2, coefficient of determination) in Fig. 1.4. Multicollinearity is the case where multiple independent variables are closely correlated. When independent variables show multicollinearity with each other, weak predictions may be provided from a linear regression model, and thus, those independent variables should not be used.

Fig. 1.3. Scatter plot matrix: relationships between the dependent variable (COOLING) and independent variables represented in graphs
Fig. 1.4. Scatter plot matrix: relationships between the dependent variable (COOLING) and independent variables represented by R-squared values.

Analysing Fig. 1.3–4 revealed that the independent variables of Hot_summer, Growing, Summer_ave_temp and Latitude had relatively strong linear relationships with COOLING, showing R2s over 0.5. Among them, Hot_summer and Latitude were chosen as the independent variables because the other two show multicollinearity. Importantly, independent variables – Hot_summer and Latitude – showed stronger relations with the dependent variable, COOLING with R2 = 0.95 and 0.54 (greater than 0.5), respectively, while the two had lower relations, showing R2 = 0.42 (smaller than 0.5). Thus, a linear regression model was built with these two and the dependent variable, COOLING.
Generalised Linear Regression (GLR) – Implementation, Analysis and Results
The Gaussian model was chosen, as annual cooling degree days are based on the daily average temperature, which is a continuous value. As explained in the previous section, the independent variables – Hot_summer and Latitude – were used to predict the dependent variable, COOLING. Note that when the linear regression model was implemented, another independent variable – Tropical_nights (which was in a weak relationship with COOLING) – was included ‘for comparison purposes’ (e.g. ‘COOLING-Hot_summer & Latitude’ [the stronger linear relationships] versus ‘COOLING-Tropical_nights [the weak relationship]’.) Fig. 1.5 shows the standardised residuals from the linear regression model. A standardised residual is “a measure of the strength of the difference between observed and expected values [16].”


Fig. 1.5. Standardised residuals of the results from the linear regression in 12 -km BNG
Fig. 1.6. represents the relationship between COOLING and Hot_summer with a specific grid location. For example, in the grid in the map (a), the annual count of hot summer days (Hot_summer) is 2 days, and the annual cooling degree days (COOLING) are 35 CDDs with an R2 value of 0.95.

Fig. 1.6. Relationship between COOLING and Hot_summer at a specific location in the map
Fig. 1.7 represents the relationships between the variables in a scattered plot matrix. It is clearly shown that the pair of COOLING and Hot_summer represents the strongest relationship, and the pair of COOLING and Latitude indicates a strong relationship (less strong than COOLING and Hot_summer but still strong). The relationship between COOLING and Tropical_nights is weak by showing scattered points. The histogram of COOLING in the first column is a quasi-Gaussian distribution. It is left-skewed, meaning higher frequencies of COOLING below the average (CDDs).

Fig. 1.7. Relationships between the variables in a scattered plot matrix (refer to Fig. 1.5. for the legend of the standardised residuals.)
Fig. 1.8 represents the relationships between the variables with R2 values. Each pair of COOLING – Hot_summer and Latitude shows an R2 value over 0.5, representing a stronger relationship. The pair of COOLING and Tropical_nights indicates a weaker relation, showing a smaller R2 value of 0.18 (less than 0.5). These R2 values are the same as those analysed in the previous section.

Fig. 1.8. Relationships between the variables with an R2 value matrix
Fig. 1.9 presents distribution of standardised residual. The standardised residuals are indicated as bars, and the curve represents a normal distribution. The bars indicate that the linear regression model estimated COOLING with similar amounts of under- and over- prediction by showing a pattern close to a normal distribution.

Fig. 1.9. Distribution of standard residual
Fig. 1.10 shows the standardised residual and predicted plots. It is indicated that when the predicted plots had relatively low values (less than 15 CDDs), the linear regression model tended to estimate the results better (with white-coloured dots). When the predicted plots had between 20 and 50 CDDs, the results would be underestimated (shown in dark green). When they had lower than 21 or higher than 67 CDDs, the results tended to be overestimated (represented by dark red).

Fig. 1.10. Standardised residuals and predicted plots

Summary and Assessment of the results
Table. 1.1 presents the summary of Generalised Linear Regression results. Refer to the explanations in Table 1.3. The coefficients of Hot_summer and Tropical_nights were positive values because when the two increased, the dependent variable COOLING also increased. A coefficient indicates a linear weight for a variable. Although Hot_summer and Tropical_nights had similar coefficients (Table. 1.1), the results of the two were different (Fig. 1.7) because they had different R2 values. R2 represents how data are scattered from a regression line. Also, all coefficients were significant as indicated in Probability and Robust_Pr (Robust Probability) (see the descriptions in Table. 1.3).
As the intercept value is 74.261215 in the first column of Table. 1.1, this means that COOLING will be about 74 CDDs when all three predictors are 0.
Variance Inflation Factor (VIF) represents the amount of multicollinearity. As the VIFs were in the range of 1.18–1.81 for the three variables (Table. 1.1), these indicated less multicollinearity (see the notes in Table. 1.3).
Table. 1.1. Summary: Results from the Generalised Linear Regression model (Gaussian model)

The adjusted R-Squared value in Table 1.2 represents the degree of the relationship between observed and predicted COOLING. As it is 0.973606, compared to 1.0, this shows a strong relationship between the two.
Table. 1.2. Diagnostics: Generalised Linear Regression results

Table. 1.3. Notes

Note that explanatory variable means independent variable.
References
[1] ArcGIS Pro. https://pro.arcgis.com/en/pro-app/latest/get-started/get-started.htm.
[2, 3 and 4] Annual Cooling Degree Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=a84ecd403c294caeba395816fa7614ee.
[5] The original data were from the Met Office Climate Data Portal service at
https://climatedataportal.metoffice.gov.uk/ and were quoted from UK Living Atlas at:
https://arch2c8ba6b08cab.maps.arcgis.com/apps/dashboards/82b5dd3298d54c8d9291082de9fc3871.
[6] Annual Count of Hot Summer Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=1a89ff97e169482291ed49ff29ce1120.
[7] Annual Count of Tropical Nights - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=910c74d1c87b407cbe0bd36e6c79c954.
[8] Annual Growing Degree Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=6e11461360b542b19aba7c480c96ded9.
[9] Annual Count of Icing Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=372edf1d69d54cb7908000e01fafdf20.
[10] Annual Count of Frost Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=49aaacb0c4ed498bb7c9677798341777.
[11] Annual Heating Degree Days - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=726accfe94f04313a8c2221a73ae865d.
[12] Summer Average Temperature Change - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=7714ab0b35bd43a6a7ffcfe548687ea8.
[13] Winter Average Temperature Change - Projections (12 km): only historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=4baa4ecb3b2942e5a31a244292735373.
[14] Summer Precipitation Change - Projections (12 km): only average historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=e3ae850b0dc04b1883879a6ba66a2b5b.
[15] Winter Precipitation Change - Projections (12 km): only average historical baseline data in the period 2001–2020 were used for this task,
https://www.arcgis.com/home/item.html?id=c6ddd044fb794315a50709caa463202c.
[16] Standardized Residual in Statistics,
https://www.statisticshowto.com/what-is-a-standardized-residuals/#:~:text=The%20standardized%20residual%20is%20a,to%20the%20chi%2Dsquare%20value.