Spatial Data

Spatial statistics is a huge area with many methods. This is a short introduction.

Examples of various maps created in R and published on R-bloggers.

IQ Data

First a simple lm model

We will use country IQ data from Hassall and Sherratt’s (2011) paper. Use this simplified form of the data. Save in your data folder in your R project.

Read in the data:

IQData <- read.table(file = "data/IQData.txt", header = TRUE)
head(IQData)

  Continent      Country IQ  Disease Nutrition Temperature   GDP DistEEA
1    Africa      Algeria 83  1974.29    439.38       13.14  7100 4397.88
2    Africa       Angola 68 19078.39   2142.56       18.79  8400 1154.33
3    Africa        Benin 70 10870.93   1143.10       25.21  1500 2991.75
4    Africa     Botswana 70 32483.12    532.08        3.90 12800 1915.18
5    Africa Burkina_Faso 68 15706.29   1405.28       25.61  1200 3526.84
6    Africa      Burundi 69 18706.93   1439.56       18.92   300  578.49
  Longitude Latitude
1     2.630   28.159
2    17.541  -12.312
3     2.338    9.628
4    23.806  -22.185
5    -1.765   12.265
6    29.942   -3.336

Note that there are countries, their corresponding IQ, possible explanatory variables for IQ and longitude and latitude.

We can run a lm model to see if nutrition and disease predict the IQ of a country.

model1 <- lm(IQ ~ log(Disease) + log(Nutrition), data=IQData)
summary(model1)


Call:
lm(formula = IQ ~ log(Disease) + log(Nutrition), data = IQData)

Residuals:
     Min       1Q   Median       3Q      Max 
-21.0732  -4.1154  -0.1215   3.9238  24.0464 

Coefficients:
               Estimate Std. Error t value Pr(>|t|)    
(Intercept)    128.6854     3.0463  42.244   <2e-16 ***
log(Disease)    -6.3655     0.6525  -9.755   <2e-16 ***
log(Nutrition)   0.6057     1.0594   0.572    0.568    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 6.544 on 171 degrees of freedom
Multiple R-squared:  0.6915,    Adjusted R-squared:  0.6879 
F-statistic: 191.6 on 2 and 171 DF,  p-value: < 2.2e-16

Why disease and nutrition is logged

During data exploration, plots reveal Disease and Nutrition to be skewed:

hist(IQData$Disease)

hist(IQData$Nutrition)

Taking the log of the data makes it more normally distributed:

hist(log(IQData$Disease))

hist(log(IQData$Nutrition))

This may improve the fit of the model.

Challenge

Identify the parts of the summary output that tell you what the effect of Disease and Nutrition is on IQ.

Answer

You could look at the p values under Pr(>|t|). For log(Disease) it is <2e-16 suggesting significance. For log(Nutrition) it is 0.568 suggesting non significance. You may also look at the Estimate and Adjusted R-squared.

So we conclude IQ is predicted by disease but not nutrition! Or is it…? Since this dataset is spatial it is likely to have the problem of autocorrelation!

Assessing Autocorrelation

Spatial autocorrelation can be measured using Moran’s I (I the letter not 1 the number).

Correlograms

We can use the function spline.correlog() in the package ncf to create a correlogram. This helps us decide if we have autocorrelation.

Install and library load the ncf package.

install.packages("ncf")

library(ncf)

Make an object called correlog_object containing information to plot a correlogram.

correlog_object <- spline.correlog(IQData$Longitude, IQData$Latitude, residuals(model1), resamp=0, latlon=TRUE)
plot(correlog_object)

Explanation of code

spine.correlog() is the function
IQData$Longitude and IQData$Latitude are the numbers the function used to calculate distances between countries
residuals(model1) is a list of the residuals from the lm model. This represents the data.
resamp=0 tells R to do the calculation once. Try increasing this number (for example, resamp = 100) and plotting to see a confidence interval.
latlon=TRUE tells R the co-ordinates are latitude and longitude.
plot(correlog_object) passes the newly created object through the plot() function to make a correlogram graph.

Interpreting the correlogram

On the x axis, are the distances between countries (calculated from latitude and longitude). Moran’s I values are on the y axis labelled Correlation. There are high values (positive ad negative) if the data is very similar, low values closer to 0 if it is not.

there is no spatial autocorrelation if the values of Moran’s I varies no matter what the distance
there is spatial autocorrelation if Moran’s I values are higher for countries closer together (i.e. lower distance apart).

We possibly have spatial autocorrelation! (Moran’s I values are high on the left of the graph.)

Adding the error to the graph (by doing iterations with resamp=10) helps us determine where Moran’s I values really do differ from 0.

correlog_object <- spline.correlog(IQData$Longitude, IQData$Latitude, residuals(model1), resamp=10, latlon=TRUE)

plot(correlog_object)

The error on the left of the graph which represents countries close together still suggests correlation.

moran.test

In addition we can establish if there is autocorrelation by running the moran.test() in package spdep. It gives us a p value that suggests if the data is randomly dispersed.

Load the package

library(spdep)

First we must create a matrix of distances and weight them:

IQnb <- dnearneigh(as.matrix(IQData[9:10]), 0, 10000,longlat=TRUE) 

IQlistw <- nb2listw(IQnb)

Explanation of code

dnearneigh() is a function
as.matrix means our object will be a matrix
IQData[9:10] specifies columns 9 and 10 where the longitude and latitude values are.
0, 10000 specify the lower and upper bounds of the distance class of interest (1 – 10000km is nearly global)
IQlistw is an object we create ready to pass through the moran.test()
nb2listw is a function that turns our neighbourhood IQnb object into a weighted list

Now perform the moran test:

result <- moran.test(residuals(model1), listw=IQlistw) 
result


    Moran I test under randomisation

data:  residuals(model1)  
weights: IQlistw    

Moran I statistic standard deviate = 11.706, p-value < 2.2e-16
alternative hypothesis: greater
sample estimates:
Moran I statistic       Expectation          Variance 
     6.690929e-02     -5.780347e-03      3.855867e-05

The p value suggests there is a difference in the Moran I’s values as the distances change. We use this result along with the correlogram to decide that there is autocorrelation in this case.

Mapping the Data

Maps may help us understand why data is autocorrelated.

Challenge

Install and library load the package ggplot2 and ggmap containing the world dataset which is the coordinates of a simple world map. Then pass the world dataset through the map_data function and assign it as an object that you call world.

Answer

install.packages("ggmap")

library(ggmap) 
world <- map_data("world")

Now plot world using ggplot.

worldmap <- ggplot(world, aes(x=long, y=lat, group=group)) +
  geom_path()
worldmap

We can plot the residuals of our model1 on top of this simple map of the world, to see where the spatial autocorrelation occurs.

worldmap + 
  geom_point(data = IQData, aes(Longitude, Latitude, color = ifelse(resid(model1) >= 0, "blue", "red"),
                                         group = Continent, 
                                         size=abs(resid(model1))/max(resid(model1))*3)) +
  theme(legend.position = "none")

Explanation of code

aes(Longitude, Latitude, the position of the residuals. Here given as longitude and latititude.
color = tells what colours to make the points. Here positive and negative residuals are blue and red respectively. Uses the ifelse function.
size= tells R how big to make the circles. Here we use the proportional size of the residuals relative to the the biggest (max) residual.

The large blue circles on the map are where the countries are most similar. There is also negative correlation where the big red circles are.

Another way to visualise why there is a autocorrelation problem, is to plot IQ and disease according to the continent of the country (similar to Figure 1 from Hassall and Sherratt, 2011):

ggplot(IQData, aes(x=log(Disease), y=IQ, group = Continent)) + 
geom_point(aes(color=Continent))

You can see the data clusters by continent.

Models for Spatial Data

Having established that we need to account for the spatial autocorrelation, we can run a model. However, there are many models to choose from.

As an example, we will use a GLS (Generalized Least Squares) model to show how spatial data is included in the model code to account for spatial autocorrelation.

GLS models

The function gls() in the nlme package will run a GLS model.

library(nlme)

gls() uses similar code to lm but the argument correlation= allows the latitude and longitude data to be included. R will automatically create a spatial correlation structure to include in the model using the longitude and latitude values.

model2 <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corExp(form=~Longitude+Latitude)) 
summary(model2)

In the correlation argument we have specified the spatial structure of our data to be corExp in the correlation= argument. However, there are many options with corExp, corGaus, corLin, corRatio, and corSpher most commonly used for spatial autocorrelation.

Evaluating models with AICs

One way we could choose which spatial structure option to use, is to run many models with the different options and evaluate the fit of the models. We can evaluate them using an AIC value (this can be used on all sorts of models not just gls models).

Run the models with the different structures corExp, corGaus, corLin, corRatio, and corSpher:

model_corExp <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corExp(form=~Longitude+Latitude)) 
model_corGaus <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corGaus(form=~Longitude+Latitude)) 
model_corLin <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corLin(form=~Longitude+Latitude)) 
model_corRatio <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corRatio(form=~Longitude+Latitude))
model_corSpher <- gls(IQ ~ log(Disease) + log(Nutrition), data=IQData, correlation=corSpher(form=~Longitude+Latitude))

View the AIC values:

AIC(model_corExp, model_corGaus, model_corLin, model_corRatio, model_corSpher)

               df      AIC
model_corExp    5 1019.224
model_corGaus   5 1089.618
model_corLin    5 1031.139
model_corRatio  5 1054.716
model_corSpher  5 1031.139

model_corExp has the lowest AIC value so it fits the data best.

Challenge

View the output of the model by passing it through summary().

Challenge

Compare the results of this model where spatial autocorrelation is accounted for with the original lm model1 where it was not. What are the differences?

Answer

summary(model1)

Only Disease was significant in the lm model. Now we conclude Disease and Nutrition is significant when autocorrelation is accounted for in the model.

Cleaning seabirds

Challenge

It is useful to learn how to download datasets from various sources. Try downloading the seabird data from Kaggle

If you get stuck you can use this source but it only has a subset of the data - seabird dataset

This data is quite large. In this case you might find it easier to make a data frame only containing the variables you need lat, lon, species, max_depth.m.

The data also needs cleaning. Practice renaming the variable max_depth.m to maxDepth.

Now run an analysis to understand if the data is spatially autocorrelated.

Then decide if you need to control for spatial autocorrelation when you test the effect of the species on the maxDepth dived.

References

Hassall C, Sherratt TN (2011) Statistical inference and spatial patterns in correlates of IQ. Intelligence, 39, 303-310.

Adapted from Christopher Hassall’s Introduction to Spatial Statistics and Mike Treglia’s Landscape Analysis and Modeling