Overview & objectives

This post goes through how to simulate occupancy data given some inputs, ψ AKA the occupancy probability and p the detection probability.
The objectives of this post is to demonstrate how to simulate data for
- sites with homogeneous ψ and p
- sites with homogeneous ψ and heterogeneous p
- sites with homogeneous ψ and heterogeneous p as a function of a covariate
- sites with heterogeneous ψ and heterogeneous p as a function of a covariate
R script for this document

Preliminaries

Reproducibility

First we are going to set a seed so that simulations done here should match the results you get.

set.seed(8675309)

Modeling probabilities

Probabilities are constrained between 0 and 1 which can be problematic when you try to model them using a linear model. If you try to model the probability using a linear model that assumes a normal distribution you can end up with values that are less than 0 or greater than 1 which is an major problem! To simulate a process and estimate parameters there are 3 parts: 1) a predictive model, 2) observations of some phenomenon, and 3) a statistical model that links the observations as a probabilistic outcome of the predictive model.

Let’s start with a simple linear regression where the outcomes of the predictive model are assumed to be normally distributed. Mathematically we specify the predictive model as

Ŷ = β₀

where Ŷ is the predicted value and β₀ is the intercept. Next is the statistical model which we are assuming is normally distributed and links the predictions to the observations probabilistically as

Y ∼ Norma**l(Ŷ, σ)

where Y is an observation, Ŷ is the predicted value, and σ is the amount of uncertainty around the prediction characterized as the standard deviation for a normal distribution. One thing to note here it the use of ∼ instead of = because it specifies that the observations Y are distributed as some statistical distribution related to the predictions, Ŷ and the associated uncertainty.

It is useful to visualize the 3 steps with some code. First we will set the predictive model, then we will generate some realizations from the statistical model.

# predictive model
beta0<-10
# uncertainty
sigma<-10
# number of outcomes to simulate
N<-1000
# the prediction, using a design matrix
dat<- data.frame(site=rep(1,N))
design_matrix<-model.matrix(~1,dat)
Y_hat<- design_matrix %*% beta0
# some outcomes given the inputs
Y<- rnorm(N,Y_hat,sigma)

Then we can look at the simulations.

hist(Y,xlab="Values")

The simulations look good, a normal distribution as we expect. One thing I sidestepped is that there is a link function that links the predicted values to the statistical model. In the case of the normal distribution the link function is the identify function as

Identit**y(Ŷ = β₀

where the identify function is simply the value Ŷ. The link function is important because it allows us to work with linear models and exploit the convenience of design matrices and then simply use the link function to transform the prediction to suit our needs.

Before I get to much farther let’s look at the convenience of using design matrices real quick. The design matrix includes an intercept, any factors which are dummy coded (i.e., 0s and 1s) and continuous covariates. In the code below I am going to add a factor that has 3 levels and a continuous covariate.

# uncertainty
sigma<-10
# number of outcomes to simulate
N<-1000
# the prediction, using a design matrix
dat<- data.frame(site=rep(1,N),
    type=sample(c("low","mod","high"),N,replace=TRUE),
    cov=runif(N,-2,2))
dat$type<-as.factor(dat$type)

Using the data.frame dat we can make the design matrix and take a look at the first 10 rows.

design_matrix<-model.matrix(~type+cov,dat)
head(design_matrix,10)

##    (Intercept) typelow typemod        cov
## 1            1       0       0  1.0558865
## 2            1       0       1 -0.6921802
## 3            1       0       1 -1.8767693
## 4            1       1       0 -1.2489723
## 5            1       0       0 -1.8964887
## 6            1       0       1 -1.8850874
## 7            1       1       0 -0.8149601
## 8            1       0       0 -0.9618601
## 9            1       1       0  0.3840910
## 10           1       0       0 -1.7563359

Notice the first columns is all 1s for the intercept and the next 2 columns are 0s and 1s that index if the row was type mod or high and the last column is the covariate. This is where the design matrix becomes useful because with some matrix multiplication it is very easy to calculate the predicted values. This will become useful in Jags later in terms of providing some flexibility in making the prediction model. The code below takes a vector of β and calculates the predicted values. Note the number of values in the β vector is the same as the the number of columns in the design matrix.

betas<-c(10,0.2,-0.5,-0.3)
Y_hat<- design_matrix %*% betas
# some outcomes given the inputs
Y<- rnorm(N,Y_hat,sigma)

The vector β includes the intercept which is the mean of type = low and the covariate value = 0. The remaining elements of the vector are the effect of type = mod, type = high, and the covariate where the last 2 have a negative effect.

Now that the link between an observation and a predictive model is made through a statistical model we can revisit the idea that to do occupancy simulation is that the outcomes are present or not and detected or not specified as 0s or 1s. The first thing we need to do is think about the observed data, which is either a 0 or 1. There are 2 statistical distributions where the outcomes can either be 0 or 1, the Bernoulli and the binomial distribution. Now the binomial distribution is a series of successive Bernoulli trials but when you specify the number of trials in the binomial to 1 you get a Bernoulli outcome of 0 or 1. The input to a Bernoulli or a binomial is a probability which is usually what we are interested in estimating. So the observation model is

Y ∼ Bernoulli(p) or Y ∼ binomia**l(p, 1).

As we saw with the normal distribution the prediction model is linked to the statistical model by a link. If we are working with probabilities the logit link is the go to link that makes a continuous value that may be negative or positive to a probability as

logit(p) = β₀.

The code below simulates the occupancy for 1000 sites and it assumes that the occupancy rate is homogeneous rate (i.e., the same for all sites). Note in the code below there is a plogis function that is the logit link and the outcomes are simulated by the rbinom function.

# predictive model
beta0<-0.2
# uncertainty
sigma<-10
# number of outcomes to simulate
N<-1000
# the prediction, using a design matrix
dat<- data.frame(site=rep(1,N))
design_matrix<-model.matrix(~1,dat)
Y_hat<- design_matrix %*% beta0
psi<-plogis(Y_hat)
# some outcomes given the inputs
Y<- rbinom(N,1,psi)

The outcomes from the binomial are 0s and 1s and the frequencies are below.

table(Y)

## Y
##   0   1 
## 449 551

The value of β₀ was 0.2 which when you do the logit transformation is about 0.54. We can use the same approach to simulate some data where occupancy rate is heterogeneous among sites.

# number of outcomes to simulate
N<-1000
# the prediction, using a design matrix
dat<- data.frame(site=rep(1,N),
    type=sample(c("low","mod","high"),N,replace=TRUE),
    cov=runif(N,-2,2))
dat$type<-as.factor(dat$type)
betas<-c(0.2,0.2,-0.5,-0.3)
Y_hat<- design_matrix %*% betas
# some outcomes given the inputs
design_matrix<-model.matrix(~type+cov,dat)
Y_hat<- design_matrix %*% betas
psi<-plogis(Y_hat)
# some outcomes given the inputs
Y<- rbinom(N,1,psi)