Estimates the generation/serial interval distribution

The function estimateSI uses the relative transmission probabilities to estimate the generation/serial interval distribution assuming a gamma distribution. It uses the PEM algorithm developed by Hens et al. 2012 extending their method to include restricting analysis to the top cluster of possible infectors.

estimateSI(
  df,
  indIDVar,
  timeDiffVar,
  pVar,
  clustMethod = c("none", "n", "kd", "hc_absolute", "hc_relative"),
  cutoffs = NULL,
  initialPars,
  shift = 0,
  epsilon = 1e-05,
  bootSamples = 0,
  alpha = 0.05,
  progressBar = TRUE
)

Arguments

df: The name of the dateset with transmission probabilities (column pVar), individual IDs (columns <indIDVar>.1 and <indIDVar>.2), and difference in time between the pair of cases (column timeDiffVar)
indIDVar: The name (in quotes) of the individual ID columns (data frame df must have variables called <indIDVar>.1 and <indIDVar>.2).
timeDiffVar: The name (in quotes) of the column with the difference in time between infection (generation interval) or symptom onset (serial interval) for the pair of cases. The units of this variable (hours, days, years) defines the units of the resulting distribution.
pVar: The column name (in quotes) of the transmission probabilities.
clustMethod: The method used to cluster the infectors; one of "none", "n", "kd", "hc_absolute", "hc_relative" where "none" or not specifying a value means use all pairs with no clustering (see clusterInfectors for detials on clustering methods).
cutoffs: A vector of cutoffs for clustering (see clusterInfectors). If more than one cutoff is provided, a pooled estimate will also be provided.
initialPars: A vector of length two with the shape and scale to initialize the gamma distribution parameters.
shift: A value in the same units as timeDiffVar that the gamma distribution should be shifted. The default value of 0 is an unmodifed gamma distribution.
epsilon: The difference between successive estimates of the shape and scale parameters that indicates convergence.
bootSamples: The number of bootstrap samples; if 0, then no confidence intervals are calculated.
alpha: The alpha level for the confidence intervals.
progressBar: A logical indicating if a progress bar should be printed (default is TRUE).

Value

A data frame with one row and the following columns:

nIndividuals - the number of infectees who have intervals included in the estimate.
pCluster - the proportion of cases who have intervals included in the estimate.
nInfectors - the average number of infectors in the top cluster.
shape - the shape of the estimated gamma distribution for the interval
scale - the scale of the estimated gamma distribution for the interval
meanSI - the mean of the estimated gamma distribution for the interval (shape * scale + shift)
medianSI - the median of the estimated gamma distribution for the interval (qgamma(0.5, shape, scale) + shift))
sdSI - the standard deviation of the estimated gamma distribution for the interval (shape * scale ^ 2)

If bootSamples > 0, then the data frame also includes the following columns:

shapeCILB and shapeCIUB - lower bound and upper bounds of the bootstrap confidence interval for the shape parameter
scaleCILB and scaleCIUB - lower bound and upper bounds of the bootstrap confidence interval for the scale parameter
meanCILB and meanCIUB - lower bound and upper bounds of the bootstrap confidence interval for the mean of the interval distribution
medianCILB and medianCIUB - lower bound and upper bounds of the bootstrap confidence interval for the median of the interval distribution
sdCILB and sdCIUB - lower bound and upper bounds of the bootstrap confidence interval for the sd of the interval distribution

Details

The PEM algorithm uses the prior probability that each pair is connected by direct transmission to estimate the generation/serial interval using estimation maximization. This code will provide an estimate of the generation interval if timeDiffVar represents the difference in infection dates and the serial interval if it represents the difference in symptom onset dates. The current generation/serial interval distribution parameters are used to update the probabilities which are then used to update the generation/serial interval distribution parameters. The process continues until the parameters converge (indicated by a change of less than epsilon) between iterations. Note: time difference between pairs should not be used to estimate the probabilities

This function acts as a wrapper around performPEM which integrates estimation of the generation/serial interval distribution with clustering the infectors and calculates derived parameters (mean, median, sd) of the distribution. Generally, this function should be called instead of performPEM directly.

All pairs of cases can be used in the estimation process by setting clustMethod = "none". However, if the probabilities are from a algorithm such as nbProbabilities, then it is recommeneded to use a clustering method and only include the top cluster of infectors for infectees which have such a cluster. This can be specified by using the clustMethod and cutoff arguments which are passed into clusterInfectors. See the details of this function for a description of the different clustering methods.

The method can be performed with any generation/serial interval distribution, but this version of this function assumes that the generation/serial interval has a gamma distribution. The function does allow for a shifted gamma distribution. The shift argument defines how much the gamma distribution should be shifted. Any observed generation/serial intervals that are less than this shift will have probability 0. This parameter should be used if there is a clinical lower bound for the possible generation/serial interval. If this argument is not specified then an unmodified gamma function is used. The units of the estimated gamma distribution will be defined by the units of the provided <timeDiffVar> column. The value of the shift should be in the same units.

The algorithm requires initial parameters which should be specified as a vector: c(<shape>, <scale>). These parameters should result in a gamma distribution that is on the desired scale, set by the <timeDiffVar> column.

If bootSamples > 0, bootstrap confidence intervals will be estimated for both the shape and scale parameters as well as the mean, median, and mode of the distribution using cluster bootstrapping.

References

Hens N, Calatayud L, Kurkela S, Tamme T, Wallinga J. Robust reconstruction and analysis of outbreak data: influenza A (H1N1) v transmission in a school-based population. American Journal of Epidemiology. 2012 Jul 12;176(3):196-203.

Examples


## First, run the algorithm without including time as a covariate.
orderedPair <- pairData[pairData$infectionDiffY > 0, ]

## Create a variable called snpClose that will define probable links
# (<3 SNPs) and nonlinks (>12 SNPs) all pairs with between 2-12 SNPs
# will not be used to train.
orderedPair$snpClose <- ifelse(orderedPair$snpDist < 3, TRUE,
                        ifelse(orderedPair$snpDist > 12, FALSE, NA))
table(orderedPair$snpClose)
#> 
#> FALSE  TRUE 
#>   881   246 

## Running the algorithm
# NOTE should run with nReps > 1.
resGen <- nbProbabilities(orderedPair = orderedPair,
                            indIDVar = "individualID",
                            pairIDVar = "pairID",
                            goldStdVar = "snpClose",
                            covariates = c("Z1", "Z2", "Z3", "Z4"),
                            label = "SNPs", l = 1,
                            n = 10, m = 1, nReps = 1)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
                            
## Merging the probabilities back with the pair-level data
nbResultsNoT <- merge(resGen[[1]], orderedPair, by = "pairID", all = TRUE)

## Estimating the serial interval

# Using hierarchical clustering with a 0.05 absolute difference cutoff
estimateSI(nbResultsNoT, indIDVar = "individualID",
             timeDiffVar = "infectionDiffY", pVar = "pScaled",
             clustMethod = "hc_absolute", cutoff = 0.05, initialPars = c(2, 2))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#>   clustMethod cutoff nIndividuals nInfectors  pCluster    shape    scale
#> 1 hc_absolute   0.05           78   1.012821 0.7878788 1.190383 1.667412
#>     meanSI medianSI     sdSI
#> 1 1.984859 1.464863 1.819225
             
# \donttest{
# Using all pairs
estimateSI(nbResultsNoT, indIDVar = "individualID",
              timeDiffVar = "infectionDiffY", pVar = "pScaled",
              clustMethod = "none", initialPars = c(2, 2))
#>   clustMethod cutoff nIndividuals nInfectors pCluster    shape     scale meanSI
#> 1        none   none           99    49.9899        1 1.435021 0.7401288 1.0621
#>   medianSI      sdSI
#> 1 0.828117 0.8866176


# # Using a shifted gamma distribution:
# # not allowing serial intervals of less than 3 months (0.25 years)
estimateSI(nbResultsNoT, indIDVar = "individualID",
              timeDiffVar = "infectionDiffY", pVar = "pScaled",
              clustMethod = "hc_absolute", cutoff = 0.05,
              initialPars = c(2, 2), shift = 0.25)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
#>   clustMethod cutoff nIndividuals nInfectors  pCluster    shape    scale
#> 1 hc_absolute   0.05           78   1.012821 0.7878788 1.273929 1.549818
#>     meanSI medianSI     sdSI
#> 1 2.224358 1.738445 1.749256


# # Using multiple cutoffs
estimateSI(nbResultsNoT, indIDVar = "individualID",
              timeDiffVar = "infectionDiffY", pVar = "pScaled",
              clustMethod = "hc_absolute", cutoff = c(0.025, 0.05), initialPars = c(2, 2))
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#>   clustMethod cutoff nIndividuals nInfectors  pCluster    shape    scale
#> 1 hc_absolute  0.025           84   1.142857 0.8484848 1.165297 1.702207
#> 2 hc_absolute   0.05           78   1.012821 0.7878788 1.190383 1.667412
#> 3 hc_absolute pooled           NA         NA        NA       NA       NA
#>     meanSI medianSI     sdSI
#> 1 1.983576 1.453673 1.837514
#> 2 1.984859 1.464863 1.819225
#> 3 1.984218 1.459268 1.828369
# }


## Adding confidence intervals
# NOTE should run with bootSamples > 2.
estimateSI(nbResultsNoT, indIDVar = "individualID",
             timeDiffVar = "infectionDiffY", pVar = "pScaled",
             clustMethod = "hc_absolute", cutoff = 0.05,
             initialPars = c(2, 2), shift = 0.25, bootSamples = 2)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |===================================                                   |  50%
  |                                                                            
  |======================================================================| 100%
#>   clustMethod cutoff nIndividuals nInfectors  pCluster    shape    scale
#> 1 hc_absolute   0.05           78   1.012821 0.7878788 1.273929 1.549818
#>     meanSI medianSI     sdSI shapeCILB shapeCIUB scaleCILB scaleCIUB meanCILB
#> 1 2.224358 1.738445 1.749256  1.160396  1.408723  1.188316   1.63125 2.024435
#>   meanCIUB medianCILB medianCIUB   sdCILB  sdCIUB
#> 1 2.164357   1.646329   1.655623 1.460173 1.77042