The function nbProbabilities uses naive Bayes and an interative estimation procedure to estimate relative transmission probabilities

nbProbabilities(
  orderedPair,
  indIDVar,
  pairIDVar,
  goldStdVar,
  covariates,
  label = "",
  l = 1,
  n = 10,
  m = 1,
  nReps = 10,
  orType = "univariate",
  nBS = 100,
  pSampled = 1,
  progressBar = TRUE
)

Arguments

orderedPair

The name of the ordered pair-level dataset with the covariates.

indIDVar

The name (in quotes) of the column with the individual ID. (data frame orderedPair must have columns called <indIDVar>.1 and <indIDVar>.2).

pairIDVar

The name (in quotes) of the column with the unique pair ID variable.

goldStdVar

The name (in quotes) of the column with a logical vector defining training links/non-links

covariates

A character vector containing the covariate column names (in quotes). All covariates need to be categorical factor variables.

label

An optional label string for the run.

l

Laplace smoothing parameter that is added to each cell.

n

The number of folds for nxm cross validation (should be at least 10).

m

The number of times to create n folds in nxm cross validation.

nReps

The number of times to randomly select the "true" infector (should be at least 10).

orType

Takes value "univariate" or "adjusted". "univariate" produces contingency table odds ratios and "adjusted" produces adjusted odds ratios from a bootstrapped multivariable logistic regression.

nBS

Number of bootstrap samples to run in each cross-validation fold/iteration (default is 100). Only relevant when orType = "adjusted".

pSampled

Proportion of unlinked cases to include in bootstrap sample (default is 1, i.e.a true bootstrap). Only relevant when orType = "adjusted".

progressBar

A logical indicating if a progress bar should be printed (default is TRUE).

Value

List containing two data frames:

  1. probabilities - a data frame of transmission probabilities. Column names:

    • label - the optional label of the run.

    • <pairIDVar> - the pair ID with the name specified.

    • pAvg - the mean transmission probability for the pair over all iterations.

    • pSD - the standard deviation of the transmission probability for the pair over all iterations.

    • pScaled - the mean relative transmission probability for the pair over. all iterations: pAvg scaled so that the probabilities for all infectors per infectee add to 1.

    • pRank - the rank of the probability of the the pair out of all pairs for that infectee (in case of ties all values have the minimum rank of the group).

    • nEstimates - the number of probability estimates that contributed to pAvg. This represents the number of prediction datasets this pair was included in over the nxm cross prediction repeated nReps times.

  2. estimates - a data frame with the contribution of covariates. Column names:

    • label - the optional label of the run

    • level - the covariate name and level

    • nIter - the number of iterations included in the estimates: n*m*nReps

    • logorMean - the mean value of the log odds ratio across iterations

    • logorSE - the standard error of the log odds ratio across iterations

    • logorCILB - the lower bound of the 95 across iterations

    • logorCIUB - the upper bound of the 95 across iterations

Details

This algorithm takes a dataset of ordered possible infector-infectee pairs in an infectious disease outbreak or cluster and estimates the relative probability the cases are linked by direct transmission using a classification technique called naive Bayes (NB). NB is a simple machine learning algorithm that uses Bayes rule to estimate the probability of an outcome in a prediction dataset given a set of covariates from the observed frequencies in a training dataset.

The input dataset - orderedPair - should represent ordered pairs of cases (where the potential infector was observed before the infectee) and have a unique identifier for each pair (pairIDVar) as well as the individual ids that are included in the pair (<indIDVar>.1 and <indIDVar>.2). If cases are concurrent (meaning the order cannot determined) both orders can be included.

A subset of pairs should also have pathogen WGS, contact investigation, or some other 'gold standard' defined by goldStdVar which should be a logical vector with TRUE indicating links, FALSE nonlinks, and NA if the pair cannot be used to train (does not have the information or is indeterminate). These pairs will be used to a training dataset of probable links and non/links. The covariates can be any categorical variables and could represent spatial, clinical, demographic, and temporal characteristics of the case pair.

Because the outcomes in the training set represent probable and not certain transmission events and a given case could have mulitple probable infectors, the algorithm uses an iterative estimation procedure. This procedure randomly chooses one link of all of the possible links to include in the training dataset nReps times, and then uses mxn cross prediction to give all pairs a turn in the prediction dataset.

The output of this function is a list of two dataframes: one with the estimates of the transmission probabilities (probabilities) and the other with the contribution of the covariates to the probabilities in the form of odds ratios (estimates). The 95% confidence intervals reported for these odds ratios use Rubin's Rules, a technique developed for multiple imputation, to pool the error across all iterations.

This function generates odds ratios describing the associations between covariates in the training data and outcome defined in the gold standard variable (goldStdVar) argument. Unadjusted odds ratios are the default. These odds ratios are produced using contingency table methods. Adjusted odds ratios are calculated via bootstrapped logistic regression to produce non-parametric standard errors. The bootstrap is controlled by parameters nBS, the number of bootstrap samples to run, and pSampled, the proportion of unlinked cases to include in the bootstrap sample. pSampled is recommended only for large datasets in which it is computationally unfeasible to run a full bootstrap. Sensitivity analyses should be run to determine an adequate value for pSampled.

References

Barnard J. and Rubin D. Small-Sample Degrees of Freedom with Multiple Imputation Biometrika. 1999 Dec;86(4):948-55.

Examples

## Use the pairData dataset which represents a TB-like outbreak
# First create a dataset of ordered pairs
orderedPair <- pairData[pairData$infectionDiffY >= 0, ]

## Create a variable called snpClose that will define probable links
# (<3 SNPs) and nonlinks (>12 SNPs) all pairs with between 2-12 SNPs
# will not be used to train.
orderedPair$snpClose <- ifelse(orderedPair$snpDist < 3, TRUE,
                        ifelse(orderedPair$snpDist > 12, FALSE, NA))
table(orderedPair$snpClose)
#> 
#> FALSE  TRUE 
#>   881   248 

## Running the algorithm
#NOTE should run with nReps > 1.
resGen <- nbProbabilities(orderedPair = orderedPair,
                            indIDVar = "individualID",
                            pairIDVar = "pairID",
                            goldStdVar = "snpClose",
                            covariates = c("Z1", "Z2", "Z3", "Z4", "timeCat"),
                            label = "SNPs", l = 1,
                            n = 10, m = 1, nReps = 1)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%

## Merging the probabilities back with the pair-level data
nbResults <- merge(resGen[[1]], orderedPair, by = "pairID", all = TRUE)