The function nbProbabilities uses naive Bayes and an interative estimation procedure to estimate relative transmission probabilities

nbProbabilities(
  orderedPair,
  indIDVar,
  pairIDVar,
  goldStdVar,
  covariates,
  label = "",
  l = 1,
  n = 10,
  m = 1,
  nReps = 10,
  progressBar = TRUE
)

Arguments

orderedPair

The name of the ordered pair-level dataset with the covariates.

indIDVar

The name (in quotes) of the column with the individual ID. (data frame orderedPair must have columns called <indIDVar>.1 and <indIDVar>.2).

pairIDVar

The name (in quotes) of the column with the unique pair ID variable.

goldStdVar

The name (in quotes) of the column with a logical vector defining training links/non-links

covariates

A character vector containing the covariate column names (in quotes). All covariates need to be categorical factor variables.

label

An optional label string for the run.

l

Laplace smoothing parameter that is added to each cell.

n

The number of folds for nxm cross validation (should be at least 10).

m

The number of times to create n folds in nxm cross validation.

nReps

The number of times to randomly select the "true" infector (should be at least 10).

progressBar

A logical indicating if a progress bar should be printed (default is TRUE).

Value

List containing two data frames:

  1. probabilities - a data frame of transmission probabilities. Column names:

    • label - the optional label of the run.

    • <pairIDVar> - the pair ID with the name specified.

    • pAvg - the mean transmission probability for the pair over all iterations.

    • pSD - the standard deviation of the transmission probability for the pair over all iterations.

    • pScaled - the mean relative transmission probability for the pair over. all iterations: pAvg scaled so that the probabilities for all infectors per infectee add to 1.

    • pRank - the rank of the probability of the the pair out of all pairs for that infectee (in case of ties all values have the minimum rank of the group).

    • nEstimates - the number of probability estimates that contributed to pAvg. This represents the number of prediction datasets this pair was included in over the nxm cross prediction repeated nReps times.

  2. estimates - a data frame with the contribution of covariates. Column names:

    • label - the optional label of the run

    • level - the covariate name and level

    • nIter - the number of iterations included in the estimates: n*m*nReps

    • logorMean - the mean value of the log odds ratio across iterations

    • logorSE - the standard error of the log odds ratio across iterations

    • logorCILB - the lower bound of the 95 across iterations

    • logorCIUB - the upper bound of the 95 across iterations

Details

This algorithm takes a dataset of ordered possible infector-infectee pairs in an infectious disease outbreak or cluster and estimates the relative probability the cases are linked by direct transmission using a classification technique called naive Bayes (NB). NB is a simple machine learning algorithm that uses Bayes rule to estimate the probability of an outcome in a prediction dataset given a set of covariates from the observed frequencies in a training dataset.

The input dataset - orderedPair - should represent ordered pairs of cases (where the potential infector was observed before the infectee) and have a unique identifier for each pair (pairIDVar) as well as the individual ids that are included in the pair (<indIDVar>.1 and <indIDVar>.2). If cases are concurrent (meaning the order cannot determined) both orders can be included.

A subset of pairs should also have pathogen WGS, contact investigation, or some other 'gold standard' defined by goldStdVar which should be a logical vector with TRUE indicating links, FALSE nonlinks, and NA if the pair cannot be used to train (does not have the information or is indeterminate). These pairs will be used to a training dataset of probable links and non/links. The covariates can be any categorical variables and could represent spatial, clinical, demographic, and temporal characteristics of the case pair.

Because the outcomes in the training set represent probable and not certain transmission events and a given case could have mulitple probable infectors, the algorithm uses an iterative estimation procedure. This procedure randomly chooses one link of all of the possible links to include in the training dataset nReps times, and then uses mxn cross prediction to give all pairs a turn in the prediction dataset.

The output of this function is a list of two dataframes: one with the estimates of the transmission probabilities (probabilities) and the other with the contribution of the covariates to the probabilities in the form of odds ratios (estimates). The 95 for multiple imputation, to pool the error across all iterations.

References

Barnard J. and Rubin D. Small-Sample Degrees of Freedom with Multiple Imputation Biometrika. 1999 Dec;86(4):948-55.

Examples

## Use the pairData dataset which represents a TB-like outbreak
# First create a dataset of ordered pairs
orderedPair <- pairData[pairData$infectionDiffY >= 0, ]

## Create a variable called snpClose that will define probable links
# (<3 SNPs) and nonlinks (>12 SNPs) all pairs with between 2-12 SNPs
# will not be used to train.
orderedPair$snpClose <- ifelse(orderedPair$snpDist < 3, TRUE,
                        ifelse(orderedPair$snpDist > 12, FALSE, NA))
table(orderedPair$snpClose)
#> 
#> FALSE  TRUE 
#>   881   248 

## Running the algorithm
#NOTE should run with nReps > 1.
resGen <- nbProbabilities(orderedPair = orderedPair,
                            indIDVar = "individualID",
                            pairIDVar = "pairID",
                            goldStdVar = "snpClose",
                            covariates = c("Z1", "Z2", "Z3", "Z4", "timeCat"),
                            label = "SNPs", l = 1,
                            n = 10, m = 1, nReps = 1)
#> 
  |                                                                            
  |                                                                      |   0%
  |                                                                            
  |======================================================================| 100%
                            
## Merging the probabilities back with the pair-level data
nbResults <- merge(resGen[[1]], orderedPair, by = "pairID", all = TRUE)