The function nbProbabilities
uses naive Bayes and an interative estimation
procedure to estimate relative transmission probabilities
nbProbabilities(
orderedPair,
indIDVar,
pairIDVar,
goldStdVar,
covariates,
label = "",
l = 1,
n = 10,
m = 1,
nReps = 10,
progressBar = TRUE
)
The name of the ordered pair-level dataset with the covariates.
The name (in quotes) of the column with the individual ID.
(data frame orderedPair
must have columns called <indIDVar>.1
and <indIDVar>.2
).
The name (in quotes) of the column with the unique pair ID variable.
The name (in quotes) of the column with a logical vector defining training links/non-links
A character vector containing the covariate column names (in quotes). All covariates need to be categorical factor variables.
An optional label string for the run.
Laplace smoothing parameter that is added to each cell.
The number of folds for nxm cross validation (should be at least 10).
The number of times to create n folds in nxm cross validation.
The number of times to randomly select the "true" infector (should be at least 10).
A logical indicating if a progress bar should be printed (default is TRUE).
List containing two data frames:
probabilities
- a data frame of transmission probabilities. Column names:
label
- the optional label of the run.
<pairIDVar>
- the pair ID with the name specified.
pAvg
- the mean transmission probability for the pair over all iterations.
pSD
- the standard deviation of the transmission probability for the pair
over all iterations.
pScaled
- the mean relative transmission probability for the pair over.
all iterations: pAvg scaled so that the probabilities for all infectors per infectee add to 1.
pRank
- the rank of the probability of the the pair out of all pairs for that
infectee (in case of ties all values have the minimum rank of the group).
nEstimates
- the number of probability estimates that contributed to pAvg. This
represents the number of prediction datasets this pair was included in over the nxm
cross prediction repeated nReps
times.
estimates
- a data frame with the contribution of covariates. Column names:
label
- the optional label of the run
level
- the covariate name and level
nIter
- the number of iterations included in the estimates: n*m*nReps
logorMean
- the mean value of the log odds ratio across iterations
logorSE
- the standard error of the log odds ratio across iterations
logorCILB
- the lower bound of the 95
across iterations
logorCIUB
- the upper bound of the 95
across iterations
This algorithm takes a dataset of ordered possible infector-infectee pairs in an infectious disease outbreak or cluster and estimates the relative probability the cases are linked by direct transmission using a classification technique called naive Bayes (NB). NB is a simple machine learning algorithm that uses Bayes rule to estimate the probability of an outcome in a prediction dataset given a set of covariates from the observed frequencies in a training dataset.
The input dataset - orderedPair
- should represent ordered pairs of cases
(where the potential infector was observed before the infectee) and have
a unique identifier for each pair (pairIDVar
) as well as the individual ids that are
included in the pair (<indIDVar>.1
and <indIDVar>.2
). If cases are concurrent
(meaning the order cannot determined) both orders can be included.
A subset of pairs should also have pathogen WGS, contact investigation, or some other
'gold standard' defined by goldStdVar
which should be a logical vector with
TRUE
indicating links, FALSE
nonlinks, and NA
if
the pair cannot be used to train (does not have the information or is indeterminate).
These pairs will be used to a training dataset of probable links and non/links.
The covariates can be any categorical variables and could represent
spatial, clinical, demographic, and temporal characteristics of the case pair.
Because the outcomes in the training set represent probable and not certain
transmission events and a given case could have mulitple probable infectors,
the algorithm uses an iterative estimation procedure. This procedure randomly chooses one
link of all of the possible links to include in the training dataset nReps
times, and then uses mxn
cross prediction to give all pairs a turn
in the prediction dataset.
The output of this function is a list of two dataframes: one with the estimates of the
transmission probabilities (probabilities
) and the other with the contribution of
the covariates to the probabilities in the form of odds ratios (estimates
). The
95
for multiple imputation, to pool the error across all iterations.
Barnard J. and Rubin D. Small-Sample Degrees of Freedom with Multiple Imputation Biometrika. 1999 Dec;86(4):948-55.
## Use the pairData dataset which represents a TB-like outbreak
# First create a dataset of ordered pairs
orderedPair <- pairData[pairData$infectionDiffY >= 0, ]
## Create a variable called snpClose that will define probable links
# (<3 SNPs) and nonlinks (>12 SNPs) all pairs with between 2-12 SNPs
# will not be used to train.
orderedPair$snpClose <- ifelse(orderedPair$snpDist < 3, TRUE,
ifelse(orderedPair$snpDist > 12, FALSE, NA))
table(orderedPair$snpClose)
#>
#> FALSE TRUE
#> 881 248
## Running the algorithm
#NOTE should run with nReps > 1.
resGen <- nbProbabilities(orderedPair = orderedPair,
indIDVar = "individualID",
pairIDVar = "pairID",
goldStdVar = "snpClose",
covariates = c("Z1", "Z2", "Z3", "Z4", "timeCat"),
label = "SNPs", l = 1,
n = 10, m = 1, nReps = 1)
#>
|
| | 0%
|
|======================================================================| 100%
## Merging the probabilities back with the pair-level data
nbResults <- merge(resGen[[1]], orderedPair, by = "pairID", all = TRUE)