Performs naive bayes classification

The function performNB calculates the posterior probabilities of a dichotomous class variable given a set of covariates using Bayes rule and either a univariate (default, orType = "univariate" odds ratio or a bootstrapped adjusted odds ratio via logistic regression.

performNB(
  training,
  prediction,
  obsIDVar,
  goldStdVar,
  covariates,
  l = 1,
  orType = "univariate",
  nBS = 100,
  pSampled = 1
)

Arguments

training: The training dataset name.
prediction: The prediction dataset name.
obsIDVar: The variable name (in quotes) of the observation ID variable.
goldStdVar: The variable name (in quotes) of the outcome in the training dataset (needs to be a logical variable with value TRUE for observations with the outcome of interest.)
covariates: A character vector containing the covariate variable names. All covariates need to be categorical factor variables.
l: Laplace smoothing parameter that is added to each cell (a value of 0 indicates no smoothing).
orType: Takes value "univariate" or "adjusted". "univariate" produces contingency table odds ratios and "adjusted" produces adjusted odds ratios from a bootstrapped multivariable logistic regression.
nBS: Number of bootstrap samples to run in each cross-validation fold/iteration (default is 100). Only relevant when orType = "adjusted".
pSampled: Proportion of unlinked cases to include in bootstrap sample (default is 1, i.e.a true bootstrap). Only relevant when orType = "adjusted".

Value

List containing two dataframes:

probabilities - a dataframe combining training and prediction with predictied probabilities for the prediction dataframe. Column names:
- <obsIDVar> - the observation ID with the name specified
- p - the probability that <goldStdVar> = TRUE for observations in the prediction dataset.
estimates - a dataframe with the effect estimates derived from the training dataset. Column names:
- level - the covariate name and level
- est - the log odds ratio for this covariate and level
- se - the standard error of the log odds ratio

Details

The main purpose of this function is to be used by nbProbabilities to estimate the relative transmission probability between individuals in an infectious disease outbreak. However, it can be used more generally to estimate the probability of any dichotomous outcome given a set of categorical covariates and adjusted odds ratios of such dichotomous outcome.

This function also generates odds ratios describing the associations between covariates in the training data and outcome defined in the gold standard variable (goldStdVar) argument. Unadjusted odds ratios are the default. These odds ratios are produced using contingency table methods. Adjusted odds ratios are calculated via bootstrapped logistic regression to produce non-parametric standard errors. The bootstrap is controlled by parameters nBS, the number of bootstrap samples to run, and pSampled, the proportion of unlinked cases to include in the bootstrap sample. pSampled is recommended only for large datasets in which it is computationally unfeasible to run a full bootstrap. Sensitivity analyses should be run to determine an adequate value for pSampled.

The function needs a training dataset with the outcome variable (goldStdVar) which is TRUE for those who have the value of interest and FALSE for those who do not. The probability of having the outcome (<goldStdVar> = TRUE) is predicted in the prediction dataset.

Examples

## Use iris dataset and predict if a flower is of the specices "virginica".

data(iris)
irisNew <- iris
## Creating an id variable
irisNew$id <- seq(1:nrow(irisNew))
## Creating logical variable indicating if the flower is of the species virginica
irisNew$spVirginica <- irisNew$Species == "virginica"

## Creating categorical/factor versions of the covariates
irisNew$Sepal.Length.Cat <- factor(cut(irisNew$Sepal.Length, c(0, 5, 6, 7, Inf)),
                                 labels = c("<=5.0", "5.1-6.0", "6.1-7.0", "7.1+"))

irisNew$Sepal.Width.Cat <- factor(cut(irisNew$Sepal.Width, c(0, 2.5, 3, 3.5, Inf)),
                                 labels = c("<=2.5", "2.6-3.0", "3.1-3.5", "3.6+"))

irisNew$Petal.Length.Cat <- factor(cut(irisNew$Petal.Length, c(0, 2, 4, 6, Inf)),
                                 labels = c("<=2.0", "2.1-4.0", "4.1-6.0", "6.0+"))

irisNew$Petal.Width.Cat <- factor(cut(irisNew$Petal.Width, c(0, 1, 2, Inf)),
                               labels = c("<=1.0", "1.1-2.0", "2.1+"))
## Using NB to predict if the species is virginica
## (training and predicting on same dataset)
pred <- performNB(irisNew, irisNew, obsIDVar = "id",
                    goldStdVar = "spVirginica",
                    covariates = c("Sepal.Length.Cat", "Sepal.Width.Cat",
                                   "Petal.Length.Cat", "Petal.Width.Cat"), l = 1)
irisResults <- merge(irisNew, pred$probabilities, by = "id")
tapply(irisResults$p, irisResults$Species, summary)
#> $setosa
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#> 0.00003 0.00006 0.00010 0.00010 0.00012 0.00018      50 
#> 
#> $versicolor
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#> 0.00018 0.03570 0.43028 0.43323 0.76622 0.85702      50 
#> 
#> $virginica
#>    Min. 1st Qu.  Median    Mean 3rd Qu.    Max.    NA's 
#>  0.1180  0.8570  0.9920  0.8474  0.9956  0.9999      50 
#>

Arguments

Value

Details

See also

Examples