The function clusterInfectors uses either kernel density estimation or hierarchical clustering to cluster the infectors for each infectee. This clustering provides a way to separate out the few top possible infectors for each infectee if there is such a cluster.

clusterInfectors(
  df,
  indIDVar,
  pVar,
  clustMethod = c("n", "kd", "hc_absolute", "hc_relative"),
  cutoff
)

Arguments

df

The name of the dateset with transmission probabilities (column pVar), individual IDs (columns <indIDVar>.1 and <indIDVar>.2).

indIDVar

The name (in quotes) of the individual ID columns (data frame df must have variables called <indIDVar>.1 and <indIDVar>.2).

pVar

The name (in quotes) of the column with transmission probabilities.

clustMethod

The method used to cluster the infectors; one of "n", "kd", "hc_absolute", "hc_relative" (see details).

cutoff

The cutoff for clustering (see details).

Value

The original data frame (df) with a new column called cluster

which is a factor variable with value 1 if the infector is in the high probability cluster or 2 if the infector is in the low probability cluster.

Details

This function provides a way to find the most likely infectors for each infectee using various clustering methods indicated by the clustmethod. The methods can be one of c("n", "kd", "hc_constant", "hc_relative").

If clustMethod == "n" then this function simply assigns the top n possible infectors in the top cluster where n is defined by the value of cutoff.

If clustMethod == "kd" then kernel density estimation is used to split the infectors. The density for the probabilities for all infectors is estimated using a binwidth defined by the value of cutoff. If the density is made up of at least two separate curves (separated by a region where the density drops to 0) then the infectors with probabilities greater than the lowest 0 region are assigned to the high probability cluster. If the density of the probabilities does not drop to 0 then all infectors are assigned to the low probability cluster (indicating no real clustering).

If clustMethod == "hc_absolute" or clustMethod == "hc_relative", then hierarchical clustering with minimum distance is used to split the possible infectors into two clusters. This method functionally splits the infectors by the largest gap in their probabilities.

Then if clustMethod == "hc_absolute", those infectees where the gap between the two clusters is less than cutoff have all of their possible infectors reassigned to the low probability cluster (indicating no real clustering). If clustMethod == "hc_relative", then all infectees where the gap between the two clusters is less than cutoff times the second largest gap in probabilities are reassigned to the low probability cluster (indicating no real clustering).

See also

Examples


## Use the nbResults data frame included in the package which has the results
## of the nbProbabilities() function on a TB-like outbreak.

## Clustering using top n
# High probability cluster includes infectors with highest 3 probabilities
clust1 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
                           clustMethod = "n", cutoff = 3)
table(clust1$cluster)
#> 
#>    1    2 
#>  301 4648 

## Clustering using hierarchical clustering

# Cluster all infectees, do not force gap to be certain size
clust2 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
                            clustMethod = "hc_absolute", cutoff = 0)
table(clust2$cluster)
#> 
#>    1    2 
#>  298 4651 

# \donttest{
# Absolute difference: gap between top and bottom clusters is more than 0.05
clust3 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
                           clustMethod = "hc_absolute", cutoff = 0.05)
table(clust3$cluster)
#> 
#>    1    2 
#>  240 4709 

# Relative difference: gap between top and bottom clusters is more than double any other gap
clust4 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
                           clustMethod = "hc_relative", cutoff = 2)
table(clust4$cluster)
#> 
#>    1    2 
#>  232 4717 

## Clustering using kernel density estimation
# Using a small binwidth of 0.01
clust5 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
                           clustMethod = "kd", cutoff = 0.01)
table(clust5$cluster)
#> 
#>    1    2 
#>  261 4688 
# }