R/clusterInfectors.R
clusterInfectors.Rd
The function clusterInfectors
uses either kernel density estimation or
hierarchical clustering to cluster the infectors for each infectee. This clustering
provides a way to separate out the few top possible infectors for each infectee
if there is such a cluster.
clusterInfectors(
df,
indIDVar,
pVar,
clustMethod = c("n", "kd", "hc_absolute", "hc_relative"),
cutoff
)
The name of the dateset with transmission probabilities (column pVar
),
individual IDs (columns <indIDVar>.1
and <indIDVar>.2
).
The name (in quotes) of the individual ID columns
(data frame df
must have variables called <indIDVar>.1
and <indIDVar>.2
).
The name (in quotes) of the column with transmission probabilities.
The method used to cluster the infectors;
one of "n", "kd", "hc_absolute", "hc_relative"
(see details).
The cutoff for clustering (see details).
The original data frame (df
) with a new column called cluster
which is a factor variable with value 1
if the infector is in the high probability cluster
or 2
if the infector is in the low probability cluster.
This function provides a way to find the most likely infectors for each infectee
using various clustering methods indicated by the clustmethod
.
The methods can be one of c("n", "kd", "hc_constant", "hc_relative")
.
If clustMethod == "n"
then this function simply assigns the top n possible
infectors in the top cluster where n is defined by the value of cutoff
.
If clustMethod == "kd"
then kernel density estimation is used to split the infectors.
The density for the probabilities for all infectors is estimated using a binwidth defined
by the value of cutoff
. If the density is made up of at least two separate curves
(separated by a region where the density drops to 0) then the infectors with probabilities
greater than the lowest 0 region are assigned to the high probability cluster. If the density of the
probabilities does not drop to 0 then all infectors are assigned to the low probability cluster
(indicating no real clustering).
If clustMethod == "hc_absolute"
or clustMethod == "hc_relative"
, then
hierarchical clustering with minimum distance is used to split the possible infectors
into two clusters. This method functionally splits the infectors by the largest gap
in their probabilities.
Then if clustMethod == "hc_absolute"
, those infectees
where the gap between the two clusters is less than cutoff
have all of their
possible infectors reassigned to the low probability cluster (indicating no real clustering).
If clustMethod == "hc_relative"
, then all infectees where the gap between the two
clusters is less than cutoff
times the second largest gap in probabilities
are reassigned to the low probability cluster (indicating no real clustering).
## Use the nbResults data frame included in the package which has the results
## of the nbProbabilities() function on a TB-like outbreak.
## Clustering using top n
# High probability cluster includes infectors with highest 3 probabilities
clust1 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
clustMethod = "n", cutoff = 3)
table(clust1$cluster)
#>
#> 1 2
#> 301 4648
## Clustering using hierarchical clustering
# Cluster all infectees, do not force gap to be certain size
clust2 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
clustMethod = "hc_absolute", cutoff = 0)
table(clust2$cluster)
#>
#> 1 2
#> 298 4651
# \donttest{
# Absolute difference: gap between top and bottom clusters is more than 0.05
clust3 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
clustMethod = "hc_absolute", cutoff = 0.05)
table(clust3$cluster)
#>
#> 1 2
#> 240 4709
# Relative difference: gap between top and bottom clusters is more than double any other gap
clust4 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
clustMethod = "hc_relative", cutoff = 2)
table(clust4$cluster)
#>
#> 1 2
#> 232 4717
## Clustering using kernel density estimation
# Using a small binwidth of 0.01
clust5 <- clusterInfectors(nbResults, indIDVar = "individualID", pVar = "pScaled",
clustMethod = "kd", cutoff = 0.01)
table(clust5$cluster)
#>
#> 1 2
#> 261 4688
# }