Title: | Distance-Based k-Medoids |
---|---|
Description: | Algorithms of distance-based k-medoids clustering: simple and fast k-medoids, ranked k-medoids, and increasing number of clusters in k-medoids. Calculate distances for mixed variable data such as Gower, Podani, Wishart, Huang, Harikumar-PV, and Ahmad-Dey. Cluster validation applies internal and relative criteria. The internal criteria includes silhouette index and shadow values. The relative criterium applies bootstrap procedure producing a heatmap with a flexible reordering matrix algorithm such as complete, ward, or average linkages. The cluster result can be plotted in a marked barplot or pca biplot. |
Authors: | Weksi Budiaji [aut, cre] |
Maintainer: | Weksi Budiaji <[email protected]> |
License: | GPL-3 |
Version: | 0.4.2 |
Built: | 2024-10-29 04:01:16 UTC |
Source: | https://github.com/cran/kmed |
This function creates a barplot from a cluster result. A barplot indicates the location and dispersion of each cluster. The x-axis of the barplot is variable's mean, while the y-axis is the variable's name.
barplotnum(dataori, clust, nc = 1, alpha = 0.05)
barplotnum(dataori, clust, nc = 1, alpha = 0.05)
dataori |
An original data set. |
clust |
A vector of cluster membership (see Details). |
nc |
A number of columns for the plot of all cluster (see Details). |
alpha |
A numeric number to set the significant level (between 0 and 0.2). |
This is a marked barplot because some markers are added, i.e. a significant test, a population mean for each (numerical) variable. The significance test applies t-test between the population's mean and cluster's mean in every variable. The alpha is set in between 0 to 20%. If the population mean differs to the cluster's mean, the bar shade in the barplot also differs.
clust
is a vector with the length equal to the number of objects
(n), or the function will be an error otherwise. nc
controls the layout (grid) of the plot. If nc = 1
, the plot of each
cluster is placed in a column. When the number of clusters is 6 and
nc = 2
, for example, the plot has a layout of 3-row and 2-column grids.
Function returns a barplot.
Weksi Budiaji
Contact: [email protected]
Leisch, F. (2008). Handbook of Data Visualization, Chapter Visualizing cluster analysis and finite mixture models, pp. 561-587. Springer Handbooks of Computational Statistics. Springer Verlag.
Dolnicar, S. and F. Leisch (2014). Using graphical statistics to better understand market segmentation solutions. International Journal of Market Research 56, 207-230.
dat <- iris[,1:4] memb <- cutree(hclust(dist(dat)),3) barplotnum(dat, memb) barplotnum(dat, memb, 2)
dat <- iris[,1:4] memb <- cutree(hclust(dist(dat)),3) barplotnum(dat, memb) barplotnum(dat, memb, 2)
A dataset containing two variables of 300 objects and their class memberships generated by the clusterGeneration package.
clust4
clust4
A data frame with 300 rows and 3 variables:
X1.
X2.
Class membership.
Data is generated via the genRandomClust
function in the
clusterGeneration package. The code to generate this data
set is
set.seed(2016)
randclust <- clusterGeneration::genRandomClust(4, sepVal = 0.001, numNonNoisy = 2, numReplicate = 1, clustszind = 3, clustSizes = as.numeric(table(sample(1:4, 300, replace = TRUE))), outputDatFlag=FALSE, outputLogFlag=FALSE, outputEmpirical=FALSE, outputInfo=FALSE)
clust4 <- as.data.frame(randclust$datList$test_1)
clust4$class <- randclust$memList$test_1
Qiu, W., and H. Joe. 2015. ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation).
Qiu, W., and H. Joe. 2006a. Generation of Random Clusters with Specified Degree of Separation. Journal of Classification 23 pp. 315-34.
Qiu, W., and H. Joe. 2006b. Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis 50 pp. 585-603.
A dataset containing two variables of 800 objects and their class memberships generated by the clusterGeneration package.
clust5
clust5
A data frame with 800 rows and 3 variables:
X1.
X2.
Class membership.
Data is generated via the genRandomClust
function in the
clusterGeneration package. The code to generate this data
set is
set.seed(2016)
randclust <- clusterGeneration::genRandomClust(5, sepVal = 0.2, numNonNoisy = 2, numReplicate = 1, clustszind = 3, clustSizes = as.numeric(table(sample(1:5, 800, replace = TRUE))), outputDatFlag=FALSE, outputLogFlag=FALSE, outputEmpirical=FALSE, outputInfo=FALSE)
clust5 <- as.data.frame(randclust$datList$test_1)
clust5$class <- randclust$memList$test_1
Qiu, W., and H. Joe. 2015. ClusterGeneration: Random Cluster Generation (with Specified Degree of Separation).
Qiu, W., and H. Joe. 2006a. Generation of Random Clusters with Specified Degree of Separation. Journal of Classification 23 pp. 315-34.
Qiu, W., and H. Joe. 2006b. Separation Index and Partial Membership for Clustering. Computational Statistics and Data Analysis 50 pp. 585-603.
This function does bootstrap replications for a clustering algorithm. Any hard clustering algorithm is valid.
clustboot(distdata, nclust = 2, algorithm = fastclust, nboot = 25, diss = TRUE)
clustboot(distdata, nclust = 2, algorithm = fastclust, nboot = 25, diss = TRUE)
distdata |
A distance matrix (n x n)/ dist object or a data frame. |
nclust |
A number of clusters. |
algorithm |
A clustering algorithm function (see Details). |
nboot |
A number of bootstrap replicates. |
diss |
A logical if |
This is a function to obtain bootstrap evaluation for cluster results.
The algorithm
argument is a function where this function has two input
arguments. The two input arguments are a distance matrix/ object or
a data frame, and number of clusters. Then the output is only
a vector of cluster memberships.
The default algorithm
is fastclust
applying the
fastkmed
function. The code of the fastclust
is
fastclust <- function(x, nclust) {
res <- fastkmed(x, nclust, iterate = 50)
return(res$cluster)
}
For other examples, see Examples. It applies ward and kmeans
algorithms. When kmeans is applied, for example, diss
is set to be
FALSE
because the input of the kmclust
and
clustboot
is a data frame instead of a distance.
Function returns a matrix of bootstrap replicates with a dimension of n x b, where n is the number of objects and b is the number of bootstrap replicates.
Weksi Budiaji
Contact: [email protected]
Dolnicar, S. and Leisch, F. 2010. Evaluation of structure and reproducibility of cluster solutions using the bootstrap. Marketing Letters 21 pp. 83-101.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") ward.D2 <- function(x, nclust) { res <- hclust(as.dist(x), method = "ward.D2") member <- cutree(res, nclust) return(member) } kmclust <- function(x, nclust) { res <- kmeans(x, nclust) return(res$cluster) } irisfast <- clustboot(mrwdist, nclust=3, nboot=7) head(irisfast) irisward <- clustboot(mrwdist, nclust=3, algorithm = ward.D2, nboot=7) head(irisward) iriskmeans <- clustboot(num, nclust=3, algorithm = kmclust, nboot=7, diss = FALSE) head(iriskmeans)
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") ward.D2 <- function(x, nclust) { res <- hclust(as.dist(x), method = "ward.D2") member <- cutree(res, nclust) return(member) } kmclust <- function(x, nclust) { res <- kmeans(x, nclust) return(res$cluster) } irisfast <- clustboot(mrwdist, nclust=3, nboot=7) head(irisfast) irisward <- clustboot(mrwdist, nclust=3, algorithm = ward.D2, nboot=7) head(irisward) iriskmeans <- clustboot(num, nclust=3, algorithm = kmclust, nboot=7, diss = FALSE) head(iriskmeans)
This function creates a consensus matrix heatmap from a consensus/ agreement matrix. The values of the consensus/ agreement matrix are transformed in order to plot the heatmap.
clustheatmap(consmat, title = "")
clustheatmap(consmat, title = "")
consmat |
A matrix of consensus/ agreement matrix (see Details). |
title |
A title of the plot. |
This is a function to produce a consensus matrix heatmap from a
consensus/ agreement matrix. A matrix produced by the
consensusmatrix
function can be directly provided in the
consmat
argument. The values of the consensus matrix, A,
are then transformed via a non-linear transformation by applying
where is the value of the consensus matrix in row i and
column j, and
is the all values of the matrix
(
A).
Function returns a heatmap plot.
Weksi Budiaji
Contact: [email protected]
Monti, S., P. Tamayo, J. Mesirov, and T. Golub. 2003. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52 pp. 91-118.
Hahsler, M., and Hornik, K., 2011. Dissimilarity plots: A visual exploration tool for partitional clustering. Journal of Computational and Graphical Statistics 20(2) pp. 335-354.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") irisfast <- clustboot(mrwdist, nclust=3, nboot=7) complete <- function(x, nclust) { res <- hclust(as.dist(x), method = "complete") member <- cutree(res, nclust) return(member) } consensuscomplete <- consensusmatrix(irisfast, nclust = 3, reorder = complete) clustheatmap(consensuscomplete)
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") irisfast <- clustboot(mrwdist, nclust=3, nboot=7) complete <- function(x, nclust) { res <- hclust(as.dist(x), method = "complete") member <- cutree(res, nclust) return(member) } consensuscomplete <- consensusmatrix(irisfast, nclust = 3, reorder = complete) clustheatmap(consensuscomplete)
This function creates a consensus matrix from a matrix of bootstrap replicates. It transforms an n x b matrix into an n x n matrix, where n is the number of objects and b is the number of bootstrap replicates.
consensusmatrix(bootdata, nclust, reorder = fastclust)
consensusmatrix(bootdata, nclust, reorder = fastclust)
bootdata |
A matrix of bootstrap replicate (n x b) (see Details). |
nclust |
A number of clusters. |
reorder |
Any distance-based clustering algorithm function (see Details). |
This is a function to obtain a consensus matrix from a matrix of
bootstrap replicates to evaluate the clustering result. The
bootdata
argument can be supplied directly from a matrix produced
by the clustboot
function. The values of the consensus matrix,
A, are calculated by
where is the agreement index between objects i and
j. Note that due to the agreement between objects i and
j equal to the agreement between objects j and i,
the consensus matrix is a symmetric matrix.
Meanwhile, the reorder
argument is a function to reorder the objects
in both the row and column of the consensus matrix such that similar objects
are close to each other. This task can be solved by applying a clustering
algorithm in the consensus matrix. The reorder
has to consist of
two input arguments. The two input arguments are a
distance matrix/ object and number of clusters.
The output is only a vector of cluster memberships. Thus,
the algorihtm that can be applied in the reorder
argument is the
distance-based algorithm with a distance as the input.
The default reorder
is fastclust
applying the
fastkmed
function. The code of the fastclust
is
fastclust <- function(x, nclust) {
res <- fastkmed(x, nclust, iterate = 50)
return(res$cluster)
}
For other examples, see Examples. It applies centroid and complete linkage algorithms.
Function returns a consensus/ agreement matrix of n x n dimension.
Weksi Budiaji
Contact: [email protected]
Monti, S., P. Tamayo, J. Mesirov, and T. Golub. 2003. Consensus clustering: A resampling-based method for class discovery and visualization of gene expression microarray data. Machine Learning 52 pp. 91-118.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") irisfast <- clustboot(mrwdist, nclust=3, nboot=7) consensusfast <- consensusmatrix(irisfast, nclust = 3) centroid <- function(x, nclust) { res <- hclust(as.dist(x), method = "centroid") member <- cutree(res, nclust) return(member) } consensuscentroid <- consensusmatrix(irisfast, nclust = 3, reorder = centroid) complete <- function(x, nclust) { res <- hclust(as.dist(x), method = "complete") member <- cutree(res, nclust) return(member) } consensuscomplete <- consensusmatrix(irisfast, nclust = 3, reorder = complete) consensusfast[c(1:5,51:55,101:105),c(1:5,51:55,101:105)] consensuscentroid[c(1:5,51:55,101:105),c(1:5,51:55,101:105)] consensuscomplete[c(1:5,51:55,101:105),c(1:5,51:55,101:105)]
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") irisfast <- clustboot(mrwdist, nclust=3, nboot=7) consensusfast <- consensusmatrix(irisfast, nclust = 3) centroid <- function(x, nclust) { res <- hclust(as.dist(x), method = "centroid") member <- cutree(res, nclust) return(member) } consensuscentroid <- consensusmatrix(irisfast, nclust = 3, reorder = centroid) complete <- function(x, nclust) { res <- hclust(as.dist(x), method = "complete") member <- cutree(res, nclust) return(member) } consensuscomplete <- consensusmatrix(irisfast, nclust = 3, reorder = complete) consensusfast[c(1:5,51:55,101:105),c(1:5,51:55,101:105)] consensuscentroid[c(1:5,51:55,101:105),c(1:5,51:55,101:105)] consensuscomplete[c(1:5,51:55,101:105),c(1:5,51:55,101:105)]
This function calculates the co-occurrence distance proposed by Ahmad and Dey (2007).
cooccur(data)
cooccur(data)
data |
A matrix or data frame of binary/ categorical variables (see Details). |
This function computes co-occurrence distance, which is a binary/ categorical distance, that based on the other variable's distribution (see Examples). In the Examples, we have a data set:
object | x | y | z |
1 | 1 | 2 | 2 |
2 | 1 | 2 | 1 |
3 | 2 | 1 | 2 |
4 | 2 | 1 | 2 |
5 | 1 | 1 | 1 |
6 | 2 | 2 | 2 |
7 | 2 | 1 | 2 |
The co-occurrence distance transforms each category of binary/ categorical in a variable based on the distribution of other variables, for example, the distance between categories 1 and 2 in the x variable can be different to the distance between categories 1 and 2 in the z variable. As an example, the transformed distance between categories 1 and 2 in the z variable is presented.
A cross tabulation between the z and x variables with corresponding (column) proportion is
1 | 2 | || | 1 | 2 | |
1 | 2 | 1 | || | 1.0 | 0.2 |
2 | 0 | 4 | || | 0.0 | 0.8 |
A cross tabulation between the z and y variables with corresponding (column) proportion is
1 | 2 | || | 1 | 2 | |
1 | 1 | 3 | || | 0.5 | 0.6 |
2 | 1 | 2 | || | 0.5 | 0.4 |
Then, the maximum values of the proportion in each row are taken such that they are 1.0, 0.8, 0.6, and 0.5. The new distance between categories 1 and 2 in the z variable is
The constant in the formula applies because the z variable
depends on the 2 other variable distributions, i.e the x and y
variables. The new distances of each category in the
for the x and y variables can be calculated in a similar way.
Thus, the distance between objects 1 and 2 is 0.45. It is only the z variable counted to calculate the distance between objects 1 and 2 because objects 1 and 2 have similar values in both the x and y variables.
The data
argument can be supplied with either a matrix or data frame,
in which the class of the element has to be an integer. If it is not
an integer, it will be converted to an integer class. If the data
of a variable only, a simple matching is calculated. The co-occurrence
is absent due to its dependency to the distribution of other variables
and a warning
message appears.
Function returns a distance matrix (n x n).
Weksi Budiaji
Contact: [email protected]
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, 226-237.
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) cooccur(a)
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) cooccur(a)
This function computes centroid shadow values and shadow value plots of each cluster. The plot presents the mean of the shadow values as well.
csv(distdata, idmedoid, idcluster, title = "")
csv(distdata, idmedoid, idcluster, title = "")
distdata |
A distance matrix (n x n) or dist object. |
idmedoid |
A vector of id medoids (see Details). |
idcluster |
A vector of cluster membership (see Details). |
title |
A title of the plot. |
The origin of the centroid shadow value is calculated in the shadow
function of the flexclust package, in which it is based on the first and
second closest centroid. The csv
function in this package modifies
the centroid into medoid such that the formula to compute shadow value of
object i is
where is the distance between object i to the first
closest medoid and d(i, m'(i)) is the distance between object
i to the second closest medoid.
The idmedoid
argument corresponds to the idcluster
argument.
If the length of idmedoid
is 3, for example, the idcluster
has
to have 3 unique cluster memberships, or it returns Error
otherwise.
The length of the idcluster
has also to be equal to n
(the number of objects). In contrast to the silhoutte value,
the centoird shadow value is interpreted that lower value is the better
cluster separation.
Function returns a list with following components:
result
is a data frame of the shadow values for all objects
plot
is the shadow value plots of each cluster.
Weksi Budiaji
Contact: [email protected]
F. Leisch. 2010 Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing. vol. 20, pp. 457-469
W. Budiaji. 2019 Medoid-based shadow value validation and visualization. International Journal of Advances in Intelligent Informatics. Vol 5 No 2 pp. 76-88
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) sha <- csv(distiris, res$medoid, res$cluster) sha$result[c(1:3,70:75,101:103),] sha$plot
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) sha <- csv(distiris, res$medoid, res$cluster) sha$result[c(1:3,70:75,101:103),] sha$plot
This function computes a distance matrix for a mixed variable data set applying various methods.
distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)
distmix(data, method = "gower", idnum = NULL, idbin = NULL, idcat = NULL)
data |
A data frame or matrix object. |
method |
A method to calculate the mixed variables distance (see Details). |
idnum |
A vector of column index of the numerical variables. |
idbin |
A vector of column index of the binary variables. |
idcat |
A vector of column index of the categorical variables. |
There are six methods available to calculate the mixed variable
distance. They are gower
, wishart
, podani
,
huang
, harikumar
, ahmad
.
gower
The Gower (1971) distance is the most common distance for a mixed variable
data set. Although the Gower distance accommodates missing values, a missing
value is not allowed in this function. If there is a missing value, the Gower
distance from the daisy
function in the cluster package can be
applied. The Gower distance between objects i and j is
computed by
, where
is a weight in variable l that is usually 1 or 0
(for a missing value). If the variable l is a numerical variable,
{0, 1}, if the variable l is a binary/
categorical variable.
wishart
Wishart (2003) has proposed a different measure compared to Gower (1971) in
the numerical variable part. Instead of a range, it applies a variance of
the numerical variable in the such that the distance becomes
where when l is a numerical variable and
{0, 1} when l is a binary/ categorical
variable.
podani
Podani (1999) has suggested a different method to compute a distance for a mixed variable data set. The Podani distance is calculated by
where when l is a numerical variable and
{0, 1} when l is a binary/ categorical
variable.
huang
The Huang (1997) distance between objects i and j is computed by
where and
are the number of numerical and categorical
variables, respectively,
and is the mismatch/ simple matching distance
(see
matching
) between object i and object
j in the variable s.
harikumar
Harikumar-PV (2015) has proposed a distance for a mixed variable data set:
where is the number of binary variables,
is the co-occurrence distance (see
cooccur
), and is the
Hamming distance.
ahmad
Ahmad and Dey (2007) has computed a distance of a mixed variable set via
where are the co-occurrence distance
(see
cooccur
). In the Ahmad and Dey distance,
the binary and categorical variables are not separable such that
the co-occurrence distance is based on the combined these two classes,
i.e. binary and categorical variables. Note that this function applies
standard version of Squared Euclidean, i.e without any weight.
At leas two arguments of the idnum
, idbin
, and
idcat
have to be provided because this function calculates
the mixed distance. If the method
is harikumar
,
the categorical variables have to be at least two variables such
that the co-occurrence distance can be computed. It also applies when
method = "ahmad"
. The idbin
+ idcat
has to
be more than 1 column. It returns to an Error
message otherwise.
Function returns a distance matrix (n x n).
Weksi Budiaji
Contact: [email protected]
Ahmad, A., and Dey, L. 2007. A K-mean clustering algorithm for mixed numeric and categorical data. Data and Knowledge Engineering 63, pp. 503-527.
Gower, J., 1971. A general coefficient of similarity and some of its properties. Biometrics 27, pp. 857-871
Harikumar, S., PV, S., 2015. K-medoid clustering for heterogeneous data sets. JProcedia Computer Science 70, pp. 226-237.
Huang, Z., 1997. Clustering large data sets with mixed numeric and categorical values, in: The First Pacific-Asia Conference on Knowledge Discovery and Data Mining, pp. 21-34.
Podani, J., 1999. Extending gower's general coefficient of similarity to ordinal characters. Taxon 48, pp. 331-340.
Wishart, D., 2003. K-means clustering with outlier detection, mixed variables and missing values, in: Exploratory Data Analysis in Empirical Research: Proceedings of the 25th Annual Conference of the Gesellschaft fur Klassifikation e.V., University of Munich, March 14-16, 2001, Springer Berlin Heidelberg, Berlin, Heidelberg. pp. 216-226.
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3) mixdata <- cbind(iris[1:7,1:3], a, a1) colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""), paste(c("bin"), 1:3, sep = ""), paste(c("cat"), 1:3, sep = "")) distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) a1 <- matrix(sample(1:3, 7*3, replace = TRUE), 7, 3) mixdata <- cbind(iris[1:7,1:3], a, a1) colnames(mixdata) <- c(paste(c("num"), 1:3, sep = ""), paste(c("bin"), 1:3, sep = ""), paste(c("cat"), 1:3, sep = "")) distmix(mixdata, method = "gower", idnum = 1:3, idbin = 4:6, idcat = 7:9)
This function computes a pairwise numerical distance between two numerical data sets.
distNumeric(x, y, method = "mrw", xyequal = TRUE)
distNumeric(x, y, method = "mrw", xyequal = TRUE)
x |
A first data matrix (see Details). |
y |
A second data matrix (see Details). |
method |
A method to calculate the pairwise numerical distance (see Details). |
xyequal |
A logical if |
The x
and y
arguments have to be matrices with
the same number of columns where the row indicates the object and the
column is the variable. This function calculate all pairwise distance
between rows in the x
and y
matrices. Although it calculates
a pairwise distance between two data sets, the default function computes all
distances in the x
matrix. If the x
matrix is not equal to
the y
matrix, the xyequal
has to be set FALSE
.
The method
available are mrw
(Manhattan weighted by range),
sev
(squared Euclidean weighted by variance), ser
(squared Euclidean weighted by range), ser.2
(squared Euclidean
weighted by squared range) and se
(squared Euclidean).
Their formulas are:
where is the number of numerical variables,
is the range
of the r-th variables,
is the variance of the r-th
variables.
Function returns a distance matrix with the number of rows equal to
the number of objects in the x
matrix () and the number of
columns equals to the number of objects in the
y
matrix ().
Weksi Budiaji
Contact: [email protected]
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") mrwdist[1:6,1:6]
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") mrwdist[1:6,1:6]
This function runs the simple and fast k-medoid algorithm proposed by Park and Jun (2009).
fastkmed(distdata, ncluster, iterate = 10, init = NULL)
fastkmed(distdata, ncluster, iterate = 10, init = NULL)
distdata |
A distance matrix (n x n) or dist object. |
ncluster |
A number of clusters. |
iterate |
A number of iterations for the clustering algorithm. |
init |
A vector of initial objects as the cluster medoids (see Details). |
The simple and fast k-medoids, which sets a set of medoids as the cluster centers, adapts the k-means algorithm for medoid up-dating. The new medoids of each iteration are calculated in the within cluster only such that it gains speed.
init = NULL
is required because the Park and Jun (2009) has
a particular method to select the initial medoids. The initial medoids
are selected by
where the first k of the is selected if the number of
clusters is k.
init
can be provided with a vector of id objects. The length of
the vector has to be equal to the number of clusters. However, assigning
a vector in the init
argument, the algorithm is no longer the simple
and fast k-medoids algorithm. The inckmed
function,
for example, defines a different method to select the initial medoid
though it applies the fastkmed
function.
Function returns a list of components:
cluster
is the clustering memberships result.
medoid
is the id medoids.
minimum_distance
is the distance of all objects to their cluster
medoid.
Weksi Budiaji
Contact: [email protected]
Park, H., Jun, C., 2009. A simple and fast algorithm for k-medoids clustering. Expert Systems with Applications 36, pp. 3336-3341.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- fastkmed(mrwdist, ncluster = 3, iterate = 50) table(result$cluster, iris[,5])
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- fastkmed(mrwdist, ncluster = 3, iterate = 50) table(result$cluster, iris[,5])
A dataset containing four variables of 113 countries for their food security index based on panelists evaluation in 2017.
globalfood
globalfood
A data frame with 113 rows and 4 variables:
Index of food affordability.
Index of food availability.
Index of food quality and safety.
Index of natural resources and resilience.
The original indicator variables consist of 27 variables. Then, they are summarized into four pillars of food security; they are affordability, availability, quality and safety, and natural resources and resilience. Food-security expertise panelists evaluate the score of each country from 0 to 100, where 0 is the least favorable towards food security.
https://impact.economist.com/sustainability/project/food-security-index/
A mixed variable dataset containing 14 variables of 297 patients for their heart disease diagnosis.
heart
heart
A data frame with 297 rows and 14 variables:
Age in years (numerical).
Sex: 1 = male, 0 = female (logical).
Four chest pain types: (1) typical angina, (2) atypical angina (3)non-anginal pain, (4) asymptomatic (categorical).
Resting blood pressure (in mm Hg on admission to the hospital) (numerical).
Serum cholestoral in mg/dl (numerical).
Fasting blood sugar more than 120 mg/dl (logical).
Resting electrocardiographic results: (0) normal, (1) having ST-T wave abnormality, (2) showing probable or definite left ventricular hypertrophy by Estes' criteria (categorical).
Maximum heart rate achieved (numerical).
Exercise induced angina (logical).
ST depression induced by exercise relative to rest (numerical).
The slope of the peak exercise ST segment: (1) upsloping, (2) flat, (3) downsloping (categorical).
Number of major vessels (0-3) colored by flourosopy (numerical).
(3) normal, (6) fixed defect, (7) reversable defect (categorical).
Diagonosis of heart disease (4 classes). It can be 2 classes by setting 0 for 0 values and 1 for non-0 values.
The data set is taken from machine learning repository of UCI. The original data set consists of 303 patients with 6 NA's. Then, the missing values are omitted such that it reduces into 297 patients.
https://archive.ics.uci.edu/ml/datasets/Heart+Disease
Lichman, M. (2013). UCI machine learning repository.
This function runs the increasing number of clusters in the k-medoids algorithm proposed by Yu et. al. (2018).
inckmed(distdata, ncluster, iterate = 10, alpha = 1)
inckmed(distdata, ncluster, iterate = 10, alpha = 1)
distdata |
A distance matrix (n x n) or dist object. |
ncluster |
A number of clusters. |
iterate |
A number of iterations for the clustering algorithm. |
alpha |
A stretch factor to determine the range of initial medoid selection (see Details). |
This algorithm is claimed to manage with the weakness of the
simple and fast-kmedoids (fastkmed
). The origin of the
algorithm is a centroid-based algorithm by applying the Euclidean distance.
Then, Bbecause the function is a medoid-based algorithm, the object mean
(centroid) and variance are redefined into medoid and deviation, respectively.
The alpha
argument is a stretch factor, i.e. a constant defined by
the user. It is applied to determine a set of medoid candidates. The medoid
candidates are calculated by
{
|
},
where
is the average deviation of object i, and
is the average deviation of the data set. They are computed by
where n is the number of objects, is the object i,
and
is the most centrally located object.
Function returns a list of components:
cluster
is the clustering memberships result.
medoid
is the id medoids.
minimum_distance
is the distance of all objects to their cluster
medoid.
Weksi Budiaji
Contact: [email protected]
Yu, D., Liu, G., Guo, M., Liu, X., 2018. An improved K-medoids algorithm based on step increasing and optimizing medoids. Expert Systems with Applications 92, pp. 464-473.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- inckmed(mrwdist, ncluster = 3, iterate = 50, alpha = 1.5) table(result$cluster, iris[,5])
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- inckmed(mrwdist, ncluster = 3, iterate = 50, alpha = 1.5) table(result$cluster, iris[,5])
This function computes the simple matching distance from two data frames/ matrices.
matching(x, y)
matching(x, y)
x |
A first data frame or matrix (see Details). |
y |
A second data frame or matrix (see Details). |
The x
and y
arguments have to be data frames/
matrices with the same number of columns where the row indicates the object
and the column is the variable. This function calculates all pairwise
distance between rows in the x
and y
data frames/ matrices.
If the x
data frame/ matrix is equal to the y
data frame/
matrix, it explicitly calculates all distances in the x
data frame/
matrix.
The simple matching distance between objects i and j is calculated by
where is the number of variables, and
{0, 1}.
, if
and
, when
.
As an example, the distance between objects 1 and 2 is presented.
object | x | y | z |
1 | 1 | 2 | 2 |
2 | 1 | 2 | 1 |
The distance between objects 1 and 2 is
Function returns a distance matrix with the number of rows equal to
the number of objects in the x
data frame/ matrix () and
the number of columns equals to the number of objects in the
y
data frame/ matrix ().
Weksi Budiaji
Contact: [email protected]
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) matching(a, a)
set.seed(1) a <- matrix(sample(1:2, 7*3, replace = TRUE), 7, 3) matching(a, a)
This function computes medoid shadow values and shadow value plots of each cluster. The plot presents the mean of the shadow values as well.
msv(distdata, idmedoid, idcluster, title = "")
msv(distdata, idmedoid, idcluster, title = "")
distdata |
A distance matrix (n x n) or dist object. |
idmedoid |
A vector of id medoids (see Details). |
idcluster |
A vector of cluster membership (see Details). |
title |
A title of the plot. |
The origin of the shadow value is calculated in the shadow
function of the flexclust package, in which it is based on the first and
second closest centroid. The msv
function in this package modifies
the centroid into medoid such that the formula to compute shadow value of
object i is
where is the distance between object i to the first
closest medoid and d(i, m'(i)) is the distance between object
i to the second closest medoid.
The idmedoid
argument corresponds to the idcluster
argument.
If the length of idmedoid
is 3, for example, the idcluster
has
to have 3 unique cluster memberships, or it returns Error
otherwise.
The length of the idcluster
has also to be equal to n
(the number of objects). In contrast to the centroid shadow value,
the medoid shadow value is interpreted likewise a silhoutte value,
the higher value the better separation.
Function returns a list with following components:
result
is a data frame of the shadow values for all objects
plot
is the shadow value plots of each cluster.
Weksi Budiaji
Contact: [email protected]
F. Leisch. 2010 Neighborhood graphs, stripes and shadow plots for cluster visualization. Statistics and Computing. vol. 20, pp. 457-469
W. Budiaji. 2019 Medoid-based shadow value validation and visualization. International Journal of Advances in Intelligent Informatics. Vol 5 No 2 pp. 76-88
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) sha <- msv(distiris, res$medoid, res$cluster) sha$result[c(1:3,70:75,101:103),] sha$plot
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) sha <- msv(distiris, res$medoid, res$cluster) sha$result[c(1:3,70:75,101:103),] sha$plot
This function creates a biplot from a pca object, which is
generated by the prcomp
function from the stats package.
pcabiplot( PC, x = "PC1", y = "PC2", var.line = TRUE, colobj = rep(1, nrow(PC$x)), o.size = 1 )
pcabiplot( PC, x = "PC1", y = "PC2", var.line = TRUE, colobj = rep(1, nrow(PC$x)), o.size = 1 )
PC |
A pca object generated by |
x |
X axis (see Details). |
y |
Y axis (see Details). |
var.line |
A logical input, if variable lines are plotted. |
colobj |
A vector to provide color in the objects (see Details). |
o.size |
A numeric number to set the object size. |
This is a function to plot a pca biplot from a pca object. The x
and y axes can be supplied with any principle component. The length of the
colobj
vector has to be equal to the number of objects. This argument
controls the color of the objects and is very convenient to explore the
clustering result. The default value is that all object have the same color.
Function returns a plot of pca.
Weksi Budiaji
Contact: [email protected]
pcadat <- prcomp(iris[,1:4], scale. = TRUE) pcabiplot(pcadat)
pcadat <- prcomp(iris[,1:4], scale. = TRUE) pcabiplot(pcadat)
This function runs the rank k-medoids algorithm proposed by Zadegan et. al. (2013).
rankkmed(distdata, ncluster, m = 3, iterate = 10, init = NULL)
rankkmed(distdata, ncluster, m = 3, iterate = 10, init = NULL)
distdata |
A distance matrix (n x n) or dist object. |
ncluster |
A number of clusters. |
m |
A number of objects to compute hostility (see Details). |
iterate |
A number of iterations for the clustering algorithm. |
init |
A vector of initial objects as the cluster medoids (see Details). |
This algorithm is claimed to cope with the local optima problem
of the simple and fast-kmedoids algorithm (fastkmed
). The
m
argument is defined by the user and has to be .
The
m
is a hostility measure computed by
where is the object j, Y is the set of objects
as many as m, and
is the rank distance, i.e. sorted
distance, between object i and j.
init
can be provided with a vector of id objects. The length of
the vector has to be equal to the number of clusters. However, assigning
a vector in the init
argument, the algorithm is no longer the rank
k-medoids algorithm.
Function returns a list of components:
cluster
is the clustering memberships result.
medoid
is the id medoids.
minimum_distance
is the distance of all objects to their cluster
medoid.
Weksi Budiaji
Contact: [email protected]
Zadegan, S.M.R, Mirzaie M, and Sadoughi, F. 2013. Ranked k-medoids: A fast and accurate rank-based partitioning algorithm for clustering large datasets. Knowledge-Based Systems 39, 133-143.
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- rankkmed(mrwdist, ncluster = 3, iterate = 50) table(result$cluster, iris[,5])
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- rankkmed(mrwdist, ncluster = 3, iterate = 50) table(result$cluster, iris[,5])
This function creates silhouette indices and silhouette plots of each cluster. The plot presents also the mean of the silhouette indices per cluster.
sil(distdata, idmedoid, idcluster, title = "")
sil(distdata, idmedoid, idcluster, title = "")
distdata |
A distance matrix (n x n) or dist object. |
idmedoid |
A vector of id medoids (see Details). |
idcluster |
A vector of cluster membership (see Details). |
title |
A title of the plot. |
The silhouette index of object i is calculated by
where is the average distance of object i to all objects
within the cluster, and
is the average distance of object i
to all objects within the nearest cluster.
The idmedoid
argument corresponds to the idcluster
argument.
If the length of idmedoid
is 3, for example, the idcluster
has
to have 3 unique memberships, or it returns Error
otherwise. The
length of the idcluster
has also to be equal to n
(the number of objects).
Function returns a list with following components:
result
is a data frame of the silhouette indices for all objects
plot
is the silhouette plots of each cluster.
Weksi Budiaji
Contact: [email protected]
P. J. Rousseeuw. 1987 Silhouettes: a graphical aid to the interpretation and validation of cluster analysis. Journal of Computational and Applied Mathematics, vol. 20, pp. 53-65
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) silhouette <- sil(distiris, res$medoid, res$cluster) silhouette$result[c(1:3,70:75,101:103),] silhouette$plot
distiris <- as.matrix(dist(iris[,1:4])) res <- fastkmed(distiris, 3) silhouette <- sil(distiris, res$medoid, res$cluster) silhouette$result[c(1:3,70:75,101:103),] silhouette$plot
This function runs the simple k-medoid algorithm proposed by Budiaji and Leisch (2019).
skm(distdata, ncluster, seeding = 20, iterate = 10)
skm(distdata, ncluster, seeding = 20, iterate = 10)
distdata |
A distance matrix (n x n) or dist object. |
ncluster |
A number of clusters. |
seeding |
A number of seedings to run the algorithm (see Details). |
iterate |
A number of iterations for each seeding (see Details). |
The simple k-medoids, which sets a set of medoids as the cluster centers, adapts the simple and fast k-medoid algoritm. The best practice to run the simple and fast k-medoid is by running the algorithm several times with different random seeding options.
Function returns a list of components:
cluster
is the clustering memberships result.
medoid
is the id medoids.
minimum_distance
is the distance of all objects to their cluster
medoid.
Weksi Budiaji
Contact: [email protected]
W. Budiaji, and F. Leisch. 2019. Simple K-Medoids Partitioning Algorithm for Mixed Variable Data. Algorithms Vol 12(9) 177
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- skm(mrwdist, ncluster = 3, seeding = 50) table(result$cluster, iris[,5])
num <- as.matrix(iris[,1:4]) mrwdist <- distNumeric(num, num, method = "mrw") result <- skm(mrwdist, ncluster = 3, seeding = 50) table(result$cluster, iris[,5])