Density-based cluster analysis engine (DBCAE) is an exploratory data analysis tool, which assesses the clusterability of a given data set and performs clus-
tering based on the results obtained from the clusterability evaluation. DBCAE adopts the nonparametric density-based definition for a cluster, which defines
clusters as high density regions separated by low density regions. This definition contains several advantages, for example, the number of clusters is not
needed as an input and the shape of the clusters is not restricted. Estimating the underlying probability density function (or just the local densities of the observed data points) using nonparametric methods and the cluster tree associated with underlying probability density function are the key objectives in
density-based cluster analysis. The cluster tree ”counts” the number of connected components at each density level. An example of the cluster tree is given
below, where a one-dimensional density function and its cluster tree are visualized.
DBCAE returns a non-overlapping clustering solution using a state-of-the-art density-based clustering algorithm called HDBSCAN*. This algorithm can
return clusters with varying densities, as the returned clusters are chosen based on the branch strengths in the cluster tree. Additionally, DBCAE also evaluates the strength of the root branch in the cluster tree to form an estimate of the clusterability of the given data set. A finite sample from an unimodal distribution (unclusterable) can often lead to an empirical cluster tree, which contains more than one branch. If the empirical cluster tree contains more than one branch, the density-based clustering algorithms typically return more than one cluster as a solution, which can lead to misleading conclusions if the data is actually coming from an unimodal distribution. By evaluating the strength of root branch we can better assess if a particular data set is suitable for clustering or not.
If a data set is identified as suitable for clustering by DBCAE, it also tries to find the most stable density estimation parameters, by comparing the different
cluster trees formed by using the different density estimation parameter values. By comparing outputs from using different parameter values, for example, the most stable value for the number of clusters can be found. If the user has a desired number of clusters in mind, DBCAE can also be used find such parameter values that return the desired number of clusters.