Clustering using DBSCAN
Will Furnass
DBSCAN is an n-dimensional clustering algorithm that has the advantages over other clustering algorithms of
- not requiring the number of points to be specified in advance and
- differentiating between clustered points and noise points
- clusters can be non-spherical
Here we use the DBSCAN implementation provided by the scikit-learn package to cluster a 2D dataset. The algorithm enumerates distinct clusters using integer labels (assigning -1 to noise points); here these labels are plotted in 2D using the matplotlib library.
Run the cell below then use the two sliders to assess the impact of changing DBSCAN's two key parameters:
eps
: The maximum distance between two samples for them to be considered as in the same neighborhood.min_pts
: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.
References
Ester, M., Kriegel, H., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Presented at the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226–231.
Keywords
- DBSCAN
- clustering