Clustering using DBSCAN

Will Furnass

DBSCAN is an n-dimensional clustering algorithm that has the advantages over other clustering algorithms of

  • not requiring the number of points to be specified in advance and
  • differentiating between clustered points and noise points
  • clusters can be non-spherical

Here we use the DBSCAN implementation provided by the scikit-learn package to cluster a 2D dataset. The algorithm enumerates distinct clusters using integer labels (assigning -1 to noise points); here these labels are plotted in 2D using the matplotlib library.

Run the cell below then use the two sliders to assess the impact of changing DBSCAN's two key parameters:

  • eps: The maximum distance between two samples for them to be considered as in the same neighborhood.
  • min_pts: The number of samples (or total weight) in a neighborhood for a point to be considered as a core point. This includes the point itself.

References

  • Ester, M., Kriegel, H., Sander, J., Xu, X. (1996). A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise. Presented at the Second International Conference on Knowledge Discovery and Data Mining, AAAI Press, pp. 226–231.

Keywords

  • DBSCAN
  • clustering