Parallel K-Means Data Clustering
This software package
parallel-kmeans.tar.gz
(4.6 MB) of parallel K-means data clustering contains the followings:
 -  A parallel implementation using
      OpenMP and C
-  A parallel implementation using
      MPI and C
-  A sequential version in C
For large data support (more than 2 billion number of data points),
see this page for an MPI implementation 
that uses 8-byte integers.Algorithm:
 
 -  The above sequential algorithm is based on the paper by
      MacQueen, J. "Some Methods for Classification and Analysis of
      Multivariate Observations." In the 5th Berkeley Symposium on
      Mathematical Statistics and Probability, 1967.
 
-  The parallel implementation uses data parallelism. Data objects
      to be clustered are evenly partitioned among all processes while the
      cluster centers are replicated. Global-sum reduction for all cluster
      centers is performed at the end of each iteration in order to
      generate the new cluster centers.
 
To compile:
Although I used Intel C compiler, icc, version 7.1 during the code development,
there is no particular features required except for OpenMP. Thus, the
implementation should be fairly portable. Please modify
 Makefile
to change the compiler if needed.
 -  Make command for the sequential version: make seq_main 
-  Make command for the MPI version: make mpi_main 
-  Make command for the OpenMP version: make omp_main 
-  To enable the feature of using parallel netCDF to performance I/O, edit Makefile
      to change variable ENABLE_PNETCDF to yes/no and set the variable PNETCDF_DIR
      to the place where parallel netCDF is installed.
To run:
 -  The Makefile will produce executables
  - "omp_main" for OpenMP version
- "mpi_main" for MPI version
- "seq_main" for sequential version
-  The list of available command-line arguments can be obtained by running -h option
  -  For example, running command "omp_main -h" will produce:
Usage: main [switches] -i filename -n num_clusters
       -i filename    : file containing data to be clustered
       -c centers     : file containing initial centers. default: filename
       -b             : input file is in binary format (default no)
       -r             : output file in binary format (default no)
       -n num_clusters: number of clusters (K must > 1)
       -t threshold   : threshold value (default 0.0010)
       -p nproc       : number of threads (default system allocated)
       -a             : perform atomic OpenMP pragma (default no)
       -o             : output timing results (default no)
       -v var_name    : using PnetCDF for file input and output and var_name
                        is variable name in the netCDF file to be clustered
       -k var_name    : name of variable in the netCDF to be used as the
                        initial cluster centers. If skipped, the variable
                        name from the option "-v" is used
       -d             : enable debug mode
       -h             : print this help information
Input file format:
The executables read an input file that stores the data points to be clustered.
A few example files are provided in the sub-directory
./Image_data.
The input files can be in two formats: ASCII text and raw binary.
 - ASCII text format:
  -  Each line contains the ID and coordinates of a single data point
-  The number of coordinates must be equal for all data points
-  Example files: color100.txt, texture100.txt, and edge100.txt
-  Raw binary format:
  -  File header contains two 4-byte integers.
  
  -  The first 4-byte integer is the number of data points. 
-  The second integer is the number of coordinates for each data point. 
 
-  The rest of the file (file body) contains the coordinates of
       all data points and each coordinate is a 4-byte floating point of type float. 
-  There are three example files (in little Endian format) provided:
       color17695.bin, edge17695.bin, and texture17695.bin. They
       each contains 17695 data points.
-  NetCDF file format:
  -   Note: The first K elements of the input data are picked as the 
       initial K cluster center coordinates.
  
Output files:
There are two output files:
  - Coordinates of cluster centers
    -  The file name is the input file name appended with
         ".cluster_centres".
-  The default is in ASCII text format.
      -  Each line contains an integer indicating the cluster id and the
           coordinates of the cluster center.
-  Binary format can be enabled by using -r option.
      -  Header: the first 4-byte integer is the number of clusters and the
           second 4-byte integer is the number of coordinates of the 
           cluster center.
-  Body: the coordinates of all cluster centers. 
-  Membership of all data points to the clusters
    -  The file name is the input file name appended with
         ".membership".
-  The default is in ASCII text format.
      -  Each line contains two integers: data point index (from 0
           to the number of points) and the cluster id indicating the
           membership of the point.
-  Binary format can be enabled by using -r option.
      -  Header: a 4-byte integer for the number of data points.
-  Body: the list of cluster IDs (4-byte integers),
           corresponding to each data point. 
Performance results:
 
- 
Performance results on Hopper, the Cray XE6 at NERSC, 2011.
Hopper has 153,216 processors cores (6,384 compute nodes made up of 2 twelve-core AMD MagnyCours 2.1 GHz processors).
The input data contains 1 billion data points and each data point is a vector of 128 dimensions.
Limitations:
 -  Data type -- This implementation uses C float data type for all coordinates
      and other real numbers.
-  Large number of data points -- The number of data points cannot exceed 2G due
      to the 4-byte integers used in the programs.
- For using a different data type and large data support, please see
     this page
Derived Work:
Related Links:
 -  NU-MineBench -- a data mining benchmark suite
 
-  STAMP --
Stanford Transactional Applications for Multi-Processing
 
-  CUDA K-Means Clustering --
by Serban Giuroiu, a student at UC Berkeley.
 
-  NetCDF -- a set of software libraries and self-describing, machine-independent data formats that support the creation, access, and sharing of array-oriented scientific data.
 
-  Parallel netCDF -- an I/O library that supports data access to netCDF files in parallel.
 
Wei-keng Liao
Electrical Engineering and Computer Science Department
Northwestern University
Please send comments to 
Software available since Sep. 17, 2005.
Page last modified date: Dec. 5, 2013.