As we previously discussed that we need descriptors of each image of the collection in order to cluster. So, first we start with generating descriptor of each image and then we save them into a single array in order to cluster. The input will be image collection of the pre-cluster phase. Technically, our focus is on the highly dissimilar representative images. For that we used local features of images. The local approach represents each image by a set of local featured descriptors computed at some interesting points inside the image [3].We used SIFT algorithm for finding and computing descriptors of each images.

Now, we apply K-means algorithm on an array of descriptors of images. In statistics and data mining, k-means clustering is a method of cluster analysis which aims to partition n observations into k clusters in which each observation belongs to the cluster with the nearest mean. Since the dataset is a large set (minimum 1000 images and maximum 5000 for N=5 windows) and our goal is to generate precise, minimal redundant and diverse informative overview of the image collection, we decided to apply k- means twice. So, first we apply k-means with the k value and generate a small subset of image set then we further apply K-means in order to get reduced set of that small set, which is more diverse .When we apply k- means, in the result, we get cluster results, centroids, sum and distances.

Now we fetch centroid image of all clusters which are the representative image of each clusters. The concept is; to find for each cluster the least distance image to the centroid. We get the distances from k-means output and after calculating the distances, we sort the nearest image of the cluster centroid for each cluster. That will be the representative set of the image collection.

After having 1st k-means subset, we again apply k-means second time on representative set which becomes precise and small representative set of the large image data set. So, from this phase we generate representative set and it is also useful for the next phase namely ranking mechanism. One can see the procedure for generating representative set in algorithm 3.

Algorithm 3: Clustering and generating representative set

1: input: result image set of pre-cluster 2: output: the representative set 3: for each images img do get descriptor or key points by calling sift function [image, descriptor] =sift (img) save each image descriptor in an array descriptor_images[img] = descriptor

4: end for

5: set number of clusters k and apply k-means on the descriptor_images array [Id, C, D] = kmeans(descriptor_images, k) where Id is image identification number, C is the assigned cluster number and the D is distance from the assigned cluster and other clusters as well.

6: find centroid image of each clusters: for each images i and j of cluster C if distance_image_i < distance_image_j //store the least distance image centroid = image_i end if save centroid at the output directory of the representative set

7: end for