On this page:
Thoughts on Clustering
Before you go...

10 Generative Recursion

home work!

Purpose: To solve and visualize interesting data problems using big-bang as an engine for generative recursion.


Today we are going to write a program which solves and animates the k-means problem. Given a list of points in n-dimensional space, k-means clustering categorizes those points into k clusters, based on their distance from each other. As we want to animate this with big-bang, we will only be dealing with points in 2-dimensional space.

Definition: a centroid is the center of a cluster.

k-means clustering works as follows:
  1. Begin with a list of points (Datapoints) and a positive integer k

  2. Select k points from the list; these are now the centroids, and the clusters they represent are labeled from 0 to k-1

  3. Assign every point a label based on which centroid it is closest to

  4. Recompute the centroids as the mean of all of the points assigned to each label

  5. Re-assign every point a label based on which centroid it is closest to

  6. Stop when re-assignment doesn’t switch any labels, otherwise repeat steps 4 and 5

As step 6 involves recursion, but the size of the problem does not structurally shrink, we are dealing with generative recursion. While it is outside the scope of this course to prove it, the above algorithm is guaranteed to terminate.

Starter Code: as the list of points and k never change, they do not need to be a part of the world state; they will be passed to our main function as that is what the program starts with.

What does change are the labels points are assigned to as well as what the centroids are. As such, the data definition much keep track of these two things as well as a boolean that lets big-bang know when re-assignment has not occurred.

; A KMCA (K-Means Clustering Assignment) is a
; (make-assignment [List-of Centroid] [List-of Nat] Boolean)
(define-struct assignment (centroids labels no-reassignment?))
; - where centroids is the current list of centroids (ordered from 0...k-1)
; - labels are the labels assigned to each datapoint
;   labels are indices into the centroids list
;   (the first datapoint is labeled with the first element in labels,
;    second point the second label, etc.)
; - and no-reassignment? keeps track of whether or not re-assignment has occurred
; A Centroid is a (make-centroid Number Number)
(define-struct centroid [x y])
; - where x is the x-coordinate of the centroid
; - and y is the y-coordinate of the centroid
; A Datapoint is a (make-datapoint Number Number)
(define-struct datapoint [x y])
; - where x is the x-coordinate of the data point
; - and y is the y-coordinate of the data point

Starter Code: Below is the main function which will enable this program to run; leave it commented out for now.
; main : PosInt [List-of Datapoint] -> [List-of [List-of Datapoint]]
; run k-means clustering and output the datapoints binned into their respective clusters
(define (main k lop)
    [; next-assignment/main : KMCA -> KMCA
     ; Advance to the next assignment
     (define (next-assignment/main assignment)
       (next-assignment assignment k lop))
     ; draw/main : KMCA -> KMCA
     ; Draw the current clusters and points
     (define (draw/main assignment)
       (draw assignment k lop))]
      (big-bang (initial-assignment k lop)
                [on-tick next-assignment/main 1]
                [to-draw draw/main]
                [stop-when assignment-no-reassignment?]))
     k lop)))

Exercise 1 Design a function which given a Nat n and a list outputs the first n elements of that list. If n is greater than the length of the list your function should just return the entire list.

Exercise 2 Design a function which computes the distance between a Datapoint and a Centroid. Remember that the distance between two points is the square root of the sum of the squares of the difference between their x- and y- coordinates.

Exercise 3 Design a function which, given a list of Datapoints outputs their mean, which is a Centroid containing the mean of their x-coordinates and y-coordinates.

Exercise 4 Design a function which, given a list of list of Datapoints, outputs the means (as Centroids) of each inner list.

Switch pair programming roles before continuing!

Exercise 5 Design the functon cluster which clusters a given list based on labels. The signature, purpose statement, and tests have been provided for you.
; cluster : [List-of Nat] Nat [List-of X] -> [List-of [List-of X]]
; Given a list of labels and the number of clusters, cluster lox
; (length labels) = (length lox)
(check-expect (cluster (list 0 1 3 0 1 3)
                       (list "a" "b" "c" "d" "e" "f"))
              (list (list "a" "d")
                    (list "b" "e")
                    (list "c" "f")))
(define (cluster labels k lox)
Hint: build-list will come in handy here, and foldr can take two lists of equal length, as in the example below.
(define (silly some-string some-boolean number)
  (+ (string-length some-string) (if some-boolean 1 0) number))
(check-expect (foldr silly 20 (list "a" "aa") (list #t #f)) 24)

Exercise 6 Design the function assign-new-labels which given a list of Centroids, k, and the list of Datapoints, will output a list of labels dictating which Centroid each Datapoint is closest to. The signature, purpose statement, and tests have been provided for you.

; assign-new-labels : [List-of Centroid] Nat [List-of Datapoint]  -> [List-of Nat]
; Given the current centroids and the list of data points, output the new labels
(check-expect (assign-new-labels (list (make-centroid -5 -5) (make-centroid 5 5))
                                 (list (make-datapoint -20 -20)
                                       (make-datapoint -3 1)
                                       (make-datapoint 20 20)))
              (list 0 0 1))
(define (assign-new-labels centroids k lodp)

Hint: for each Datapoint dp in lodp, we want to output the index of the centroid in centroids where the distance between dp and that centroid is smallest. So we want to map over lodp, and find out which number from 0 through k-1 (build-list) gives the index (list-ref) of the centroid with the smallest distance (argmin) to dp.

Switch pair programming roles before continuing!

Exercise 7 Design the function initial-assignment which creates the initial KMCA, given k and the list of Datapoints. The first Centroids should be at the same positions as the first k elements of the list of data points and the first labels should be given by assign-new-labels after the Centroids have been locally defined (so they can be used twice). no-reassignment? should be #f.

Exercise 8 Design the function next-assignment which, given a KMCA, k, and the list of data points, outputs the next KMCA. The Datapoints should be clustered, and this clustering should be used to compute the Centroids (the means of each clustering). Then we will assign-new-labels with these Centroids. no-reassignment? will be determined by whether or not these new labels are equal? to the old ones.

Exercise 9 Temporarily redefine draw/main to just output empty-image and run the following code to ensure it clusters posns.
; random-datapoint : ? -> Datapoint
; Make a random datapoint
(define (random-datapoint _)
  (make-datapoint (random 500) (random 500)))
(define DATAPOINTS (build-list 50 random-datapoint))

Switch pair programming roles before continuing!


Now we will actually build out draw/main.

Starter Code: The following code will be useful in drawing the world.
(define BG (empty-scene 500 500))
(define COLORS (list "red" "purple" "blue" "green" "yellow" "orange" "pink" "grey"))
; draw-centroid : Centroid Color Image -> Image
; Draw a centroid on the given image
 (draw-centroid (make-centroid 0 0) "red" (circle 10 "solid" "black"))
 (place-image (text "X" 20 "red") 0 0 (circle 10 "solid" "black")))
(define (draw-centroid ctr color bg)
  (place-image (text "X" 20 color) (centroid-x ctr) (centroid-y ctr) bg))
; draw-point : Datapoint Color Image -> Image
; Draw a datapoint onto the given image
 (draw-point (make-datapoint 10 2) "red" (circle 10 "solid" "black"))
 (place-image (circle 3 "outline" "red") 10 2 (circle 10 "solid" "black")))
(define (draw-point dp color bg)
  (place-image (circle 3 "outline" color) (datapoint-x dp) (datapoint-y dp) bg))

Exercise 10 Design a function which, given a list of X, a list of natural numbers of the same length (representing the label of each X, which will correspond with a color from COLORS), a function which takes an X, Color, Image, and draws the X in that color on that image, and a background image, draws all of the X’s in their respective colors on the image. The signature, purpose statement, header, and test have been provided for you.

Hint: Remember, foldr can take more than one list.
; draw-shapes : [List-of X] [List-of Nat] [X Color Image -> Image] Image -> Image
; Draw shapes on bg at xs using colors from labels
(check-expect (draw-shapes (list (make-centroid 0 0) (make-centroid 1 1))
                           (list 0 3)
              (draw-centroid (make-centroid 0 0) (first COLORS)
                             (draw-centroid (make-centroid 1 1) (fourth COLORS)
(define (draw-shapes xs labels x->image bg)

Exercise 11 Design a function which, given a list of Datapoints, their labels, and an image, draws a point at each of the given locations onto the given image. The color that should be used for each datapoint is given by their label.

Switch pair programming roles before continuing!

Exercise 12 Design a function which, given a list of Centroids, k, and an image, draws those centroids on the image. Hint: build-list.

Exercise 13 Design draw, which given a KMCA, k, and a list of the Datapoints draws the current clustering assignment.

Exercise 14 Run your program with draw/main correctly defined to call draw.

Thoughts on Clustering

So given a fixed k and fixed initial centroids, this algorithm will always produce the same clustering. But is it the best clustering? Think about what properties you want a clustering to have, insofar as how similar elements are within the same cluster and how different distinct clusters are. How would you measure these traits? Once you decide on these measurements, given a data set, how would you go about picking the ’correct’ k? As it turns out, evaluating the quality of clusters and picking the correct k are complex, actively researched areas of computer science.

Before you go...

If you had trouble finishing any of the exercises in the lab or homework, or just feel like you’re struggling with any of the class material, please feel free to come to office hours and talk to a TA or tutor for additional assistance.