On this page:
Instructions
Problem 1: The Registrar’s Office
6.1 Data constraints
6.2 Methods and examples to design
Problem 2: Who’s On Deque?
6.1 Data definitions and examples
6.2 Methods
Problem 3: Huffman Coding
6.1 The algorithm
6.2 Data design and methods
7.6

Assignment 6: Mutable data

Goals: Constructing cyclic data; using mutation to design data structures; using mutation to solve algorithmic problems.

Instructions

As always, be very careful with your naming conventions.

The submissions will be organized as follows:

Due Dates:
  • Problem 1 – Monday Feb 24, 9:00pm

  • Problem 2 – Friday Feb 28, 9:00pm

  • Problem 3 – Friday Feb 28, 9:00pm

Problem 1: The Registrar’s Office

The registrar’s office maintains a great deal of information about classes, instructors, and students. Your task in this problem is to model that data, and implement a few methods on it. We deliberately do not give you the class diagram for this problem: from the description below, you should properly design whatever classes you think are relevant. Please use generic lists (i.e. IList<T>) for this problem.

6.1 Data constraints
6.2 Methods and examples to design

Hint: you will at minimum need to design some kind of sameness-checking method for Students. We are not giving you any guarantees about how the classes in your program might be used or instantiated, so you must figure out what information should be used to define sameness, and justify your reasoning.

Problem 2: Who’s On Deque?

We would like to build a generic list in such a way that we can start either at the front, or at the back, and move through the list in either direction. Here is an example of such a list (shown here with Strings), drawn as an object diagram:

              +---------------+
              | Deque<String> |
              +---------------+
              | header ----------+
              +---------------+  |
                  +--------------+
                  |
                  v
            +------------------+
+---------->| Sentinel<String> |<-------------------------------------------+
|           +------------------+                                            |
|       +---- next             |                                            |
|       |   | prev ------------------------------------------------+        |
|       |   +------------------+                                   |        |
|       |                                                          |        |
|       v                                                          v        |
| +--------------+   +--------------+   +--------------+   +--------------+ |
| | Node<String> |   | Node<String> |   | Node<String> |   | Node<String> | |
| +--------------+   +--------------+   +--------------+   +--------------+ |
| | "abc"        |   | "bcd"        |   | "cde"        |   | "def"        | |
| | next ----------->| next ----------->| next ----------->| next ----------+
+-- prev         |<--- prev         |<--- prev         |<--- prev         |
  +--------------+   +--------------+   +--------------+   +--------------+

Every Deque has a header that consists of the Sentinel node. This header field does not change, but the fields within the Sentinel node (that provide the links to the first and the last item in the Deque) can and will change.

Each data Node has two links, one to the next item in the list and one to the previous item in the list.

The Sentinel node is always present. It also has next and prev fields, which are consistently updated to link to the head of the list and to the tail of the list, respectively.

The Sentinel and Node classes both inherit from a common superclass (shown below), because the next or prev item after any given node may be either a data-carrying Node or the Sentinel marking the end of the list.

The list shown above has four data-carrying nodes and the lone sentinel. The empty list has just the Sentinel, and its links to the first and to the last item just reference the Sentinel node itself. So an empty list would have the following object diagram:

     +---------------+
     | Deque<String> |
     +---------------+
     | header ---------+
     +---------------+ |
                       |
              +--------+
+----+        |     +----+
|    |        |     |    |
|    V        V     V    |
|  +------------------+  |
|  | Sentinel<String> |  |
|  +------------------+  |
+--- next             |  |
   | prev ---------------+
   +------------------+

The class diagram for these classes is as follows:
        +--------------------+
        | Deque<T>           |
        +--------------------+
        | Sentinel<T> header |-+
        +--------------------+ |
                               |
    +--------------------------+
    |
    |
    |      +---------------+
    |      | ANode<T>      |
    |      +---------------+
    |      | ANode<T> next |
    |      | ANode<T> prev |
    |      +---------------+
    |         /_\     /_\
    |          |       |
    |     +----+       |
    V     |            |
+-------------+      +---------+
| Sentinel<T> |      | Node<T> |
+-------------+      +---------+
+-------------+      | T data  |
                     +---------+

We have an abstract class ANode<T>, which can be either a Sentinel<T> node or an actual Node<T> containing data. We use an abstract class here instead of an interface because we need to share the next and prev fields.

With one caveat, below.

Moreover, because all ANodes are contained and hidden within the Deque, no external code needs to know about it: they are implementation details rather than a publicly visibly datatype definition.

6.1 Data definitions and examples
6.2 Methods

Problem 3: Huffman Coding

A character encoding is a way of representing an alphabet (set of characters) with some kind of code. In computer science, that code is often made of single bits, 0s and 1s, since that is how all data is ultimately stored.

Unicode’s industry-standard encoding is what lets us have emojis. :-)

For example, if we had the following encoding:

Letter

 

Code

a

 

0

b

 

1,0

c

 

1,1

Then "abc" would be encoded as the five bits 01011, and 1110 would be decoded as "cb".

After all, there are far fewer short codes than long codes. Why?

For this assignment, you are going to implement Huffman coding, which is a strategy for encoding an alphabet when you know how relatively popular the letters are. For example, a, e, i, o, and u are heavily used in English, whereas j, q, x, and z are not. It makes sense to try to use shorter codes to represent the popular letters, and let the less-common letters be stuck with longer codes.

While there are various properties that make Huffman codings interesting, the one we are most interested in is that it is a prefix code. This means that no encoding of one character is a prefix of any other character’s encoding. The example given above is a prefix code. The following, however, is not:

Letter

 

Code

e

 

0

f

 

1

g

 

01

What would 01 decode to, ef or g? Prefix codes help eliminate these types of ambiguities.

6.1 The algorithm

The best way to explain the Huffman coding algorithm is to illustrate it.

For example, say we begin with the following letters:

"a", "b", "c", "d", "e", "f"

Where each letter has the following relative frequencies:

12, 45, 5, 13, 9, 16

To begin, we create 6 leaves, where each leaf stores the letter and its relative frequency.
image
We then sort them in ascending order by frequency.
image
Then, we combine the two smallest leaves into a node, whose value is now the sum of its subtrees. The first leaf we place under the red branch, and the second leaf under the blue. We place this node inside of our forest (list of trees) such that it is still sorted.
image
and repeat...
image
and repeat...
image
and repeat...
image
...until there is just one tree.
image

This process is guaranteed to produce a prefix code. Why? Could we choose red as 1, and blue as 0? Could we choose differently at every level of the tree?

To encode a character, follow the path from the root to the leaf with that letter, where a red line represents a 0, and a blue a 1. So, encoding "eba" with the tree above would produce 11010100. Decoding a sequence of 0s and 1s is (essentially) the inverse operation, so decoding 101111 would yield "df".

Composing the encoding and decoding operations in either order should produce what is effectively the identity function (on either strings or bit-sequences), but they aren’t quite inverse functions. Why not?

6.2 Data design and methods

(Hint: for a change, we’re not treating Strings as primitive data: you’ll need to search around the methods on Strings to find one that helps get length-1 substrings out from the string. You should not need to work with Java chars for this assignment.)