# CS4120/6120: Natural Language Processing

## Homework 3

Assigned: Thursday, 30 March 2017
Due: 11:59pm, Friday, 21 April 2017

### Instructions

1. If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
2. Email the TA the requested written answers, code, and instructions on how to (compile and) run the code.

### Formal Semantics

For background on formal semantics see the lecture slides, Jurafsky & Martin chapters 18-20, or also the NLTK book, chapter 10.

1. [12 points] Translate the following sentences into quantified formulas of first-order logic:

1. Alice likes someone and someone likes Bob.
2. Nobody smiles at Jan.
3. Nobody coughed or sneezed.

2. [12 points] Translate the following verb phrases using $\lambda$-abstracts and quantified formulas of first-order logic:

1. feed Chester and give a compliment to Alice
2. be loved by everyone
3. be loved by everyone and detested by no-one

3. [6 points] Let $g = chase$ and $h = \lambda x.\forall y.(dog(y) \Rightarrow chase(x, y))$. If $h = f(g)$, write down a $\lambda$-abstract for $f$.

4. [7 points] Let $g = give$ and $h = \lambda z.\lambda x.\exists y.(present(y) \wedge give(x, y, z))$. If $h = f(g)$, write down a $\lambda$-abstract for $f$.

### Word Clustering

In class, we briefly discussed “Brown clustering” for the words in a corpus, named after the original formulation by Peter Brown et al. in Class-based n-gram models of natural language.

The idea is to define a deterministic mapping $C$ from words to classes and then to define a first-order hidden Markov language model in terms of those classes. For a corpus of $N$ words: \begin{eqnarray*} p(w_1, w_2, \ldots, w_N) & = & \prod_{i=1}^N p(w_i \mid C(w_i)) p(C(w_i) \mid C(w_{i-1})) \end{eqnarray*}

Since the mapping $C$ is deterministic, inference in this model is simple if the mapping is known. If we partition a vocabulary $V$ into $K$ clusters, we obviously don't want to evaluate each of the $|V|^K$ partitions. Instead, both the original paper by Brown et al. and later implementations of the idea have used bottom-up agglomerative clustering, starting with every distinct word in its own cluster and merging until a certain number of clusters $K$ is reached. This greedy algorithm, as a useful by-product, induces a tree, not just a partition, over the vocabulary, so that we can take advantage of word clusters with several levels of granularity.

Even this greedy algorithm can take too long, however. To start with each distinct word in its own cluster and then check all pairwise merges would take $O(|V|^3)$ time. Most implementations, therefore, adopt the following strategy: To produce $K$ clusters, put the $K$ most common words in their own clusters, then successively add each of the $K+1, K+2, \ldots$st most common words as its own cluster and, at each step, merge the two clusters in such a way as to most increase (or cause the least decrease to) the (log) likelihood of the data averaged over all words. In case of ties in the merge criterion, choose to produce the largest cluster with the best value. Finally, we merge the remaining $K$ clusters to finish the connected tree. In all, this approach takes $O(K^2|V|)$ time.

How do we compute the average log likelihood of the data? As for standard Markov language models, we can see that it only depends on word counts. For a given cluster mapping $C$, we will take advantage of the stationarity of the Markov process and define the average log likelihood in terms of successive word pairs $w, w'$, without reference to particular indices $i$. This also allows us to talk about cluster co-occurrences $C(w), C(w')$, or $c, c'$ for short. We thus have \begin{eqnarray*} L & = & \frac{1}{n} \sum_{i=1}^N \log p(C(w_i) \mid C(w_{i-1})) p(w_i \mid C(w_i)) \\ & = & \sum_{w,w'} \frac{\#(w,w')}{N} \log p(C(w') \mid C(w)) p(w' \mid C(w')) \\ & = & \sum_{w,w'} \frac{\#(w,w')}{N} \log \frac{\#(C(w), C(w'))}{\#(C(w))} \frac{\#(w')}{\#(C(w'))} & = & \sum_{w,w'} \frac{\#(w,w')}{N} \log \frac{\#(C(w), C(w')) N }{\#(C(w))\#(C(w'))} + \sum_{w,w'} \frac{\#(w,w')}{N} \log \frac{\#(w')}{N} \\ & = & \sum_{c,c'} \frac{\#(c,c')}{N} \log \frac{\#(c,c') N}{\#(c) \#(c')} + \sum_{w'} \frac{\#(w')}{N} \log \frac{\#(w')}{N} \\ & = & \sum_{c,c'} p(c,c') \log \frac{p(c,c')}{p(c) p(c')} + \sum_w p(w) \log p(w) \\ & = & I(C) - H(W) \end{eqnarray*} where $I(C)$ is the mutual information between successive clusters and $H(W)$ is the entropy of the unigram word distribution. The latter is a constant with respect to the clustering, so we simply have to pick the cluster merge that results in the highest mutual information.

1. [10 points] Download the NLTK version of the Brown corpus (no relation). The corpus is already tokenized with whitespace. Remove the part-of-speech tags appended to each token with a slash. Lower case all words. Replace all tokens with a count of 10 or less with the conventional out-of-vocabulary symbol UNK. Sort the vocabulary first by decreasing frequency and then alphabetically, to break ties. Submit this ranked vocabulary list, with counts. This will be useful for debugging.

2. [50 points] Write a program to implement Brown clustering on this dataset. Let the initial number of clusters $K = 40$. You should therefore give each of the 40 most frequent tokens its own cluster and proceed as described above. For purposes of the bigram model, treat each sentence on its own line as independent of the others, and assume that each sentence begins and ends with invisible START and END symbols. These symbols should not be included in the clustering, however. Turn in your code that takes the vocabulary you just produced and the corpus as input. The output will consist of two parts:

1. the vocabulary with a string of 0's and 1's for each, indicating its path from the root of the cluster merge tree to the leaves; and
2. the initial forty clusters after merging all of the vocabulary, i.e., 40 clusters with their words.
Since the clusters start out sorted by decreasing frequency, we stipulate that when merging two clusters, the earlier cluster on the list gets the code 0 and the later cluster gets the code 1.