CS6120: Natural Language Processing

Homework 2

Return to basic course information.

Assigned: Friday, 22 February 2013
Due: 11:59pm, Wednesday, 13 March 2013

Instructions

If you collaborated with others, you must write down with whom you worked on the assignment. If this changes from problem to problem, then you should write down this information separately with each problem.
Email the instructor the requested written answers, code, and instructions on how to (compile and) run the code.

Problems

Generating sentences from a grammar [10 points]

First, download the initial grammar: grammar1. The file contains self explanatory comments, which are delimited by #. The basic format is:
```
	1 VP	Verb NP
      
```
Ignore the leading "1" for now. The first symbol is the left-hand side of a context-free rewrite rule; the remaining one or more symbols are right-hand-side terminals and non-terminals. There is no typographic distinction enforced between terminals and non-terminals; rather, any symbol that does not appear on a left-hand side is by definition a terminal.

Now let's get down to work:
1. Write a random sentence generator in the language of your choice. When handing it in, make sure we can run it like so:
```
	      ./generate grammar1 5
	    
```
  In other words, print five random sentences from the grammar grammar1 to standard output. You should see output like:
```
	      the president ate every sandwich !
	    
```
  Your program should start with the ROOT symbol and recursively choose rewrite rules until termination. In grammar1, there are three possible expansions of grammar1; you should have an equal chance of choosing each of them. Don't hardwire this grammar into your program. Read it anew each time it runs, so that you can modify the grammar later.
  
  Save and hand in 10 random sentences generated by this first version of the program.
2. Modify your program to allow for rule expansions with unequal probabilities. To try this out, copy grammar1 to grammar2 and modify it. Instead of just "1", allow any non-negative real number at the beginning of a rule. For instance, you can assert that the is a more common determiner than a or every like so:
```
	      4 Det the
	      1.5 Det a
	      0.5 Det every
	    
```
  In particular, they would be distributed (2/3, 1/4, 1/12).
  Play around with the weights in grammar2 and see how your generated sentences change. Can you make the average sentence length much longer or shorter?
  
  Hand in the new probabilized generation program, your modified grammar2, and ten random sentences.
Describing English with context-free rules [15 points]
1. Copy grammar2 to grammar3. Modify the latter so that, in addition to the strings of grammar2, it can also generate phenomena like:
```
	  Alex kissed a sandwich .
	  that Alex kissed a sandwich perplexed the president .
	  the president thought that Alex kissed a sandwich .
	  the president smiled .
	  Alex ate a sandwich and wanted a pickle .
	  the president and the chief of staff understood that Alex smiled .
	
```
  In addition to new terminals, like Alex, you'll need to add new non-terminals. Of course, you don't want to generate just the six sentences above, but others with the same constructions.
  Hand in your expanded grammar.
2. Create a final grammar, grammar4. Modify it to model some new constructions of English that you find interesting. Explain in comments to your grammar file what you are doing.
Context-free parsing [25 points]

Using the programming language of your choice, write a recognizer using Earley's algorithm as discussed in class. The parser should read grammars in the format you've been writing above and read input sentences in the space-separated format you used above to output generated sentences. In other words, you should be able to parse the sentences generated by your own grammar. To get sufficient speed, you may want to implement the left-corner indexing trick discussed in class. (Other tricks may not offer as much benefit for a given implementation time within the scope of this assignment.)

For sentences in the language defined by the grammar, output true; otherwise, false.

Since this is an unweighted parser, you don't have to do the bookkeeping of weights or decide among competing alternatives when attaching completed rules.
Extra credit: Competitve grammar writing

This exercise is based on Jason Eisner and Noah Smith's article of that name. Using the grammar you've handed in as grammar4, we will generate 10 new sentences and judge whether they are in fact grammatical English. (You can't win this competition by generating Chinese or Finnish or Navajo.) Then, your grammatical sentences will be sent to be parsed by everyone else's Earley recognizers and their grammars. You get five points for each grammatical sentence you generate that no one else's grammar can parse.