Lecture 38: LLMs & Specification

8.7

Lecture 38: LLMs & Specification

1 Purpose

Discuss the impending apocalypse (or is it?) for programming.

2 Outline

Doom and gloom: it seems like everyone thinks that LLMs are going to take over the world. There are even people who believe that they are nearly sentient. https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/

And this isn’t limited to people completely ignorant of the technology. Perhaps some of the sentience claims, but there are computer science professors who believe this is going to fundamentally change programming. Is it?

https://dei.fe.up.pt/en/blog/2023/03/16/creativitytalks-the-end-of-programming-as-we-know-it-by-prof-cristina-videira-lopes/

http://tagide.com/education/the-end-of-programming-as-we-know-it/

And even if not sentient, it’s worth understanding the potential impact of the responses (e.g., Microsoft’s chat agent trying to convince a journalist to leave his wife https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html) when these tools are potentially being incorporated more broadly (e.g., people suggesting using them for therapy https://twitter.com/RobertRMorris/status/1611450197707464706: having a chatbot that is abusive to a journalist researching the topic is one thing: having it being abusive to a vulnerable person who is seeking help crosses into the realm of actively causing harm).

So how do we understand these tools? What do they do, and not do?

Let’s try it out, just so we are on somewhat the same page. Given the context of this class, we are going to primarily be talking about how LLMs can be used for programming, so let’s try doing HW12.

https://chat.openai.com

3 Threat to programmers?

So is this a threat to programmers? Are you going to have a job when you graduate?

Let’s take a longer look at the history of both the "threats to programmers" by increased productivity, and also, the idea of statistical generation (the technology behind LLMs).

To some extent, the entire history of programming, dating back to the 1950s, has been an effort of making it easier to write programs. e.g., COBOL, created in the late 1950s, was explicitly created to make it easier to write business programs, and it was sucessful enough that there are still hundreds of billions of lines of it running today.

3.1 Why was COBOL so succesful?

https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/blog-entry1/2016/02/26/cobol-and-the-enterprise-programming-paradigm https://www.infoworld.com/article/3539057/what-is-cobol-cobol-programming-explained.html https://www.cnn.com/2020/04/08/business/coronavirus-cobol-programmers-new-jersey-trnd/index.html

It was unquestionably easier to write than the proprietary, hardware specific assembly languages that predated it, especially because in the 1950s there were many more hardware platforms than there are today. While you could say that COBOL was a "job killer" for people who knew some of those esoteric hardware languages (and had the skills to learn new ones easily) in actuality, by raising the level of abstraction, it simply made it possible for more people to be programmers, and expanded the domains in which programming was used. The same could be said for any number of advances, with ever higher levels of abstraction; the types of programs that you can easily write today were likely inconceivable decades ago, or would have taken teams of programmers much longer to write.

For a long time, the idea was that "automatic programming" was coming: this goes back to the very earliest days of programming in the 1940s, where the automatic programming meant writing down code in essentially assembly, and having the program generate the actual punch cards that were used. The same ideas recurred again with languages like Fortran, in the 1950s, where the idea was that you could write mathematics and have the program generate the actual code. The same ideas were present in COBOL, and then further higher level ideas going through at least the 1980s. In some sense, it was also present when Java was created in the 1990s, where there the promise was the increased productivity provided by platform independence: since the JVM gave a common platform that code could be run on, you wouldn’t have to make multiple versions of the same program. But in all these cases, the idea was just that you could write higher level versions of programs (or descriptions of computations, or specifications): it never removed the need for programming, or programmers.

So the idea of making programming easier is not new: to some extent, it’s something we have been working on as long as we’ve been programming.

4 A Difference?

There is one human difference to the current iteration of this: there are a lot more programmers now than there were in the 1980s, or 1960s, or 1940s. In previous generations, the idea of automatic programming, or improvements to productivity, or even just making programming easier, was primarily sold as a way of allowing non-programmers to become programmers.

The goal was to bring programming to new domains, and to rely on making it easier for people to learn the skill to make that possible. It’s questionable whether any of these technologies actually made programming easier, but they certainly made it possible for more sophisticated programs to be written, and more people certainly did become programmers, though likely for economic rather than technical reasons.

Now, it is being talked of as a potential threat to already employed programmers. It doesn’t help that this is coming at the same time that tech companies, now some of the largest corporations in the world, are looking at the larger context of more empowered workers in the economy as a whole and seeing this as a potential way to gain power over tech workers. In the same way that automation has been used in other industries to wrest control from workers, the hope is that automation can similarly help tech companies gain more control over their workers.

There is also a technical difference to the current iteration: the input to the LLM tools are non-algorithmic specifications. That’s a fancy way of saying that they do not have precise semantics: they are, after all, just a bunch of text. This has interesting consequences: both good, in that it opens up possibilities, but also bad, as it brings up questions how how accurate the output can ever be.

5 How do these work?

How can we understand how these systems work, and how it relates (or doesn’t relate) to programming?

At their core, all of the LLMs are statistical completion models. i.e., given a sequence of tokens (words, etc), what is a plausible thing to happen next. How do they do this? Well, you can imagine how you might develop a statistical completion model on letters: if you have a bunch of English text, you have 26 letters and a space (ignore case), and for each letter you can come up with a probability of it being followed by another letter. You can figure out what those probabilities are by looking at a lot of text: e.g., "t" is much more often followed by "h" than by "l", though possibly not always, depending on the source of text. Once you have those probabilities, you can generate text: you start with a letter, and then you pick the next letter based on the probabilities. You can repeat this process, and you will get something that maybe looks somewhat like words. Or at least, things that plausibly could be words.

Let’s try doing that; we need a source of text. In this (and the next), we’ll use the complete works of William Shakespeare as a source. OpenAI takes that, and adds all digitized English text they can find (the whole internet, plus whatever else people have scanned in).

sim-letters.py

(see at bottom)

We can use this to generate words that aren’t actually words, but that plausibly could be words. A fancier version of this is probably what underlies the proprietary (and very expensive) systems to develop names for new drugs (e.g., by the Brand Institute): they develop thousands of candidates using complicated models involving different languages, and then filter them down using complicated rules. https://www.cnn.com/2016/11/25/health/art-of-drug-naming/index.html

Now, imagine doing the same thing with words. It would end up working better, but require a lot more space, because instead of having 27 tokens, you have one for every word in the English language (or at least, all the words in your input). You also might have a more complex model: rather than just the probability of the next word, you might have sequences, and you might take into account more of the past tokens. Certainly, there is a lot more to how these models actually work, but that’s something for an AI class: from our perspective, it’s enough to understand them as text completion models: just very good ones.

sim-words.py

(see at bottom)

There are some natural consequences, that surprise people but shouldn’t: e.g., if asked for research papers about a subject, it’s quite likely that the model will make up papers that don’t exist, possibly (though less likely), but authors that don’t exist either. Why? Because those are plausible titles of papers that could exist, and the model is just trying to complete with statistically likely text.

The same is true for the "AI panic", escaping from machine dialogs that people post. https://bootcamp.uxdesign.cc/gpt-4-tried-to-escape-into-the-internet-today-and-it-almost-worked-2689e549afb5 If asked to make a plan of how the AI would escape, it will complete that with a statistically likely explanation of how a machine would escape. It doesn’t imply cognition or intent in any way: it is completing text in a way that is statistically likely, which means, based on how such stories are typically written. i.e., it is re-using the ideas that people have come up with, written down, and then built into the model that is being used to complete the text.

This also explains how we get the seemingly impressive "corrections": when the model gives incorrect input, and is corrected, it is not learning: rather, the entire sequence (the original prompt, the incorrect response, the correction) is now the new prompt, and the statistically most likely response now takes into account that correction.

6 And programming?

How does this affect programming? Up until now, everything we’ve talked about is about text – generally English text, based on how the models are trained, but in principle any natural language.

But I said we’d talk about how this is (or is not) going to take away your jobs.

Just as we can get statistically plausible completions of text, we can get it for code as well. Code is after all just a sequence of tokens, and there is no reason why the models couldn’t also learn patterns in code, and be trained on massive bodies of existing code.

As we saw at the beginning, it does a pretty good job: given a description of code, it can come up with pretty plausible code. But, just like with everything else, it is just plausible: there is no guarantee that it will work, that it will be correct, that it won’t have bugs, etc.

This is, in a sense, required: because as long as the input to the model is vague, and the connection between input and output has some amount of randomness, there can never be a guarantee that correct output will result. English text, after all, does not have the precision of a formal specification.

This is actually a pretty serious problem. There is already preliminary research showing that people who use these systems produce less secure code than those who don’t, and while doing so, believe that they are producing more secure code! https://arxiv.org/abs/2211.03622 This is actually a well-studied phenomenon called automation bias: people are more likely to trust the output of an automated system, even if there is no particular reason to. This has potentially pretty serious consequences as these systems become more widely used: obvious bugs should show up in buggy behavior, and they might cause more work to fix, but subtle security bugs may not show up until it’s too late.

There are also, of course, more concrete current limitations: the models will only generate a certain amount of code, so the overall design of software is still very much a human task. i.e., even if an LLM is perfect at generating individual function, or methods, or maybe entire classes, software systems are made out of hundreds or thousands of such building blocks, and most of the complexity of software is how those blocks combine together, not in building any individual block. So, in some sense, even if the LLMs were perfect at generating little bits (which, given their statistical nature, and the vague nature of the input, they will always have limitations), they would still only be doing a small part of the overall task of software construction: and in some sense, the easy part! Which is fine, even reducing effort on the "easy parts" would be great!

7 Just autocomplete?

So is that it? Are LLMs going to just end up being a supercharged autocomplete—what they have been most concretely marketed as (GitHub Copilot being the essential example), and are quite successful as.

They certainly could be, but I’d like to suggest a better idea. If, rather than starting from vague descriptions, generating code, and then having a human read whatever comes out from the model and trying to confirm that it "does the right thing" (this is the LLM-as-autocomplete idea), we start by writing down in a precise way what we want the code to do: whether by unit tests, property based specifications, etc.

If we do that, we can then use the model to generate code that is automatically checked against that specification, and in doing so, never even show it to the programmer. As indeed, in all past evolutions of "automatic programming", the output of the compiler was only inspected if there were problems: several hundred billion lines of COBOL would not have been written if every programmer had to inspect the output of the COBOL compiler to the particular mainframe hardware they were running the program on.

Not only does this free us up from having to carefully read what comes out of the model (which may be quite difficult, if there are subtle bugs), but it has some interesting consequences: if we realize there is a bug, we can updated our specification and then just re-run the generation, with a guarantee that the new implementation will satisfy the now improved specification.

This, of course, is active research: it’s not something you can try out. What you can try out right now is the autocompletion, but I think the specification-driven version is going to be much more interesting.

So rather than thinking that these models are going to take away your jobs, you should think of them as yet another tool that you will be able to use: how you use them depends on what you are doing, but they are just a tool, with their own strengths and weaknesses.

That requires, of course, you to be able to use them to do something more than they can do on their own: as I’ve said before, if your goal is just to churn out semi-working code that looks a lot like other people’s code, them there is a good chance that these models are going to be coming for you. But if you are able to understand how they work, what they are good at and not good at, and how you can use them to build something that they can’t build, you will be fine.

8 Code

sim-letters.py

import numpy as np

# Complete works of Shakespeare https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
shakespeare = open('t8.shakespeare.txt', encoding='utf8').read()

corpus = list(shakespeare.lower())

def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])
        
pairs = make_pairs(corpus)

letter_dict = {}

for letter_1, letter_2 in pairs:
    if letter_1 in letter_dict.keys():
        letter_dict[letter_1].append(letter_2)
    else:
        letter_dict[letter_1] = [letter_2]
 
first_letter = np.random.choice(corpus)

chain = [first_letter]

n_letters = 50

for i in range(n_letters):
    chain.append(np.random.choice(letter_dict[chain[-1]]))

print(''.join(chain))

sim-words.py

import numpy as np

# Complete works of Shakespeare https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
shakespeare = open('t8.shakespeare.txt', encoding='utf8').read()

corpus = shakespeare.split()

def make_pairs(corpus):
    for i in range(len(corpus)-1):
        yield (corpus[i], corpus[i+1])
        
pairs = make_pairs(corpus)

word_dict = {}

for word_1, word_2 in pairs:
    if word_1 in word_dict.keys():
        word_dict[word_1].append(word_2)
    else:
        word_dict[word_1] = [word_2]
 
first_word = np.random.choice(corpus)

while first_word.islower():
    first_word = np.random.choice(corpus)

chain = [first_word]

n_words = 50

for i in range(n_words):
    chain.append(np.random.choice(word_dict[chain[-1]]))

print(' '.join(chain))

contents ← prev up next →

	Syllabus
	Schedule
	Lectures
	Homework
	Lean Tactic Reference

	Lecture 1: Intro
	Lecture 2: Logistics, Specifications in ISL
	Lecture 3: Propositional Logic in Code
	Lecture 4: Design / Specification Recipe
	Lecture 5: Relational Specifications
	Lecture 6: For-all, Intro to PBT
	Lecture 7: ==>, PBT filtering
	Lecture 8: PBT generators
	Lecture 9: SAT
	Lecture 10: SMT & Rosette
	Lecture 11: Rosette & Finitization
	Lecture 12: Higher order logic
	Lecture 13: Functions
	Lecture 14: Propositional logic
	Lecture 15: Propositional logic & programming
	Lecture 16: Exam 1, Part A
	Lecture 17: Exam 1, Part B
	Lecture 18: Proving with tactics
	Lecture 19: Inductive types & proofs
	Lecture 20: Lists
	Lecture 21: Proof Practice
	Lecture 22: Generalizing hypotheses
	Lecture 23: Generalize
	Lecture 24: Standard Libraries
	Lecture 25: Forward Reasoning
	Lecture 26: Proof Automation
	Lecture 27: Compiler Correctness
	Lecture 28: Induction
	Lecture 29: Functions vs. Relations
	Lecture 30: Why logic?
	Lecture 31: Linear Logic
	Lecture 32: Hoare Logic
	Lecture 33: Separation Logic
	Lecture 34: Exam 2 Review
	Lecture 35: Exam 2, Part A
	Lecture 36: Exam 2, Part B
	Lecture 37: Demo of Slim Check
	Lecture 38: LLMs & Specification

1	Purpose
2	Outline
3	Threat to programmers?
4	A Difference?
5	How do these work?
6	And programming?
7	Just autocomplete?
8	Code