Lecture 38: LLMs & Specification
1 Purpose
Discuss the impending apocalypse (or is it?) for programming.
2 Outline
Doom and gloom: it seems like everyone thinks that LLMs are going to take over the world. There are even people who believe that they are nearly sentient. https://www.scientificamerican.com/article/google-engineer-claims-ai-chatbot-is-sentient-why-that-matters/
There are red teams that are testing if ChatGPT can "escape" human control
(Though the setup of these "experiments" are pretty ludicrous)
And this isn’t limited to people completely ignorant of the technology. Perhaps some of the sentience claims, but there are computer science professors who believe this is going to fundamentally change programming. Is it?
And even if not sentient, it’s worth understanding the potential impact of the responses (e.g., Microsoft’s chat agent trying to convince a journalist to leave his wife https://www.nytimes.com/2023/02/16/technology/bing-chatbot-microsoft-chatgpt.html) when these tools are potentially being incorporated more broadly (e.g., people suggesting using them for therapy https://twitter.com/RobertRMorris/status/1611450197707464706: having a chatbot that is abusive to a journalist researching the topic is one thing: having it being abusive to a vulnerable person who is seeking help crosses into the realm of actively causing harm).
So how do we understand these tools? What do they do, and not do?
Let’s try it out, just so we are on somewhat the same page. Given the context of this class, we are going to primarily be talking about how LLMs can be used for programming, so let’s try doing our last exam :)
3 Threat to programmers?
So is this a threat to programmers? Are you going to have a job when you graduate?
Let’s take a longer look at the history of both the "threats to programmers" by increased productivity, and also, the idea of statistical generation (the technology behind LLMs).
To some extent, the entire history of programming, dating back to the 1950s, has been an effort of making it easier to write programs. e.g., COBOL, created in the late 1950s, was explicitly created to make it easier to write business programs, and it was sucessful enough that there are still hundreds of billions of lines of it running today.
3.1 Why was COBOL so succesful?
https://community.ibm.com/community/user/ibmz-and-linuxone/blogs/blog-entry1/2016/02/26/cobol-and-the-enterprise-programming-paradigm https://www.infoworld.com/article/3539057/what-is-cobol-cobol-programming-explained.html https://www.cnn.com/2020/04/08/business/coronavirus-cobol-programmers-new-jersey-trnd/index.html https://www.marketplace.org/shows/marketplace-tech/the-65-year-old-computer-system-at-the-heart-of-american-business/
It was unquestionably easier to write than the proprietary, hardware specific assembly languages that predated it, especially because in the 1950s there were many more hardware platforms than there are today. While you could say that COBOL was a "job killer" for people who knew some of those esoteric hardware languages (and had the skills to learn new ones easily) in actuality, by raising the level of abstraction, it simply made it possible for more people to be programmers, and expanded the domains in which programming was used. The same could be said for any number of advances, with ever higher levels of abstraction; the types of programs that you can easily write today were likely inconceivable decades ago, or would have taken teams of programmers much longer to write.
For a long time, the idea was that "automatic programming" was coming: this goes back to the very earliest days of programming in the 1940s, where the automatic programming meant writing down code in essentially assembly, and having the program generate the actual punch cards that were used. The same ideas recurred again with languages like Fortran, in the 1950s, where the idea was that you could write mathematics and have the program generate the actual code. The same ideas were present in COBOL, and then further higher level ideas going through at least the 1980s (for much more on this, see https://ieeexplore.ieee.org/document/75). In some sense, it was also present when Java was created in the 1990s, where there the promise was the increased productivity provided by platform independence: since the JVM gave a common platform that code could be run on, you wouldn’t have to make multiple versions of the same program. But in all these cases, the idea was just that you could write higher level versions of programs (or descriptions of computations, or specifications): it never removed the need for programming, or programmers.
Indeed, there is interesting research, going back to the 1980s (https://dl.acm.org/doi/10.1145/214956.214961), that there is a fundamental complexity to programming that seems unlikely to go away. If we distill systems down to the states that they can take, then digital systems are discrete systems with combinatorially massive numbers of states. Unlike in engineering, where many processes are continuous and thus both combine & approximate correspondingly, discrete systems do not have this feature, and it makes them extremely hard to reason about. So while unquestionably we have built more advanced digital systems, to some extent the fundamental task of programming has not gotten any easier: we’ve just built more and faster hardware than has enabled us to better re-use the techniques we already knew at the very beginning.
So the idea of making programming easier is not new: to some extent, it’s something we have been working on as long as we’ve been programming.
4 A Difference?
There is one human difference to the current iteration of this: there are a lot more programmers now than there were in the 1980s, or 1960s, or 1940s. In previous generations, the idea of automatic programming, or improvements to productivity, or even just making programming easier, was primarily sold as a way of allowing non-programmers to become programmers.
Now, it is being talked of as a potential threat to already employed programmers. It doesn’t help that this is coming at the same time that tech companies, now some of the largest corporations in the world, are looking at the larger context of more empowered workers in the economy as a whole and seeing this as a potential way to gain power over tech workers. In the same way that automation has been used in other industries to wrest control from workers, the hope is that automation can similarly help tech companies gain more control over their workers.
There is also a technical difference to the current iteration: the input to the LLM tools are non-algorithmic specifications. That’s a fancy way of saying that they do not have precise semantics: they are, after all, just a bunch of text. This has interesting consequences: both good, in that it opens up possibilities, but also bad, as it brings up questions how how accurate the output can ever be.
5 How do these work?
How can we understand how these systems work, and how it relates (or doesn’t relate) to programming?
At their core, all of the LLMs are statistical completion models. i.e., given a sequence of tokens (words, etc), what is a plausible thing to happen next. How do they do this? Well, you can imagine how you might develop a statistical completion model on letters: if you have a bunch of English text, you have 26 letters and a space (ignore case), and for each letter you can come up with a probability of it being followed by another letter. You can figure out what those probabilities are by looking at a lot of text: e.g., "t" is much more often followed by "h" than by "l", though possibly not always, depending on the source of text. Once you have those probabilities, you can generate text: you start with a letter, and then you pick the next letter based on the probabilities. You can repeat this process, and you will get something that maybe looks somewhat like words. Or at least, things that plausibly could be words.
Let’s try doing that; we need a source of text. In this (and the next), we’ll use the complete works of William Shakespeare as a source. OpenAI takes that, and adds all digitized English text they can find.
sim-letters.py |
Now, imagine doing the same thing with words. It would end up working better, but require a lot more space, because instead of having 27 tokens, you have one for every word in the English language. You also might have a more complex model: rather than just the probability of the next word, you might have sequences, and you might take into account more of the past tokens. Certainly, there is a lot more to how these models actually work, but that’s something for an AI class: from our perspective, it’s enough to understand them as text completion models: just very good ones.
sim-words.py |
There are some natural consequences, that surprise people but shouldn’t: e.g., if asked for research papers about a subject, it’s quite likely that the model will make up papers that don’t exist, possibly (though less likely), but authors that don’t exist either. Why? Because those are plausible titles of papers that could exist, and the model is just trying to complete with statistically likely text.
The same is true for the "AI panic", escaping from machine dialogs that people post. If asked to make a plan of how the AI would escape, it will complete that with a statistically likely explanation of how a machine would escape. It doesn’t imply cognition or intent in any way: it is completing text in a way that is statistically likely, which means, based on how such stories are typically written. i.e., it is re-using the ideas that people have come up with, written down, and then built into the model that is being used to complete the text.
This also explains how we get the seemingly impressive "corrections": when the model gives incorrect input, and is corrected, it is not learning: rather, the entire sequence (the original prompt, the incorrect response, the correction) is now the new prompt, and the statistically most likely response now takes into account that correction.
6 And programming?
How does this affect programming? Up until now, everything we’ve talked about is about text – generally English text, based on how the models are trained, but in principle any natural language.
But I said we’d talk about how this is (or is not) going to take away your jobs.
Just as we can get statistically plausible completions of text, we can get it for code as well. Code is after all just a sequence of tokens, and there is no reason why the models couldn’t also learn patterns in code, and be trained on massive bodies of existing code.
As we saw at the beginning, it does a pretty good job: given a description of code, it can come up with pretty plausible code. But, just like with everything else, it is just plausible: there is no guarantee that it will work, that it will be correct, that it won’t have bugs, etc.
This is, in a sense, required: because as long as the input to the model is vague, and the connection between input and output has some amount of randomness, there can never be a guarantee that correct output will result. English text, after all, does not have the precision of a formal specification.
There are also, of course, more concrete current limitations: the models will only generate a certain amount of code, so the overall design of software is still very much a human task. i.e., even if an LLM is perfect at generating individual function, or methods, or maybe entire classes, software systems are made out of hundreds or thousands of such building blocks, and most of the complexity of software is how those blocks combine together, not in building any individual block. So, in some sense, even if the LLMs were perfect at generating little bits (which, given their statistical nature, and the vague nature of the input, they will always have limitations), they would still only be doing a small part of the overall task of software construction.
7 Just autocomplete?
So is that it? Are LLMs going to just end up being a supercharged autocomplete—
They certainly could be, but I’d like to suggest a better idea. If, rather than starting from vague descriptions, generating code, and then having a human read whatever comes out from the model and trying to confirm that it "does the right thing" (this is the LLM-as-autocomplete idea), we start by writing down in a precise way what we want the code to do: whether by unit tests, property based specifications, etc. If we do that, we can then use the model to generate code that is automatically checked against that specification, and in doing so, never even show it to the programmer. As indeed, in all past evolutions of "automatic programming", the output of the compiler was only inspected if there were problems: several hundred billion lines of COBOL would not have been written if every programmer had to inspect the output of the COBOL compiler to the particular mainframe hardware they were running the program on.
Not only does this free us up from having to carefully read what comes out of the model (which may be quite difficult, if there are subtle bugs), but it has some interesting consequences: if we realize there is a bug, we can updated our specification and then just re-run the generation, with a guarantee that the new implementation will satisfy the now improved specification.
This is something we can build – and indeed, with students I’ve been building a prototype of this over the last year.
DEMO.
So rather than thinking that these models are going to take away your jobs, you should think of them as yet another tool that you will be able to use: how you use them depends on what you are doing, but they are just a tool, with their own strengths and weaknesses.
That requires, of course, you to be able to use them to do something more than they can do on their own: as I’ve said before, if your goal is just to churn out semi-working code that looks a lot like other people’s code, them there is a good chance that these models are going to be coming for you. But if you are able to understand how they work, what they are good at and not good at, and how you can use them to build something that they can’t build, you will be fine.
8 Code
sim-letters.py |
import numpy as np
# Complete works of Shakespeare https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
shakespeare = open('t8.shakespeare.txt', encoding='utf8').read()
corpus = list(shakespeare.lower())
def make_pairs(corpus):
for i in range(len(corpus)-1):
yield (corpus[i], corpus[i+1])
pairs = make_pairs(corpus)
letter_dict = {}
for letter_1, letter_2 in pairs:
if letter_1 in letter_dict.keys():
letter_dict[letter_1].append(letter_2)
else:
letter_dict[letter_1] = [letter_2]
first_letter = np.random.choice(corpus)
chain = [first_letter]
n_letters = 50
for i in range(n_letters):
chain.append(np.random.choice(letter_dict[chain[-1]]))
print(''.join(chain))
sim-words.py |
import numpy as np
# Complete works of Shakespeare https://ocw.mit.edu/ans7870/6/6.006/s08/lecturenotes/files/t8.shakespeare.txt
shakespeare = open('t8.shakespeare.txt', encoding='utf8').read()
corpus = shakespeare.split()
def make_pairs(corpus):
for i in range(len(corpus)-1):
yield (corpus[i], corpus[i+1])
pairs = make_pairs(corpus)
word_dict = {}
for word_1, word_2 in pairs:
if word_1 in word_dict.keys():
word_dict[word_1].append(word_2)
else:
word_dict[word_1] = [word_2]
first_word = np.random.choice(corpus)
while first_word.islower():
first_word = np.random.choice(corpus)
chain = [first_word]
n_words = 50
for i in range(n_words):
chain.append(np.random.choice(word_dict[chain[-1]]))
print(' '.join(chain))