From Regular Expression Puzzles and AI Coding Assistants by David Mertz

Regular Expression Puzzles and AI Coding Assistants is the perfect starting point for programmers of any experience level who want to understand the capabilities—and the limitations—of exciting new AI tools. Author David Mertz presents 24 challenging regex puzzles, their traditional human-made solutions, and the fascinating answers given by popular AI assistants.


The map and the territory

This book winds through a combination of two quite different things, both of which I believe will be largely novel to most of my readers. On the one hand, this is a puzzle book intended to be more “quirky” and “fun”, than to serve as a tutorial or reference text. However, the puzzles I have chosen should make both beginners and experienced users of regular expressions question what is and is not possible within them—and perhaps what should and should not be done using them—and burrow into readers’ brains, as if they are some sort of parasitic, eidetic worm. This book is not, however, without an overriding pedagogical subtext: I expect you to think differently, and indeed more productively, if you can solve these puzzles (or at least reflect upon my discussions of how one might solve them).

That is only the one hand, however. Many of us have a second hand, though few a third. Another curious puzzle has arisen in the last years, or even only over the last months, which is similarly ubiquitous in the minds of us computer programmers. A class of software that I call “AI coding assistants” can often be made to write programming code on our behalf, which is at once dumbfounding and very often just plainly dumb. I have chosen two of the most popular (at present) such tools—Copilot and ChatGPT—and I hope that what I discuss more generally will be informative in our approach to any future such tools, however rebranded, refreshed, or enhanced.

AI coding assistants—which you are likely to see named in a variety of other ways in other books, articles, press releases, and so on—are software tools which can assist software developers while they write programming language code.

A typical operation of an AI coding assistant allows a developer to write comments describing what they would like a function, class, structure, or module to do, then have the AI write code that tries to fulfill that stated goal. I will use the phrase “unit of functionality” generically for a collection of code devoted to a particular purpose (and usually textually adjacent to the rest of that unit).

Comments or prompts can, and should, simply be written in natural language (in English in this book), rather than in a special domain specific language. In fact, a good comment for an AI coding assistant should look exactly like a good comment written for human programmers who will later work with your code. A related mode of operation of these tools allows a human developer to write a portion of the code they wish to create, but have the AI fill in the missing pieces in that unit of functionality. To some degree, these AI coding assistants can also take functioning code and provide the missing documentation that humans failed to write to describe the purpose of a given unit of functionality.

At the time of this writing, these AI coding assistants are very large neural networks which live on remote servers maintained and controlled by the organizations that created them. The underlying engines that power these AIs do not reside on local developer workstations for multiple reasons: the large size of the models; the mostly proprietary and confidential details of how the models are trained; licensing and subscription terms the developers wish to enforce, and even simply the computational power and specialized hardware needed to enable their effective function. When you use one of these AI coding assistants, your programming editor, web page, or other interface, makes requests over the internet to these servers, and integrates the responses into familiar local interfaces via plugins. This does tend to mean that you need an internet connection in order to utilize these tools.

About regular expressions

Regular expressions—sometimes given the playful back formation regexen or more neutrally regex—are a powerful and compact way of describing patterns in text.

The appendix to this book contains a brief tutorial aimed at users and programmers who have begun to work with tools that use regular expressions, but who are not yet quite comfortable with the intricacies of them. Even users who may have used regular expressions in the past, but have forgotten some of their details can benefit from this refresher.

After completing that tutorial—should you feel such is relevant—you will not yet be an expert in using regular expressions to best advantage. But the tutorial, combined with lots of practice with varying cases, is about all you need to be an expert. The concepts of regular expressions are extremely simple and powerful. It is their application that takes some work.

The puzzles in this book begin at a certain point where the formal descriptions leave off. As you work with regexen, you will find subtle pitfalls. A pattern that seems like it should obviously match one thing actually matches something slightly different than you intended. Or perhaps a match pattern has “pathological” behavior and takes far too long. Or sometimes it is simply that a more concise pattern would be clearer in describing what you actually wish to match.

A great many programming languages, libraries, and tools support regular expressions, with relatively minor variations in the syntax used. Such software includes [efr]?grep, sed, awk, Perl, Java, .NET, JavaScript, Julia, Go, Rust, XML Schema, or indeed, pretty much every other programming language via a library.

For this book, we will use Python to pose these puzzles. In particular, we will use the standard library module re. Often code samples are used in puzzles and in explanation; where I wish to show the output from code, the example emulates the Python shell with lines starting with >>> (or continuing with …). Outputs are echoed without a prompt. Where code defines a function that is not necessarily executed in the mention, only the plain code is shown.

While you are reading this book, I strongly encourage you to keep open an interactive Python environment. Many tools enable this, such as the Python REPL (read-evaluate-print-loop) itself, or the IPython enhanced REPL, or Jupyter notebooks, or the IDLE editor that comes with Python, or indeed most modern code editors and IDEs (integrated development environments). A number of online regular expression testers are also available, although those will not capture the the Python calling details. Explanations will follow each puzzle, but trying to work it out in code before reading it is worthwhile.

Rise of the programming machines

You do not need to understand the underlying (complex) mathematics and design of deep neural networks to use AI coding assistants to read this book. Machine learning is an intricate topic, and the subject of many other longer books than this one. For those who are interested in such machine learning arcana, we can note a few (but only a few) details of how they work, and a hint of how they might work in the future.

With the rise of large language models (LLMs), the ability of coding tools to suggest both code and documentation has become—as of January 2023—fairly remarkable. I personally first tried a system called Tabnine in 2019 (still available, and significantly updated, at the time of writing). More recently, since 2021, GitHub Copilot has become widespread and sophisticated. In late 2022, OpenAI released ChatGPT. Just released at time of writing is an open source effort called PaLM+RLHF-Pytorch. More, similar products or projects will certainly be released in the next months and years, and those mentioned may undergo rebranding and changes in underlying core technologies.

Many of these tools are based on GPT-n, OpenAI’s Generative Pre-trained Transformer series, which are trained on many billions of texts, and utilize hundreds of billions of coefficients (connection weights), in order to produce “human-like” textual responses to textual inputs. In particular, for these Artificial Intelligence (AI) coding assistants, these neural networks models—the latest generation existing at time of writing being GPT-3—are specialized and tweaked, via reinforcement learning with human evaluators, and by layering on substantial bodies of programming language code as “fine-tuning” of these underlying LLMs.

For the current generation of AI coding assistants, a great deal of their fundamental technology can be traced to a 2017 academic paper that shifted a great deal of research focus to transformer deep neural networks.[1] This book is not in a position to predict whether new AI technologies will use different techniques, but we can be confident that future machines will generally continue to improve upon existing ones.

Regular expressions provide an interesting challenge for AI coding assistants which this book will partially address. Compared to other types of programming code, regular expressions are extremely dense and complex expressions, and ones where very subtle differences in their implied state machines can dramatically alter the function of the regex. A single changed character within a regular expression might produce a syntactically valid regular expression which does something meaningful—even something concretely useful in some contexts—but does not achieve the precise purpose at hand.

Tokenization strategies

Some—perhaps many— of the failures we see discussed within this book reflect the tokenization strategy used by GPT-3.5. Specifically, it is (most likely) a variant of byte-pair encoding (https://en.wikipedia.org/wiki/Byte_pair_encoding) which has the effect of creating a dictionary composed primarily of word roots, or even of whole words, rather than of single character transitions. For normal prose, this is exactly what one would want. For a dense character-based encoding such as regular expressions—or probably similarly for dense programing languages like APL, J, K, A+, or Q; and for many “esoteric programming languages” (https://en.wikipedia.org/wiki/Esoteric_programming_language) the tokenization model works against the effectiveness of the AI coding assistant. It is possible that future large language models, perhaps those based on GPT-4 (when it is created), will remedy some of these limitations.

The puzzles in this book are generally precisely the kinds of traps where slightly wrong approaches might seem to work, but fail in edge cases that require a nuanced understanding of regular expressions. We shall see, and discuss, where AIs are able to capture this nuance, and where they fail. I shall try to understand why these successes and failures occur, and share my thoughts with you readers.

Caveats

The future is already here—it’s just not evenly distributed. –William Gibson (The Economist, December 4, 2003).

Three caveats are needed in approaching the “robot regexen” that I present. One is that, by the time you read this, the machines will almost surely be “better” than they are at the time of writing, even if you read it mere days or weeks after I have written this. All the companies and organizations behind these technologies are continually retraining and refining their AI coding assistants.

The second caveat is that, I myself (your humble author), may simply fail to think of the best prompts to solicit improved responses from the AIs. While writing, I tried a variety of ways of phrasing my prompts, but I have certainly not tried all possible prompts. Results produced by these AIs can vary dramatically based on quite small changes to prompt phrasing.

A third important caveat is that these AI coding assistants are often context sensitive. If the code file you are working on already contains some related functionality, or even simply a choice of variable and function names previously defined, the AI will modify its results. Or similarly, within the “online chat” interface of ChatGPT, responses to previous prompts will affect future responses (sometimes subtly, sometimes dramatically).

Within my discussions of AI coding assistant suggestions, I often omit boilerplate such as “import re” or variable names which the AIs suggest. These are certainly useful when working as a developer, but are less relevant to evaluating the underlying ability of the AIs to “find the right regex.” In many cases, I have modified code to fit the scope of this book, which may involve syntactic—but never semantic—changes from the literal suggestions of the AIs.

Everywhere in the text, unless otherwise explicitly noted, where Copilot is shown as the AI coding assistant used, all comments above the created code were typed by me while the function bodies (or bare regular expression) that follow are created by Copilot. For readers familiar with code editors with non-AI code completion (for example, via lookup of method definitions), this type of automatic completion is very familiar and convenient to work with.

Intentional software development

One of my numerous favorite philosophers tells a well-known parable about intentionality:

“An ant is crawling on a patch of sand. As it crawls, it traces a line in the sand. By pure chance the line that it traces curves and re-crosses itself in such a way that it ends up looking like a recognizable caricature of Winston Churchill. Has the ant traced a picture of Winston Churchill, a picture that depicts Churchill?” –Hilary Putnam, Reason, Truth and History (1981).

Putnam’s question is one that readers might well keep in mind while reading this book. It pertains, in fact, to both of the “hands” I mention. First, regular expressions, in their sometimes bewildering nuances, can sometimes match the right things for the wrong reason; if anything, this “sometimes” is perhaps the norm. While not completely unique among the techniques we programmers use, regular expressions are somewhat special in not consisting merely of recipes to “do this, then do that” in a directly composable way. Sure, parser grammars are probably similar in this regard, but less well known to a broad range of programmers. Pure functional languages also have something of this quality of “non-composable composition” but are again not as widely used as procedural and object-oriented programming languages and styles.

In having such a curiously dependent structure among small parts, regular expressions, I believe, form a particularly deep challenge for AI coding assistants to create units of functionality productively (and correctly). In most of this book, the unit of concern is a single regular expression, often fitting on a single line, rather than, for example, a function definition that occupies several tens of line. Our understanding of the “intention” of the AIs intertwines with the similarly murky intentions of human programmers of regular expressions, and provides a particularly useful glimpse into understanding the utility and limits of these AI coding assistants.

In this book, I provide suggestions about how to understand well and how to misunderstand poorly exactly what it is that AI coding assistants do, and can do. I do this by putting each discussion of “AI thoughts” after both an initial puzzle and some “Author’s thoughts” (which sometimes approach “solutions” generically but calling them such would be an overreach).

As you read

Authors cannot and should not control books once readers obtain them. However, I would gently recommend that readers approach these puzzles as detailed below.

  • Read the puzzle description, which will involve doing something with regular expressions. Before reading further, think carefully about how to solve it, and play around with possible answers in your favorite coding environment (the Python shell is a great choice).
  • Compare what you came up with to the “Author’s thoughts” that follow. Maybe you missed something that I noticed. Or maybe I missed something you thought of. But I certainly hope my thoughts are illustrative of some intricacies of regexen.
  • With a good grasp of the puzzle and approaches to it, look at the “AI thoughts” which try to illustrate and discuss where the AI coding assistants succeed and where they fail. If you have access to these tools—either the two I discuss explicitly or others—maybe try out your own prompts and comments to see if you can get better AI answers than I did.

The puzzles in this book are arranged (approximately) in order of increasing difficulty. More features of regular expressions are often needed for later puzzles but, more importantly, understanding the nuance of edge cases is needed as well as you progress through the puzzles. As well as this general progression, in many cases sequences of puzzles play off a similar theme or topic, and become a bit more difficult with each variation.

The lessons you will learn in the “AI thoughts” sections following each puzzle are myriad and varied. Only occasionally can the virtues and errors of an AI “solution” be boiled down to a single “takeaway.” Instead, I reflect upon the numerous lessons we might learn from each (partial) success or failure.

At times, the AI coding assistant might fail to solve an “easy” puzzle but succeed in a “difficult” one within the same general sequence. However, in a very broad way, the AIs tend to get worse as the puzzles become more subtle. This is not surprising, of course, but the particular modes of failure are hopefully illuminating for developers who might use these tools.

What will you learn in this book?

This book will teach you to:

  1. Use AI code assistants to improve coding.
  2. Think more deeply about edge cases of regular expressions.
  3. Apply proper skepticism to code that “seems right”.
  4. Understand the overall capabilities of regular expressions.
  5. Think flexibly about open ended and ambiguous problems.

If you want to learn more about the book, check it out here.