|
From Tiny Python Projects by Ken Youens-Clark Everyone loves Mad Libs! And everyone loves Python. This article shows you how to have fun with both and learn some programming skills along the way. |
Take 40% off Tiny Python Projects by entering fccclark into the discount code box at checkout at manning.com.
When I was a wee lad, we used to play at Mad Libs for hours and hours. This was before computers, mind you, before televisions or radio or even paper! No, scratch that, we had paper. Anyway, the point is we only had Mad Libs to play, and we loved it! And now you must play!
We’ll write a program called
mad.py
which reads a file given as a positional argument and finds all the placeholders noted in angle brackets like <verb>
or <adjective>
. For each placeholder, we’ll prompt the user for the part of speech being requested like “Give me a verb” and “Give me an adjective.” (Notice that you’ll need to use the correct article.) Each value from the user replaces the placeholder in the text, and if the user says “drive” for “verb,” then <verb>
in the text replaces with drive
. When all the placeholders have been replaced with inputs from the user, print out the new text.
For instance, here’s a version of the “fox” text:
$ cat inputs/fox.txt The quick <adjective> <noun> jumps <preposition> the lazy <noun>.
When the program runs with this file as the input, it asks for each of the placeholders and then prints the silliness:
$ ./mad.py inputs/fox.txt Give me an adjective: surly Give me a noun: car Give me a preposition: under Give me a noun: bicycle The quick surly car jumps under the lazy bicycle.
By default, this is an interactive program that uses the input
prompt to ask the user for their answers, but, for testing purposes, you have an option for -i
or --inputs
and the test suite can pass in all the answers and bypass the interactive input
calls:
$ ./mad.py inputs/fox.txt -i surly car under bicycle The quick surly car jumps under the lazy bicycle.
In this exercise, you will:
- Learn about greedy matching
- Use
re.findall
to find all matches for a regex - Use
re.sub
to substitute found patterns with new text - Explore ways to write without using regular expressions.
Writing mad.py
To start off, use new.py mad.py
to create the program or copy template/template.py
to mad_libs/mad.py
. You should define the positional file
argument as type=argparse.FileType('r')
. The -i
or --inputs
option should use nargs='*'
to define a list
of zero or more str
values.
First modify your mad.py
until it produces the following when given no arguments or the -h
or --help
flag:
$ ./mad.py -h usage: mad.py [-h] [-i [input [input ...]]] FILE Mad Libs positional arguments: FILE Input file optional arguments: -h, --help show this help message and exit -i [input [input ...]], --inputs [input [input ...]] Inputs (for testing) (default: None)
If the given file
argument doesn’t exist, the program should error out:
$ ./mad.py blargh usage: mad.py [-h] [-i [str [str ...]]] FILE mad.py: error: argument FILE: can't open 'blargh': \ [Errno 2] No such file or directory: 'blargh '
If the text of the file contains no <>
placeholders, it should print a message and exit with an error value. Note this doesn’t need to print a usage, and you don’t have to use parser.error
as in previous exercises:
$ cat no_blanks.txt This text has no placeholders. $ ./mad.py no_blanks.txt "no_blanks.txt" has no placeholders.
Here’s a string diagram to help you visualize the program:
Using regular expressions to find the pointy bits
The first thing we need to do is read
the input file:
>>> text = open('inputs/fox.txt').read().rstrip() >>> text 'The quick <adjective> <noun> jumps <preposition> the lazy <noun>.'
We need to find all the <…>
bits; let’s use a regular expression. We can find a literal <
character:
>>> import re >>> re.search('<', text) <re.Match object; span=(10, 11), match='<'>
Now let’s find that bracket’s mate. The .
means “anything,” and we can add a +
after it to mean “one or more”. I’ll capture the match to make it easier to see:
>>> match = re.search('(<.+>)', text) >>> match.group(1) '<adjective> <noun> jumps <preposition> the lazy <noun>'
Hmm, that matched all the way to the end of the string instead of stopping at the first available >. It’s common when we use * or + for zero/one or more that the regex engine is “greedy” on the or more part. The pattern matches beyond where we want them to, but they are technically matching exactly what we describe. Remember that . means anything, and a right angle bracket is anything. It matches as many characters as possible until it finds the last right angle to stop which is why this pattern is called “greedy.”
We can make the regex “non-greedy” by changing +
to +?
:
>>> re.search('<.+?>', text) <re.Match object; span=(10, 21), match='<adjective>'>
Rather than using .
for “anything,” it’s more accurate to say that we want to match one or more of anything which is neither of the angle brackets. The character class [<>]
matches either bracket. We can negate (or complement) the class by putting a caret (^
) as the first character and we have [^<>]
. This matches anything which isn’t a left or right-angle bracket:
>>> re.search('<[^<>]+>', text) <re.Match object; span=(10, 21), match='<adjective>'>
Why do we have both brackets inside the negated class? Wouldn’t the right bracket be enough? Well, I’m guarding against unbalanced brackets. With only the right bracket, it matches this text:
>>> re.search('<[^>]+>', 'foo <<bar> baz') <re.Match object; span=(4, 10), match='<<bar>'>
But with both brackets in the negated class, it finds the correct, balanced pair:
>>> re.search('<[^<>]+>', 'foo <<bar> baz') <re.Match object; span=(5, 10), match='<bar>'>
We’ll add two sets of parentheses ()
, one to capture the entire placeholder pattern:
>>> match = re.search('(<([^<>]+)>)', text) >>> match.groups() ('<adjective>', 'adjective')
And another for the string inside the <>
:
A handy function called re.findall
returns all matching text groups as a list
of tuple
values:
>>> from pprint import pprint >>> matches = re.findall('(<([^<>]+)>)', text) >>> pprint(matches) [('<adjective>', 'adjective'), ('<noun>', 'noun'), ('<preposition>', 'preposition'), ('<noun>', 'noun')]
Note that the capture groups are returned in the order of their opening parentheses, so the entire placeholder is the first member of each tuple and the contained text is the second. We can iterate over this list, unpacking each tuple into variables:
>>> for placeholder, name in matches:
... print(f'Give me {name}')
...
Give me adjective
Give me noun
Give me preposition
Give me noun
Figure 1. Because the list contains 2-tuples, we can unpack them into two variables in the for
loop.
You should insert the correct article (“a” or “an”, see the “Crow’s Nest” exercise) to use as the prompt for input
.
Halting and printing errors
If you find no placeholders in the text, you need to print an error message. It’s common to print error message to STDERR
(standard error), and the print
function allows us to specify a file
argument. We’ll use sys.stderr
which is like an already open file handle (no need to open
it):
print('This is an error!', file=sys.stderr)
If there are no placeholders, then we should exit the program with an error value to indicate to the operating system which our program failed to run properly. In the Unix world, the normal exit value is 0
as in “zero errors,” and we need to exit with some int
value which isn’t 0
. I always use 1
:
sys.exit(1)
One of the tests checks if your program can detect missing placeholders and if your program exits correctly.
Getting the values
For each one of those parts of speech, you need a value that comes either from the --inputs
argument or directly from the user. If we have nothing for --inputs
, then you can use the input
function to get some answer from the user. The function takes a str
value to use as a prompt:
>>> value = input('Give me an adjective: ') Give me an adjective: blue
And returns a str
value of whatever the user typed before hitting the Return
key:
>>> value 'blue'
If you have values for the inputs, use those and don’t bother with the input
function. Assume that you’re always given the correct number of inputs for the number of placeholders in the text.
The inputs
are provided in the same order as the placeholders they replace.
Assume this:
>>> inputs = ['surly', 'car', 'under', 'bicycle']
You need to remove and return the first string, “surly,” from inputs
. The list.pop
method is what you need, but it wants to remove the last element by default:
>>> inputs.pop() 'bicycle'
The list.pop
method takes an optional argument to indicate the index of the element you want to remove. Can you figure out how to make that work?
Substituting the text
When you have values for each of the placeholders, you need to substitute them into the text. I suggest you look into the re.sub
function that substitutes text matching a given regular expression for some given text. I recommend you read help(re.sub)
:
sub(pattern, repl, string, count=0, flags=0) Return the string obtained by replacing the leftmost non-overlapping occurrences of the pattern in string by the replacement repl.
I don’t want to give away the ending, but you need to use a pattern similar to the one above to replace each <placeholder>
with each value
.
Note that it’s not a requirement to use the re
functions to solve this. I challenge you, in fact, to try writing a manual solution that doesn’t use the re
module at all! Now go write the program and use the tests to guide you!
Solution
#!/usr/bin/env python3 """Mad Libs""" import argparse import re import sys # -------------------------------------------------- def get_args(): """Get command-line arguments""" parser = argparse.ArgumentParser( ❶ description='Mad Libs', formatter_class=argparse.ArgumentDefaultsHelpFormatter) parser.add_argument('file', metavar='FILE', type=argparse.FileType('r'), help='Input file') parser.add_argument('-i', ❷ '--inputs', help='Inputs (for testing)', metavar='input', type=str, nargs='*') return parser.parse_args() # -------------------------------------------------- def main(): """Make a jazz noise here""" args = get_args() inputs = args.inputs text = args.file.read().rstrip() ❸ blanks = re.findall('(<([^<>]+)>)', text) ❹ if not blanks: ❺ print(f'"{args.file.name}" has no placeholders.', file=sys.stderr) ❻ sys.exit(1) ❼ tmpl = 'Give me {} {}: ' ❽ for placeholder, pos in blanks: ❾ article = 'an' if pos.lower()[0] in 'aeiou' else 'a' ❿ answer = inputs.pop(0) if inputs else input(tmpl.format(article, pos)) ⓫ text = re.sub(placeholder, answer, text, count=1) ⓬ print(text) ⓭ # -------------------------------------------------- if __name__ == '__main__': main()
❶ The file
argument should be a readable file.
❷ The --inputs
option may have zero or more strings.
❸ Read the input file, stripping off the trailing newline.
❹ Use a regex to find all the matches for a left angle bracket followed by one or more of anything whichisn’t a left or right-angle bracket followed by a right angle bracket. Use two capture groups to capture the entire expression and the text inside the brackets.
❺ If there are no placeholders….
❻ Print a message to STDERR
that the given file name contains no placeholders.
❼ Exit the program with a non-zero status to indicate an error to the operating system.
❽ Create a string template for the prompt to ask for input
from the user.
❾ Iterate through the blanks
, unpacking each tuple
into variables.
❿ Choose the correct article based on the first letter of the name of the part of speech (pos
), “an” for those starting with a vowel and “a” otherwise.
⓫ If there are inputs, remove the first one for the answer
, otherwise use the input
to prompt the user for a value.
⓬ Replace the current placeholder
text with the answer
from the user. Use count=1
to ensure that only the first value is replaced. Overwrite the existing value of text
to replace all the placeholders by the end of the loop.
⓭ Print the resulting text to STDOUT
.
Discussion
Defining the arguments
If you define the file
with type=argparse.FileType('r')
, then argparse
verifies that the value is a file, creating an error and usage if it isn’t, and then open
it for you. Quite the time saver. I also define --inputs
with nargs='*'
to get any number of strings as a list
. If nothing is provided, the default value is None
; be sure you don’t assume it’s a list
and try doing list operations on a None
.
Substituting with regular expressions
A subtle bug waits for you to use re.sub
. Suppose we replaced the first <adjective>
with “blue” and we have this:
text = 'The quick blue <noun> jumps <preposition> the lazy <noun>.'
Now we want to replace <noun>
with “dog,” and try this:
>>> text = re.sub('<noun>', 'dog', text)
Let’s check on the value of text
now:
>>> text 'The quick blue dog jumps <preposition> the lazy dog.'
Because there were two instances of the string <noun>
, both got replaced with “dog.”
We must use count=1
to ensure that only the first occurence changes:
>>> text = 'The quick blue <noun> jumps <preposition> the lazy <noun>.' >>> text = re.sub('<noun>', 'dog', text, count=1) >>> text 'The quick blue dog jumps <preposition> the lazy <noun>.'
And now we can keep moving to replace the other placeholders.
Finding the placeholders without regular expressions
I trust the explanation of the regex solution in the introduction was sufficient. I find that solution fairly elegant, but it’s certainly possible to solve this without using regexes. Here’s how I might solve it manually.
First I need a way to search the text for <…>
. I start off by writing a test that helps me imagine what I might give to my function and what I might expect in return for both good and bad values. I decided to return None
when the pattern is missing and to return a tuple of (start, stop)
indices when the pattern is present:
def test_find_brackets(): """Test for finding angle brackets""" assert find_brackets('') == None ❶ assert find_brackets('<>') == None ❷ assert find_brackets('<x>') == (0, 2) ❸ assert find_brackets('foo <bar> baz') == (4, 8) ❹
❶ Because there’s no text, it should return None
.
❷ Angle brackets lack any text inside, and this should return None
.
❸ The pattern should be found at the beginning of a string.
❹ The pattern should be found further into the string.
Now to write the code that satisfies that test. Here’s what I wrote:
def find_brackets(text): """Find angle brackets""" start = text.index('<') if '<' in text else -1 ❶ stop = text.index('>') if start >= 0 and '>' in text[start + 2:] else -1 ❷ return (start, stop) if start >= 0 and stop >= 0 else None ❸
❶ Find the index of the left bracket if one is found in the text.
❷ Find the index of the right bracket if one is found starting two positions after the left.
❸ If both brackets were found, return a tuple of their start
and stop
positions, otherwise return None
.
This function works well enough to pass the given tests, but it’s not quite correct because it returns a region that contains unbalanced brackets:
>>> text = 'foo <<bar> baz' >>> find_brackets(text) [4, 9] >>> text[4:10] '<<bar>'
That may seem unlikely, but I chose angle brackets to make you think of HTML tags like <head>
and <img>
. HTML is notorious for being incorrect, maybe because it was hand-generated by a human who messed up a tag or because some tool that generated the HTML had a bug. The point is that most web browsers have to be fairly relaxed in parsing HTML, and it’s not unexpected to see a malformed tag like <<head>
instead of the correct <head>
.
The regex version, on the other hand, specifically guards against matching internal brackets by using the class [^<>]
to define text that can’t contain any angle brackets. I could write a version of find_brackets
that finds only balanced brackets, but, honestly, it’s not worth it. This function points out that one of the strengths of the regex engine is that it can find a partial match (the first left bracket), see that it’s unable to make a complete match, and start over (at the next left bracket). Writing this is tedious and, frankly, not that interesting.
Still, this function works for all the given test inputs. Note that it only returns one set of brackets at a time. This is because I’ll alter the text after I find each set of brackets which is likely change the start and stop positions of any following brackets, and it’s best to handle one set at a time.
Here’s how I’d incorporate it into the main
function:
def main(): args = get_args() inputs = args.inputs text = args.file.read().rstrip() had_placeholders = False ❶ tmpl = 'Give me {} {}: ' ❷ while True: ❸ brackets = find_brackets(text) ❹ if not brackets: ❺ break ❻ start, stop = brackets ❼ placeholder = text[start:stop + 1] ❽ pos = placeholder[1:-1] ❾ article = 'an' if pos.lower()[0] in 'aeiou' else 'a' ❿ answer = inputs.pop(0) if inputs else input(tmpl.format(article, pos)) ⓫ text = text[0:start] + answer + text[stop + 1:] ⓬ had_placeholders = True ⓭ if had_placeholders: ⓮ print(text) ⓯ else: print(f'"{args.file.name}" has no placeholders.', file=sys.stderr) ⓰ sys.exit(1) ⓱
❶ Create a variable to track whether we find placeholders. Assume the worst.
❷ Create a template for the input
prompt.
❸ Start an infinite loop. The while
continues as long as it has a “truthy” value as True
will always be.
❹ Call the find_brackets
function with the current value of text
.
❺ If the return is None
, then this is “falsey.”
❻ If there are no brackets found, use break to exit the infinite while
loop.
❼ Now that we know brackets
isn’t None
, unpack the start
and stop
values.
❽ Find the entire <placeholder>
value by using a string slice with the start
and stop
values, adding one to the stop
to include that index.
❾ The “part of speech” is the bit inside, and this extracts adjective
from <adjective>
.
❿ Choose the correct article for the part of speech.
⓫ Get the answer
from the inputs
or from an input
call.
⓬ Overwrite the text
using a string slice up to the start
, the answer
, and then the rest of the text
from the stop
.
⓭ Note that we saw a placeholder.
⓮ We exit the loop when we no longer find placeholders. Check if we ever saw one.
⓯ If we did see a placeholder, print the new value of the text
.
⓰ If we never saw a placeholder, print an error message to STDERR
.
⓱ Exit with a non-zero value to indicate an error.
Review
- Regular expressions are almost like functions where we describe the patterns we want to find. The regex engine does the work of trying to find the patterns, handling mismatches, and starting over to find the pattern in the text.
- Regex patterns with
*
or+
are “greedy” in that they match as many characters as possible. Adding a?
after them makes them “not greedy” to match as few characters as possible. - The
re.findall
function returns alist
of all the matching strings or capture groups for a given pattern. - The
re.sub
function substitutes a pattern in some text with new text.
Going Further
- Extend your code to find all the HTML tags enclosed in
<…>
and</…>
in a web page you download from the Internet.
That’s all for this article. If you want to see more, you can preview the book’s contents on our browser-based liveBook reader here.