An excerpt from Publishing Python Packages by Dane Hillard
This article delves into what packages are exactly.
Read it if you want to learn more about Python packages.
Imagine that you work for a company called CarCorp, and they have given you a project. You have to come back in a few weeks with an overhauled process that will help customers install your software in a snap. You know that some of your favorite Python code, like
requests, are available as packages online and you want to provide the same ease of installation to your own consumers.
Packaging is the act of archiving software along with metadata that describes those files. Developers usually create these archives, or packages, with the intent of sharing or publishing them.
The Python ecosystem uses the word “package” for two distinct concepts. The Python Packaging Authority (PyPA) differentiates the terms in the Python Packaging User Guide (https://packaging.python.org):
- Import packages organize multiple Python modules into a directory for discovery purposes (https://packaging.python.org/glossary/#term-Import-Package).
- Distribution packages archive Python projects to be published for others to install (https://packaging.python.org/glossary/#term-Distribution-Package).
Import packages aren’t always distributed in an archive, though distribution packages often contain one or more import packages. Distribution packages are the main subject of this article, and will be disambiguated from import packages where necessary to avoid confusion.
With probably an infinite number of ways to roll software and its metadata together, how do maintainers and users of that software manage expectations and reduce manual work? That’s where package management systems come in.
Standardizing packaging for automation
Package management systems, or package managers, standardize the archive and metadata format for software packages in a particular domain. Package managers provide tools to help consumers install dependencies at the project, programming language, framework, or operating system levels. Most package managers ship with a familiar set of instructions to install, uninstall, or update packages. You may have used some of the following package managers:
- pip (https://pip.pypa.io)
- conda (https://docs.conda.io)
- Homebrew (https://brew.sh/)
- NPM (https://www.npmjs.com/)
- asdf (https://asdf-vm.com/)
The early days of package management
Although developers had been packaging their code informally for some time, it wasn’t until package management systems became widely available in the early 1990s that this approach took off (see Jeremy Katz, “A brief history of package management,” Tidelift, https://blog.tidelift.com/a-brief-history-of-package-management).
The ability to declaratively define project dependencies proved a boon to developer productivity by abstracting away a major area of leg work in managing software projects.
Software repositories standardize packaging further by acting as centralized marketplaces to publish and host packages others can install (figure 1). Many programming language communities provide an official or de facto standard repository for installing packages. PyPI (https://pypi.org), RubyGems (https://rubygems.org/), and Docker Hub (https://hub.docker.com/) are a few popular software repositories.
Figure 1. Packages, package managers, and software repositories are all critical in sharing software.
If you own a smartphone, tablet, or desktop computer and you’ve installed apps from an app store, that’s packaging at work. Packages are software bundled together with metadata about that software, and that’s precisely what an app is. Software repositories host software that people can install, and that’s what the app store is.
So, packages are software and metadata rolled together in an agreed-upon format codified in the relevant package management system. At a more granular level, packages also typically include a way to build the software on a user’s system, or they may provide several prebuilt versions of the software for a variety of target systems.
The contents of a distribution package
Figure 2 shows some of the files you might choose to put in a distribution package. Developers often include the source code files in a package, but they can also provide compiled artifacts, test data, and whatever else a consumer or colleague might need. By distributing a package, your consumers will have a one-stop shop to grab all the pieces they need to get started with your software.
Figure 2. A package often includes source code, a Makefile for compiling the code, metadata about the code, and instructions for the consumer.
Distributing non-code files is an important capability. Although the code is often the reason to distribute anything in the first place, many users and tools depend on the metadata about the code to differentiate it from other code. Developers usually specify the name of a software project, its creator(s), the license under which it can be reused, and so on in the metadata. Importantly, the metadata often includes the version of the archive to distinguish it from previous and future publications of the project.
The early days of sharing software
For more than a decade after the Unix operating system first became available, sharing software between teams and individuals remained a largely manual process. Downloading source code, compiling it, and contending with the artifacts of the compilation were all left up to the person trying to use the code. Each step in this process introduced opportunities for failure due to human error and architectural or environmental differences between systems. Tools like Make (https://www.gnu.org/software/make/) removed some of this variation from the process, but stop shy of package version, dependency, and installation management.
Now that you’re familiar with what goes into a package, you’ll learn how this approach to sharing software solves specific problems in practice.
The challenges of sharing software
The first iteration of your project has been delivered and your boss drags you into a meeting and demands to know why it isn’t working. You realize you forgot to have them install all your project’s dependencies first. You back up a few steps and navigate them through the dependency installation. Unfortunately, you forgot to check which version you’ve been using for one of your major dependencies and the latest version doesn’t seem to work. You walk them through installing each previous version until you finally find one that works. Crisis narrowly averted.
As you develop increasingly complex systems, the effort to make sure you’ve installed the required version of each dependency correctly grows quickly. In the worst cases you might reach a point where you need two different versions of the same dependency, and they can’t coexist. This is affectionately known as “dependency hell.” Detangling a project from this point can prove challenging.
Even without running into dependency hell, without a standardized approach to packaging it can be difficult to share software in a standard way that anyone, anywhere knows what other dependencies they need to install for your project. Software communities create conventions and standards for managing packages, codifying those practices into the package management systems you use to get your work done.
Now that you understand why packaging is good for sharing software, read on to learn some of the advantages packaging can provide even if you aren’t always making your software publicly available.
How packaging helps you
If you’re new to packaging, so far it may seem like it’s mainly useful for sharing software with people across the globe. Although that’s certainly a good reason to package your code, you may also like some of the benefits packaging brings when developing software:
- Stronger cohesion and encapsulation
- Clearer definition of ownership
- Looser coupling between areas of the code
- More opportunity for composition
The following sections cover these benefits in detail.
Enforcing cohesion and encapsulation through packaging
A particular area of code should generally have one job. Cohesion measures how dutifully the code sticks to that job. The more stray functionality floating around, the less cohesive the code is.
You’ve probably used functions, classes, modules, and import packages to organize your Python code (see Hillard, Dane. “The hierarchy of separation in Python,” Practices of the Python Pro, Manning Publications, 2020, pp. 25–39, https://manning.com/books/practices-of-the-python-pro). These constructs each place a kind of named boundary around areas of code that have a particular job. When done well, naming communicates to developers what belongs inside the boundary and, importantly, what doesn’t.
Despite best efforts, names and people are rarely perfect. If you put all your Python code in a single application, chances are some code will eventually seep into areas it doesn’t belong. Think about some of the larger projects you’ve developed. How many times did you create a
helpers.py module containing a grab bag of functionality? The boundaries you create with a function or a module are readily overcome. These “utility” areas of the code tend to attract new “utilities,” with the cohesion trending down over time.
Imagine that your self-driving car system can use lidar (https://oceanservice.noaa.gov/facts/lidar.html) as one type of input. CarCorp’s vehicles don’t include lidar sensors. Being the diligent developer you are, you create a lidar-specific part of the code base to separate it from other concerns. Although assessing naming and regularly refactoring the code base can keep cohesion higher, it’s also a maintenance burden. Distribution packages increase the barrier to adding code where it may not belong in the first place. Because updating a package necessitates going through a cycle of packaging, publishing, and installing the update, it prompts developers to think more deeply about the changes they make. You will be less likely to add code to a package without explicit intent that’s worth the investment of the update cycle.
Creating cohesion and packaging a cohesive area of code is a gateway into encapsulation. Encapsulation helps you build the right expectations with your consumers about how to interact with your code by defining if and how the code’s behavior is exposed. Think of a project you built and shared with someone to use. Now think about how many times you changed your code, and how many times they had to change their code in turn. How frustrating was it for them? How about for you? Encapsulation can reduce this kind of churn by better defining the API contract that’s less subject to change. Figure 3 shows how you might create multiple packages out of cohesive areas of code.
Figure 3. Packaging can reduce unexpected interdependence between areas of code by introducing stronger boundaries.
You might’ve felt frustration in the past when you found out a piece of code meant only for use internal to a module was being used widely throughout the code. Each time you update that “internal” code you need to go update usages elsewhere. This high churn environment can lead to bugs when you don’t propagate a change everywhere, leaving you or your team that much less productive.
Well-encapsulated, highly-cohesive code will change rarely even when used widely. This kind of code is sometimes labeled “mature.” Mature code is a great candidate for distributing as a package because you won’t need to republish it frequently. You can get a start in packaging by extracting some of the more mature code from your code base, then use what you know about cohesion and encapsulation to bring less mature code up to snuff.
Promoting clear ownership of code
Teams benefit from clear ownership over areas of code. Ownership often goes beyond maintaining the behavior of the code itself. Teams build automation to streamline unit testing, deployment, integration testing, performance testing, and more. That’s a lot of plates to keep spinning at once. Keeping the scope of a bounded area of code small so that a team can own all these aspects will ensure the code’s longevity. Packaging is one tool for managing scope.
The encapsulation you create through packaging code enables you to develop automation independent of other code. As an example, automation for a code base with little structure may require you to write conditional logic to determine which tests to run based on which files changed. Alternatively, you might run all the tests for every change, which can be slow. Creating packages that you can test and publish independently of other code will result in clearer mappings from source code to test code to publication code (figure 4).
Figure 4. Teams can take full ownership over individual packages, defining how they want to manage the development, testing, and publishing lifecycle.
A clear delineation of purpose for a package makes it likelier to have a clear delineation of ownership. If a team isn’t sure what they’re committing to by taking ownership of some code, they’re going to be wary. Try providing a package with a clear scope, story, and operator’s manual to see how the mood shifts.
Decoupling implementation from usage
You may have heard the term loose coupling to describe the level of interdependence between areas of code.
The cohesion and encapsulation practices you read about earlier in this chapter are a way to reduce the likelihood of tight coupling due to poor code organization. Highly-cohesive code will have tight coupling within itself, and loose coupling to anything outside its boundary. Encapsulation exposes an intentional API, limiting any coupling to that API. Your choices about packaging and encapsulation, then, help you decouple your consumers from implementation details in your code. Packaging also makes it possible to decouple consumers from implementation through versioning, namespacing, and even the programming language in which software is written.
In a big ball of mud, you’re stuck running whatever code is in each module. If you or someone on your team updates a module, all code using that module needs to accommodate the change immediately. If the update changes a call signature or a return value, it may have a wide blast radius. Packaging significantly reduces this restriction (figure 5).
Figure 5. Packaging provides flexibility so two areas of code can evolve at different rates
Imagine if each update to the
requests package required you to react immediately by updating your own code. That would be a nightmare! Because packages version the code they contain, and because consumers can specify which version they want to install, a package can be updated many times without impacting consuming code. Developers can choose precisely when to incur the effort of updating their code to accommodate a change in a more recent version of the package.
Another point at which you can decouple code is namespacing. Namespaces attach values and behavior to human readable names. When you install a package, you make it available at the namespace it specifies. As an example, the
requests package is available in the
Different packages can have the same namespace. This means they could conflict if you install more than one of them, but it also makes something interesting possible: this flexibility in namespaces means packages can act as full alternatives to one another. If a developer creates an alternative to a popular package that’s faster, safer, or more maintainable, you can install it in place of the original as long as the API is the same. As an example, the following packages all provide roughly equivalent MySQL (https://www.mysql.com) client functionality (specifically, they implement some level of compatibility with PEP 249, https://www.python.org/dev/peps/pep-0249/):
- mysqlclient (https://github.com/PyMySQL/mysqlclient)
- PyMySQL (https://github.com/PyMySQL/PyMySQL)
- mysql-python (https://github.com/arnaudsj/mysql-python)
- oursql (https://github.com/python-oursql/oursql)
Finally, Python packaging can even decouple usage in Python from the language in which a package is written! Many Python packages are written in C and even Fortran for improved performance or integration with legacy systems. Package authors can provide pre-compiled versions of these packages alongside versions that can be built from source by the consumer if needed. This also makes packages more portable, decoupling developers somewhat from the details of the computer or server they’re using. You’ll learn more about packaging build targets in a later chapter.
If for no other reason, you might like to package some of your code to experiment with the freedom of version decoupling. See how your versioned packages evolve over time. Those that change quickly may point to low cohesion because the code has many reasons to change. On the other hand, it may indicate only that the code is still maturing. At the least, these data points will be observable! You’ll learn more about versioning in a later chapter.
Filling roles by composing small packages
The act of extracting code into multiple packages is a bit like decomposition. Successful decomposition requires a good handle on loose coupling. Decomposing code is an art that separates pieces of code so they can be recombined in new ways (for a wonderfully concise rundown of decomposition and coupling, see Josh Justice, “Breaking Up Is Hard To Do: How to Decompose Your Code”, Big Nerd Ranch, https://www.bignerdranch.com/blog/breaking-up-is-hard-to-do-how-to-decompose-your-code/).
By packaging smaller areas of your code, you’ll start to identify code that accomplishes a very specific goal that can be generalized or broadened to fulfill a role. As an example, you can create one-off HTTP requests using a built-in Python utility like
urllib.request.urlopen. Once you’ve done this a few times, you can see commonalities between the use cases and generalize the concept into a higher-level utility. So the
requests package isn’t built to make just one, specific HTTP request; it fills a general role as an HTTP client. Some of your code may be very specific now, but as you find new areas where you need similar behavior, you may see an opportunity to identify the role it fills, generalize a bit, and create a package that can fill that role.
As you work on revamping your software for CarCorp, you remember that a major portion of the code deals with the car’s navigation systems. You realize that with a bit of tweaking, the navigation code will also work for Acme Auto’s vehicles. This code could fill the role of communicating with vehicle navigation systems. Because you’ve learned that packages can depend on other packages, and because your navigation system code is already fairly cohesive, you commit yourself to creating not one but two packages before your next CarCorp meeting.
A composition success story
You can see great examples of composition at play in packaging through Python frameworks like Django (https://www.djangoproject.com). Django is itself a package, and because it’s built as a plugin-based architecture, you can extend its functionality by installing and configuring additional packages. Peruse the hundreds of packages listed on Django Packages (https://djangopackages.org) to see the kind of wide adoption the packaging approach enjoys.
Thinking about composition and decomposition highlights the fact that distribution packages can exist at any size, just as functions, classes, modules, and import packages do. Look to cohesion and decoupling as guiding lights to strike the right balance. One hundred distribution packages that each provide a single function would be a maintenance burden, and one distribution package that provides one hundred import packages would be about the same as having no package at all. If all else fails, always ask yourself, “What role do I want this code to fill?”
Now that you’ve learned that packaging can help you write cohesive, loosely coupled code with clear ownership that you can deliver to consumers in an accessible way, I hope you’re rolling up your sleeves to dive into the details.
That’s all for this article. If you want to learn more about the book, check it out on Manning’s liveBook platform here.