|
From Python How-To by Yong Cui. This article discusses alternatives to lists and tuples for use a data structures. |
Among the various sequence data types, it’s no doubt that the versatility of their features makes lists and tuples satisfactory data containers in many common situations. However, when you move onto more specific projects, you’ll find out that lists and tuples may become less ideal. Thus, you should be open-minded about alternative data structures that can be the correct choice depending on the use cases. In this section, let’s review some common use scenarios and the recommended alternatives.
Using sets when membership is concerned
We often need to check if the data container has the specific item under examination, a functionality that is termed membership checking. With lists and tuples, we’ve learned that we can use either item in the list
to check the membership or the index method as an indirect way to determine if a list contains a specific item. Please note that using index is less desired because when the item isn’t in the list, the ValueError
exception is raised.
Although lists support membership testing, you should consider using sets when membership is concerned in your application. Python requires that all the items in a set are unique because under the hood, sets are implemented using a hash table, which offers a significant benefit of constant item look-up time, known as the O(1) time complexity. By contrast, membership testing look-up time is linear with the length of the list, because Python needs to traverse the sequence to find a potential match. The more items a list has, the longer time it costs for the traverse. Thus, you should use sets when membership testing is concerned in your application.
Using deques if you care about first-in-first-out
In certain applications, we want our data to have the first-in, first-out (FIFO) capability. The FIFO emphasizes that the items that are added to the sequence first (first-in) are removed from the sequence first (first-out).
Suppose that we’re building an online customer chat system for an enterprise. Throughout business hours, clients check in and we use a list to track the order of the check-in sequence. It’s reasonable to connect those who check in first to the customer support associates, which represents a FIFO need in the application. Let’s first see a possible solution using lists, as shown in the code snippet below.
Listing 1 Using lists to create the client queue system
clients = list() def check_in(client): clients.append(client) #A print(f"in: New client {client} joined the queue.") def connect_to_associate(associate): if clients: #B client_to_connect = clients.pop(0) #C print(f"out: Remove {client_to_connect}, connecting to {associate}.") else: print("No more clients are waiting.")
#A Append a new item to the end of the list.
#B Examine whether there are items in the list. When the list is empty, pop will result in an IndexError.
#C Remove the first item in the list.
In the snippet, the check_in
function adds a new client to the end of the waiting queue, which is a list object named clients
. Once an associate becomes available, we connect the first client in the queue to the associate. To retrieve the first client, we use the pop
method of list objects. This method doesn’t only return the first item in the list, but it also removes it.
The removal of the item at the beginning of a list object is significant. Because under the hood, Python shifts each of the items in the list to adjust the vacancy of the first item in the memory, which is an expensive operation with a time complexity of O(n). Given its considerable complexity, we should consider an alternative solution — using deques.
The deque data type is a double-ended queue, and its pronunciation is the same as “deck.” Because of its double-ended feature, it supports insertion and removal from both ends, making it a perfect data type for implementing the client chat management system which requires FIFO. As mentioned previously, the invocation of pop
method on a list object is an expensive operation time- and memory-wise. By contrast, because deques’ both ends are open, it’s a computationally trivial operation to remove the leftmost item from a deque. Refer to figure 1 for an illustration of the contrast between lists and deques.
Figure 1. Removing the first item in a list vs. a deque. Removing the leftmost item of a list requires the shifting of all remaining item, making it an O(n) operation, while removing the leftmost item of a deque requires no actions on the remaining items, making it an O(1) operation.
Let’s make a direct comparison between lists and deques for this operation. Consider the following simplified set-up about removing the first item in the waiting queue. Please note that the following code includes the usage of a lambda function.
Listing 2 Comparing the performance between deques and lists
from collections import deque #A from timeit import timeit #B def time_fifo_testing(n): integer_l = list(range(n)) integer_d = deque(range(n)) t_l = timeit(lambda : integer_l.pop(0), number=n) t_d = timeit(lambda : integer_d.popleft(), number=n) #C return f"{n: >9} list: {t_l:.6e} | deque: {t_d:.6e}" numbers = (100, 1000, 10000, 100000) for number in numbers: print(time_fifo_testing(number)) # output something like the following lines: 100 list: 6.470000e-05 | deque: 3.790000e-05 1000 list: 7.637000e-04 | deque: 3.435000e-04 10000 list: 1.805050e-02 | deque: 2.134700e-03 100000 list: 1.641030e+00 | deque: 1.336000e-02
#A The deque data type is available in the collections module in the standard library.
#B The timeit function calculates the average execution time of an expression.
#C The popleft method pops the first item from the beginning of the deque.
The performance gain in this trivial example using deques over lists is significant with two orders of magnitude for 100,000 items. For enterprise applications, such improvement in a single aspect can be essential for improving overall user experiences. Importantly, using the deque data type doesn’t involve any complicated implementations. So why not enjoy the performance gain without any cost other than using a built-in data type? Here’s the modified implementation using deques.
Listing 3 Using lists to create the client queue system
from collections import deque clients = deque() def check_in(client): clients.append(client) print(f"in: New client {client} joined the queue.") def connect_to_associate(associate): if clients: client_to_connect = clients.popleft() print(f"out: Remove {client_to_connect}, connecting to {associate}.") else: print("No more clients are waiting.")
Processing multi-dimensional data with NumPy and Pandas
So far, we’ve been focused on linear sequence data structures, such as lists, tuples, and strings. However, in real-life, many data resume a multi-dimensional shape, such as images and videos. For instance, images can be mathematically represented as three layers (red, green, and blue) of two-dimensional pixel panels. With these high dimensional data, it can be a nightmare if you try to use these basic data models to represent them. Fortunately, Python’s open-source nature has bolstered the development of many third-party libraries and packages for processing multi-dimensional large-scale datasets. Thus, instead of using lists, we should consider alternatives that are designed for computationally heavy jobs.
For example, if you need to work on a large amount of numeric data, you should consider using NumPy arrays, which are the core data type implemented in the NumPy package. Importantly, lots of related manipulations are available in the package, such as reshaping, transformation, and various arithmetic operations.
If you need to work on a spreadsheet-like data with mixed data types (e.g. strings, dates, numbers), you should consider using Pandas DataFrame, one of the core data types implemented in the Pandas packages. If you do machine learning, you need to use tensors, which are the most important data types in major machine learning frameworks, such as TensorFlow and PyTorch.
If your applications deal with a large amount of multi-dimensional data, especially in the form of numeric values, you should take advantage of these third-party libraries, which have specialized data types and their associated methods to ease your life.
Summary and discussion
In this section, we reviewed the essential alternative sequential data models to lists or tuples. Certainly, we’re not trying to exhaust all available data models. Instead, I just want to convey the message that you should be open-minded about the data model choices. The decision must be driven by the specific business need.
When you’re concerned about membership testing, you should use sets, which have the advantage of fast and constant look-up time for membership checking. In addition, by using sets, you make it clear to the readers that you don’t care about the order of the items in the container. When your applications require FIFO operations, you should use the built-in deque data type. As a double-ended sequence type, deques allow you to effectively perform insertion and removal operations at both ends. With the trend of data science and machine learning, more applications have become centered at multi-dimensional data. Manipulating these data is computationally heavy. You should consider specialized data structures implemented in third-party packages, such as NumPy and Pandas.
Challenges
I’ve mentioned that lists are not the desired data model for manipulating multi-dimensional numeric data. As a challenge, can you use two list objects to hold two 3-dimensional datasets, respectively? As a hint, you need to embed lists in another list object.
Can you also write some code to add these two list objects at the corresponding index? Just for a comparison, would you please find out how you can easily get these two jobs done with NumPy? If you want to find out, check out the book here!