Description

From Seriously Good Software by Marco Faella

This article delves into using memory storage space efficiently, thus making your software more efficient and resilient.


Take 37% off Seriously Good Software. Just enter fccfaella into the discount code box at checkout at manning.com.


Sometimes programmers need to store their data in as little space as possible. Contrary to intuition, this rarely happens because the device they’re targeting comes with little memory. Rather, it happens because the amount of data’s huge. For example, video games are a type of software that often push the limits of the hardware. No matter how many GB of memory the next console boasts, games will eventually run out of memory and start packing data in weird ways.

In this article, assume that your water-management program deals with millions, perhaps even billions of containers, and you want to keep as many of them as possible in main memory. Clearly, you’ll want to shrink the memory footprint of each container as much as possible. On the other hand, you don’t need to worry about the memory used by temporary local variables, because those live only the short time span of a method call.

For each implementation in this article, you’ll compare its memory footprint with the one of Reference. Reference is the baseline implementation for water containers, where each container knows its water amount (a doublefield), and the group of connected containers is represented by a HashSet field.

In the meantime, recall the fields used in that class:

 
 public class Container {
    private Set<Container> group;         private double amount;          

Containers connected to this one

 Amount of water in this container

The Water Container API

This section describes the desired API for the water containers. At the least, you’re going to build a Container class, endowed with a public constructor that takes no arguments and creates an empty container, and the following three methods:

  • public double getAmount()Returns the amount of water currently held in this container.
  • public void addWater(double amount)Pours amount units of water into this container. Water’s automatically and equally distributed among all containers which are connected, directly or indirectly, to this one. This method can also be used with a negative amount, to remove water from this container. In that case, there should be enough water in the group of connected containers to satisfy the request (we don’t want to leave a negative amount of water in a container).
  • public void connectTo(Container other)Connects this container with other, using a (permanent) pipe.

A connection between two containers is symmetric: water can flow in both directions. A set of containers connected by symmetric links form what is known in computer science as an undirected graph. The following box recalls the basic notions about such graphs.

Undirected graphs

In computer science, networks of pairwise connected items are called graphs. In this context, items are also known as nodes and their connections as edges. If connections are symmetric, the graph is called undirected, because the connections don’t have a specific direction. A set of items that are connected, directly or indirectly, is called a connected component. In this book, a maximal connected component is simply called a group.



 

A proper implementation of addWater in our container scenario requires connected components to be known, because water must be spread (or removed) evenly among all connected containers. In fact, the main algorithmic problem underlying the proposed scenario consists in maintaining knowledge of the connected components under node creation (new Container) and edge insertion (connectTo method), a type of dynamic graph connectivity problem.

Such problems are central to many applications involving networks of items: in a social network, connected components represent groups of people linked by friendship; in image processing, connected (in the sense of adjacent) regions of same-color pixels help identify objects in a scene; in computer networks, discovering and maintaining connected components is a basic step in routing.

Gently squeezing [Memory1]

You can do better than Reference with a few simple tricks. First, it’s unlikely that you need the resolution or range of double-precision numbers to represent the amount of water in a container, and we can save four bytes per container by downgrading the amount field from double to float. You need to downgrade the argument of addWater and the return type of getAmount accordingly, and you’re slightly modifying the public API.

Space-saving data types

Reduced-size data types play a limited role in the main Java API, but they’re well-supported in more specialized contexts where memory may be an issue. For example, Android provides a FloatMath class with common mathematical operations performed on floats instead of doubles. Also, in the Java specification for smart cards (aka Java Card), most integers occurring in the API are encoded as either shorts or bytes.

POP QUIZ 1  If your program contains ten occurrences of the string literal “Hello World”, how much memory is devoted to those strings?

Regarding the group field, its Set type was chosen in the reference implementation because it clearly expresses the intent that groups are unordered and contain no duplicates. By giving up this clarity of intent and switching to an ArrayList, you can save a significant amount of memory. After all, an ArrayList is a thin wrapper around a plain array, and the net memory cost of an extra element is four bytes. Your new Container class, nicknamed Memory1, should start as follows.

Listing 1. Memory 1: fields and method getAmount; no constructor is needed

public class Container {
   private List<Container> group; 
   private float amount;

   public float getAmount() { return amount; }

  It will be initialized with an ArrayList

Additionally, if many containers are never connected to a group, you can save space by instantiating the list only when needed (a technique called lazy initialization). An isolated container is represented by the group field equal to null. This choice allows you to provide no explicit constructor, although it also means that connectTo and addWater need to treat isolated containers as special cases, as you’ll see in a minute.

In general, you should be careful when migrating from a Set to a List, because you’ lose the ability to automatically reject duplicate elements. Luckily, you aren’t using that ability in Reference, because the groups merged by the connectTo method is guaranteed to be disjoint in the first place. You obtain the following implementation for connectTo.

Listing 2. Memory 1: method connectTo

   public void connectTo(Container other) {
      if (group==null) {              
         group = new ArrayList<>();
         group.add(this);
      }
      if (other.group==null) {        
         other.group = new ArrayList<>();
         other.group.add(other);
      }
      if (group==other.group) return; 

      int size1 = group.size(),       
          size2 = other.group.size();
      float tot1 = amount * size1,
            tot2 = other.amount * size2,
            newAmount = (tot1 + tot2) / (size1 + size2);

      group.addAll(other.group);      
      for (Container x: other.group) x.group = group;
      for (Container x: group) x.amount = newAmount;
   }

  If this is isolated, initialize its group

  If other is isolated, initialize its group

  Check if they are already connected

  Compute the new water amount

  Merge the two groups

Finally, the addWater method also needs to take into account the special case of an isolated container, to avoid dereferencing a null pointer.

Listing 3. Memory 1: method addWater

   public void addWater(double amount) {
      if (group==null) { 
         this.amount += amount;
      } else {
         double amountPerContainer = amount / group.size();
         for (Container c: group)
            c.amount += amountPerContainer;
      }
   }

  If this is isolated, update locally

We end this section by taking a look at the memory layout of this implementation. Assume you run the first three parts of UseCase, which consist in creating four containers (a to d) and running the following lines:

 
 a.addWater(12);
 d.addWater(8);
 a.connectTo(b);
 b.connectTo(c);
  

The scenario is illustrated in figure 1, and the corresponding memory layout of Memory1 is depicted in figure 2. The layout is similar to Reference, except for the ArrayList instead of HashSet, and for the null value in container d, instead of a reference to a one-element HashSet. The third difference with Reference, water amounts of type float instead of double, doesn’t show in the diagram.


Figure 1 The situation after executing the first three parts of Usecase. Containers a through c have been connected together, and water has been poured into a and d.


Figure 2. Memory layout of Memory 1 after executing the first three parts of Usecase.


Space and time complexity

To estimate the memory footprint of a Memory1 container, start by evaluating the size of an ArrayList, which is internally implemented as an array and a couple of bookkeeping fields. The length of the internal array is called the capacity of the ArrayList, to distinguish it from its size, which is the number of elements stored in it. The memory requirements of an ArrayList come from the following features:

  • 12 bytes for the standard object overhead;
  • 4 bytes for an integral field counting the number of structural modifications (insertions and deletions) ever performed on the list; this field is used to raise an exception if the list is modified during an iteration;
  • 4 bytes for the integral size field;
  • 4 bytes for the reference to the array;
  • 16 bytes for the standard array overhead;
  • 4 bytes for each array cell.

The memory layout of an ArrayList is sketched in figure 3. Because an ArrayList with n elements contains at least n array cells, it occupies at least 40+4n bytes.


Figure 3  Memory layout of an ArrayList.


In reality, the capacity of an ArrayList is often larger than its size. If you add an extra element to an ArrayList which is at full capacity, the class creates a larger array and copies the old one onto the new one. To improve its overall performance, the enlarged array won’t be tight—it’s one cell longer than the old one. The capacity increases by 50%. At any given time the capacity of an ArrayList is between 100% and 150% of its size. In the following estimates, assume the average value of 125%.

You should obtain the estimates in table 1. An isolated container (first scenario) carries no ArrayList. Only the container objects take memory: 4 bytes for the group reference (which holds null) and 4 bytes for the amount field, plus the usual 12-byte object overhead. When containers are organized in one hundred groups of ten (second scenario), one hundred ArrayLists must be added to the footprint of one thousand container objects.

Table 1. Memory requirements of [Memory 1]

Scenario

Size (calculations)

Size (bytes)

% of reference

1000 isolated

1000  (12 + 4 + 4)

20 000

19%

100 groups of 10

1000  (12 + 4 + 4) + 100 (40 + 10  1.25  4)

29 000

47%

As you can see from table 1, with a few simple changes to Reference, you can save a significant amount of space. In particular, the idea of allocating the lists when they’re first needed obviously brings about great savings in our first scenario, where all containers are isolated, and no list’s ever allocated. The 50% savings in the second scenario’s entirely due to having replaced HashSet with ArrayList.

Notice that the memory savings achieved in this section comes at no performance cost because the three operations keep the same complexity they have in Reference, as reported in table 2. On the other hand, the class is somewhat less readable than Reference. First, declaring the group field of type List hides the fact that groups are, in fact, unordered collections with no duplicates. Secondly, treating isolated containers as special cases is an unnecessary complication, whose only aim is to save some space.

Table 2. Time complexities for Memory 1, where n is the total number of containers. These complexities coincide with those of Reference.

Method

Time complexity

getAmount

O(1)

connectTo

O(n)

addWater

O(n)

The high memory overhead of HashSet and other standard collections is a well-known fact, enough that several frameworks provide more space-efficient alternatives. For example, Android provides a class called SparseArray, representing a map with integer keys and reference values, which is implemented based on two same-length arrays. The first array stores the keys in ascending order and the second array stores the corresponding values. With this data structure, one pays the improved memory efficiency with a worse time complexity: finding the value corresponding to a given key requires a binary search over the key array, which in turn requires logarithmic time. Exercise 3 at the end of this article invites you to further analyze the SparseArray class.

When only primitive values need to be stored, several libraries, such as GNU Trove[1], provide specialized set and map implementations that avoid wrapping each value in the corresponding class.

Plain arrays    [Memory2]

In your second attempt at saving memory, nicknamed Memory2, you’re going to replace the ArrayList representing a group with a plain array, and keep its length exactly equal to the size of the group. Here’s what the beginning of the class should look like:

Listing 4                Memory2: fields and method getAmount; no constructor is needed

 
 public class Container {
    private Container[] group;
    private float amount;
  
    public float getAmount() { return amount; }
  

As for Memory1, you may want to allocate the group array only when necessary; when this container’s connected to at least other one. The resulting memory layout in the usual scenario is shown in figure 4.


Figure 4 Memory layout of Memory 2 after executing the first three parts of Usecase.


The connectTo method’s similar to the one in Memory1, slightly more cumbersome, due to its lower level of abstraction. For example, merging two ArrayLists is a simple matter of invoking the addAll method, whereas merging two arrays requires reallocating one of them and then iterating over the other (lines 5 and 9 in the following listing).

Listing 5. Memory 2: method connectTo

 

   public void connectTo(Container other) {

      if (group==null)                                     
         group = new Container[] { this };
      if (other.group==null)                               
        other.group = new Container[] { other };
      if (group == other.group) return;                    

      int size1 =       group.length,                      
          size2 = other.group.length;
      float amount1   =       amount * size1,
            amount2   = other.amount * size2,
            newAmount = (amount1 + amount2) / (size1 + size2);

      Container[] newGroup = new Container[size1 + size2]; 

      int i=0;
      for (Container a: group) {                           
         a.group = newGroup;                               
         a.amount = newAmount;                             
         newGroup[i++] = a;                                
      }
      for (Container b: other.group) {                     
         b.group = newGroup;
         b.amount = newAmount;
         newGroup[i++] = b;
      }
   }

  If this is isolated, initialize its group

  If other is isolated, initialize its group

  Check if they are already connected

  Compute the new water amount

  Allocate new group

  For each container in 1st group…

  …update its group

  …update its amount

  …and append it to newGroup

  Do the same for 2nd group

Finally, the addWater method is almost identical to the one from Memory1, except for the type of the water amount variables—float instead of double.

Listing 6   Memory2: method addWater

 
 public void addWater(float amount) {
       if (group==null) {
         this.amount += amount;
       } else {
          float amountPerContainer = amount / group.length;
          for (Container c: group)
             c.amount += amountPerContainer;
       }
    }
   
  

Space and time complexity

A plain array containing references to n containers takes 16+4 n bytes, leading to the estimates in table 3 for our standard scenarios. An isolated container (first scenario) allocates no group array, and its memory footprint’s exactly as large as in Memory1: 20 bytes. When containers are organized in one hundred groups of ten (second scenario), each group’s represented by a 10-cell array, taking 16 bytes of array overhead and 4*10 bytes for its actual content.

Table 3. Memory requirements of [Memory 2]

Scenario

Size (calculations)

Size (bytes)

% of reference

1000 isolated

1000  (12 + 4 + 4)

20 000

19%

100 groups of 10

1000  (12 + 4 + 4) + 100  (16 + 10  4)

25 600

42%

The bad news is that the memory savings achieved by Memory2 are insignificant compared with the previous version Memory1. Isolated containers occupy the same amount of memory, and groups of containers, now represented by arrays, are only marginally more compact than ArrayLists. In fact, the bigger savings come from keeping the arrays tight, which is exactly as long as they need to be, rather than relatively loose as an ArrayList.

It’s a good time to recall and compare the memory requirements of the three standard data structures used to represent groups of containers: HashSet (Reference and Speed1), ArrayList (Memory1) and plain arrays (Memory2). Classes Speed2 and Speed3 aren’t included in the comparison because they’re based on custom datastructures, which aren’t immediately available for general use.

Table 4 summarizes these memory requirements. The size estimate for a plain array of object references is easily done: 16 bytes of overhead and 4 bytes for each reference.  Here, I’m assuming that its capacity’s equal to its size (in reality, the first can be up to 50% larger than the latter). Similar simplifying assumptions apply to the size analysis for HashSet.

As you can see from the table, an array and an ArrayList are close in terms of memory requirements, but ArrayList is more useful, supporting automatic resizing and other utilities, and leads to more readable code, due to its higher level of abstraction. Moreover, ArrayList plays nicely with generics, whereas arrays are at odds with them.

Table 4 Memory requirements of common collections, assuming that the capacity of the ArrayList and the HashSet is equal to their size. The second column’s tagged “barebone” instead of “empty”, because it doesn’t take into account the default initial capacity of that collection.

Type

Size (barebone)

Size (each extra element)

array

16

4

ArrayList

40

4

LinkedList

24

24

HashSet

52

32

POPQUIZ 2 For a type parameter T, why isn’t “new T[10]” a legal Java expression?

HashSet, instead, is in a different, much bulkier league, particularly because it wraps each inserted element into a new object. It provides unique services in optimal time:

  • Membership query in constant time (method contains).
  • Rejection of duplicate elements in constant time (method add).
  • Removal of an arbitrary element in constant time (method remove).

If those services are required by your application, a HashSet generally more than repays its larger memory footprint.

Regarding performance, the time complexity of Memory2 turns out to be the same as Memory1 and Reference. After all, Memory2 is a variation of Memory1 with plain arrays instead of ArrayLists.

You may be wondering why the smart resizing policy of ArrayList, which guarantees amortized constant-time insertions, doesn’t provide any advantage to Memory2. The explanation is that it does provide some advantage in the performance of connectTo, but that advantage is hidden by the other operations performed by that method. In detail, Memory1 merges two groups through the line group.addAll(other.group); where group and other.group are two ArrayLists. Memory2 instead executes the line Container[] newGroup = new Container[size1 + size2];

The first is generally more efficient than the second, because the extra capacity of the first ArrayList may be enough to insert all the elements from the second ArrayList without any new allocation. Both versions of connectTo proceed to iterate over all the elements from both of the old groups. Asymptotically speaking, the early savings are overruled by the later loops, leading to the same big-O complexity of O(n) for both versions of connectTo.

Table 5. Time complexities for Memory 2, where n is the total number of containers. These complexities coincide with those of Reference.

Method

Time complexity

getAmount

O(1)

connectTo

O(n)

addWater

O(n)

One more reason to like Memory2 is that it’s entirely self-contained, in that it doesn’t mention any other class from the Java API. In special circumstances, this could be beneficial, because it means that this class can function in a context with a limited runtime environment, as allowed by the module system introduced with Java 9.

That’s all for now. If you want to learn more about the book, check it out on liveBook here and see this slide deck.