By Tryggvi Björgvinsson

This article discusses quality management in terms of data and data projects. When you know what your users need or want, you’ll have to manage those expectations somehow, but how do you manage quality? It turns out that you approach it similarly to how you would try to properly answer a question.

Save 37% off The Art of Data Usability with code fccbjorgvinsson at manning.com.

Find the first part of this article here.

Let’s start with a question we all ask ourselves at some point in our lives (if you haven’t already, you’re about to do it now); why does chocolate melt in your mouth but not in your hands?

Maybe you know the answer thanks to your background. Maybe not, but you may have an idea why. Let’s talk through the process you’d follow to answer the chocolate question. You start with your idea, which may or may not be the answer to the question. Let’s say that your idea is that the melting point of chocolate lies somewhere between the temperature on the surface of hands and the temperature inside mouths. That sounds plausible.

Next, you do some experiments, which may measure the melting points of various brands of chocolate and the temperatures of the surface of hands and inside of mouths. After collecting all temperatures and melting points you analyze the results and find out that the melting point of processed chocolate is around 34°C, which is slightly lower than the temperature inside a mouth (35-37°C) and just higher than the average temperature of the skin surface of hands at room temperature (32-33°C). Eureka! Your idea seems to have been correct, but chocolate still melts a little bit in your hands, which poses a new question. You can then test another idea (that surface temperature is raised above the melting point of chocolate by closing your hands or applying pressure). From this you get ideas like how to sugarcoat chocolate to raise the melting point and create a better-quality chocolate.

This exact process of answering questions is quite old—most often referred to as the scientific method—and is the basis of a lot of academic research. You start with a question (the research question). You put forward an idea (a hypothesis). Next, you test your idea (with experiments). You analyze the results of the experiments (evaluation). Last, you let everyone know how it turned out (publication).

Note: I  added the last step, because although it’s not considered to be a part of the scientific method, it’s how results are established within the academic community.

The scientific method is also the foundation of quality management, and there are complete quality management systems that revolve around how to set up and manage quality, based on an approach similar to the scientific method. All of these quality management systems, and quality management in general, are based on a management method usually known by its abbreviation: PDCA. The whole idea, which resembles the steps in the scientific method, is to repeat the following steps (also shown in figure 1) to reach and maintain quality:

  • Plan your changes (hypothesize about what will improve quality)

  • Do the changes (implement the hypothesis to experiment)

  • Check your outcome (evaluate the results)

  • Act on it (publish it to make it known)


Figure 1. The Plan – Do – Check – Act quality cycle.


The PDCA has been called a great many things over the years, but I just call it the quality cycle. To me, there is only one quality cycle, with some variations. It usually involves the steps plan, do, check, and act, but it can also be a variation of those steps.

To better understand how to work with the quality cycle, we can walk through one iteration of the cycle for a simple project. You’ll follow a similar iterative process in some form whenever you work on quality. It doesn’t matter if you’re working on data quality or any other form of quality: this is the foundation. Imagine you’re managing a dataset for UFO sightings around the world, you’ve been trying to improve the dataset and are just starting a new iteration of the quality cycle. We need something to work with, so let’s start by generating a dataset we can use. We can create an example dataset using Python.

Are you all set up to create a Python project? Great, let’s continue. First, we create a directory to work from. We create it in the root of our home directory and call it art_of_data_usability. Type this into a terminal (can be bash or Powershell or whatever you fancy):

  
 $ cd                          
 $ mkdir art_of_data_usability 
 $ cd art_of_data_usability    
  

 Typing cd and nothing else makes sure we are in our home directory (on bash)

 This command creates a directory called art_of_data_usability.

Then we traverse into the newly created directory.

Next, we create a virtual environment. We don’t actually need it to generate the dataset but we’re going to need it later in our example. Type this into the terminal:

  
 $ python -m venv venv         
 $ source venv/bin/activate    
  

 Creates a new virtual environment in the current directory. If your Python 3 executable is installed as python3 you use that instead of the python one.

 Activates the virtual environment (this is for a bash shell, in Powershell on Microsoft Windows you would run .\venv\Scripts\Activate.ps1 but before that you may have to allow running scripts in Powershell).

Now we should have our environment ready and we can move on to create a small Python script to generate a csv file with a UFO sighting every day from 1956 through 2017 at a fixed location, and always reported by the same made up newslet (news outlet). The first few lines of the resulting csv file (called ufos.csv) will be these:

 date,location,reporter
 1956-01-01,Area-52,NEVADAta
 1956-01-02,Area-52,NEVADAta
 1956-01-03,Area-52,NEVADAta
 1956-01-04,Area-52,NEVADAta
  

Open a file called generate_example_ufo_dataset.py and write the code in listing 1 into it, save, and close. This is our date-generating Python script.

Listing 1. generate_example_ufo_dataset.py

  
 Limport csv 
 from datetime import date, timedelta
  
 with open('ufos.csv', 'w') as ufos:                
     ufo_csvfile = csv.writer(ufos)                 
  
     headers = ['date', 'location', 'reporter']     
     ufo_csvfile.writerow(headers)
  
     start_date = date(1956,1,1) 
     end_date = date(2018,1,1)
     days = (end_date-start_date).days
  
     same_location = 'Area-52'   
     same_reporter = 'NEVADAta'
  
     for day in range(days):     
         sighting_day = start_date + timedelta(day)  
         ufo_csvfile.writerow([sighting_day, same_location, same_reporter]) 
  

 Import the libraries we need: csv and two components from datetime.

 Open a file (hardcoded file name: ufos.csv) for writing.

 Create a csv writer object out for the file.

 We create three headers and write the header row to the file.

 We set the start and end dates, then we compute the amount of days in between.

 We hardcode the same location and reporting newslet for our file

 Loop over the amount of days, this will create the sequence: 0,1,2,3,4,…

 Set the date for the sighting, if we’re in the first iteration of our for loop (where day is 0), this will be equal to the start_date. If we’re in the next iteration (where day is 1) this will be the day after the start_date and so on.

Write a row to our csv file with the sighting day, and the hardcoded location and reporting newslet.isting 2.1. generate_example_ufo_dataset.py

Now you can run the data-generating script by running this command in the terminal:

  
 (venv) $ python generate_example_ufo_dataset.py
  

This creates the desired csv file (called ufos.csv) in our working directory (which we created earlier as art_of_data_usability). Now that we have something to work with in our example, let’s move on to the quality cycle.

Planning and designing metrics

This phase of the quality cycle starts with an idea—an idea for an improvement, a new method, or something else. With that idea, you start planning and designing your change. You do this in four steps, those shown in figure 2.


Figure 2. You plan the quality improvement and how you will measure it


The first planning step is to define the objective of your idea. You base your work on a single quality attribute you want to improve; for example, the size of a dataset. There are, of course, many quality attributes to work on, but, for each iteration, you choose only one. It’s best to prioritize the quality attributes and pick the one at the top of the list. For our UFO dataset, there can be many different quality attributes you’d like to improve on (standardization of locations, completeness in the UFO sightings reports, the ability to aggregate), but for this example let’s say the highest-priority attribute now comes from your system administrators. You’ve collected so many sightings that the dataset file size is too large. Your system administrators have assigned you a quota of 640KB because they’re firm believers in the old (and incorrectly appropriated) Bill Gates quote that “640K ought to be enough for anybody.”

Next, you propose a small change and write down the predicted outcome. It’s important to focus on a small change. There can be multiple viable changes but don’t try to do too many things in one go; if you want to reduce the size of a dataset and you think it might get smaller with a new data format, splitting the sightings up into years, and also by compressing it, pick one but not all of them.

Tip: If you don’t know what contributed to the improvement, you risk institutionalizing unnecessary behavior. So, something might actually reduce quality, even if in your books it’s recorded as something that improves quality.

The new data format may be worse than the old one but you don’t see that because, thanks to compression, the dataset is smaller. Or the compression may not become as good when you split the dataset into multiple files. For our example, let’s say we propose compression, and we think we might be able to reduce the file size sufficiently because we get so many reports from the same place that the compression can take advantage of that. You don’t implement the change at this stage, you just write down the change and its expected outcome beforehand. Doing this allows you to better focus your efforts to know what you’re doing and why. It also means you’ll spend less time in the analysis step, which could turn into a treasure hunt if you don’t plan properly, and treasure hunts are rarely productive.

Writing down the expected outcome also allows you to design your metrics before you make the change, and that’s your next step. Designing your metrics before doing anything means you won’t end up in a position where you’ve made a change, but, when asked whether that change helped, can only answer, “I can’t say. It turns out we can’t really measure it.” It’s better to know, before you start, how you’ll compare the actual outcome to the expected outcome. Our goal is to get the file size to less than 640KB (but we hope to exceed those expectations). Another goal could be to aim for a specific size reduction where the metric would be to compare the original size against the outcome. This all depends on what we’re trying to achieve.

The last step is to write down the plan. It may seem like you don’t need it for our example, after all, you’re just going to compress the file, but this is good practice. Real-world projects will not be as simple and you need to document what you’re doing. If you do it correctly, there’s also a side-benefit I’ll point out to you, after I describe what should be in the plan, below:

  • Who is involved in the iteration?

  • What will they be doing?

  • Where will they do it?

  • When can you expect an outcome?

These questions are pretty straightforward. It may seem weird to ask where the involved people will do what they’ll be doing, but sometimes a quality improvement isn’t performed at a desk, it may be somewhere in the field. If your quality attribute is understandability and the change you propose is giving a presentation to the target group, you won’t do that at your desk: you’ll have to think about a lecture hall, a meeting room, or some other place where you’ll give the presentation.

The last question, about when you can expect an outcome, is important to think about both in terms of implementation and measurements. Gathering measurements for analysis can take a lot longer than the actual implementation. If you were changing a work process at a bakery, you can’t expect a great outcome in a single day; the bakers will have to get used to the new process. You let it run for maybe a month and then you can see if things improved. You need to know what period, the reference period, to compare your measurements against. That’s your plan—it doesn’t have to be complicated. If you’ve restricted yourself to one quality attribute and a small change, it should fit on a single piece of paper (something similar to the example in table 1).

Table 1. Example of a simple quality plan

Subject

ufos.csv dataset

Quality attribute

Size

Proposed change

Implement compression using the DEFLATE algorithm

Expected outcome

File size less than 640KB

Metric

Dataset file size

People

Jane Doe implements DEFLATE algorithm

Special locations

None

Reference period

Immediate

Tip: Here’s the side-benefit I promised you. I like to use these as cards on a Kanban board (for more on Kanban see Kanban in Action). A Kanban board consists of swim lanes representing different stages like Planned, Doing, and Done. You put each card into the swim lanes to visually show the progress, then you move them around. When you start work on a planned change you move it to the doing stage. This allows you to quickly gauge the progress. If you’re doing this physically (with real cards you write on and stick to the wall), you can use both sides of the card; The front side would contain more important information like subject, attribute, people, and reference period, but the back side would have more details, like proposed change, expected outcome, metric, and locations.

If you like this approach you can think about the Kanban board as the four different steps in the quality cycle and it’ll provide you with a good overview of your quality work:

  • Planning

  • Doing

  • Analysis

  • Done (baseline established)

Whatever you use, you should create a simple template you can use for all your quality adventures, just to speed up your planning phase.

Implementing controls and changes

After setting up your plan, you just go ahead and do it. Doing the change involves one step, but at the same time you also document problems and begin analysis, as shown in figure 3.


Figure 3. Carry out the plan while documenting and analyzing problems and behavior


Even though implementing the change (do) is a one step process, carrying out the plan, it’s still not simple. What you have to do in this step depends on which iteration you’re on (you will probably do many iterations for each quality attribute). If this is the first iteration, you have to create your baseline, which describes the current level of the quality attribute. Before you implement any compression algorithms, you need to know the size of the uncompressed dataset. To do that, you design a test, normally referred to as a quality control.

Even if I referred to quality controls as a test you design, it’s not a simple test of yes or no, rather a way to gauge the quality level for comparisons. It leads to a yes or no answer when you ask yourself if you have the desired level of quality. The quality control for the size of a dataset is not does the dataset have size X, it’s what is the size of the dataset? The latter allows you to examine the size after implementing a change. For our case, there are many ways to measure file size. We could just use what’s available on the operating system (right click on the file, choose properties, and look at the size) or we could write a small program that does this for us. Let’s go the hard way and write a small Python script to gauge the file size. We can use the same virtual environment we created when we generated the example dataset. If we don’t have it activated, we can activate it by running the following command:

  
 $ source venv/bin/activate
  

For this script, we’ll use the fantastic Click library to quickly turn the code into a command-line script that can take a file name as an argument. Let’s install Click into the virtual environment using Pip. At the command prompt in a terminal, run the following command:

  
 (venv) $ pip install click
  

After running that install you should see a few lines and one of them should say how successful your installation was, something along these lines:

  
 Successfully installed click-6.7
  

Then we can write a small script to gauge file size. Create a file named get_ufo_dataset_size.py and add the code from listing 2 to it.

Listing 2. get_ufo_dataset_size.py

  
 import os.path   
 import click
 import math
  
 @click.command() 
 @click.argument('filename')  
 def get_ufo_filesize(filename):
     size_in_bytes = os.path.getsize(filename)                 
     size_in_kilobytes = math.ceil(size_in_bytes / 1024)       
  
     print('Size is: {size}KB'.format(size=size_in_kilobytes)) 
  
 if __name__ == '__main__':   
     get_ufo_filesize()
  

 We install the three libraries we need, Click and two standard libraries

 We create a command-line interface using Click

 Our command-line interface should take one argument, called filename

 We use os.path.getsize to get the file size in bytes

 Because humans rarely talk about file sizes in bytes (except for small files) we convert it to kilobytes. Why do we divide it by 1024? Because the kilo in computers is 1024 (2 to the power of 10). We also round it up to get a nice number.

 Print out the file size as a human readable text.

  Then we invoke our Click command by calling it when we run the Python file (this business with __name__ and ‘__main__’ is a Python convention).

To run this script, we can just execute it and pass in the name of the file we want to know the size of (which is the Click filename argument). In our case that would be the ufos.csv file. Run the following command at the prompt:

  
 (venv) $ python get_ufo_dataset_size.py ufos.csv
  

This should output the following text:

  
 Size is: 642KB
  

This is our baseline: 642KB (actually, the baseline is uncompressed data, but the size is the metric we’re interested in). The file is clearly too large. The system administrators don’t want it to surpass 640KB. Obviously, quality is lacking for the size attribute. The system administrators won’t be happy. Let’s implement the proposed change to see if we can improve the quality. Again we turn to Python to compress the data. We don’t have to disturb work processes because we can just copy the contents of the file into a compressed file.

Activate the virtual environment, if it isn’t already activated, and create the file in listing 3 to compress the dataset.

Listing 3. gzip_ufo_dataset.py

  
 import gzip                                              
 import shutil
  
 with open('ufos.csv', 'rb') as ufos_csvfile:             
     with gzip.open('ufos.csv.gz', 'wb') as ufos_gzipped: 
         shutil.copyfileobj(ufos_csvfile, ufos_gzipped)   
  

 Import the libraries we need. One of them, gzip, is the compression library we use. This library complies with our DEFLATE algorithm requirement, so this is all according to plan.

 Open the original csv file for reading. We’re not using Click for this but instead hardcoding the file name for simplicity. If you want practice with Click this is a good script to convert.

 Use gzip to open a compressed version for writing. This will automatically compress everything we write to the file. Again we hardcode the file name for simplicity.

 Use the shutil library to copy the contents of the original file to the compressed version of the file.

To run this and generate a file called ufos.csv.gz (the compressed version of ufos.csv), we only have run the following command at the prompt:

  
 (venv) $ python gzip_ufo_dataset.py
  

That’s what we will do in the do step of our cycle. Irrespective of how you implement the quality controls or the changes, you should always record all problems and unexpected incidents that occur during the implementation and reference periods. It makes the upcoming analysis of how well the implementation went much easier. In our example, we probably wouldn’t run into a problem because the reference period is immediate, meaning we’ll just do the change and analyze the output immediately. Also, we compressed the data into a different file to delay work process problems. This is more likely to be something you document during a longer reference period.

There might still be incidents or concerns raised by something as simple as the compression. For example, we’re using gzip, which works on only a single file, but regular zip archives can contain many files and use the same DEFLATE algorithm, which makes updating a file within a zip archive more complex. This might be something you flag if you started out with a zip archive instead of the gzip format. You should document the problems and behaviors in real time and you should start analysis of those incidents when they come up. It’s best to gather the evidence during an event rather than sometime later. This is good to do for the metrics you have designed and implemented, and it’s a very basic quality control that catches everything you didn’t think about. Even with good intentions, you can’t plan and design your metrics for all situations. If you make a habit of documenting the problems and incidents, you’ll at least know something you can focus on improving in future iterations. By starting analysis of those incidents as soon as they happen, you’re more likely to be able to collect the data you need while the problem happens instead of being stuck with a problem and no means of analyzing it later because the necessary data was never recorded.

When you begin, this phase may take some time because you’re creating the quality controls, but in future iterations, once you’ve put all the metrics into play, you won’t have to spend time on this again. You actually should try to avoid changing quality controls as much as possible because that affects the comparability between the results of independent iterations. If you need to change a quality control, you must re-approach it as the first iteration and begin again by finding your baseline.

Analyzing the implementation

Armed with the expected outcomes from your plan and the actual outcome after making the changes, you should have an easy time checking the measurements and analyzing them against the outcome (figure 4).


Figure 4. Analyze, compare, and summarize when checking the outcome


By this time, you should have bulk of the work done: you’ve created a plan, you’ve carried out that plan, and you have your baseline. Now it’s time to check whether you’ve improved the quality. First, you must complete the analysis that you should have started during the implementation. It’s even possible that you’ve already finished your analysis, but sometimes that’s not possible until after the reference period (the time interval between the process start and when you expect the outcome). If your change was a new work process, you analyze individual incidents while the staff works according to the new process, but you should still continue to monitor for the remainder of the complete reference period. Afterward, you’ll be able to analyze all of the incidents as a whole. The analysis of individual incidents might reveal problems with wording of the work process steps, but when you analyze the incidents as a whole, you notice that the incidents rose only in the first few weeks while the staff was getting accustomed to the new process. Once they settled in, the new work process actually resulted in happier staff, more productivity, or whatever quality you were after. In our example, we can just run our little program to check the file size of the compressed dataset file with the following command (remember to activate your virtual environment):

  
 (venv) $ python get_ufo_dataset_size.py ufos.csv.gz
  

The output, if you run it on the dataset immediately after compressing it, should be the following:

  
 Size is: 55KB
  

This is not the only analysis you need to do. You’ll also have to check how fast the dataset grows. This is about exceeding the expectations. Even if you’re able to reduce the size, that won’t be good enough if we surpass the limit again in a couple of days.

If we have more than 60 years of daily records stored in 55KB, we can expect that we won’t see much change (in kilobytes) by adding one more record. Let’s try to add 100 years, or a little bit more than 36,500 records, to see how that affects our compressed dataset. To do that, we need to modify our initial data-generating code to work with compressed files. Luckily, that’s a pretty simple change. Create a file called append_to_gzipped_ufo_dataset.py and add the code in listing 4.

Listing 4. append_to_gzipped_ufo_dataset.py

  
 import csv  
 import gzip
 from datetime import date, timedelta
  
 with gzip.open('ufos.csv.gz', 'at') as ufos:       
     ufo_csvfile = csv.writer(ufos) 
  
     start_date = date(2018,1,1)    
     end_date = date(2118,1,1)
     days = (end_date-start_date).days              
  
     same_location = 'Area-52'      
     same_reporter = 'NEVADAta'
  
     for day in range(days):        
         sighting_day = start_date + timedelta(day) 
         ufo_csvfile.writerow([sighting_day, same_location, same_reporter]) 
  

 Import the libraries we need, csv, gzip, and two components of datetime

 We open the gzipped file with gzip but we don’t open it for writing, we open it for appending (the ‘at’ bit in the open call) because we want to add to the file, not overwrite it. This line is actually the only thing that’s really different from our original data-generating example.

 We create a csv writer object out of our file so we can write our csv rows.

 We hardcode the start and end dates to cover 100 years

 We’ll loop over all of the days so we need to count how many days there are in this year.

 We use the same location and reporter (hardcoded) because we want this to be similar to the last 60 years.

 Here we loop over the number of days. In essence this counts from zero up to the value (in our case days): 0,1,2,3,4,…

 Then to get each date, we take the start date and add the days counter to it (start_date plus 0, start_date plus 1, start_date plus 2, …).

 For that day we write (append) a row to our csv file.

To run this script and append data to our compressed csv file, run the following command:

  
 (venv) $ python append_to_gzipped_ufo_dataset.py
  

Now let’s see how large the file has become; run the following command:

  
 (venv) $ python get_ufo_dataset_size.py ufos.csv.gz
  

The output should be as follows:

  
 Size is: 143KB
  

That’s an increase of 88KB or an annual increase of around 0.88KB if the frequency and contents don’t change.

After you finish the analysis, you compare it with your predictions of the outcome and draw conclusions. Be honest and objective in your analysis. It’s better to prove that the change didn’t improve the status quo, than fake the results to show that you were right. Be proud of making mistakes. You learn a lot more from an honest analysis instead of building up a web of lies to maintain the façade that you’re perfect. Remember this: you make mistakes to build up a career, but you hide your mistakes for a short-lived hobby. When planning the compression to reduce size, maybe you wrote down an expected outcome that stated that the file would be less than 640KB, but after implementing the compression, the dataset is smaller but still more than 640KB. That’s still a valid comparison; it tells you that you’ve made progress but you need to look at other options, like different compression algorithms or a different data format. Also, if the compressed size had been 639KB you’d also have to point that out in the same way. You met the expected outcome but you can obviously expect it to exceed 640KB soon.

After the comparison, you should always write down the results and summarize your key takeaways. Is there a better compression algorithm? Why did the algorithm you used not meet your predictions? You have to summarize your findings because quality is a team effort. You might have learned from your mistakes or successes, but you want the whole team to learn from them. You might not be the one who implements a different compression algorithm (if the one you tried didn’t work), so it’s good to write down what made your choice not so good. Write it all down and make the reasons known. If not, the whole team is doomed to repeat your experiments and failures in the future, until someone else writes them down.

Our example shows us that we greatly reduced the file size. We went from 642KB to 55KB (less than 9% of the original size). Given a 0.88KB annual increase, file size won’t get back up to 640KB until after about 664 years. This is because there isn’t that much change per row in the dataset (only the date changes). If our dataset had been more random, we probably wouldn’t have achieved this amount of compression. So, for our example, we can definitely recommend compression of the dataset, which brings us to the last step in the quality cycle.

Establishing a new baseline

After analyzing the changes, you act if the output from making the changes is promising, and, if you do, your improvements become your new standard for going forward. This is a relatively simple last step in the PDCA cycle, as you can see in figure 5.


Figure 5. Decide how you’re going to act and think about the next iteration


Originally, this step of the PDCA cycle was excluded (when Shewhart proposed the cycle). Still, it’s important to include because it’s the culmination of your work. Everything you’ve worked on was done to allow you to make the decision of whether to adopt the change or not.

If your analysis shows that the change actually contributed to higher quality, you can establish the change as your baseline for the future. The change is no longer an idea, it’s what you will use from now on (which is why honesty in your analysis is so important). If the change results in lower quality, it’s pretty obvious that it shouldn’t be adopted. In those cases, you don’t adopt it, which is still an action and a perfectly acceptable result. It’s still good to know that the change didn’t improve the quality (if you document it).

Status quo or no change is slightly trickier to decide. If you’ve made a change and it has no effect, should you keep it or not? In those, cases you’ll just have to make an estimate about whether it will cost you more to adopt the change or to reverse it. It’s highly situational, but can also be very difficult to say. Most often, this comes down to cost or time. If it’s too costly or time-consuming to reject the change, keep it. For example, if you’ve installed a cheap motion sensor and it turns out that it never is triggered, it may cost you more to get an electrician to dismount it than the refund you’d get. Sometimes though, it’s worth spending the time and money to reject the change. If a software change doesn’t do anything, it may end up costing you more, in the long run, to keep the code change than it will to remove it, because more code is more likely to have bugs or confuse future programmers.

With our UFO sighting dataset, we can safely say that we improved the quality of the dataset. We’re going to satisfy the system administrators for a good chunk of the next millennia. We can safely accept this and establish the baseline that we will use compression for the dataset. At this point, you’d swap out the datasets in production and start working with compressed datasets. In many cases you would have already implemented this into the work processes, but we were able to leverage the immediate reference period and a copy of the data to perform our analysis without disrupting work processes. I recommend that you try, where you can, to update work processes as part of the quality change, because then you’ll get a better feeling for the effects of the change.

That depends on the context, however. You always learn something new in each iteration, and based on what you learn, you can think about the next cycle:

  • Do you want to re-prioritize the user requirements for quality?

  • Do you want to continue improvements in the same area (continue to focus on dataset size)?

  • Do you want to move on to another quality attribute?

Even if act is the last thing you do with an idea for a change, it’s also never the last stop because it’s a cycle. You just move on to the planning step of the next cycle. Quality improvement is a continuous process.

Document everything

You may have noticed that I encouraged you to write a lot down in each iteration of the PDCA cycle, including the following:

  • Write down your plan

  • Write down problems with your changes and unexpected behaviors

  • Write down your key findings

There are a lot of things you have to document, and for good reason, which I mentioned briefly: quality is a team effort. You work with your data users to identify the requirements, and you work with a team of people who work on the product or service to improve those requirements. They all need to know why you’re making a change, when you’re going to make it, and whether it paid off.

Here’s the real juicy part: Quality management is a data project. Data and quality go hand in hand. Chances are you’re interested in the book that spawned this article because it’s about data. If quality is new to you, then fear not. It’s what you’ve been doing (working with data), it’s just framed differently.

When working on quality, all you do is collect data about the current status, the impact of a change, the time and date when some unexpected behavior occurred, and so on. This data is then processed and transformed into information in the analysis step, so that it can be better understood by humans. The newly created information can be put into context with how you understand the situation to become knowledge, which allows you to decide to adopt or not.

The goal of all this data collection and documentation is to manage the quality process and know whether you’re fulfilling the desired quality levels (or iteratively getting there). That’s the question you’re collecting data to answer. Management of the quality process is called quality assurance, and it’s what allows us to provide confidence that the desired quality is fulfilled. Quality assurance is not only about collecting and processing the data and documenting what you’re doing, but also other aspects of the quality process that need to be managed, like training people and selecting the right tools.

Quality assurance and quality controls are two of four parts of the broader term, quality management. The other two components are quality planning and quality improvements. Quality planning is the input to quality assurance and controls. Quality improvements are the results of quality assurance and controls. I consider quality planning to be a part of quality assurance because it’s the first step in the process, the input, where you identify users, specify their needs, and analyze them. Out of that process, you create a plan for the quality you want to improve and the level of quality you’re after. Quality improvements are what you do when you measure the level of quality and see that it isn’t where you would like it to be. Then you make some changes to what you’re working on in an attempt to increase the quality, and send it through the quality assurance process again and hope for the best. Let’s recap these terms, because they’re important, especially if you are a data quality manager. These terms are basically what you’re responsible for:

  • Quality planning: Identify what you’ll be doing and create a plan

  • Quality assurance: Follow procedures to be able to show that you’re making progress on quality

  • Quality controls: Create measurements that tell you where you are currently

  • Quality improvements: Make changes to increase the quality, if the level isn’t satisfactory yet

The difference of data quality

Everything we’ve gone through can be applied to quality management in general, irrespective of the subject. If you’re working on organizational quality, you follow the same steps and the management principles as those responsible for managing quality for a brand of tea.

The same can be said about data quality. Generally speaking, there is nothing really different about managing the quality of data from managing the quality of any other product or service. You follow the same steps to manage quality as a whole and go through iterations of same steps of the quality cycle.

Well, it’s almost the same but there is one important distinction between data quality and most other quality subjects: Automation is easier.

Data is usually digitized and stored on computers. Think back to our dataset size example. Comparing the size of a dataset can be automated. That’s different from the majority of the other subjects of quality management, where quality controls emphasize inspections and reviews over automation. For example, how would you gauge the quality of a magazine? You wouldn’t be able to automate that—you’d have to survey your readers. If you’re making fire escapes, you don’t automatically set buildings on fire to make sure your fire escape is of good quality. You run simulations manually: for example, with fire drills.

Because data is digital, we’re in a much better position to automate quality. We can automatically gauge the data quality level. That doesn’t mean inspections and reviews aren’t used: if you want to gauge the quality level of the data clarity attribute, you’re going to have to survey your users, but in most cases data quality controls can be automated. Just be aware that our response to a lack of quality is not automated. We’re automating the bureaucracy of quality management so we can focus on the fun stuff.

This means that we can focus on improvements and let automation take care of the quality controls and perhaps more importantly, the quality process as a whole. The well-known protest song by Bob Dylan, The times they are a-changing, consists of four verses that tell the listener how the time is changing. The song was actually written as an anthem of change thanks to time, and indeed, the world is constantly changing. We have to look constantly at where we are and adapt to changes over time. You rarely achieve quality and then just have that quality from then until eternity. It’s especially applicable to data quality because data is an abstraction of a constantly changing world.

Put yourself into the shoes of someone tasked with counting the number of birds of a specific species. The quality attribute you have is completeness of the data, meaning that the requirement is to know the exact number of birds of that species in a particular region. It’s not enough to give just an estimate of the amount, you have to count them all. If, by some miracle, you are able to count all of the birds and record them, your victory dance won’t be a long one. Some of the birds die, new ones hatch, or they all migrate for the season. Unless you’re tracking the number of dodos in the wild (or other extinct species), the quality of your amount is constantly affected by the world and you don’t always control how the world affects your data.

That’s why it’s important to set the data quality management up as a cycle. You follow the four steps, and then you rinse and repeat. You should always check the level of all quality attributes, even those that already reached the desired level of quality. This can of course, in most cases, be automated as well. A good way to ensure quality is to constantly check the status of it for the attributes you’re interested in. Wouldn’t it be a good idea to just run a program that periodically checks whether file size is OK, getting close to the limit, or has surpassed the limit? It’s simple: we just write small scripts like check_ufo_filesize.py in listing 5 and run it from time to time.

Listing 5. check_ufo_filesize.py

  
 import os.path 
 import math
  
 sysadmin_limit_in_kilobytes = 640 
  
 size_in_bytes = os.path.getsize('ufos.csv.gz')       
 size_in_kilobytes = math.ceil(size_in_bytes / 1024)
 if size_in_kilobytes > sysadmin_limit_in_kilobytes:  
     print('Critical: How dare you go over the limit!?', end=' ') 
 elif size_in_kilobytes > sysadmin_limit_in_kilobytes - 100:      
     print('Warning: You are dangerously close to the limit', end=' ')
 else:          
     print('All is OK: The sysadmins love you!', end=' ')
  
 print('(Size is: {size}KB)'.format(size=size_in_kilobytes))      
  

 We import the libraries we need, for this we only need the built-in os.path and math libraries.

 We set the file size limit to 640.

 We compute the size of the compressed file in kilobytes.

 If the file size is more than the limit we print out a message claiming that we’ve surpassed the limit. Our situation is now critical!

 For each of these print statements (messages) we end it with a space ‘ ‘ instead of the default new-line character because we want the whole output to be in a single line which is easier to read if you’re periodically writing them on the screen.

 If we’re getting close to the limit, in this case a 100KB from it, we print out a warning statement. After a few 100 years, when our compressed dataset becomes 540KB we’ll be thankful for this warning message because it gives us a lot of time to improve the quality before we get to a critical state. The system administrators will never again see a file larger than 640KB.

 This should be our regular state. We’re way under the limit and sysadmins love us.

 Then for completeness and to be informative we print out the actual file size like before.

This is what we’ll be doing to leverage the difference of data quality: writing the quality controls as scripts that run from time to time and monitor the quality level. All those scripts have to do is notify us when things are ok, getting bad, and after they have gone bad.

There’s a whole suite of software out there to help you to constantly check the quality level. We can hook our scripts into their system to make use of all the different features they have. I like to re-purpose monitoring software that’s used for system administration (probably because I know that very well). Monitoring software like Nagios, Zabbix, and others allow you to create your own little scripts that can serve as your quality controls. Monitoring software is designed to periodically run the scripts you want, based on configurations you determine, and notify staff when something bad has happened (usually through warnings and critical alerts).

Setting up your work environment is all part of quality planning, but that’s a topic for another day.

You’re hopefully wiser for having read this article, and, more importantly, more interested in data, data quality, and data usability. For more, download the free first chapter of The Art of Data Usability and see this slide deck on slideshare.net.