From Grokking Deep Learning for Computer Vision by Mohamed Elgendy

In this article, we’ll zoom in on the interpreting device component (of a computer vision system) to take a look at the pipeline it uses to process and understand images.

Take 37% off Grokking Deep Learning for Computer Vision. Just enter fccelgendy into the discount code box at checkout at

Are you ready to start exploring computer vision systems? Let’s dig right in! Vision systems are composed of two main components: 1) a sensing device and 2) an interpreting device.

Figure 1

Applications of computer vision vary, but a typical vision system uses a similar sequence of distinct steps to process and analyze image data. These are referred to as a vision pipeline. Many vision applications start off by acquiring images and data, then processing that data, performing some analysis and recognition steps, and finally make a prediction based on the extracted information.

Figure 2

Let’s apply the above pipeline on an image classifier example. Suppose we have an image of a motorcycle, and we want the model to predict the probability of the object from the following classes: car, motorcycle, and dog.


Image classifier is an algorithm that takes in an image as input and outputs a label or “class” which identifies that image.

A class in machine learning is the output category of your data. You can call these categories as well.

Figure 3

Here’s how the image flows through the classification pipeline:

  1. First, a computer receives visual input from an imaging device like a camera. This is typically captured as an image or a sequence of images forming a video.
  2. Each image is sent through some pre-processing steps whose purpose is to standardize each image. Common preprocessing steps include resizing an image, blurring, rotating, change its shape or transforming the image from one color to another—like from color to grayscale. Only by standardizing each image, for example: making them the same size, can you then compare them and further analyze them in the same way.
  3. Next, we extract features. Features are what help us define certain objects, and they’re usually information about object shape or color. For example, some features that distinguish the shape of a motorcycle’s wheel, headlights, mudguards, and so on. The output of this process is a features vector, which is a list of unique shapes that identify the object.
  4. Finally, these features are fed into a classification model! This step looks at the features vector from the previous step and predicts the class of the image. Pretend that you’re the classifier model and let’s go through the classification process: You look at the list of features in the vector feature one-by-one and try to divine what’s in the image.
    1. First, you see a feature of a wheel—could this be a car, motorcycle or a dog? Clearly it isn’t a dog because dogs don’t have wheels (at least normal dogs, not robots!). Then, this could be an image of a car or a motorcycle
    2. Then you move on to the next feature “the headlights.” It’s a higher probability that this is a motorcycle than a usual car
    3. The next feature is “rear mudguard.” Again, there’s a higher probability it’s a motorcycle
    4. The object has only two wheels, hmm, this is closer to a motorcycle
    5. And you keep going through all the features like the body shape, pedal, etc. until you have created a better guess of the object in the image

The output of this process is the probabilities of each class. As you can see in the above example, the dog has the lowest probability of 1% whereas there’s an 85% probability that this is a motorcycle. You can see that, although the model was able to predict the right class with high probability, it’s still a little confused in distinguishing between cars and motorcycles because it predicted that there’s a 14% chance this is an image of a car. Because we know that it’s a motorcycle, then we can say that our ML classification algorithm is 85% accurate. Not too bad! To improve this accuracy, we may need to do more of step 1 (acquire more training images) or step 2 (more processing to remove noise) or step 3 (extract better features) or step 4 (change the classifier algorithm and tune some hyperparameters or even more training time). Many different approaches can improve the performance of our model. They all lie in one or more of the pipeline steps. And this is what we’ll be working on in this article.

That was the big picture of how images flow through the computer vision pipeline. In the next section, we will dive one level deeper into each one of the pipeline steps:

  1. Input image
  2. Image preprocessing
  3. Feature extraction
  4. Classification

That’s all for now. Keep a look out for part 2. If you’re interested in learning more about the book, check it out on liveBook here and see this slide deck.