Computer vision is the field of artificial intelligence that enables machines to “see”.

Humans have the gift of vision, and the organ that makes it possible is complex. Although it’s incomparable with the long-distance vision of eagles or the eyes of a bluebottle butterfly, which can see in the UV spectrum, it still does an excellent job.

A part of seeing is understanding what you’re seeing. Otherwise, it’s just receiving the light being reflected from objects in front of you. This is what happens if you have a pair of eyes but not the visual cortex inside the occipital lobe (the part of the brain responsible for visual processing).

For computers, cameras are their eyes. And computer vision acts as the occipital lobe and processes the thousands of pixels on images. In short, computer vision enables machines to comprehend what they’re seeing.

Computer vision is critical for several technological innovations, including self-driving cars, facial recognition, and augmented reality. The increasing amount of image data we generate is one reason why this field of artificial intelligence is growing exponentially. This increase also makes it easier for data scientists to train algorithms.

Simply put, the two main tasks of computer vision are identifying the objects of an image and understanding what they mean as a whole.

Humans take virtual perception, a product of millions of years of evolution, for granted. A 5-year-old could easily name the items placed on a table and comprehend that the entire setup is a dining table. For machines, it’s a Herculean task, and this is what computer vision is trying to solve.

Artificial general intelligence, if ever possible, wouldn’t be feasible without computer vision. That’s because accurately identifying and reacting to objects around us is one of the notable traits of our intelligence. In other words, to teach machines to think, you must give them the ability to see.

Along with the exponential growth in the number of digital photographs and videos available, advancements in deep learning and artificial neural networks also contribute to the current glory of computer vision.

A brief history of computer vision

The first experiments in the field of computer vision began in the 1950s with the help of some of the early forms of artificial neural networks. They were used to detect the edges of objects and could sort simple objects like circles and squares.

Computer vision was seen as a stepping stone towards artificial intelligence as mimicking the human visual system is a prerequisite for attaining human intelligence. Therefore in the 1960s, universities exploring AI were also involved in computer vision.

In 1963, Larry Roberts, considered the founding father of the internet, described the process of deriving 3D information about solid objects from 2D photos. His thesis “Machine Perception of Three-Dimensional Solids” is still recognized as one of the foundational works of the computer vision field.

Later in 1966, Marvin Minsky, one of the founding fathers of AI, believed that computer vision could be achieved with a summer project. But we all know what happened. Fast forward to the 1970s, computer vision technology was used for commercial applications such as optical character recognition (OCR), which can identify handwritten text or printed characters in images.

The internet, which became mainstream in the 1990s, played a crucial role in computer vision’s rapid development. Large sets of images became easily accessible, which made the training of algorithms easier.

Cheap and abundant computing power also added to the ease of training algorithms. This was also the point when the interactions between computer graphics and computer vision increased.

Here are some notable milestones in computer vision that made it the robust technology it is today.

1959: The very first digital image scanner was invented that converted images into number grids.

1963: Larry Roberts described the process of deriving 3D information of solid objects from 2D pictures.

1966: Marvin Minsky instructed a graduate student to attach a camera to a computer and describe what it saw.

1980: Kunihiko Fukushima created the neocognitron. It’s considered the precursor of the modern convolutional neural network (CNN).

2001: Paul Viola and Michael Jones, two researchers at MIT, created the first face detection framework that works in real time.

2009: Google started the self-driving car project.

2010: Google released Google Goggles, an image recognition app useful for searches based on pictures captured by mobile devices. The same year, Facebook started using facial recognition to tag people on photos effectively.

2011: Facial recognition technology was used to confirm the identity of Osama Bin Laden after he was killed.

2012: Google Brain created a neural network consisting of 16,000 computer processors that could recognize the pictures of cats with the help of a deep learning algorithm. The same year, AlexNet, a convolutional neural network, attained a top-5 error of 15.3% in the ImageNet 2012 Challenge.

2014: Tesla introduced Autopilot in its Model S electric cars. The self-driving system not only worked offline but also parked with precision.

2015: Google launched TensorFlow, which is an open-source and free software library for machine learning. The same year, Google introduced FaceNet for facial recognition.

2016: Pokémon GO, the famous AR-based mobile game, was introduced.

2017: Apple released the iPhone X with the face recognition feature.

2019: The UK HighCourt permitted the use of automated facial recognition technology to search for people in crowds.

How does computer vision work?

Computer vision starts small and ends big.

It follows a layered processing technique in which it begins with identifying and analyzing low-level features such as pixels and colors. Gradually, it works its way up to analyze higher-level features such as lines and objects.

Suppose you see an image of people running. Even though it’s a still image, in most cases, you’ll be able to understand the context; people are running away from something, running towards something, or running leisurely.

It’s simple for us to understand the emotion and context of images. Computers are still learning the trade, but their pace is impressive for non-biological entities.

For machines, images are just a collection of pixels. Unlike humans, they can’t understand an image’s semantic meaning and can only detect pixels. The goal of computer vision is to abridge that semantic gap.

When light rays hit the retina of our eyes, special cells, called photoreceptors, transform the light into electrical signals. These electrical signals are then sent to the brain through the optic nerve. The brain then converts these signals into the images we see.

This processes up until the electrical signals reaching the brain seem straightforward. How exactly the brain processes these signals and converts them into images isn’t yet fully understood. More precisely, the brain is a black box; so is computer vision.

There are neural networks and other machine learning algorithms that try to mimic the human brain. They make computer vision feasible and help comprehend what the images are about. Even in the case of algorithms, ML researchers aren’t fully aware of how they work. However, since their results are quantifiable, we can judge the accuracy of each algorithm.

Computer vision as a process is explainable, just like human vision. But nobody’s quite sure how neural networks work to comprehend images or whether they’re remotely close to how humans process visual information.

That said, in a simple sense, computer vision is all about pattern recognition. Using machine learning techniques like unsupervised learning, algorithms are trained to recognize patterns in visual data. If you’re thinking about the number of images required, it’s millions or thousands at the very least.

Suppose you want the algorithm to identify dogs in images. If you’re following the unsupervised learning technique, you don’t have to label any images as dogs. Instead, after analyzing thousands or millions of images, the machine learns the specific characteristics of dogs.

In short, a computer can perceive the specific features that make an animal (or object) a dog. It still wouldn’t know that the particular animal is called a “dog”. But it’ll have enough information and experience to determine whether an unlabeled image contains a dog.

If you want the learning process to be faster, you can go for supervised learning. In supervised learning, the images are labeled, which makes the job easier for the algorithms.

Examining images on pixel-levels

When talking about algorithms analyzing images, they aren’t examining the picture as a whole like humans. Instead, they look at individual pixels, which are the smallest addressable elements of a raster image.

For the sake of simplicity, let’s consider a grayscale image. The brightness of each pixel, called pixel values, is represented by an 8-bit integer with a range of possible values from 0 to 255. Zero is considered to be black, and 255 is white. If we’re studying a colored image, things will get more intricate.

When we say an algorithm analyzes and learns, it’s actually learning these pixel values. In other words, a computer sees and recognizes images based on such numerical values. This also means that algorithms find patterns in images by looking at their numerical values and compare pictures in a similar way.

In short, for machines, comprehending an image is a mathematical process that involves arrays of integers.

Then there are convolutional neural networks

A convolutional neural network (CNN or ConvNet) is a deep learning algorithm that can extract features from image datasets. They are a category of neural networks and have impressive capabilities for image recognition and classification. Almost every computer vision algorithm uses convolutional neural nets.

Although CNNs were invented back in the 1980s, they weren’t exactly feasible until the introduction of graphics processing units (GPUs). GPUs can significantly accelerate convolutional neural nets and other neural networks. In 2004, GPU implementation of CNNs was 20 times faster than an equivalent CPU implementation.

How do CNNs do it?

ConvNets learn from input images and adjust their parameters (weights and biases) to make better predictions. CNNs treat images like matrices and extract spatial information from them, such as edges, depth, and texture. ConvNets do this by using convolutional layers and pooling.

The architecture of a CNN is analogous to that of the connectivity pattern of neurons in our brains. CNNs were created by taking inspiration from the organization of the visual cortex, which is the region of the brain that receives and processes visual information.

A CNN consists of multiple layers of artificial neurons called perceptrons, which are the mathematical counterparts of our brain’s biological neurons. Perceptrons roughly imitate the workings of their biological counterparts as well.

A convolutional neural net comprises an input layer, multiple hidden layers, and an output layer

The hidden layers contain:

  • Convolutional layers
  • Rectified linear activation function (ReLU) layers
  • Normalization layers
  • Pooling layers
  • Fully connected layers

Here’s a simple explanation of what they do.

When a CNN processes an image, each of its layers extracts distinct features from the image pixels. The first layer is responsible for detecting basic characteristics such as horizontal and vertical edges.

As you go deeper into the neural network, the layers start detecting complex features such as shapes and corners. The final layers of the convolutional neural network are capable of detecting specific features such as faces, buildings, and places.

The output layer of the convoluted neural net offers a table containing numerical information. This table represents the probability that a particular object was identified in the image.

Examples of computer vision tasks

Computer vision is a field of computer science and AI that enables computers to see. There are numerous methods by which computers can take advantage of this field. These attempts to identify objects or activities in images are called computer vision tasks.

Here are some of the common computer vision tasks.

Image recognition software applications may use just one of these computer vision techniques. Advanced applications like self-driving cars will use several techniques at the same time.

Real-world computer vision applications

Computer vision is already fused into many of the products we use today. Facebook automatically tags people using CV. Google Photos uses it to group images, and software applications like Adobe Lightroom use it to enhance the details of zoomed images. It’s also extensively used for quality control in manufacturing processes that rely on automation.

Here are some more real-world applications of computer vision you might have come across.

Facial recognition

One of the best use cases of computer vision is in the field of facial recognition. It hit the mainstream in 2017 with Apple’s iPhone X model and is now a standard feature in most smartphones.

Facial recognition technology is used as an authentication feature on multiple occasions. Otherwise, it’s used to identify the person, like in the case of Facebook. Law enforcement agencies are known to use facial recognition technology to identify law-breakers in video feeds.

Self-driving cars

Self-driving cars rely heavily on computer vision for real-time image analysis. It helps autonomous vehicles make sense of their surroundings. However, the technology behind such cars is still in its infancy stage and requires further development before it can be confidently deployed on traffic-filled roads.

Self-driving vehicles are virtually impossible without computer vision. This technology helps autonomous vehicles process visual data in real time. One example of its application is the creation of 3D maps. Along with object identification and classification, computer vision can help create 3D maps to give vehicles a sense of the surroundings.

Vehicle and lane line detection are another two important use cases. Then there’s free space detection, which is quite famous in the self-driving car realm. As the name suggests, it’s used to determine obstacle-free space around the vehicle. Freespace detection is useful when the autonomous vehicle approaches a slow-moving vehicle and needs to change lanes.

Medical imaging

Computer vision is used in the healthcare industry to make faster and more accurate diagnoses and monitor the progression of diseases. Using pattern recognition, doctors can detect early symptoms of diseases like cancer, which might not be visible to the human eye.

Medical imaging is another critical application with a plethora of benefits. Medical imaging analysis cuts down the time it takes for medical professionals to analyze images. Endoscopy, X-ray radiography, ultrasound, and magnetic resonance imaging (MRI) are some of the medical imaging disciplines that use computer vision.

By pairing CNNs with medical imaging, medical professionals can observe internal organs, detect anomalies, and understand the cause and impact of specific diseases. It also helps doctors to monitor the development of diseases and the progress of treatments.

Content moderation

Social media networks like Facebook have to review millions of new posts every day. It’s impractical to have a content moderation team that goes through every image or video posted, and so, computer vision systems are used for automating the process.

Computer vision can help such social media platforms analyze uploaded content and flag those containing banned content. Companies can also use deep learning algorithms for text analysis to identify and block offensive content.


Surveillance video feeds are a solid form of evidence. They can help discover law-breakers and also help security professionals to act before minor concerns become catastrophic.

It’s practically impossible for humans to keep an eye on surveillance footage from multiple sources. But with computer vision, this task is simplified. CV-powered surveillance systems can scan live footage and detect people with suspicious behavior.

Facial recognition can be used to identify wanted criminals and thereby prevent crimes. Image recognition technology can be employed to detect individuals carrying dangerous objects in crowded areas. The same is also used to determine the number of free parking spaces available in malls.

Challenges in computer vision

Helping computers see is more challenging than we thought it is.

Marvin Minsky was confident that computer vision could be solved by connecting a camera to a computer. Even after decades of research, we’re nowhere near solving the problem. For humans, vision is so effortless. That’s the reason why computer vision was seen as a trivially simple problem and was supposed to be solved over a summer.

Our knowledge is limited

One reason why we aren’t able to fully crack the computer vision problem is our limited knowledge of ourselves. We don’t have a complete understanding of how the human visual system works. Of course, rapid strides are made in the study of biological vision, but there’s still a long way to go.

The visual world is complex

A challenging problem in the field of CV is the natural complexity of the visual world. An object could be viewed from any angle, under any lighting conditions, and from varying distances. The human optical system is ordinarily capable of viewing and comprehending objects in all such infinite variations, but the capability of machines is still quite limited.

Another limitation is the lack of common sense. Even after years of research, we’re yet to recreate common sense in AI systems. Humans can apply common sense and background knowledge about specific objects to make sense of them. This also allows us to understand the relationship between different entities of an image with ease.

Humans are good at guesswork, at least when compared to computers. It’s easier for us to make a not-so-bad decision, even if we haven’t faced a specific problem before. But the same isn’t true for machines. If they encounter a situation that doesn’t resemble their training examples, they’re prone to act irrationally.

Computer vision algorithms get notably better if you train them with newer visual datasets. But at their core, they’re trying to match pixel patterns. In other words, apart from the knowledge of pixels, they don’t exactly understand what’s happening in the images. But it’s fascinating to think of the wonders CV-powered systems do in self-driving cars.

CV is hardware bound

In computer vision, latency is evil.

In real-world applications like self-driving cars, image processing and analysis must happen almost instantaneously. For example, if an autonomous vehicle traveling at 30 mph detects an obstacle a hundred meters away, it has only a few seconds to stop or turn safely.

For the car to act on time, the AI system will have to understand the surroundings and make decisions in milliseconds. Since computer vision systems are heavily dependent on hardware components like the camera, a delay of even a fraction of a second in data transmission or computation can cause catastrophic accidents.

Narrow AI isn’t enough

Some AI researchers feel that a 20/20 computer vision can be achieved only if we unlock artificial general intelligence (AGI). That’s because consciousness seems to play a critical role in the human visual system. Just as how much we see and observe, we imagine. Our imagination augments the visuals we see and brings a better meaning to them.

Also, visual intelligence isn’t inseparable from intelligence. The ability to process complex thoughts did complement our ability to see and comprehend our surroundings.

According to many researchers, learning from millions of images or video feeds downloaded from the internet wouldn’t help much to attain true computer vision. Instead, the AI entity will have to experience it like humans. In other words, narrow AI, the level of artificial intelligence we currently have, isn’t enough.

The timeframe within which we’ll achieve general intelligence is still debatable. Some feel that AGI can be achieved in a few decades. Others suggest it’s a thing of the next century. But the majority of researchers think that AGI is unattainable and will only exist in the science fiction genre.

Achievable or not, there are numerous other ways we can try to unlock true computer vision. Feeding quality and diverse data is one way to do it. This will make sure that systems relying on computer vision technology steer clear of biases.

Finding better ways to magnify the strengths of artificial neural nets, creating powerful GPUs and other needed hardware components, and understanding the human visual system are some ways to advance toward true computer vision.

Gifting vision to machines

The error rates of image recognition models are dramatically dropping. We’ve come a long way from just detecting printed letters to identifying human faces with precision. But there’s a long way to go and many new milestones to conquer. Achieving true computer vision will most likely be one of the keys to creating robots that are as sophisticated and intelligent as humans.

If a process can be digitally executed, machine learning will eventually become a part of it. If you aren’t entirely convinced, here are 51 machine learning statistics that hint the same technology is taking almost all industries by storm.