Crate #5: Computer Vision — Teaching Machines to See — Mini Crates

How Computers See

You see a sunset. A computer sees a grid of numbers.

Every digital image is a grid of pixels. Each pixel has three numbers: how much Red, Green, and Blue light to mix (0-255 each). A 1920x1080 HD image is about 6 million numbers. A computer doesn't "see" the sunset — it processes those 6 million numbers.

The hard part isn't reading pixels. It's understanding what the pixels MEAN. Your brain does this effortlessly — you recognize your friend's face from any angle, in any lighting, even if they got a haircut. This is incredibly hard for computers.

Computer vision is the field of making computers understand images and video. It's everywhere: unlocking your phone with your face, self-driving cars seeing the road, Instagram filters tracking your face, doctors using AI to spot tumors in X-rays.

Convolutional Neural Networks (CNNs)

The breakthrough for computer vision was a special type of neural network called a Convolutional Neural Network, or CNN.

Instead of looking at the entire image at once (way too many numbers), CNNs slide a small window across the image, looking at tiny patches. Imagine holding a magnifying glass and scanning it across a photo, inch by inch.

Each filter (window) detects one specific pattern: • Early filters detect simple things: edges, corners, color gradients • Middle filters combine those into textures and shapes • Later filters recognize complex things: eyes, wheels, text

This is almost exactly how your visual cortex works! Your brain processes vision in stages too — from simple shapes to complex objects.

The word "convolution" sounds scary but it just means "sliding a small filter across the image and computing a value at each position." It's like running a stamp across a page.

The breakthrough moment: In 2012, a CNN called AlexNet crushed the competition in the ImageNet challenge (1.2 million images, 1000 categories) by a huge margin. Before AlexNet, computer vision had been improving by tiny increments for years. AlexNet improved accuracy by more than all the previous years combined. This was the moment deep learning went mainstream.

Beyond Just Recognizing: What CV Can Do

IMAGE CLASSIFICATION — "This is a photo of a golden retriever." (What is it?)

OBJECT DETECTION — "There are 3 cars, 2 pedestrians, and 1 stop sign in this image, and here's where each one is." (What is it AND where is it?)

SEGMENTATION — Color-coding every single pixel: "These pixels are sky, these are road, these are car." (What is every part of the image?)

FACE RECOGNITION — "This face matches Person #4392 in the database." (Who is it?)

POSE ESTIMATION — "This person's left arm is raised, right knee is bent." (How is the body positioned?) This is how Snapchat knows where to put dog ears on your face.

IMAGE GENERATION — "Here's a new image of a cat wearing a top hat on the moon that never existed before." (Create something new.) This is where AI art tools like DALL-E and Midjourney come in. We'll cover this more later.

🎯 Fun Fact

The ImageNet dataset was created by professor Fei-Fei Li and her team, who used Amazon Mechanical Turk to get 49,000 workers from 167 countries to hand-label 14 million images. The project took 2.5 years. When she first proposed the idea, many colleagues said it was pointless. It ended up being one of the most important datasets in AI history.

📝 Quick Quiz

1. How does a computer 'see' an image?

2. What was special about AlexNet in 2012?

3. What's the difference between object detection and image classification?

Answer all 3 questions to submit

← Crate #4: Neural Networks — The Brain Factory Crate #6: Natural Language Processing →

Crate #5: Computer Vision — Teaching Machines to See

📋 Prerequisites

How Computers See

Convolutional Neural Networks (CNNs)

Beyond Just Recognizing: What CV Can Do

🤔 Think About It

🔬 Try This

📚 Go Deeper

🎯 Fun Fact

📝 Quick Quiz