Crate #5: Computer Vision β Teaching Machines to See
Pixels, patterns, and why Snapchat filters work
π Prerequisites
How Computers See
You see a sunset. A computer sees a grid of numbers.
Every digital image is a grid of pixels. Each pixel has three numbers: how much Red, Green, and Blue light to mix (0-255 each). A 1920x1080 HD image is about 6 million numbers. A computer doesn't "see" the sunset β it processes those 6 million numbers.
The hard part isn't reading pixels. It's understanding what the pixels MEAN. Your brain does this effortlessly β you recognize your friend's face from any angle, in any lighting, even if they got a haircut. This is incredibly hard for computers.
Computer vision is the field of making computers understand images and video. It's everywhere: unlocking your phone with your face, self-driving cars seeing the road, Instagram filters tracking your face, doctors using AI to spot tumors in X-rays.
Convolutional Neural Networks (CNNs)
The breakthrough for computer vision was a special type of neural network called a Convolutional Neural Network, or CNN.
Instead of looking at the entire image at once (way too many numbers), CNNs slide a small window across the image, looking at tiny patches. Imagine holding a magnifying glass and scanning it across a photo, inch by inch.
Each filter (window) detects one specific pattern: β’ Early filters detect simple things: edges, corners, color gradients β’ Middle filters combine those into textures and shapes β’ Later filters recognize complex things: eyes, wheels, text
This is almost exactly how your visual cortex works! Your brain processes vision in stages too β from simple shapes to complex objects.
The word "convolution" sounds scary but it just means "sliding a small filter across the image and computing a value at each position." It's like running a stamp across a page.
The breakthrough moment: In 2012, a CNN called AlexNet crushed the competition in the ImageNet challenge (1.2 million images, 1000 categories) by a huge margin. Before AlexNet, computer vision had been improving by tiny increments for years. AlexNet improved accuracy by more than all the previous years combined. This was the moment deep learning went mainstream.
Beyond Just Recognizing: What CV Can Do
IMAGE CLASSIFICATION β "This is a photo of a golden retriever." (What is it?)
OBJECT DETECTION β "There are 3 cars, 2 pedestrians, and 1 stop sign in this image, and here's where each one is." (What is it AND where is it?)
SEGMENTATION β Color-coding every single pixel: "These pixels are sky, these are road, these are car." (What is every part of the image?)
FACE RECOGNITION β "This face matches Person #4392 in the database." (Who is it?)
POSE ESTIMATION β "This person's left arm is raised, right knee is bent." (How is the body positioned?) This is how Snapchat knows where to put dog ears on your face.
IMAGE GENERATION β "Here's a new image of a cat wearing a top hat on the moon that never existed before." (Create something new.) This is where AI art tools like DALL-E and Midjourney come in. We'll cover this more later.
π€ Think About It
- Self-driving cars need to work in rain, snow, night, and fog. Why is this much harder than recognizing objects in clear daylight?
- Facial recognition technology can identify people in crowds. What are the benefits AND risks of this technology?
- If an AI can generate fake photos of real people that look completely real, what problems could this cause?
π¬ Try This
- Take the same photo of an object in 5 different lighting conditions. See how different the pixels look β this is why computer vision is hard!
- Try Google Lens on your phone (or Google Image search on desktop). Upload unusual objects and see if it can identify them.
- Draw a simple 5x5 grid on paper. Fill each cell with a number (0 = white, 1 = black). Have a friend try to guess what you drew. This is how low-resolution images work.
π Go Deeper
π― Fun Fact
The ImageNet dataset was created by professor Fei-Fei Li and her team, who used Amazon Mechanical Turk to get 49,000 workers from 167 countries to hand-label 14 million images. The project took 2.5 years. When she first proposed the idea, many colleagues said it was pointless. It ended up being one of the most important datasets in AI history.
π Quick Quiz
1. How does a computer 'see' an image?
2. What was special about AlexNet in 2012?
3. What's the difference between object detection and image classification?
