Crate #3: Data โ The Fuel of AI
Garbage in, garbage out (seriously)
๐ Prerequisites
Why Data Is Everything
If AI is a race car, data is the fuel. The best engine in the world is useless without fuel, and the best algorithm is useless without good data.
Here's the uncomfortable truth: most AI breakthroughs in the last decade weren't because someone invented a brilliant new algorithm. They happened because someone got access to MORE DATA and FASTER COMPUTERS. The algorithms we use today are often surprisingly similar to ones invented in the 1980s โ they just didn't work back then because we didn't have enough data or computing power.
How much data? The GPT models were trained on hundreds of billions of words โ essentially a significant chunk of the internet. Image recognition models train on millions of labeled photos. Self-driving cars use petabytes (millions of gigabytes) of driving footage.
Good Data vs. Bad Data
GOOD DATA is: โข Accurate โ the labels are correct (that photo really IS a cat) โข Representative โ it covers the full range of what you'll encounter โข Clean โ no duplicates, no corruption, no irrelevant junk โข Sufficient โ enough examples to learn meaningful patterns โข Ethical โ collected with consent, respecting privacy
BAD DATA is: โข Biased โ only represents some groups or viewpoints โข Noisy โ full of errors and mislabeled examples โข Stale โ outdated information presented as current โข Sparse โ not enough examples for the AI to learn from โข Invasive โ scraped without permission, violating privacy
The classic example: an AI system trained to screen job resumes was found to discriminate against women. Why? Because it was trained on 10 years of hiring data from a male-dominated industry. The AI learned that male-sounding resumes were "better" because historically, more men were hired. The data was technically accurate โ it just reflected decades of human bias.
Where Does Training Data Come From?
SCRAPED FROM THE INTERNET โ Common Crawl is a nonprofit that regularly downloads the entire public web. Most large language models train on some version of this. Yes, this includes Reddit posts, Wikipedia articles, blog rants, and probably some truly terrible fan fiction.
CURATED DATASETS โ Researchers carefully collect and label data. ImageNet contains 14 million hand-labeled images. MNIST has 70,000 handwritten digits. These are the "benchmarks" that AI researchers use to compare their models.
SYNTHETIC DATA โ Sometimes you generate fake data that looks real. Need 100,000 X-ray images but only have 5,000? An AI can generate more. This sounds circular (using AI to train AI), and honestly, it kind of is.
USER-GENERATED โ Every time you solve a CAPTCHA ("click all the traffic lights"), you're labeling data for Google. Every time you correct a voice assistant, you're improving its training data. You are an unpaid AI trainer. You're welcome, Silicon Valley.
YOUR OWN DATA โ For specific tasks, companies collect their own. A hospital might use its own patient records (anonymized!) to train a diagnostic AI.
๐ค Think About It
- If you were training an AI to recognize different types of pizza, what problems might you run into with your training data?
- Should companies be allowed to scrape your social media posts to train AI? Why or why not?
- How would you test whether your dataset is biased?
๐ฌ Try This
- Take 30 photos of things in your room. Try to label them into categories. How hard is it to make consistent labels?
- Look at the MNIST dataset online (it's just handwritten numbers). Can YOU tell what some of the messy ones say? That's the labeling problem.
- Count how many times in one day you generate data that could theoretically be used to train AI.
๐ Go Deeper
๐ฏ Fun Fact
In 2016, Microsoft launched a chatbot called Tay on Twitter. Within 24 hours, internet trolls had taught it to post incredibly offensive things by flooding it with toxic data. Microsoft shut it down in less than a day. The lesson: your AI is only as good (or as terrible) as the data people feed it.
๐ Quick Quiz
1. What does 'garbage in, garbage out' mean in the context of AI?
2. Why did an AI hiring tool discriminate against women?
3. What is 'synthetic data'?
