freshcrate
โ›ฝ

Crate #3: Data โ€” The Fuel of AI

Garbage in, garbage out (seriously)

๐ŸŒฑ Starterโฑ ~12 min
datadatasetsbiasprivacy

Why Data Is Everything

If AI is a race car, data is the fuel. The best engine in the world is useless without fuel, and the best algorithm is useless without good data.

Here's the uncomfortable truth: most AI breakthroughs in the last decade weren't because someone invented a brilliant new algorithm. They happened because someone got access to MORE DATA and FASTER COMPUTERS. The algorithms we use today are often surprisingly similar to ones invented in the 1980s โ€” they just didn't work back then because we didn't have enough data or computing power.

How much data? The GPT models were trained on hundreds of billions of words โ€” essentially a significant chunk of the internet. Image recognition models train on millions of labeled photos. Self-driving cars use petabytes (millions of gigabytes) of driving footage.

Good Data vs. Bad Data

GOOD DATA is: โ€ข Accurate โ€” the labels are correct (that photo really IS a cat) โ€ข Representative โ€” it covers the full range of what you'll encounter โ€ข Clean โ€” no duplicates, no corruption, no irrelevant junk โ€ข Sufficient โ€” enough examples to learn meaningful patterns โ€ข Ethical โ€” collected with consent, respecting privacy

BAD DATA is: โ€ข Biased โ€” only represents some groups or viewpoints โ€ข Noisy โ€” full of errors and mislabeled examples โ€ข Stale โ€” outdated information presented as current โ€ข Sparse โ€” not enough examples for the AI to learn from โ€ข Invasive โ€” scraped without permission, violating privacy

The classic example: an AI system trained to screen job resumes was found to discriminate against women. Why? Because it was trained on 10 years of hiring data from a male-dominated industry. The AI learned that male-sounding resumes were "better" because historically, more men were hired. The data was technically accurate โ€” it just reflected decades of human bias.

Where Does Training Data Come From?

SCRAPED FROM THE INTERNET โ€” Common Crawl is a nonprofit that regularly downloads the entire public web. Most large language models train on some version of this. Yes, this includes Reddit posts, Wikipedia articles, blog rants, and probably some truly terrible fan fiction.

CURATED DATASETS โ€” Researchers carefully collect and label data. ImageNet contains 14 million hand-labeled images. MNIST has 70,000 handwritten digits. These are the "benchmarks" that AI researchers use to compare their models.

SYNTHETIC DATA โ€” Sometimes you generate fake data that looks real. Need 100,000 X-ray images but only have 5,000? An AI can generate more. This sounds circular (using AI to train AI), and honestly, it kind of is.

USER-GENERATED โ€” Every time you solve a CAPTCHA ("click all the traffic lights"), you're labeling data for Google. Every time you correct a voice assistant, you're improving its training data. You are an unpaid AI trainer. You're welcome, Silicon Valley.

YOUR OWN DATA โ€” For specific tasks, companies collect their own. A hospital might use its own patient records (anonymized!) to train a diagnostic AI.

๐Ÿค” Think About It

  1. If you were training an AI to recognize different types of pizza, what problems might you run into with your training data?
  2. Should companies be allowed to scrape your social media posts to train AI? Why or why not?
  3. How would you test whether your dataset is biased?

๐Ÿ”ฌ Try This

  1. Take 30 photos of things in your room. Try to label them into categories. How hard is it to make consistent labels?
  2. Look at the MNIST dataset online (it's just handwritten numbers). Can YOU tell what some of the messy ones say? That's the labeling problem.
  3. Count how many times in one day you generate data that could theoretically be used to train AI.

๐ŸŽฏ Fun Fact

In 2016, Microsoft launched a chatbot called Tay on Twitter. Within 24 hours, internet trolls had taught it to post incredibly offensive things by flooding it with toxic data. Microsoft shut it down in less than a day. The lesson: your AI is only as good (or as terrible) as the data people feed it.

๐Ÿ“ Quick Quiz

1. What does 'garbage in, garbage out' mean in the context of AI?

2. Why did an AI hiring tool discriminate against women?

3. What is 'synthetic data'?

Answer all 3 questions to submit