🏋️ Crate #7: Training Your Own AI

The ML Pipeline

Building an AI model follows a standard pipeline. Think of it as a recipe:

1. DEFINE THE PROBLEM — What exactly are you trying to predict or classify? "Is this email spam?" is a clear problem. "Make AI do stuff" is not.

2. COLLECT DATA — Gather examples. The more, the better (usually). If you're building a spam detector, you need thousands of labeled emails: spam and not-spam.

3. PREPARE THE DATA — Clean it up. Remove duplicates, fix errors, handle missing values. This is the boring part. It's also 80% of the work. Seriously. Data scientists spend most of their time cleaning data, not building fancy models.

4. CHOOSE A MODEL — Pick the right architecture. Simple problem? Use a simple model (like logistic regression). Complex images? Use a CNN. Text? Use a transformer. Using a nuclear reactor to toast bread is wasteful. Using a toaster to power a city won't work.

5. TRAIN — Feed the data through the model, adjust weights, repeat. This can take minutes (small model, small data) or months (massive model, massive data).

6. EVALUATE — Test on data the model has NEVER seen before. This is critical. A student who only studies the answers to specific test questions gets 100% on those questions but fails everything else. Same with AI.

7. DEPLOY — Put it in the real world. Monitor it. Things break.

Tools You Can Actually Use

You don't need a PhD or a supercomputer to start. Here are tools for beginners:

GOOGLE TEACHABLE MACHINE (teachablemachine.withgoogle.com) No code required. Train image, sound, or pose classifiers in your browser. You can train a model to tell your cat from your dog in about 5 minutes.

SCRATCH + ML extensions If you use Scratch (the block-based programming language), there are ML extensions that let you train classifiers and use them in your projects.

PYTHON + scikit-learn If you're ready for real code, Python is THE language for ML. scikit-learn is a library that makes it easy:

# This trains a spam classifier in 5 lines from sklearn.naive_bayes import MultinomialNB model = MultinomialNB() model.fit(training_emails, training_labels) predictions = model.predict(new_emails)

GOOGLE COLAB (colab.research.google.com) Free Jupyter notebooks in the cloud — with free GPU access. You can train neural networks without installing anything.

KAGGLE (kaggle.com) A platform with thousands of datasets and competitions. Great for practice. Some competitions have cash prizes, and newcomers regularly place well.

Common Mistakes Beginners Make

TESTING ON TRAINING DATA — If you train AND test on the same data, your accuracy looks amazing but means nothing. It's like memorizing a test and then taking that exact test. Always keep separate data for testing.

OVERFITTING — Your model learns the training data TOO well. It memorizes noise and quirks instead of learning general patterns. It's like studying only one textbook so hard that you can recite it word-for-word but can't answer questions phrased differently.

UNDERFITTING — Your model is too simple to capture the patterns. Like trying to draw a circle with only straight lines — you'll get a polygon, but it won't be a circle.

IGNORING DATA QUALITY — Throwing more data at a bad model doesn't help. Neither does throwing more computing power at bad data. Fix the data first.

OVER-ENGINEERING — Starting with the most complex model instead of the simplest. Always try a simple model first. Sometimes linear regression (basically drawing a line through points) is all you need. If it works, don't complicate it.

NOT UNDERSTANDING THE PROBLEM — Building a model before clearly defining what "success" looks like. 95% accuracy sounds great until you realize the dataset is 95% one class (like 95% of emails are NOT spam) — a model that always predicts "not spam" gets 95% accuracy while being completely useless.

🔬 Try This

Go to Google Teachable Machine and train a model to recognize 3 hand gestures (like thumbs up, peace sign, and wave). How many examples do you need before it works well?
If you know any Python, try Google Colab. Load the Iris flower dataset (it's built into scikit-learn) and train a classifier. The whole thing is ~10 lines of code.
Create a paper-based 'decision tree' classifier. Write down yes/no questions that lead to identifying different animals. That's literally a machine learning algorithm!

🎯 Fun Fact

The first time a neural network was used commercially was in 1989, for reading handwritten zip codes on mail envelopes at the US Postal Service. It was designed by Yann LeCun, who later became the chief AI scientist at Meta. Postal workers hated it at first, but it processed mail faster and more accurately than humans.

📝 Quick Quiz

1. What is 'overfitting'?

2. Why should you NEVER test a model on the same data you trained it on?

3. What percentage of a data scientist's time is typically spent on data cleaning?

Answer all 3 questions to submit

← Crate #6: Natural Language Processing Crate #8: AI Ethics — The Hard Questions →

Crate #7: Training Your Own AI

📋 Prerequisites