How much training data do I need?

For simple binary classification, 50-100 examples per class often suffice. For multi-class problems with subtle distinctions, aim for 200+ examples per class. TinyModels' AI-generated data can bootstrap you when real examples are limited.

Can I use AI-generated training data?

Yes, and TinyModels does this automatically. AI-generated data works well for bootstrapping, especially when combined with even small amounts of real labeled data for calibration.

What if my classes are imbalanced?

Class imbalance is common (you might have 1000 'normal' examples and 50 'urgent' ones). TinyModels handles this through oversampling and weighted training. You can also generate more synthetic examples for minority classes.

How do I know if my training data is good enough?

Watch the validation accuracy during training. If it plateaus below your target, you likely need more diverse examples, clearer label definitions, or to address class overlap.

Training Data Best Practices: How to Get High-A...

Why Training Data Matters

Your classifier is only as good as its training data. Garbage in, garbage out. But "good" training data isn't just about quantity—it's about quality, diversity, and relevance.

The Fundamentals

Clear Label Definitions

Before collecting any data, define exactly what each label means.

Bad definitions:

Positive: Good reviews
Negative: Bad reviews

Better definitions:

Positive: Customer expresses satisfaction, would recommend, or explicitly praises the product
Negative: Customer expresses dissatisfaction, regret, or explicitly criticizes the product
Neutral: Factual description without clear positive/negative sentiment, or mixed opinions that balance out

Mutually Exclusive Categories

Each example should belong to exactly one category (for single-label classification). If examples regularly fit multiple categories, either:

Merge overlapping categories
Create a new combined category
Switch to multi-label classification

Representative Examples

Training data should mirror what your classifier will see in production.

If you're classifying customer emails:

Include emails of varying lengths
Cover different writing styles (formal, casual, angry, polite)
Include typos and grammatical errors
Represent all customer segments

How Much Data?

Minimum Viable Dataset

Classification Type	Examples per Class
Binary (2 classes)	50-100
Multi-class (3-5)	100-200
Multi-class (6-10)	200-500
Fine-grained (10+)	500+

These are minimums. More data generally helps, but with diminishing returns.

When Synthetic Data Helps

TinyModels generates training data using Claude. This works well when:

You have clear category definitions but limited real examples
You need to bootstrap a classifier quickly
Real data is expensive or slow to collect
You want to augment minority classes

Synthetic data works best when:

Combined with even small amounts of real data
Generated with specific domain context
Reviewed before training

Data Quality Checklist

Consistency

Multiple labelers should agree on labels. If two people would label the same example differently, your definitions need work.

Test this by:

Have 2-3 people label the same 50 examples independently
Calculate agreement rate
Discuss disagreements to refine definitions

Target: 90%+ agreement on clear-cut examples.

Coverage

Your training data should cover the full range of inputs your classifier will see.

Check for:

Edge cases (very short/long texts)
Different formats (questions, statements, lists)
Various tones (formal, casual, urgent)
All relevant topics/products/domains

Balance

Class imbalance is common but manageable.

Example: Support ticket urgency

Normal: 850 examples
High: 120 examples
Critical: 30 examples

Solutions:

Oversample minority classes: Duplicate critical examples 5-10x
Generate synthetic examples: Ask TinyModels to generate more critical ticket examples
Weighted training: TinyModels automatically weights rare classes higher

Cleanliness

Remove or fix:

Duplicate examples
Mislabeled examples (spot-check your data)
Irrelevant content (headers, signatures, boilerplate)
Extreme outliers that don't represent real usage

Preparing Your Data

CSV Format

TinyModels accepts CSV with two columns:

text,label
"This product is amazing! Best purchase ever.",positive
"Broke after one week. Total waste of money.",negative
"Arrived on time. Works as described.",neutral

Text Preprocessing

Generally, keep preprocessing minimal. Modern models handle:

Capitalization variations
Punctuation differences
Common misspellings

Do preprocess:

Remove personally identifiable information (PII)
Strip email headers/signatures if not relevant
Normalize extreme whitespace

Don't preprocess:

Don't lowercase (capitalization carries meaning)
Don't remove punctuation (!!!!! means something different than .)
Don't stem/lemmatize (modern models understand word forms)

Iterating on Your Data

The Training Loop

Train with initial dataset
Evaluate on held-out test data
Analyze errors (where does the model fail?)
Improve data (add examples that address failures)
Retrain and repeat

Common Issues and Fixes

Model confuses two classes:

Add more examples that highlight the distinction
Clarify label definitions
Consider merging if truly indistinguishable

Model fails on specific input type:

Add more examples of that type
Check if training data includes that format/style

High accuracy but wrong in production:

Training data doesn't match production distribution
Collect more representative real-world examples

Using TinyModels' AI Generation

Describe, Don't Just Label

When asking TinyModels to generate training data, provide context:

Basic: "Generate examples for 'urgent' support tickets"

Better: "Generate 'urgent' support ticket examples. Urgent tickets involve: payment failures, account lockouts, security concerns, or service outages affecting business operations. Include varied writing styles and urgency indicators."

Review Generated Data

Always preview synthetic data before training:

Check that examples match your mental model of each category
Look for patterns that might cause overfitting
Edit or remove examples that seem off
Request regeneration if quality is low

Combine Real and Synthetic

The best results come from mixing:

20-30% real labeled examples (for calibration)
70-80% synthetic examples (for coverage and volume)

Real data anchors the model; synthetic data provides diversity.

Final Checklist

Before training, verify:

Labels are clearly defined and mutually exclusive
Each class has sufficient examples
Data represents real-world input distribution
Class imbalance is addressed
Data is clean (no duplicates, consistent formatting)
Synthetic data has been reviewed
A held-out test set exists for evaluation

Good training data is an investment. The time you spend on data quality pays dividends in model accuracy.

Training Data Best Practices: How to Get High-Accuracy Classifiers

What You'll Learn