Why Training Data Matters
Your classifier is only as good as its training data. Garbage in, garbage out. But "good" training data isn't just about quantity—it's about quality, diversity, and relevance.
The Fundamentals
Clear Label Definitions
Before collecting any data, define exactly what each label means.
Bad definitions:
- Positive: Good reviews
- Negative: Bad reviews
Better definitions:
- Positive: Customer expresses satisfaction, would recommend, or explicitly praises the product
- Negative: Customer expresses dissatisfaction, regret, or explicitly criticizes the product
- Neutral: Factual description without clear positive/negative sentiment, or mixed opinions that balance out
Mutually Exclusive Categories
Each example should belong to exactly one category (for single-label classification). If examples regularly fit multiple categories, either:
- Merge overlapping categories
- Create a new combined category
- Switch to multi-label classification
Representative Examples
Training data should mirror what your classifier will see in production.
If you're classifying customer emails:
- Include emails of varying lengths
- Cover different writing styles (formal, casual, angry, polite)
- Include typos and grammatical errors
- Represent all customer segments
How Much Data?
Minimum Viable Dataset
| Classification Type | Examples per Class |
|---|---|
| Binary (2 classes) | 50-100 |
| Multi-class (3-5) | 100-200 |
| Multi-class (6-10) | 200-500 |
| Fine-grained (10+) | 500+ |
These are minimums. More data generally helps, but with diminishing returns.
When Synthetic Data Helps
TinyModels generates training data using Claude. This works well when:
- You have clear category definitions but limited real examples
- You need to bootstrap a classifier quickly
- Real data is expensive or slow to collect
- You want to augment minority classes
Synthetic data works best when:
- Combined with even small amounts of real data
- Generated with specific domain context
- Reviewed before training
Data Quality Checklist
Consistency
Multiple labelers should agree on labels. If two people would label the same example differently, your definitions need work.
Test this by:
- Have 2-3 people label the same 50 examples independently
- Calculate agreement rate
- Discuss disagreements to refine definitions
Target: 90%+ agreement on clear-cut examples.
Coverage
Your training data should cover the full range of inputs your classifier will see.
Check for:
- Edge cases (very short/long texts)
- Different formats (questions, statements, lists)
- Various tones (formal, casual, urgent)
- All relevant topics/products/domains
Balance
Class imbalance is common but manageable.
Example: Support ticket urgency
- Normal: 850 examples
- High: 120 examples
- Critical: 30 examples
Solutions:
- Oversample minority classes: Duplicate critical examples 5-10x
- Generate synthetic examples: Ask TinyModels to generate more critical ticket examples
- Weighted training: TinyModels automatically weights rare classes higher
Cleanliness
Remove or fix:
- Duplicate examples
- Mislabeled examples (spot-check your data)
- Irrelevant content (headers, signatures, boilerplate)
- Extreme outliers that don't represent real usage
Preparing Your Data
CSV Format
TinyModels accepts CSV with two columns:
text,label
"This product is amazing! Best purchase ever.",positive
"Broke after one week. Total waste of money.",negative
"Arrived on time. Works as described.",neutral
Text Preprocessing
Generally, keep preprocessing minimal. Modern models handle:
- Capitalization variations
- Punctuation differences
- Common misspellings
Do preprocess:
- Remove personally identifiable information (PII)
- Strip email headers/signatures if not relevant
- Normalize extreme whitespace
Don't preprocess:
- Don't lowercase (capitalization carries meaning)
- Don't remove punctuation (!!!!! means something different than .)
- Don't stem/lemmatize (modern models understand word forms)
Iterating on Your Data
The Training Loop
- Train with initial dataset
- Evaluate on held-out test data
- Analyze errors (where does the model fail?)
- Improve data (add examples that address failures)
- Retrain and repeat
Common Issues and Fixes
Model confuses two classes:
- Add more examples that highlight the distinction
- Clarify label definitions
- Consider merging if truly indistinguishable
Model fails on specific input type:
- Add more examples of that type
- Check if training data includes that format/style
High accuracy but wrong in production:
- Training data doesn't match production distribution
- Collect more representative real-world examples
Using TinyModels' AI Generation
Describe, Don't Just Label
When asking TinyModels to generate training data, provide context:
Basic: "Generate examples for 'urgent' support tickets"
Better: "Generate 'urgent' support ticket examples. Urgent tickets involve: payment failures, account lockouts, security concerns, or service outages affecting business operations. Include varied writing styles and urgency indicators."
Review Generated Data
Always preview synthetic data before training:
- Check that examples match your mental model of each category
- Look for patterns that might cause overfitting
- Edit or remove examples that seem off
- Request regeneration if quality is low
Combine Real and Synthetic
The best results come from mixing:
- 20-30% real labeled examples (for calibration)
- 70-80% synthetic examples (for coverage and volume)
Real data anchors the model; synthetic data provides diversity.
Final Checklist
Before training, verify:
- Labels are clearly defined and mutually exclusive
- Each class has sufficient examples
- Data represents real-world input distribution
- Class imbalance is addressed
- Data is clean (no duplicates, consistent formatting)
- Synthetic data has been reviewed
- A held-out test set exists for evaluation
Good training data is an investment. The time you spend on data quality pays dividends in model accuracy.


