TinyModels

Training Data Best Practices: How to Get High-Accuracy Classifiers

Learn how to prepare training data for text classification. Best practices for labeling, data quality, class balance, and augmentation to build accurate custom models.

Training Data Best Practices: How to Get High-Accuracy Classifiers

What You'll Learn

Understand what makes good training data
Learn optimal dataset sizes for different use cases
Handle class imbalance effectively
Use AI-generated data to bootstrap your classifier
Iterate and improve model accuracy over time

Why Training Data Matters

Your classifier is only as good as its training data. Garbage in, garbage out. But "good" training data isn't just about quantity—it's about quality, diversity, and relevance.

The Fundamentals

Clear Label Definitions

Before collecting any data, define exactly what each label means.

Bad definitions:

  • Positive: Good reviews
  • Negative: Bad reviews

Better definitions:

  • Positive: Customer expresses satisfaction, would recommend, or explicitly praises the product
  • Negative: Customer expresses dissatisfaction, regret, or explicitly criticizes the product
  • Neutral: Factual description without clear positive/negative sentiment, or mixed opinions that balance out

Mutually Exclusive Categories

Each example should belong to exactly one category (for single-label classification). If examples regularly fit multiple categories, either:

  1. Merge overlapping categories
  2. Create a new combined category
  3. Switch to multi-label classification

Representative Examples

Training data should mirror what your classifier will see in production.

If you're classifying customer emails:

  • Include emails of varying lengths
  • Cover different writing styles (formal, casual, angry, polite)
  • Include typos and grammatical errors
  • Represent all customer segments

How Much Data?

Minimum Viable Dataset

Classification TypeExamples per Class
Binary (2 classes)50-100
Multi-class (3-5)100-200
Multi-class (6-10)200-500
Fine-grained (10+)500+

These are minimums. More data generally helps, but with diminishing returns.

When Synthetic Data Helps

TinyModels generates training data using Claude. This works well when:

  • You have clear category definitions but limited real examples
  • You need to bootstrap a classifier quickly
  • Real data is expensive or slow to collect
  • You want to augment minority classes

Synthetic data works best when:

  • Combined with even small amounts of real data
  • Generated with specific domain context
  • Reviewed before training

Data Quality Checklist

Consistency

Multiple labelers should agree on labels. If two people would label the same example differently, your definitions need work.

Test this by:

  1. Have 2-3 people label the same 50 examples independently
  2. Calculate agreement rate
  3. Discuss disagreements to refine definitions

Target: 90%+ agreement on clear-cut examples.

Coverage

Your training data should cover the full range of inputs your classifier will see.

Check for:

  • Edge cases (very short/long texts)
  • Different formats (questions, statements, lists)
  • Various tones (formal, casual, urgent)
  • All relevant topics/products/domains

Balance

Class imbalance is common but manageable.

Example: Support ticket urgency

  • Normal: 850 examples
  • High: 120 examples
  • Critical: 30 examples

Solutions:

  1. Oversample minority classes: Duplicate critical examples 5-10x
  2. Generate synthetic examples: Ask TinyModels to generate more critical ticket examples
  3. Weighted training: TinyModels automatically weights rare classes higher

Cleanliness

Remove or fix:

  • Duplicate examples
  • Mislabeled examples (spot-check your data)
  • Irrelevant content (headers, signatures, boilerplate)
  • Extreme outliers that don't represent real usage

Preparing Your Data

CSV Format

TinyModels accepts CSV with two columns:

text,label
"This product is amazing! Best purchase ever.",positive
"Broke after one week. Total waste of money.",negative
"Arrived on time. Works as described.",neutral

Text Preprocessing

Generally, keep preprocessing minimal. Modern models handle:

  • Capitalization variations
  • Punctuation differences
  • Common misspellings

Do preprocess:

  • Remove personally identifiable information (PII)
  • Strip email headers/signatures if not relevant
  • Normalize extreme whitespace

Don't preprocess:

  • Don't lowercase (capitalization carries meaning)
  • Don't remove punctuation (!!!!! means something different than .)
  • Don't stem/lemmatize (modern models understand word forms)

Iterating on Your Data

The Training Loop

  1. Train with initial dataset
  2. Evaluate on held-out test data
  3. Analyze errors (where does the model fail?)
  4. Improve data (add examples that address failures)
  5. Retrain and repeat

Common Issues and Fixes

Model confuses two classes:

  • Add more examples that highlight the distinction
  • Clarify label definitions
  • Consider merging if truly indistinguishable

Model fails on specific input type:

  • Add more examples of that type
  • Check if training data includes that format/style

High accuracy but wrong in production:

  • Training data doesn't match production distribution
  • Collect more representative real-world examples

Using TinyModels' AI Generation

Describe, Don't Just Label

When asking TinyModels to generate training data, provide context:

Basic: "Generate examples for 'urgent' support tickets"

Better: "Generate 'urgent' support ticket examples. Urgent tickets involve: payment failures, account lockouts, security concerns, or service outages affecting business operations. Include varied writing styles and urgency indicators."

Review Generated Data

Always preview synthetic data before training:

  1. Check that examples match your mental model of each category
  2. Look for patterns that might cause overfitting
  3. Edit or remove examples that seem off
  4. Request regeneration if quality is low

Combine Real and Synthetic

The best results come from mixing:

  • 20-30% real labeled examples (for calibration)
  • 70-80% synthetic examples (for coverage and volume)

Real data anchors the model; synthetic data provides diversity.

Final Checklist

Before training, verify:

  • Labels are clearly defined and mutually exclusive
  • Each class has sufficient examples
  • Data represents real-world input distribution
  • Class imbalance is addressed
  • Data is clean (no duplicates, consistent formatting)
  • Synthetic data has been reviewed
  • A held-out test set exists for evaluation

Good training data is an investment. The time you spend on data quality pays dividends in model accuracy.

Frequently Asked Questions

Ready to Transform Your Document Workflow?

Let AI help you review, edit, and transform Word documents in seconds.

No credit card required • Free trial available