Training the Multi-class Classifier

Training teaches the NaiveBayes engine which text belongs to which category. Categories are created automatically the first time you train a sample under a new category name.

Basic training

$nb->train(string $text, string $category): void

Category names are arbitrary strings — use whatever makes sense for your domain:

$nb->train('PHP is a web programming language', 'tech');
$nb->train('The stock market fell today', 'finance');
$nb->train('The election results were announced', 'politics');

Untraining

Remove a previously trained sample from the model:

$nb->untrain(string $text, string $category): void

$nb->untrain('PHP is a web programming language', 'tech');

Token counts are decremented proportionally. If a token count reaches zero, the token is removed. If the document count for a category reaches zero, the category is removed.

Category lifecycle

Operation	Effect
First `train()` for a new category name	Category is created automatically
All samples untrained from a category	Category disappears from `classify()` results
`classify()` with no trained categories	Returns empty array `[]`

Correcting a mistake

// Trained under wrong category
$nb->untrain($text, 'tech');
$nb->train($text, 'science');

Training data guidelines

Principle	Guidance
Balance	Similar sample counts across all categories reduces bias
Diversity	Varied phrasing generalises better than repeated phrases
Minimum	At least a few samples per category before classifying
Text length	Longer texts provide more signal; very short texts (1-2 words) are less reliable

Token counting

The lexer extracts tokens from the text (see Lexer concepts). Each unique token is counted. A word that appears three times in one training text contributes count = 3 for that token, not count = 1. This means frequently repeated words carry more weight.

Basic training​

Untraining​

Category lifecycle​

Correcting a mistake​

Training data guidelines​

Token counting​

Related​