Pular para o conteúdo principal

Training the Multi-class Classifier

Training teaches the NaiveBayes engine which text belongs to which category. Categories are created automatically the first time you train a sample under a new category name.

Basic training

$nb->train(string $text, string $category): void

Category names are arbitrary strings — use whatever makes sense for your domain:

$nb->train('PHP is a web programming language', 'tech');
$nb->train('The stock market fell today', 'finance');
$nb->train('The election results were announced', 'politics');

Untraining

Remove a previously trained sample from the model:

$nb->untrain(string $text, string $category): void
$nb->untrain('PHP is a web programming language', 'tech');

Token counts are decremented proportionally. If a token count reaches zero, the token is removed. If the document count for a category reaches zero, the category is removed.

Category lifecycle

OperationEffect
First train() for a new category nameCategory is created automatically
All samples untrained from a categoryCategory disappears from classify() results
classify() with no trained categoriesReturns empty array []

Correcting a mistake

// Trained under wrong category
$nb->untrain($text, 'tech');
$nb->train($text, 'science');

Training data guidelines

PrincipleGuidance
BalanceSimilar sample counts across all categories reduces bias
DiversityVaried phrasing generalises better than repeated phrases
MinimumAt least a few samples per category before classifying
Text lengthLonger texts provide more signal; very short texts (1-2 words) are less reliable

Token counting

The lexer extracts tokens from the text (see Lexer concepts). Each unique token is counted. A word that appears three times in one training text contributes count = 3 for that token, not count = 1. This means frequently repeated words carry more weight.