Classifying Text

classify() returns a ClassificationResult object on success, or a string error code on failure.

use ByJG\TextClassifier\ClassificationResult;

$result = $classifier->classify($text);

if (!($result instanceof ClassificationResult)) {
    // handle error code string
}

echo $result->choice;  // 'spam' or 'ham'
echo $result->score;   // float 0.0–1.0 (spam probability)

ClassificationResult fields

Field	Type	Description
`choice`	`string`	`'spam'` or `'ham'`
`score`	`float`	Spam probability `0.0`–`1.0` (final, after any LLM retraining)
`scores`	`array<string, float>`	`['spam' => …, 'ham' => …]` final scores
`statScores`	`array<string, float>`	Raw statistical scores before any LLM escalation
`llmDecision`	`string\|null`	LLM's label if it was consulted, otherwise `null`
`escalated`	`bool`	`true` when the LLM was invoked

Interpreting the score

Score range	Interpretation
`0.0 – 0.2`	Strong ham
`0.2 – 0.4`	Likely ham
`0.4 – 0.6`	Uncertain
`0.6 – 0.8`	Likely spam
`0.8 – 1.0`	Strong spam

0.5 means the filter has no opinion — either no training data exists, or the tokens in the text are equally distributed between ham and spam.

Choosing a threshold

use ByJG\TextClassifier\ClassificationResult;

$result = $classifier->classify($text);
if (!($result instanceof ClassificationResult)) {
    // handle error
}

// Aggressive spam filtering (minimise false negatives)
if ($result->score > 0.7) {
    rejectMessage();
}

// Conservative (minimise false positives)
if ($result->score > 0.9) {
    rejectMessage();
} elseif ($result->score > 0.7) {
    quarantineMessage();
}

Relevance filtering

BinaryClassifier does not use all tokens in the text. It selects the most "relevant" tokens — those whose spam probability deviates most from 0.5. The number of tokens considered is controlled by ConfigBinaryClassifier::setUseRelevant() (default: 15) and the minimum deviation threshold ConfigBinaryClassifier::setMinDev() (default: 0.2).

Tokens that appear multiple times in the text contribute proportionally — a token appearing three times counts three times in the probability calculation.

Degeneration fallback

When a token is not found in the database, BinaryClassifier tries degenerated variants (case variants, punctuation stripped). If none are found, it uses the neutral value robX (default: 0.5). This prevents unknown words from skewing the score.

Error codes

classify() returns a string error code instead of a ClassificationResult when something goes wrong:

Constant	Meaning
`BinaryClassifier::CLASSIFYER_TEXT_MISSING`	`$text` was `null`
`StandardLexer::LEXER_TEXT_NOT_STRING`	`$text` is not a string
`StandardLexer::LEXER_TEXT_EMPTY`	`$text` is an empty string

How B8 works — the Robinson-Fisher algorithm explained
ConfigBinaryClassifier reference — all tuning parameters
Error codes reference
LLM-Assisted Classification

ClassificationResult fields​

Interpreting the score​

Choosing a threshold​

Relevance filtering​

Degeneration fallback​

Error codes​

Related​