Skip to main content

Classifying Text

classify() returns a ClassificationResult object on success, or a string error code on failure.

use ByJG\TextClassifier\ClassificationResult;

$result = $classifier->classify($text);

if (!($result instanceof ClassificationResult)) {
// handle error code string
}

echo $result->choice; // 'spam' or 'ham'
echo $result->score; // float 0.0–1.0 (spam probability)

ClassificationResult fields

FieldTypeDescription
choicestring'spam' or 'ham'
scorefloatSpam probability 0.01.0 (final, after any LLM retraining)
scoresarray<string, float>['spam' => …, 'ham' => …] final scores
statScoresarray<string, float>Raw statistical scores before any LLM escalation
llmDecisionstring|nullLLM's label if it was consulted, otherwise null
escalatedbooltrue when the LLM was invoked

Interpreting the score

Score rangeInterpretation
0.0 – 0.2Strong ham
0.2 – 0.4Likely ham
0.4 – 0.6Uncertain
0.6 – 0.8Likely spam
0.8 – 1.0Strong spam

0.5 means the filter has no opinion — either no training data exists, or the tokens in the text are equally distributed between ham and spam.

Choosing a threshold

use ByJG\TextClassifier\ClassificationResult;

$result = $classifier->classify($text);
if (!($result instanceof ClassificationResult)) {
// handle error
}

// Aggressive spam filtering (minimise false negatives)
if ($result->score > 0.7) {
rejectMessage();
}

// Conservative (minimise false positives)
if ($result->score > 0.9) {
rejectMessage();
} elseif ($result->score > 0.7) {
quarantineMessage();
}

Relevance filtering

BinaryClassifier does not use all tokens in the text. It selects the most "relevant" tokens — those whose spam probability deviates most from 0.5. The number of tokens considered is controlled by ConfigBinaryClassifier::setUseRelevant() (default: 15) and the minimum deviation threshold ConfigBinaryClassifier::setMinDev() (default: 0.2).

Tokens that appear multiple times in the text contribute proportionally — a token appearing three times counts three times in the probability calculation.

Degeneration fallback

When a token is not found in the database, BinaryClassifier tries degenerated variants (case variants, punctuation stripped). If none are found, it uses the neutral value robX (default: 0.5). This prevents unknown words from skewing the score.

Error codes

classify() returns a string error code instead of a ClassificationResult when something goes wrong:

ConstantMeaning
BinaryClassifier::CLASSIFYER_TEXT_MISSING$text was null
StandardLexer::LEXER_TEXT_NOT_STRING$text is not a string
StandardLexer::LEXER_TEXT_EMPTY$text is an empty string