Skip to main content

ConfigDegenerator

ByJG\TextClassifier\Degenerator\ConfigDegenerator controls the word degeneration behaviour used by the BinaryClassifier spam filter. All setters return $this for fluent chaining.

Parameters

ParameterSetterDefaultTypeDescription
multibytesetMultibyte(bool)falseboolUse multibyte string functions (mb_strtolower, mb_strtoupper, mb_substr) instead of their ASCII counterparts. Required for non-ASCII text.
encodingsetEncoding(string)'UTF-8'stringCharacter encoding for multibyte operations. Only used when multibyte = true.

Usage

ASCII text (default)

use ByJG\TextClassifier\Degenerator\ConfigDegenerator;
use ByJG\TextClassifier\Degenerator\StandardDegenerator;

$degenerator = new StandardDegenerator(new ConfigDegenerator());

Multibyte / Unicode text

$degenerator = new StandardDegenerator(
(new ConfigDegenerator())
->setMultibyte(true)
->setEncoding('UTF-8')
);

When to enable multibyte

Enable if your training texts or classified texts contain:

  • Accented characters (é, ü, ñ, ö, ...)
  • Cyrillic, Greek, Arabic, Hebrew, or other non-Latin scripts
  • CJK characters (Chinese, Japanese, Korean)
  • Any character outside the ASCII range (0x000x7F)

Without multibyte mode, strtolower('Ñoño') does not produce ñoño, which means degenerated variants will be incorrect and case-folding will fail silently for non-ASCII input.

Getters

MethodReturns
isMultibyte()bool
getEncoding()string