Pular para o conteúdo principal

ConfigLexer

ByJG\TextClassifier\Lexer\ConfigLexer controls tokenisation behaviour. Used by both BinaryClassifier and NaiveBayes. All setters return $this for fluent chaining.

Parameters

ParameterSetterDefaultTypeDescription
min_sizesetMinSize(int)3intMinimum token length. Tokens shorter than this are discarded.
max_sizesetMaxSize(int)30intMaximum token length. Tokens longer than this are discarded.
allow_numberssetAllowNumbers(bool)falseboolWhether to keep pure numeric tokens (e.g. 12345).
get_urissetGetUris(bool)trueboolExtract URI-like patterns (e.g. example.com, cdn.host.net).
old_get_htmlsetOldGetHtml(bool)trueboolTokenise HTML tags without removing them from the text. Legacy mode.
get_htmlsetGetHtml(bool)falseboolTokenise HTML tags and remove them from the remaining text. Modern mode.
get_bbcodesetGetBbcode(bool)falseboolExtract BBCode tags (e.g. [b], [url=...]).

Usage

use ByJG\TextClassifier\Lexer\ConfigLexer;

$config = (new ConfigLexer())
->setMinSize(3)
->setMaxSize(30)
->setAllowNumbers(false)
->setGetUris(true)
->setOldGetHtml(false)
->setGetHtml(true)
->setGetBbcode(false);

HTML mode selection

Use caseold_get_htmlget_html
Default (legacy)truefalse
Recommended for HTML emailfalsetrue
Ignore all HTMLfalsefalse
Both modes active (not recommended)truetrue

In "modern" mode (get_html = true), HTML tags are extracted as tokens (e.g. <b>, <a...>) and removed from the text before the raw split, preventing partial tag text from appearing as word tokens.

Language detection tip

For language detection or any use case where short words carry discriminative value, lower min_size:

(new ConfigLexer())->setMinSize(2)

This allows articles and prepositions like el, la, de, in to be included as tokens.

Getters

MethodReturns
getMinSize()int
getMaxSize()int
isAllowNumbers()bool
isGetUris()bool
isOldGetHtml()bool
isGetHtml()bool
isGetBbcode()bool