ConfigLexer

ByJG\TextClassifier\Lexer\ConfigLexer controls tokenisation behaviour. Used by both BinaryClassifier and NaiveBayes. All setters return $this for fluent chaining.

Parameters

Parameter	Setter	Default	Type	Description
`min_size`	`setMinSize(int)`	`3`	`int`	Minimum token length. Tokens shorter than this are discarded.
`max_size`	`setMaxSize(int)`	`30`	`int`	Maximum token length. Tokens longer than this are discarded.
`allow_numbers`	`setAllowNumbers(bool)`	`false`	`bool`	Whether to keep pure numeric tokens (e.g. `12345`).
`get_uris`	`setGetUris(bool)`	`true`	`bool`	Extract URI-like patterns (e.g. `example.com`, `cdn.host.net`).
`old_get_html`	`setOldGetHtml(bool)`	`true`	`bool`	Tokenise HTML tags without removing them from the text. Legacy mode.
`get_html`	`setGetHtml(bool)`	`false`	`bool`	Tokenise HTML tags and remove them from the remaining text. Modern mode.
`get_bbcode`	`setGetBbcode(bool)`	`false`	`bool`	Extract BBCode tags (e.g. `[b]`, `[url=...]`).

Usage

use ByJG\TextClassifier\Lexer\ConfigLexer;

$config = (new ConfigLexer())
    ->setMinSize(3)
    ->setMaxSize(30)
    ->setAllowNumbers(false)
    ->setGetUris(true)
    ->setOldGetHtml(false)
    ->setGetHtml(true)
    ->setGetBbcode(false);

HTML mode selection

Use case	`old_get_html`	`get_html`
Default (legacy)	`true`	`false`
Recommended for HTML email	`false`	`true`
Ignore all HTML	`false`	`false`
Both modes active (not recommended)	`true`	`true`

In "modern" mode (get_html = true), HTML tags are extracted as tokens (e.g. <b>, <a...>) and removed from the text before the raw split, preventing partial tag text from appearing as word tokens.

Language detection tip

For language detection or any use case where short words carry discriminative value, lower min_size:

(new ConfigLexer())->setMinSize(2)

This allows articles and prepositions like el, la, de, in to be included as tokens.

Getters

Method	Returns
`getMinSize()`	`int`
`getMaxSize()`	`int`
`isAllowNumbers()`	`bool`
`isGetUris()`	`bool`
`isOldGetHtml()`	`bool`
`isGetHtml()`	`bool`
`isGetBbcode()`	`bool`

Parameters​

Usage​

HTML mode selection​

Language detection tip​

Getters​

Parameters

Usage

HTML mode selection

Language detection tip

Getters