Preprocessing Utility Functions

Utility Functions for Text Preprocessings

pydantic model nlpurify.preprocessing.utils.WordTokenize

Tokenize text into word vectors using different types of methods to achieve cleaner text in desired formats.

Parameters:: tokenizer (regexp, vanilla,) – Selection methods for different tokenization techniques. Set the value to regexp = True to tokenize text using regular expressions, for using pure Python based text tokenization use the vanilla = True method, and tokenizer = True (default) is for using external tokenizer functions like nltk.tokenize.word_tokenize() methods. The function will throw error if all of the values are set to true, and only one can be true at a time.

Show JSON schema

{
   "title": "WordTokenize",
   "description": "Tokenize text into word vectors using different types of methods\nto achieve cleaner text in desired formats.\n\n:param regexp, vanilla, tokenizer: Selection methods for different\n    tokenization techniques. Set the value to ``regexp = True`` to\n    tokenize text using regular expressions, for using pure Python\n    based text tokenization use the ``vanilla = True`` method, and\n    ``tokenizer = True`` (default) is for using external tokenizer\n    functions like :func:`nltk.tokenize.word_tokenize` methods.\n    The function will throw error if all of the values are set to\n    true, and only one can be true at a time.",
   "type": "object",
   "properties": {
      "regexp": {
         "default": false,
         "title": "Regexp",
         "type": "boolean"
      },
      "vanilla": {
         "default": false,
         "title": "Vanilla",
         "type": "boolean"
      },
      "tokenizer": {
         "default": true,
         "title": "Tokenizer",
         "type": "boolean"
      },
      "regexp_pattern": {
         "default": "\\w+",
         "title": "Regexp Pattern",
         "type": "string"
      },
      "vanilla_split_by": {
         "default": " ",
         "title": "Vanilla Split By",
         "type": "string"
      },
      "vanilla_getalpha": {
         "default": false,
         "title": "Vanilla Getalpha",
         "type": "boolean"
      },
      "vanilla_getalnum": {
         "default": false,
         "title": "Vanilla Getalnum",
         "type": "boolean"
      },
      "tokenizer_language": {
         "default": "english",
         "title": "Tokenizer Language",
         "type": "string"
      },
      "tokenizer_preserve_line": {
         "default": false,
         "title": "Tokenizer Preserve Line",
         "type": "boolean"
      }
   }
}

Fields:

regexp (bool)
regexp_pattern (str)
tokenizer (bool)
tokenizer_language (str)
tokenizer_preserve_line (bool)
vanilla (bool)
vanilla_getalnum (bool)
vanilla_getalpha (bool)
vanilla_split_by (str)

field regexp: bool = False

field regexp_pattern: str = '\\w+'

field tokenizer: bool = True

field tokenizer_language: str = 'english'

field tokenizer_preserve_line: bool = False

field vanilla: bool = False

field vanilla_getalnum: bool = False

field vanilla_getalpha: bool = False

field vanilla_split_by: str = ' '

apply(text: str) → str

_abc_impl = <_abc._abc_data object>