Preprocessing Utility Functions
Utility Functions for Text Preprocessings
- pydantic model nlpurify.preprocessing.utils.WordTokenize
Tokenize text into word vectors using different types of methods to achieve cleaner text in desired formats.
- Parameters:
tokenizer (regexp, vanilla,) – Selection methods for different tokenization techniques. Set the value to
regexp = Trueto tokenize text using regular expressions, for using pure Python based text tokenization use thevanilla = Truemethod, andtokenizer = True(default) is for using external tokenizer functions likenltk.tokenize.word_tokenize()methods. The function will throw error if all of the values are set to true, and only one can be true at a time.
Show JSON schema
{ "title": "WordTokenize", "description": "Tokenize text into word vectors using different types of methods\nto achieve cleaner text in desired formats.\n\n:param regexp, vanilla, tokenizer: Selection methods for different\n tokenization techniques. Set the value to ``regexp = True`` to\n tokenize text using regular expressions, for using pure Python\n based text tokenization use the ``vanilla = True`` method, and\n ``tokenizer = True`` (default) is for using external tokenizer\n functions like :func:`nltk.tokenize.word_tokenize` methods.\n The function will throw error if all of the values are set to\n true, and only one can be true at a time.", "type": "object", "properties": { "regexp": { "default": false, "title": "Regexp", "type": "boolean" }, "vanilla": { "default": false, "title": "Vanilla", "type": "boolean" }, "tokenizer": { "default": true, "title": "Tokenizer", "type": "boolean" }, "regexp_pattern": { "default": "\\w+", "title": "Regexp Pattern", "type": "string" }, "vanilla_split_by": { "default": " ", "title": "Vanilla Split By", "type": "string" }, "vanilla_getalpha": { "default": false, "title": "Vanilla Getalpha", "type": "boolean" }, "vanilla_getalnum": { "default": false, "title": "Vanilla Getalnum", "type": "boolean" }, "tokenizer_language": { "default": "english", "title": "Tokenizer Language", "type": "string" }, "tokenizer_preserve_line": { "default": false, "title": "Tokenizer Preserve Line", "type": "boolean" } } }
- Fields:
- field regexp: bool = False
- field regexp_pattern: str = '\\w+'
- field tokenizer: bool = True
- field tokenizer_language: str = 'english'
- field tokenizer_preserve_line: bool = False
- field vanilla: bool = False
- field vanilla_getalnum: bool = False
- field vanilla_getalpha: bool = False
- field vanilla_split_by: str = ' '
- apply(text: str) str
- _abc_impl = <_abc._abc_data object>