Text Normalization
Text normalization is the process of converting text into a consistent, standard, or “canonical” form. The goal is to reduce randomness and variations in the text data, which helps in reducing the overall number of unique words (the vocabulary size) and ensures that different forms of the same word are treated as one.
The main goal is to provide a single function that can be used to
achieve normalization goals - popular methods are text cases (setting
lower or upper case to all the words), stopwords removal etc. The
underlying function uses core Python string manipulation methods with
additional third party libraries (like nltk) to achieve text
normalization.
The core methods is kept simple, and generic arguments are used which are widely recognized/used by popular libraries.
- pydantic model nlpurify.preprocessing.normalization.WhiteSpace
A Model to Normalize White Space (space, tabs, newlines) from Text
Cleaning texts of white spaces like from beginning, end, and also multiple white spaces does not add any value to a text and should thus be removed to normalize the text.
A modular approach is now enabled which is derived from a base normalization class. The usage is as below:
import nlpurify as nlpu model = nlpu.preprocessing.normalization.WhiteSpace() # let's define a multi-line uncleaned text text = ''' This is a uncleaned text with lots of extra white space. ''' print(model.apply(text)) # uses default settings >> "This is a uncleaned text with lots of extra white space."
The model does not accept additional arguments and the function
.apply()is used to clean and normalize white space from text.Show JSON schema
{ "title": "WhiteSpace", "description": "A Model to Normalize White Space (space, tabs, newlines) from Text\n\nCleaning texts of white spaces like from beginning, end, and\nalso multiple white spaces does not add any value to a text and\nshould thus be removed to normalize the text.\n\nA modular approach is now enabled which is derived from a base\nnormalization class. The usage is as below:\n\n.. code-block:: python\n\n import nlpurify as nlpu\n model = nlpu.preprocessing.normalization.WhiteSpace()\n\n # let's define a multi-line uncleaned text\n text = '''\n This is a uncleaned text with lots of\n extra white\n space.\n '''\n\n print(model.apply(text)) # uses default settings\n >> \"This is a uncleaned text with lots of extra white space.\"\n\nThe model does not accept additional arguments and the function\n``.apply()`` is used to clean and normalize white space from text.", "type": "object", "properties": { "name": { "default": null, "description": "Set the model name, or default to class name", "title": "Name", "type": "string" }, "strip": { "default": true, "description": "\n Strip white spaces from both the beginning and the end of the\n string for normalization. By default, all the spaces are\n removed as they do not provide any additional information for\n a LLM/NLP based models and reduces token counts.\n\n When the attribute is set to ``True`` the alternate parameters\n :attr:`lstrip` and :attr:`rstrip` is ignored, check model\n validator for more information. This uses the Python in-built\n string function as in example below:\n\n .. code-block:: python\n\n text = \" this is a long text \"\n print(text.strip())\n >> 'this is a long text'\n\n Further customization like specifying alternate set of\n characters to be removed from the string is also supported by\n using the :attr:`strip_chars` attribute, for more information\n check `docs <https://docs.python.org/3/library/stdtypes.html#str.strip>`_.\n ", "title": "Strip", "type": "boolean" }, "lstrip": { "default": true, "description": "\n When set to true (default) removes the leading white characters\n from the string, or specify alternate set using\n :attr:`strip_chars` attribute.\n ", "title": "Lstrip", "type": "boolean" }, "rstrip": { "default": true, "description": "\n When set to true (default) removes the trailing white\n characters from the string, or specify alternate set using\n :attr:`strip_chars` attribute.\n ", "title": "Rstrip", "type": "boolean" }, "strip_chars": { "default": null, "description": "\n Custom set characters to be removed from the string. The\n argument is not a \"prefix\" or a \"suffix\" but a combination of\n all the values to be stripped. Check\n `docs <https://docs.python.org/3/library/stdtypes.html#str.strip>`_\n for more information.\n ", "title": "Strip Chars", "type": "string" }, "newline": { "default": true, "description": "\n Strip new line characters from a multiple line (i.e., a\n paragraph or text from \"text area\") to get one single text,\n defaults to True. By default, :attr:`strip` removes new lines\n from the beginning and end, while this argument using string\n replace method to remove within lines - useful when the source\n text is paragraphed and needs to be cleaned.\n ", "title": "Newline", "type": "boolean" }, "newlinesep": { "default": "\n", "description": "\n A string value which defaults to the systems' default new line\n seperator (\"\\r\\n\" `CRLF` for windows, and \"\\n\" `LF` for\n *nix based systems) to replace from string.\n ", "title": "Newlinesep", "type": "string" }, "multispace": { "default": true, "description": "\n Replace multiple spaces using regular expressions, which often\n reduces the models' performance, defaults to True.\n ", "title": "Multispace", "type": "boolean" } } }
- Fields:
- Validators:
model_validator»all fields
- field strip: bool = True
Strip white spaces from both the beginning and the end of the string for normalization. By default, all the spaces are removed as they do not provide any additional information for a LLM/NLP based models and reduces token counts.
When the attribute is set to
Truethe alternate parameterslstripandrstripis ignored, check model validator for more information. This uses the Python in-built string function as in example below:text = " this is a long text " print(text.strip()) >> 'this is a long text'
Further customization like specifying alternate set of characters to be removed from the string is also supported by using the
strip_charsattribute, for more information check docs.- Validated by:
__set_name__
- field lstrip: bool = True
When set to true (default) removes the leading white characters from the string, or specify alternate set using
strip_charsattribute.- Validated by:
__set_name__
- field rstrip: bool = True
When set to true (default) removes the trailing white characters from the string, or specify alternate set using
strip_charsattribute.- Validated by:
__set_name__
- field strip_chars: str = None
Custom set characters to be removed from the string. The argument is not a “prefix” or a “suffix” but a combination of all the values to be stripped. Check docs for more information.
- Validated by:
__set_name__
- field newline: bool = True
Strip new line characters from a multiple line (i.e., a paragraph or text from “text area”) to get one single text, defaults to True. By default,
stripremoves new lines from the beginning and end, while this argument using string replace method to remove within lines - useful when the source text is paragraphed and needs to be cleaned.- Validated by:
__set_name__
- field newlinesep: str = '\n'
A string value which defaults to the systems’ default new line seperator (”rn” CRLF for windows, and “n” LF for *nix based systems) to replace from string.
- Validated by:
__set_name__
- field multispace: bool = True
Replace multiple spaces using regular expressions, which often reduces the models’ performance, defaults to True.
- Validated by:
__set_name__
- apply(text: str) str
A abstract method
.apply()that takes in any string value that needs to be converted. The apply function makes the same model to be used for n-elements using a processing engine.- Parameters:
text (str) – Any string value that needs to be normalized, the method may also extend unique properties of the child.
Return Value
- Return type:
str
- Returns:
Returns a normalized text which is as per the child class properties and attributes.
- validator model_validator » all fields
Pydantic generic model validator which validates all the fields using the self.attribute parameter and is generic to the class.
- Raises:
UserWarning – A warning is raised when the parameter does not follow specified directive. It is recommended to check the attribute settings before using
apply()or it might generated unwanted output.
- _abc_impl = <_abc._abc_data object>
- pydantic model nlpurify.preprocessing.normalization.CaseFolding
A Model to Normalize Case Folding from Texts
Case folding from raw data source is often in title case, or is in a mixed case which may hinder the NLP/LLM model’s performance. The general convention is to convert all to lower cases using native Python function
lower()which is available for strings.Show JSON schema
{ "title": "CaseFolding", "description": "A Model to Normalize Case Folding from Texts\n\nCase folding from raw data source is often in title case, or is in\na mixed case which may hinder the NLP/LLM model's performance. The\ngeneral convention is to convert all to lower cases using native\nPython function :func:`lower()` which is available for strings.", "type": "object", "properties": { "name": { "default": null, "description": "Set the model name, or default to class name", "title": "Name", "type": "string" }, "upper": { "default": false, "description": "\n Convert the text to upper case and return the text without\n altering other things. Defaults to False, the class converts\n the text to lower case which is recommended in LLM/NLP models.\n ", "title": "Upper", "type": "boolean" }, "lower": { "default": true, "description": "\n Convert the contents fof the text to lower case (default) for\n an easy forward integration with LLM/NLP based models.\n ", "title": "Lower", "type": "boolean" } } }
- Fields:
- Validators:
model_validator»all fields
- field upper: bool = False
Convert the text to upper case and return the text without altering other things. Defaults to False, the class converts the text to lower case which is recommended in LLM/NLP models.
- Validated by:
__set_name__
- field lower: bool = True
Convert the contents fof the text to lower case (default) for an easy forward integration with LLM/NLP based models.
- Validated by:
__set_name__
- apply(text: str) str
Normalize the text into either all small case or upper case as per the forward models’ need.
- validator model_validator » all fields
Validate all the attributes of the class, and raise an error when the validation fails for any given combination below.
- Raises:
AssertionError – Error is raised when both the attribute
upperandloweris set to True.
- _abc_impl = <_abc._abc_data object>
- pydantic model nlpurify.preprocessing.normalization.StopWords
Normalize Raw Texts from Stop Words using NLTK Corpus
The model uses the
nltk.corpusto check the valid stopwords that when removed from a text improves an NLP/LLM models’ performance. By default, the model is set to use the stopwords in the English language.Show JSON schema
{ "title": "StopWords", "description": "Normalize Raw Texts from Stop Words using NLTK Corpus\n\nThe model uses the :mod:`nltk.corpus` to check the valid stopwords\nthat when removed from a text improves an NLP/LLM models'\nperformance. By default, the model is set to use the stopwords in\nthe English language.", "type": "object", "properties": { "name": { "default": null, "description": "Set the model name, or default to class name", "title": "Name", "type": "string" }, "language": { "default": "english", "description": "\n A valid language name which is available and defined under\n :func:`nltk.corpus.stopwords`, defaults to the English. To see\n a valid list of languages follow below.\n\n .. code-block:: python\n\n import nltk\n\n # download the corpus if not already available\n # nltk.download(\"stopwords\")\n from nltk.corpus import stopwords\n\n # once downloaded and available, check available list:\n print(stopwords.fileids())\n\n The code block is dependent on :mod:`nltk` for more information\n check `docs <https://www.nltk.org/index.html>`_.\n ", "title": "Language", "type": "string" }, "extrawords": { "default": [], "description": "\n The model gives the flexibility to add extra words which will\n be treated as stopwords which are not already defined under\n the :func:`nltk.corpus.stopwords` function. This can be\n helpful in dynamic debuging and quick manipulation of text to\n check forward models performance.\n ", "items": {}, "title": "Extrawords", "type": "array" }, "excludewords": { "default": [], "description": "\n Opposite to ``extrawords`` this attribute helps in updating\n the stopwords by removing/excluding words from the already\n defined words in ``stopwords.words(self.language)`` list.\n ", "items": {}, "title": "Excludewords", "type": "array" }, "stopwords_in_uppercase": { "default": false, "title": "Stopwords In Uppercase", "type": "boolean" }, "tokenize": { "default": true, "title": "Tokenize", "type": "boolean" }, "tokenize_config": { "$ref": "#/$defs/WordTokenize", "default": { "regexp": false, "vanilla": false, "tokenizer": true, "regexp_pattern": "\\w+", "vanilla_split_by": " ", "vanilla_getalpha": false, "vanilla_getalnum": false, "tokenizer_language": "english", "tokenizer_preserve_line": false } } }, "$defs": { "WordTokenize": { "description": "Tokenize text into word vectors using different types of methods\nto achieve cleaner text in desired formats.\n\n:param regexp, vanilla, tokenizer: Selection methods for different\n tokenization techniques. Set the value to ``regexp = True`` to\n tokenize text using regular expressions, for using pure Python\n based text tokenization use the ``vanilla = True`` method, and\n ``tokenizer = True`` (default) is for using external tokenizer\n functions like :func:`nltk.tokenize.word_tokenize` methods.\n The function will throw error if all of the values are set to\n true, and only one can be true at a time.", "properties": { "regexp": { "default": false, "title": "Regexp", "type": "boolean" }, "vanilla": { "default": false, "title": "Vanilla", "type": "boolean" }, "tokenizer": { "default": true, "title": "Tokenizer", "type": "boolean" }, "regexp_pattern": { "default": "\\w+", "title": "Regexp Pattern", "type": "string" }, "vanilla_split_by": { "default": " ", "title": "Vanilla Split By", "type": "string" }, "vanilla_getalpha": { "default": false, "title": "Vanilla Getalpha", "type": "boolean" }, "vanilla_getalnum": { "default": false, "title": "Vanilla Getalnum", "type": "boolean" }, "tokenizer_language": { "default": "english", "title": "Tokenizer Language", "type": "string" }, "tokenizer_preserve_line": { "default": false, "title": "Tokenizer Preserve Line", "type": "boolean" } }, "title": "WordTokenize", "type": "object" } } }
- Fields:
- Validators:
- field language: str = 'english'
A valid language name which is available and defined under
nltk.corpus.stopwords(), defaults to the English. To see a valid list of languages follow below.import nltk # download the corpus if not already available # nltk.download("stopwords") from nltk.corpus import stopwords # once downloaded and available, check available list: print(stopwords.fileids())
The code block is dependent on
nltkfor more information check docs.- Validated by:
__set_name__
- field extrawords: list = []
The model gives the flexibility to add extra words which will be treated as stopwords which are not already defined under the
nltk.corpus.stopwords()function. This can be helpful in dynamic debuging and quick manipulation of text to check forward models performance.- Validated by:
__set_name__
- field excludewords: list = []
Opposite to
extrawordsthis attribute helps in updating the stopwords by removing/excluding words from the already defined words instopwords.words(self.language)list.- Validated by:
__set_name__
- field stopwords_in_uppercase: bool = False
- Validated by:
__set_name__
- field tokenize: bool = True
- Validated by:
__set_name__
- field tokenize_config: WordTokenize = WordTokenize(regexp=False, vanilla=False, tokenizer=True, regexp_pattern='\\w+', vanilla_split_by=' ', vanilla_getalpha=False, vanilla_getalnum=False, tokenizer_language='english', tokenizer_preserve_line=False)
- Validated by:
__set_name__
- apply(text: str) str
A abstract method
.apply()that takes in any string value that needs to be converted. The apply function makes the same model to be used for n-elements using a processing engine.- Parameters:
text (str) – Any string value that needs to be normalized, the method may also extend unique properties of the child.
Return Value
- Return type:
str
- Returns:
Returns a normalized text which is as per the child class properties and attributes.
- property stopwords_: list
- _abc_impl = <_abc._abc_data object>
- nlpurify.preprocessing.normalization.normalize(text: str, whitespace: bool = True, casefolding: bool = True, stopwords: bool = True, **kwargs) str
The normalization function provides an one-stop solution for all types of basic text normalization - white space, case folding and stop words removal each of which can be toggled on/off as per enduser’s need. A normalized text may have the following properties:
It may not start or end with a white space character,
It may not have multiple spaces or spaces in the beginning or end of the scentence, and
It may not be spread in multiple lines (i.e., paragraph).
All the above properties are desired, and can improve performance when used to train a large language model. Normalizaton of texts may also involve uniform case, typically
string.lower()that can be used to create a word vector.- Parameters:
text (str) – The base uncleaned text, all the operations are done on this text to return a cleaner version. The string can be single line, multi-line (example from “text area”) and can have any type of escape characters.
All the normalization techniques are put into one callable method which in turn uses
pydanticmodels for data validation and settings management of each technique. Below are the toggles:- Parameters:
whitespace (bool) – A technique that normalizes the white space from the underlying texts. A text with multiple white spaces increases the processing load of a NLP/LLM model that can hurt performance. White spaces in a text includes spaces, tabs and new lines which is the primary delimiter of a NLP/LLM model.
casefolding (bool) – Technique to normalize cases from a string to a desired format, i.e., either all caps or all in small case. It is always a good practice to convert all the raw text into small case and then send for further modeling.
stopwords (bool) – A stop word is a common high-frequency word like “the”, “and”, etc. that have no meaning of their own. Removing the stop words can often improve the model efficiency, default to True.
Keyword Arguments
The keyword arguments are used to toggle on/off each of the normalization techniques. Each technique is associated with an underlying dictionary which is defined under respective models.
Please refer to the underlying functions for detailed keyword arguments associated with each normalization techique(s) as below:
whitespace : Associated with white space removal, the function takes in arguments associated with native string functions of Python, check
WhiteSpacefor more informations.casefolding : Associated to set uniform text case, the model either converts all the string to upper case or in lower case using Python native string functions, for more details check signature of
CaseFoldingclass.stopwords : Associated with white stop words removal, check the underlying validation class is
StopWordsfor more details.
Code Example(s)
The default configuration is (most of the time) the best normal form of the text, which is widely used. This can be achieved using the default setting like below.
import nlpurify as nlpu ... text = " My unCleaned text!! " print(nlpu.preprocessing.normalize(text, ...)) >> "my uncleaned text" # example of a cleaned text
Return Data
- Return type:
str
- Returns:
Return a cleaner version of string which is normalized and treated thus providing a better performance for forward NLP/LLM based modelling.