Text Normalization

Text normalization is the process of converting text into a consistent, standard, or “canonical” form. The goal is to reduce randomness and variations in the text data, which helps in reducing the overall number of unique words (the vocabulary size) and ensures that different forms of the same word are treated as one.

The main goal is to provide a single function that can be used to achieve normalization goals - popular methods are text cases (setting lower or upper case to all the words), stopwords removal etc. The underlying function uses core Python string manipulation methods with additional third party libraries (like nltk) to achieve text normalization.

The core methods is kept simple, and generic arguments are used which are widely recognized/used by popular libraries.

pydantic model nlpurify.preprocessing.normalization.WhiteSpace

A Model to Normalize White Space (space, tabs, newlines) from Text

Cleaning texts of white spaces like from beginning, end, and also multiple white spaces does not add any value to a text and should thus be removed to normalize the text.

A modular approach is now enabled which is derived from a base normalization class. The usage is as below:

import nlpurify as nlpu
model = nlpu.preprocessing.normalization.WhiteSpace()

# let's define a multi-line uncleaned text
text = '''
    This is a   uncleaned text    with lots of
   extra white
space.
'''

print(model.apply(text)) # uses default settings
>> "This is a uncleaned text with lots of extra white space."

The model does not accept additional arguments and the function .apply() is used to clean and normalize white space from text.

Show JSON schema
{
   "title": "WhiteSpace",
   "description": "A Model to Normalize White Space (space, tabs, newlines) from Text\n\nCleaning texts of white spaces like from beginning, end, and\nalso multiple white spaces does not add any value to a text and\nshould thus be removed to normalize the text.\n\nA modular approach is now enabled which is derived from a base\nnormalization class. The usage is as below:\n\n.. code-block:: python\n\n    import nlpurify as nlpu\n    model = nlpu.preprocessing.normalization.WhiteSpace()\n\n    # let's define a multi-line uncleaned text\n    text = '''\n        This is a   uncleaned text    with lots of\n       extra white\n    space.\n    '''\n\n    print(model.apply(text)) # uses default settings\n    >> \"This is a uncleaned text with lots of extra white space.\"\n\nThe model does not accept additional arguments and the function\n``.apply()`` is used to clean and normalize white space from text.",
   "type": "object",
   "properties": {
      "name": {
         "default": null,
         "description": "Set the model name, or default to class name",
         "title": "Name",
         "type": "string"
      },
      "strip": {
         "default": true,
         "description": "\n        Strip white spaces from both the beginning and the end of the\n        string for normalization. By default, all the spaces are\n        removed as they do not provide any additional information for\n        a LLM/NLP based models and reduces token counts.\n\n        When the attribute is set to ``True`` the alternate parameters\n        :attr:`lstrip` and :attr:`rstrip` is ignored, check model\n        validator for more information. This uses the Python in-built\n        string function as in example below:\n\n        .. code-block:: python\n\n            text = \" this is a long text   \"\n            print(text.strip())\n            >> 'this is a long text'\n\n        Further customization like specifying alternate set of\n        characters to be removed from the string is also supported by\n        using the :attr:`strip_chars` attribute, for more information\n        check `docs <https://docs.python.org/3/library/stdtypes.html#str.strip>`_.\n        ",
         "title": "Strip",
         "type": "boolean"
      },
      "lstrip": {
         "default": true,
         "description": "\n        When set to true (default) removes the leading white characters\n        from the string, or specify alternate set using\n        :attr:`strip_chars` attribute.\n        ",
         "title": "Lstrip",
         "type": "boolean"
      },
      "rstrip": {
         "default": true,
         "description": "\n        When set to true (default) removes the trailing white\n        characters from the string, or specify alternate set using\n        :attr:`strip_chars` attribute.\n        ",
         "title": "Rstrip",
         "type": "boolean"
      },
      "strip_chars": {
         "default": null,
         "description": "\n        Custom set characters to be removed from the string. The\n        argument is not a \"prefix\" or a \"suffix\" but a combination of\n        all the values to be stripped. Check\n        `docs <https://docs.python.org/3/library/stdtypes.html#str.strip>`_\n        for more information.\n        ",
         "title": "Strip Chars",
         "type": "string"
      },
      "newline": {
         "default": true,
         "description": "\n        Strip new line characters from a multiple line (i.e., a\n        paragraph or text from \"text area\") to get one single text,\n        defaults to True. By default, :attr:`strip` removes new lines\n        from the beginning and end, while this argument using string\n        replace method to remove within lines - useful when the source\n        text is paragraphed and needs to be cleaned.\n        ",
         "title": "Newline",
         "type": "boolean"
      },
      "newlinesep": {
         "default": "\n",
         "description": "\n        A string value which defaults to the systems' default new line\n        seperator (\"\\r\\n\" `CRLF` for windows, and \"\\n\" `LF` for\n        *nix based systems) to replace from string.\n        ",
         "title": "Newlinesep",
         "type": "string"
      },
      "multispace": {
         "default": true,
         "description": "\n        Replace multiple spaces using regular expressions, which often\n        reduces the models' performance, defaults to True.\n        ",
         "title": "Multispace",
         "type": "boolean"
      }
   }
}

Fields:
Validators:
field strip: bool = True

Strip white spaces from both the beginning and the end of the string for normalization. By default, all the spaces are removed as they do not provide any additional information for a LLM/NLP based models and reduces token counts.

When the attribute is set to True the alternate parameters lstrip and rstrip is ignored, check model validator for more information. This uses the Python in-built string function as in example below:

text = " this is a long text   "
print(text.strip())
>> 'this is a long text'

Further customization like specifying alternate set of characters to be removed from the string is also supported by using the strip_chars attribute, for more information check docs.

Validated by:
field lstrip: bool = True

When set to true (default) removes the leading white characters from the string, or specify alternate set using strip_chars attribute.

Validated by:
field rstrip: bool = True

When set to true (default) removes the trailing white characters from the string, or specify alternate set using strip_chars attribute.

Validated by:
field strip_chars: str = None

Custom set characters to be removed from the string. The argument is not a “prefix” or a “suffix” but a combination of all the values to be stripped. Check docs for more information.

Validated by:
field newline: bool = True

Strip new line characters from a multiple line (i.e., a paragraph or text from “text area”) to get one single text, defaults to True. By default, strip removes new lines from the beginning and end, while this argument using string replace method to remove within lines - useful when the source text is paragraphed and needs to be cleaned.

Validated by:
field newlinesep: str = '\n'

A string value which defaults to the systems’ default new line seperator (”rn” CRLF for windows, and “n” LF for *nix based systems) to replace from string.

Validated by:
field multispace: bool = True

Replace multiple spaces using regular expressions, which often reduces the models’ performance, defaults to True.

Validated by:
apply(text: str) str

A abstract method .apply() that takes in any string value that needs to be converted. The apply function makes the same model to be used for n-elements using a processing engine.

Parameters:

text (str) – Any string value that needs to be normalized, the method may also extend unique properties of the child.

Return Value

Return type:

str

Returns:

Returns a normalized text which is as per the child class properties and attributes.

validator model_validator  »  all fields

Pydantic generic model validator which validates all the fields using the self.attribute parameter and is generic to the class.

Raises:

UserWarning – A warning is raised when the parameter does not follow specified directive. It is recommended to check the attribute settings before using apply() or it might generated unwanted output.

_abc_impl = <_abc._abc_data object>
pydantic model nlpurify.preprocessing.normalization.CaseFolding

A Model to Normalize Case Folding from Texts

Case folding from raw data source is often in title case, or is in a mixed case which may hinder the NLP/LLM model’s performance. The general convention is to convert all to lower cases using native Python function lower() which is available for strings.

Show JSON schema
{
   "title": "CaseFolding",
   "description": "A Model to Normalize Case Folding from Texts\n\nCase folding from raw data source is often in title case, or is in\na mixed case which may hinder the NLP/LLM model's performance. The\ngeneral convention is to convert all to lower cases using native\nPython function :func:`lower()` which is available for strings.",
   "type": "object",
   "properties": {
      "name": {
         "default": null,
         "description": "Set the model name, or default to class name",
         "title": "Name",
         "type": "string"
      },
      "upper": {
         "default": false,
         "description": "\n        Convert the text to upper case and return the text without\n        altering other things. Defaults to False, the class converts\n        the text to lower case which is recommended in LLM/NLP models.\n        ",
         "title": "Upper",
         "type": "boolean"
      },
      "lower": {
         "default": true,
         "description": "\n        Convert the contents fof the text to lower case (default) for\n        an easy forward integration with LLM/NLP based models.\n        ",
         "title": "Lower",
         "type": "boolean"
      }
   }
}

Fields:
Validators:
field upper: bool = False

Convert the text to upper case and return the text without altering other things. Defaults to False, the class converts the text to lower case which is recommended in LLM/NLP models.

Validated by:
field lower: bool = True

Convert the contents fof the text to lower case (default) for an easy forward integration with LLM/NLP based models.

Validated by:
apply(text: str) str

Normalize the text into either all small case or upper case as per the forward models’ need.

validator model_validator  »  all fields

Validate all the attributes of the class, and raise an error when the validation fails for any given combination below.

Raises:

AssertionError – Error is raised when both the attribute upper and lower is set to True.

_abc_impl = <_abc._abc_data object>
pydantic model nlpurify.preprocessing.normalization.StopWords

Normalize Raw Texts from Stop Words using NLTK Corpus

The model uses the nltk.corpus to check the valid stopwords that when removed from a text improves an NLP/LLM models’ performance. By default, the model is set to use the stopwords in the English language.

Show JSON schema
{
   "title": "StopWords",
   "description": "Normalize Raw Texts from Stop Words using NLTK Corpus\n\nThe model uses the :mod:`nltk.corpus` to check the valid stopwords\nthat when removed from a text improves an NLP/LLM models'\nperformance. By default, the model is set to use the stopwords in\nthe English language.",
   "type": "object",
   "properties": {
      "name": {
         "default": null,
         "description": "Set the model name, or default to class name",
         "title": "Name",
         "type": "string"
      },
      "language": {
         "default": "english",
         "description": "\n        A valid language name which is available and defined under\n        :func:`nltk.corpus.stopwords`, defaults to the English. To see\n        a valid list of languages follow below.\n\n        .. code-block:: python\n\n            import nltk\n\n            # download the corpus if not already available\n            # nltk.download(\"stopwords\")\n            from nltk.corpus import stopwords\n\n            # once downloaded and available, check available list:\n            print(stopwords.fileids())\n\n        The code block is dependent on :mod:`nltk` for more information\n        check `docs <https://www.nltk.org/index.html>`_.\n        ",
         "title": "Language",
         "type": "string"
      },
      "extrawords": {
         "default": [],
         "description": "\n        The model gives the flexibility to add extra words which will\n        be treated as stopwords which are not already defined under\n        the :func:`nltk.corpus.stopwords` function. This can be\n        helpful in dynamic debuging and quick manipulation of text to\n        check forward models performance.\n        ",
         "items": {},
         "title": "Extrawords",
         "type": "array"
      },
      "excludewords": {
         "default": [],
         "description": "\n        Opposite to ``extrawords`` this attribute helps in updating\n        the stopwords by removing/excluding words from the already\n        defined words in ``stopwords.words(self.language)`` list.\n        ",
         "items": {},
         "title": "Excludewords",
         "type": "array"
      },
      "stopwords_in_uppercase": {
         "default": false,
         "title": "Stopwords In Uppercase",
         "type": "boolean"
      },
      "tokenize": {
         "default": true,
         "title": "Tokenize",
         "type": "boolean"
      },
      "tokenize_config": {
         "$ref": "#/$defs/WordTokenize",
         "default": {
            "regexp": false,
            "vanilla": false,
            "tokenizer": true,
            "regexp_pattern": "\\w+",
            "vanilla_split_by": " ",
            "vanilla_getalpha": false,
            "vanilla_getalnum": false,
            "tokenizer_language": "english",
            "tokenizer_preserve_line": false
         }
      }
   },
   "$defs": {
      "WordTokenize": {
         "description": "Tokenize text into word vectors using different types of methods\nto achieve cleaner text in desired formats.\n\n:param regexp, vanilla, tokenizer: Selection methods for different\n    tokenization techniques. Set the value to ``regexp = True`` to\n    tokenize text using regular expressions, for using pure Python\n    based text tokenization use the ``vanilla = True`` method, and\n    ``tokenizer = True`` (default) is for using external tokenizer\n    functions like :func:`nltk.tokenize.word_tokenize` methods.\n    The function will throw error if all of the values are set to\n    true, and only one can be true at a time.",
         "properties": {
            "regexp": {
               "default": false,
               "title": "Regexp",
               "type": "boolean"
            },
            "vanilla": {
               "default": false,
               "title": "Vanilla",
               "type": "boolean"
            },
            "tokenizer": {
               "default": true,
               "title": "Tokenizer",
               "type": "boolean"
            },
            "regexp_pattern": {
               "default": "\\w+",
               "title": "Regexp Pattern",
               "type": "string"
            },
            "vanilla_split_by": {
               "default": " ",
               "title": "Vanilla Split By",
               "type": "string"
            },
            "vanilla_getalpha": {
               "default": false,
               "title": "Vanilla Getalpha",
               "type": "boolean"
            },
            "vanilla_getalnum": {
               "default": false,
               "title": "Vanilla Getalnum",
               "type": "boolean"
            },
            "tokenizer_language": {
               "default": "english",
               "title": "Tokenizer Language",
               "type": "string"
            },
            "tokenizer_preserve_line": {
               "default": false,
               "title": "Tokenizer Preserve Line",
               "type": "boolean"
            }
         },
         "title": "WordTokenize",
         "type": "object"
      }
   }
}

Fields:
Validators:

field language: str = 'english'

A valid language name which is available and defined under nltk.corpus.stopwords(), defaults to the English. To see a valid list of languages follow below.

import nltk

# download the corpus if not already available
# nltk.download("stopwords")
from nltk.corpus import stopwords

# once downloaded and available, check available list:
print(stopwords.fileids())

The code block is dependent on nltk for more information check docs.

Validated by:
  • __set_name__

field extrawords: list = []

The model gives the flexibility to add extra words which will be treated as stopwords which are not already defined under the nltk.corpus.stopwords() function. This can be helpful in dynamic debuging and quick manipulation of text to check forward models performance.

Validated by:
  • __set_name__

field excludewords: list = []

Opposite to extrawords this attribute helps in updating the stopwords by removing/excluding words from the already defined words in stopwords.words(self.language) list.

Validated by:
  • __set_name__

field stopwords_in_uppercase: bool = False
Validated by:
  • __set_name__

field tokenize: bool = True
Validated by:
  • __set_name__

field tokenize_config: WordTokenize = WordTokenize(regexp=False, vanilla=False, tokenizer=True, regexp_pattern='\\w+', vanilla_split_by=' ', vanilla_getalpha=False, vanilla_getalnum=False, tokenizer_language='english', tokenizer_preserve_line=False)
Validated by:
  • __set_name__

apply(text: str) str

A abstract method .apply() that takes in any string value that needs to be converted. The apply function makes the same model to be used for n-elements using a processing engine.

Parameters:

text (str) – Any string value that needs to be normalized, the method may also extend unique properties of the child.

Return Value

Return type:

str

Returns:

Returns a normalized text which is as per the child class properties and attributes.

property stopwords_: list
_abc_impl = <_abc._abc_data object>
nlpurify.preprocessing.normalization.normalize(text: str, whitespace: bool = True, casefolding: bool = True, stopwords: bool = True, **kwargs) str

The normalization function provides an one-stop solution for all types of basic text normalization - white space, case folding and stop words removal each of which can be toggled on/off as per enduser’s need. A normalized text may have the following properties:

  • It may not start or end with a white space character,

  • It may not have multiple spaces or spaces in the beginning or end of the scentence, and

  • It may not be spread in multiple lines (i.e., paragraph).

All the above properties are desired, and can improve performance when used to train a large language model. Normalizaton of texts may also involve uniform case, typically string.lower() that can be used to create a word vector.

Parameters:

text (str) – The base uncleaned text, all the operations are done on this text to return a cleaner version. The string can be single line, multi-line (example from “text area”) and can have any type of escape characters.

All the normalization techniques are put into one callable method which in turn uses pydantic models for data validation and settings management of each technique. Below are the toggles:

Parameters:
  • whitespace (bool) – A technique that normalizes the white space from the underlying texts. A text with multiple white spaces increases the processing load of a NLP/LLM model that can hurt performance. White spaces in a text includes spaces, tabs and new lines which is the primary delimiter of a NLP/LLM model.

  • casefolding (bool) – Technique to normalize cases from a string to a desired format, i.e., either all caps or all in small case. It is always a good practice to convert all the raw text into small case and then send for further modeling.

  • stopwords (bool) – A stop word is a common high-frequency word like “the”, “and”, etc. that have no meaning of their own. Removing the stop words can often improve the model efficiency, default to True.

Keyword Arguments

The keyword arguments are used to toggle on/off each of the normalization techniques. Each technique is associated with an underlying dictionary which is defined under respective models.

Please refer to the underlying functions for detailed keyword arguments associated with each normalization techique(s) as below:

  • whitespace : Associated with white space removal, the function takes in arguments associated with native string functions of Python, check WhiteSpace for more informations.

  • casefolding : Associated to set uniform text case, the model either converts all the string to upper case or in lower case using Python native string functions, for more details check signature of CaseFolding class.

  • stopwords : Associated with white stop words removal, check the underlying validation class is StopWords for more details.

Code Example(s)

The default configuration is (most of the time) the best normal form of the text, which is widely used. This can be achieved using the default setting like below.

import nlpurify as nlpu

...
text = " My   unCleaned text!!    "
print(nlpu.preprocessing.normalize(text, ...))
>> "my uncleaned text" # example of a cleaned text

Return Data

Return type:

str

Returns:

Return a cleaner version of string which is normalized and treated thus providing a better performance for forward NLP/LLM based modelling.