Legacy Functions

Legacy Version of NLP Utility Module

The NLP utility module is refactored and moved to a new version with advanced collection of features. However the existing codes from the gist is maintained under the legacy submodule unless dependent codes are gradually migrated.

More Information: Issue #5 on code migrations and submodule details.

Caution

The documentation does not follow PEP-8 convention, and is not maintained properly. This submodule is kept only as a precautionary submodule.

NLP Utilities

A set of utility function related to natural language processing. The code uses the nltk library along with basic string formattings to clean and process texts.

Warning

The functions are not optimized and test cases are not checked. Use the function with caution.

Getting Started

To use the function and its capabilities, first install the required libraries:

$ pip install fuzzywuzzy
$ pip install python-Levenshtein # improve performance

The legacy code is a standalone submodule, and can be used for existing dependent modules like:

import nlpurify.legacy as nlpu # nlp-utility functions
print(nlpu.text_process("some random string that needs cleaning"))

To use the function, nltk.corpus must be installed for stopwords and related. More informations is available here.

nlpurify.legacy.nlp_utils.fuzzyMatch(string: str, reference: str, method: str = 'partial_ratio') int

Calculate a Percentage Similarity between string and reference Text

Using the fuzzywuzzy.fuzz() method, the function calculates the percentage of similarity between two text data. There are various methods available which can be declared via method parameter. However, partial_ratio is great when we want to match a text with partial data. For example, we want to find all the strings which have the word ‘annonymous’ but the spelling, position may be different in each case.

nlpurify.legacy.nlp_utils.processor(string: str, text_process: bool = False, **kwargs) str

A Simple Utility Function to Pre-Process a String

The function inputs a string, and exports clean formatted string which is free of stop words (english) and the words are lemmatized, i.e. transformed to their base form.

Parameters:
  • string (str) – Base string on which various nltk functions are applied to clean unwanted informations.

  • text_process (bool) – Should the base string be formatted using text_process(). Defaults to False.

nlpurify.legacy.nlp_utils.text_processor(string: str, **kwargs) str

Uses String Methods to Clean a String

An extension of the processor function, which uses the in-built python string methods to clear string contents. The function can be called seperatly, or pass text_process = True) in processor. More information on in-built string methods is available here: https://www.programiz.com/python-programming/methods/string.

Attention

The function is not yet optimized when used in conjunction.

Parameters:

string (str) – Base string which needs formatting. The string is converted into lower case. If passed from :func:`processor()`this step is repeated. TODO fix when passed through parent function.

Keyword Arguments

  • isalnum (bool): Only keep alpha-numeric charecters in the string. Defaults to False.

  • isalpha (bool): Only keep alphabets charecters in the string. Defaults to False.