Legacy Functions
Legacy Version of NLP Utility Module
The NLP utility module is refactored and moved to a new version with advanced collection of features. However the existing codes from the gist is maintained under the legacy submodule unless dependent codes are gradually migrated.
More Information: Issue #5 on code migrations and submodule details.
Caution
The documentation does not follow PEP-8 convention, and is not maintained properly. This submodule is kept only as a precautionary submodule.
NLP Utilities
A set of utility function related to natural language
processing. The code uses the nltk library along with basic
string formattings to clean and process texts.
Warning
The functions are not optimized and test cases are not checked. Use the function with caution.
Getting Started
To use the function and its capabilities, first install the required libraries:
$ pip install fuzzywuzzy
$ pip install python-Levenshtein # improve performance
The legacy code is a standalone submodule, and can be used for existing dependent modules like:
import nlpurify.legacy as nlpu # nlp-utility functions
print(nlpu.text_process("some random string that needs cleaning"))
To use the function, nltk.corpus must be installed for
stopwords and related. More informations is available
here.
- nlpurify.legacy.nlp_utils.fuzzyMatch(string: str, reference: str, method: str = 'partial_ratio') int
Calculate a Percentage Similarity between string and reference Text
Using the fuzzywuzzy.fuzz() method, the function calculates the percentage of similarity between two text data. There are various methods available which can be declared via method parameter. However, partial_ratio is great when we want to match a text with partial data. For example, we want to find all the strings which have the word ‘annonymous’ but the spelling, position may be different in each case.
- nlpurify.legacy.nlp_utils.processor(string: str, text_process: bool = False, **kwargs) str
A Simple Utility Function to Pre-Process a String
The function inputs a string, and exports clean formatted string which is free of stop words (english) and the words are lemmatized, i.e. transformed to their base form.
- Parameters:
string (str) – Base string on which various nltk functions are applied to clean unwanted informations.
text_process (bool) – Should the base string be formatted using text_process(). Defaults to False.
- nlpurify.legacy.nlp_utils.text_processor(string: str, **kwargs) str
Uses String Methods to Clean a String
An extension of the processor function, which uses the in-built python string methods to clear string contents. The function can be called seperatly, or pass text_process = True) in processor. More information on in-built string methods is available here: https://www.programiz.com/python-programming/methods/string.
Attention
The function is not yet optimized when used in conjunction.
- Parameters:
string (str) – Base string which needs formatting. The string is converted into lower case. If passed from :func:`processor()`this step is repeated. TODO fix when passed through parent function.
Keyword Arguments
isalnum (bool): Only keep alpha-numeric charecters in the string. Defaults to False.
isalpha (bool): Only keep alphabets charecters in the string. Defaults to False.