You will need the following modules before running the code — numpy, pandas, sklearn and nltk. For nltk, please also download the corpus and treebanks using the nltk.download() function to ensure smooth execution (well, you do not need all of them but the files are small and if you plan to learn nltk you will need them).
This function opens a window that lets you choose and install the nltk components you need:
Here are the implementation steps so you can relate them to the code structure:
- The code looks for the input text files in the designated folders.
- The code opens one input text document at a time. For batch processing, put more than one file in the designated folder. The code will iterate and process each text file. Make sure the extension is .txt for easy identification.
- The code parses each input text file in these seven steps:
3.1 Opens the master punctuations file (sb_golden_punctuations.txt). It contains punctuations or characters you want to remove from your input files. You can add or remove characters to this punctuations file depending on what you want to delete or retain from your source data.
3.2 Removes punctuations and unwanted characters from each of the input files, using the punctuations in the master punctuations file.
3.3 Opens the master Penn PoS tags file (sb_golden_penn_pos_tags.csv) as a Panda’s Dataframe.
3.4 Tokenizes the input text file. Each file’s tokens are eventually stored as features together with their counts.
3.5 Conditionally joins the Dataframe of the master Penn PoS tags file with the input text file, using the ‘filename’ column as the common key.
3.6 Saves each PoS tags and the count of each PoS tag. Refer the image in the later paragraphs for the file structure.
3.7 Opens sklearn’s LabelEncoder and converts features into numerical data. This is needed only for the non-numerical features.
Tip: You may want to use sklearn’s MinMaxScaler to map the features into the range [0, 1]. The mapping to [0, 1] is useful for deep learning but make sure your convert all your values to a float datatype.
You can choose to add a target or a dependent variable as the last column to this final file.
The Python code creates various CSV files at critical points of code execution. This helps track progress of execution and data state while reducing debugging effort.
Coming back, here is a brief description of each file: