When generating synthetic reviews I wanted to ensure that the responses expanded on the genuine data and produced responses that were a strong representation of the genuine data. So when I wrote the prefix prompts I used words that were heavily represented in the genuine datasets. I wrote a Python Function that organized the genuine dataset corpus into trigrams (3 word consecutive combinations), bigrams (2 word combinations) and words. This function also provides a count and numerically sorts the occurrences of these words and combinations.
After I created the synthetic Positive and Negative datasets I used Python Pandas to concatenate genuine Negative and Positive Datasets.
To ensure there was a fair and equal analysis of the performance metrics, I used the scikit-learn train_test_split method to establish a single ground truth test set consisting of 198 observations derived from a totally separate dataset from the Yelp Open Dataset. I then built two baseline models on two datasets using the Multinomial Naive Bayes Classifier Algorithm. The two datasets were: The genuine Yelp Pizza Reviews Dataset (450 observations) and the combined Genuine and Synthetic Yelp Reviews Dataset(11,380 observations).
I chose Precision, Accuracy, Recall, and F1 as performance metrics for the three baseline models. Overall, the Combined (Synthetic and Genuine) Model outperformed the Genuine Model on all performance metrics.
I also performed a Confusion Matrix Analysis. The Genuine Model had more True Positives but less True Negatives when compared to the Synthetic and Genuine Model. The Synthetic and Genuine Model had more False Positives but less False Negatives when compared to the Genuine Model.
In conclusion, the Synthetic and Genuine Model outperformed the Genuine Model in all performance metrics. This technique has the possibility of allowing organizations and businesses to build high performing NLP classification models without the high cost associated with large scale data acquisition. There are opportunities in exploring this technique on datasets with a larger observation count. There are also opportunities in exploring GPT-2 prompt design to better guide the GPT-2 model in generating relevant text. This is an exciting Machine Learning Technique that I feel deserves further exploration.
Yelp Open Dataset: https://www.yelp.com/dataset
GPT-2 Prompt Aid Tool: https://raw.githubusercontent.com/success81/Synthetic_NLP_Data_Generation_Paper/main/GPT_Prompt_Aid
- Higginbotham, S. (2020, June 29). Fake data is great data when it comes to machine learning. Retrieved December 20, 2020, from https://staceyoniot.com/fake-data-is-great-data-when-it-comes-to-machine-learning/
- Radford, A. (2020, September 03). Better Language Models and Their Implications. Retrieved December 31, 2020, from https://openai.com/blog/better-language-models/
- Sweeney, E. (2019, March 06). IAB: 78% of marketers will spend more on data in 2019. Retrieved December 31, 2020, from https://www.marketingdive.com/news/iab-78-of-marketers-will-spend-more-on-data-in-2019/549811/