• Skip to main content
  • Skip to secondary menu
  • Skip to primary sidebar
  • Skip to footer
  • Home
  • Crypto Currency
  • Technology
  • Contact
NEO Share

NEO Share

Sharing The Latest Tech News

  • Home
  • Artificial Intelligence
  • Machine Learning
  • Computers
  • Mobile
  • Crypto Currency

The Magic of Synthetic Data: Using Artificial Intelligence to Train Artificial Intelligence with…

January 3, 2021 by systems

Table 1: Sample of GPT-2 Prompt Aid Tool

When generating synthetic reviews I wanted to ensure that the responses expanded on the genuine data and produced responses that were a strong representation of the genuine data. So when I wrote the prefix prompts I used words that were heavily represented in the genuine datasets. I wrote a Python Function that organized the genuine dataset corpus into trigrams (3 word consecutive combinations), bigrams (2 word combinations) and words. This function also provides a count and numerically sorts the occurrences of these words and combinations.

Figure 4: Synthetic Data concatenated with the Genuine Data

After I created the synthetic Positive and Negative datasets I used Python Pandas to concatenate genuine Negative and Positive Datasets.

Figure 5: Baseline Model Testing

To ensure there was a fair and equal analysis of the performance metrics, I used the scikit-learn train_test_split method to establish a single ground truth test set consisting of 198 observations derived from a totally separate dataset from the Yelp Open Dataset. I then built two baseline models on two datasets using the Multinomial Naive Bayes Classifier Algorithm. The two datasets were: The genuine Yelp Pizza Reviews Dataset (450 observations) and the combined Genuine and Synthetic Yelp Reviews Dataset(11,380 observations).

Table 2: Baseline Model Performance Metrics

I chose Precision, Accuracy, Recall, and F1 as performance metrics for the three baseline models. Overall, the Combined (Synthetic and Genuine) Model outperformed the Genuine Model on all performance metrics.

Figure 6: Baseline Model Confusion Model Analysis Results

I also performed a Confusion Matrix Analysis. The Genuine Model had more True Positives but less True Negatives when compared to the Synthetic and Genuine Model. The Synthetic and Genuine Model had more False Positives but less False Negatives when compared to the Genuine Model.

In conclusion, the Synthetic and Genuine Model outperformed the Genuine Model in all performance metrics. This technique has the possibility of allowing organizations and businesses to build high performing NLP classification models without the high cost associated with large scale data acquisition. There are opportunities in exploring this technique on datasets with a larger observation count. There are also opportunities in exploring GPT-2 prompt design to better guide the GPT-2 model in generating relevant text. This is an exciting Machine Learning Technique that I feel deserves further exploration.

Yelp Open Dataset: https://www.yelp.com/dataset

GPT-2 Prompt Aid Tool: https://raw.githubusercontent.com/success81/Synthetic_NLP_Data_Generation_Paper/main/GPT_Prompt_Aid

Github: https://github.com/success81/Synthetic_NLP_Data_Generation_Paper

  1. Higginbotham, S. (2020, June 29). Fake data is great data when it comes to machine learning. Retrieved December 20, 2020, from https://staceyoniot.com/fake-data-is-great-data-when-it-comes-to-machine-learning/
  2. Radford, A. (2020, September 03). Better Language Models and Their Implications. Retrieved December 31, 2020, from https://openai.com/blog/better-language-models/
  3. Sweeney, E. (2019, March 06). IAB: 78% of marketers will spend more on data in 2019. Retrieved December 31, 2020, from https://www.marketingdive.com/news/iab-78-of-marketers-will-spend-more-on-data-in-2019/549811/

Filed Under: Machine Learning

Primary Sidebar

Stay Ahead: The Latest Tech News and Innovations

Cryptocurrency Market Updates: What’s Happening Now

Emerging Trends in Artificial Intelligence: What to Watch For

Top Cloud Computing Services to Secure Your Data

The Future of Mobile Technology: Recent Advancements and Predictions

Footer

  • Privacy Policy
  • Terms and Conditions

Copyright © 2025 NEO Share

Terms and Conditions - Privacy Policy