Synthetic Data: What, Why, and How? — innotescus

Innotescus LLC
2 min readApr 7, 2021

Synthetic Minority Oversampling Technique

Synthetic data can be particularly useful in cases where there are too few examples of the minority class for a model to effectively learn the decision boundary. One way to solve this problem is to oversample the minority class, and simply duplicate examples of it in the training dataset. While this balances the class distribution, it provides no new information to the model. Rather than duplicating existing information, data scientists can synthesize examples around the minority class using the Synthetic Minority Oversampling Technique.

SMOTE is a method of synthesizing data to bolster datasets that include rare events or scenarios whose detections are crucial, such as cancer detection. The basic process of SMOTE requires the data scientist to sample two data points — one from the minority class and one that is nearby in feature space, but not of the minority class. Then, the data scientist must create a synthetic data point along the line between the two samples in feature space. Using to create more samples in and around the minority class allows the network to better define the minority class and the crucial boundaries around it.

Generative Adversarial Networks

Generative Adversarial Networks (GANs) are used to generate synthetic data by training a generative model, a network that creates synthetic data, using a discriminator model, a network that has been trained to classify data as real or fake. The generative model is trained until the discriminator is unable to distinguish between its synthetic data and real data, and has about a 50% success rate at categorizing its outputs as real or fake. Once trained, the generative model can be used to create synthetic data for the intended application.

Though the techniques outlined above leverage real data, they go a step further than augmentation in creating new data rather than simply altering existing data. Synthetic data can be a powerful way to address the shortcomings of data collection, which can become too slow or costly for certain types of data. Though technically more challenging than data augmentation, synthesizing new data can help train networks to address a greater variety of scenarios that, although less common or harder to record, are crucial to the successful deployment of a machine learning model.

Originally published at https://innotescus.io on April 7, 2021.

--

--

Innotescus LLC

Enabling better data, faster annotation, and deeper insights through innovative computer vision solutions.