Skip to main content
Synthetic Data for Better and Safer AI Models

Synthetic Data for Better and Safer AI Models

As modern machine learning algorithms grow in complexity and sophistication so does their need for larger sets of training data. In practice these sets might be smaller than required, be of bad quality or misrepresent the actual domain of the real life problem that is being modelled. Simulation is a useful tool for countering those issues, as it can not only augment the training input and speed up the training process of the algorithm, but can also help with its subsequent validation and robustification by producing original samples and scenarios intended to enhance the algorithm's predictability and trustworthiness. Our aim in the STAR project is to provide a simulated reality module that can work as an add-on for predictive methods, while also interacting with the human operator in order to produce verifiably realistic scenarios that will augment the automated decision making process.

Simulated reality can been applied in a variety of manufacturing scenarios, such as visual quality inspection and autonomous robotic control. In a supervised setting such as defect detection it takes the form of synthetic data generation. In such a use-case data augmentation is often necessary due to the class skew caused by the rarity of defective artifacts. Underrepresented classes can be supplemented with synthetic images constructed from the existing data. Recent works have achieved good prediction rates under skewed training sets using state of the art methods such as Convolutional Variational Autoencoders and Generative Adversarial Networks. A step beyond is the synthesis of novel defects, usually based on prior knowledge from different products. Neural Style Transfer has been used successfully to fuse defect snippets with non-defective images. Such images can be used to test and also enhance the algorithm's predictive capacity.

In our work in STAR we are tackling such a characteristic supervised defect detection usecase, where the dataset is small and heavily imbalanced with images of defects being rare and hard to obtain. It was therefore important when choosing a suitable data augmentation method to take into account its ability to handle small datasets, something which is not the case when training large autoencoders or GANs from scratch. We are currently examining and evaluating different state of the art approaches that address this issue, such as trying to selectively utilize complex open-source GANs trained or large datasets such as imagenet. Of course lack of data is not the only challenge we have come across; high image fidelity is also an important requirement as differences between defects and non-defects can be minuscule. If this requirement is not met, the use of synthetic data can backfire and confuse the classifier further. To overcome this issue we chose to purposefully fuse synthetic with real data and apply some filtering and post-processing which has led to interesting results. Our next steps are to improve the quality and usefulness of the synthetic data even further and also to produce novel examples that will help assess and increase the robustness of the underlying AI models.