Synthetic data#
This module allows the users to generate and synthetic data to their existing dataset, and, as a result, increase the sample size. The synthetic data is generated within the cross-validation framework and prior to the preprocessing from the original, ‘raw’ data in the current cross-validation fold. The data is then, together with the original data, used to train the models. Importantly, it undergoes the same procedures as the real data, i.e., also all the preprocessing steps.
The user can decide between different data generation methods (see below), whether the original data should be standardized prior to data generation, how many synthetic observations to generate, and whether the synthetic data should be saved to disk.
There are different methods available to generate the artificial data:
kNN-based data generation#
The kNN-based method interpolates new data points depending on the k-nearest neighboring observations of randomly selected real observations. The user can define the number of nearest neighbors to use for data interpolation and which distance measure (Euclidean, Manhattan, Cosine, Mahalanobis) to use to find the k-nearest neighbors.
ADASYN (Adaptive Synthetic Sampling)#
ADASYN (Adaptive Synthetic Sampling) is an oversampling technique commonly used in machine learning to address class imbalance in datasets. It focuses on the minority class by generating synthetic examples that are similar to the existing minority class instances. Here, you can specify how much balancing will be applied using the beta parameter. Next to this, the user can define the k for the density algorithm which defines the neighborhoods in which data is generated and the k value for SMOTE for data interpolation within these neighborhoods.
What sets ADASYN apart is its adaptability: it increases the number of synthetic samples for minority class observations that are more challenging to classify correctly, helping to balance the class distribution effectively. This adaptiveness makes ADASYN a powerful tool for improving the performance of classifiers, particularly in situations where one class is significantly underrepresented.
Gaussian random interpolation with PCA-based data generation (PCAGauss)#
Gaussian random interpolation with PCA is useful when you want to create synthetic data that captures the statistical characteristics of the original data. First, a PCA is performed on the original dataset to identify the principal components (PCs) and their corresponing eigenvectors and eigenvalues. Next, the original data is projected onto the principal components (dimensionality = n features - 1). For each to-be-generated synthetic data point, values are randomly sampled from a Gaussian distribution for each of the PCs and added to the projected representation of the original data. Finally, PCA is backprojected to obtain the synthetic data point in the original feature space.
Gaussian Mixture Models (GMM-based data generation)#
Gaussian Mixture Models (GMM) are useful for creating synthetic data that reflects the underlying distribution of a real dataset, especially when the data is thought to originate from multiple subpopulations or clusters. GMM models the data as a combination of multiple Gaussian distributions, each representing a cluster. First, the GMM learns the means, covariances, and weights of these Gaussians by fitting to the original data using an expectation-maximization algorithm. Once the model is trained, synthetic data points can be generated by sampling from the mixture of learned Gaussian components. The GMM tool in NeuroMiner uses the scikit-learn implementation (GaussianMixture). If the GMM does not fit into your computer’s RAM due to the dimensionality of your input data, NeuroMiner will use PCA to reduce it before running the GMM algorithm. If you choose to find the optimal GMM across a range of possible component numbers, the optimal number of components will be identified by the minimum BIC or AIC.
Important
Use data permanently by writing it to disk It is very important to save the generated data to your computer if you want to do further analyses beyond training the model, such as model visualization, interpretation and external validation.