Learning algorithm parameters#
Parameters for a learning algorithm determine the way that the model is fitted to the data. A common example is the ”C” or ”slack” parameter for soft margin SVM that defines the penalty associated with misclassifying individuals – i.e., whether the model does not allow any error within the CV1 folds and very tightly fits the decision boundary, or whether it allows errors in order to improve generalizability. Therefore, parameters such as these can greatly affect the model performance and it is common practice in machine learning to optimize the parameters for your specific problem and data.
NeuroMiner was developed in order to optimize performance across pre-defined parameter ranges within the nested cross-validation framework using a gridsearch, which protects against overfitting due to the application of the trained models to held-out data. Default parameter ranges are provided in NeuroMiner based on the literature and empirical testing, but we strongly recommend that the parameters are defined for your study and problem.
As previously stated, NeuroMiner uses a dynamic menu configuration that changes based on previous input. This is also true for the learning algorithm parameters, whereby the menu options will change based on what you have selected in the ”Classification algorithm” section.
Optimization strategies#
NeuroMiner offers several strategies for finding the optimal hyperparameters for your chosen learning algorithm. While a simple grid search is thorough, it can be computationally expensive. For complex models or large search spaces, more intelligent strategies can find better solutions in less time.
Brute-force optimization#
Brute-force optimization, also known as a grid search, is a straightforward and exhaustive method for tuning a model’s hyperparameters. Hyperparameters are the “settings” of a learning algorithm that are not learned from the data itself but are set by the user beforehand (e.g., the cost C of an SVM).
The goal of this method is to find the combination of hyperparameters that results in the best model performance.
How it works: Think of tuning a radio by methodically turning the frequency and volume knobs. Brute-force optimization does something similar:
Define a Grid: The user specifies a list, or “grid,” of possible values for each hyperparameter they want to tune. For example, for an SVM, you might test C = [0.1, 1, 10, 100] and gamma = [0.01, 0.1].
Exhaustive Search: The algorithm then systematically trains and evaluates a model for every single combination of these values. In the example above, it would test 8 different models (4 values for C × 2 values for gamma).
Select the Best: The combination of hyperparameters that achieves the best performance within the inner cross-validation loop (CV1) is selected as the “optimal” set.
This optimal set of parameters is then used to train the final model that will be evaluated on the held-out test data (CV2). While this method is computationally expensive, its “brute-force” nature guarantees that the best combination within the specified grid will be found.
Simulated Annealing (SA)#
Simulated Annealing is a sophisticated optimization algorithm inspired by the process of annealing in metallurgy, where a material is heated and then slowly cooled to strengthen it. In machine learning, it’s a powerful method for navigating large and complex hyperparameter spaces to find a global optimum without getting stuck in local optima.
How it Works: The algorithm starts at a random point in the hyperparameter space. At each step, it considers a new, nearby point. If the new point yields better performance, it’s accepted. If it’s worse, it might still be accepted with a certain probability. This probability is controlled by a “temperature” parameter:
At a high temperature, the algorithm is very likely to accept worse solutions, allowing it to “jump” out of local optima and explore the search space broadly.
As the temperature slowly cools, the algorithm becomes less likely to accept worse solutions, eventually converging on a high-performing region.
When to Use It: SA is highly recommended when your hyperparameter search space is very large, non-convex, or when you suspect there are many local optima where a simple search could get stuck. It is more efficient than a grid search for exploring high-dimensional spaces.
Parameters:#
Initial temperature: Controls the initial probability of accepting a worse-performing set of hyperparameters. A higher temperature encourages broader exploration at the start.
Max initial step size: Defines how “far” the algorithm can jump in the hyperparameter grid at the beginning of the search.
Cooling factor (alpha): A value typically between 0.8 and 0.99 that determines the rate of cooling. The temperature at each step is multiplied by this factor (e.g., T_new = T_old * 0.95). A value closer to 1.0 results in a slower, more thorough search.
Random jump probability: The probability (e.g., 0.1 for 10%) of making a completely random jump to a new part of the search space, which helps prevent the search from getting permanently trapped.
Maximum number of iterations: The total “budget” of model evaluations the algorithm is allowed to perform.
Reheating iterations & factor: If performance does not improve for a specified number of iterations, the algorithm can “reheat” (increase the temperature) to start exploring more broadly again.
Bayesian Optimization (BO)#
Bayesian Optimization is an intelligent, sample-efficient strategy for finding the maximum of an objective function—in this case, the model’s performance. It’s particularly effective when each model evaluation is computationally expensive (e.g., training a deep neural network).
How it Works: Instead of searching blindly, BO builds a probabilistic model (a “surrogate”) of the relationship between hyperparameters and performance. Think of it like a smart geologist drilling for oil:
Initial Sampling: It starts by evaluating a few random hyperparameter combinations (Number of seed iterations).
Build a Map: It uses these initial results to build a surrogate model (typically a Gaussian Process) that predicts performance and uncertainty across the entire search space.
Intelligent Selection: It then uses an “acquisition function” to decide the most promising point to evaluate next. This function balances exploitation (checking areas the model predicts will be good) and exploration (checking areas where the model is most uncertain).
Update and Repeat: After evaluating the new point, it updates its surrogate model with the new information and repeats the process, getting progressively more accurate.
When to Use It: BO is the preferred method when model training is time-consuming. It can often find a near-optimal set of hyperparameters in significantly fewer iterations than both grid search and simulated annealing.
Parameters:#
Maximum number of iterations: The total budget of model evaluations.
Max iterations without change: An early-stopping criterion. The optimization will halt if the best-found performance does not improve for this many iterations.
Number of seed iterations: The number of random points to evaluate before the intelligent Bayesian search begins. A value between 10-20 is often a good starting point to build a reasonable initial model.
Kernel function: This defines the behavior of the Gaussian Process surrogate model.
Squared Exponential / Matern: These are common, powerful choices. Matern kernels are often more flexible for complex, less smooth functions.
ARD (Automatic Relevance Determination): ARD (Automatic Relevance Determination) modifies standard kernels (like the ‘squared exponential’) into versions that can automatically learn the relative importance of different input features. Such ARD-enabled versions are very useful in high-dimensional spaces.
Note
You’ll need the Statistics & Machine Learning Toolbox to use Bayesian Optimization or Simulated Annealing.
SUPPORT VECTOR MACHINES (LIBSVM)#
Here we show an example for a RBF-Gaussian kernel SVM classifier with LIBSVM. The options might vary depending on the configurations of the model.
1 | Define Slack/Regularization parameter(s)
2 | Define RBF/Gaussian kernel parameter(s)
3 | Enable regularization of model selection
4 | Criterion for cross-parameter model selection
5 | Define weight (lambda) of SV ratio
6 | Define non-linearity (big gamma) of SV ratio
7 | Specify cross-parameter model selection process
1 | Define Slack/Regularization parameter(s) The Slack/Regularization parameter, also known as C parameter, is a hyperparameter specific to SVM. In NeuroMiner, a range of values can be specified here and it will be optimized throughout model training within the defined cross-validation structure.
2 | Define RBF/Gaussian kernel parameter(s) See link.
3 | Enable regularization of model selection This option enables the control of selection of models by considering the variability of performance across thre CV1 test folds. This approach aims to improve model selection by factoring in the consistency of performance metrics obtained from cross-validation.
4 | Criterion for cross-parameter model selection Criterion used to choose the optimal model when multiple hyperparameters are being tested. The options are model complexity, ensemble diversity, model performance, or complexity and ensemble diversity.
5 | Define weight (lambda) of SV ratio The weight controls the trade-off between between maximizing the margin and minimizing the classification errors of the SVM.
6 | Define non-linearity (big gamma) of SV ratio Determines the shape of the decision boundary. Higher gammas create a more complex (highly non-linear) decision boundary that can fit intricate patterns in dataThis can be helpful for highly non-linear datasets but might also lead to overfitting.
7 | Specify cross-parameter model selection process This option controls the number of models (parameters combination) selected. One optimal model based on the criteria and regularization defined previously or an ensemble of the top performing models You are given the possibility to select between: (1) select a single optimum parameter node which returns one optimal model based on the criteria, (2) generate cross-node ensemble by aggregating base learners above a predefined percentile which results in an ensemble of the top performing models or (3) automatically determine optimal percentile for optimum cross-node ensemble performance which depends on the regularization.
SUPPORT VECTOR MACHINES (LIBLINEAR)#
Here we show an example of a linear SVM using the LIBLINEAR library.
1 | Define Slack/Regularization parameter(s)
2 | Define Weighting exponents
3 | Enable regularization of model selection using CV1 test performance variance
4 | Specify cross-parameter model selection process
1 | Define Slack/Regularization parameter(s) The Slack/Regularization parameter, also known as C parameter, is a hyperparameter specific to SVM. In NeuroMiner, a range of values can be specified here and it will be optimized throughout model training within the defined cross-validation structure.
2 | Define Weighting exponents Configures the weight applied to weighting the coefficient applied to the samples from different classes in imbalanced data.
3 | Enable regularization of model selection This option enables the control of selection of models by considering the variability of performance across thre CV1 test folds. This approach aims to improve model selection by factoring in the consistency of performance metrics obtained from cross-validation.
4 | Specify cross-parameter model selection process This option controls the number of models (parameters combination) selected. One optimal model based on the criteria and regularization defined previously or an ensemble of the top performing models You are given the possibility to select between: (1) select a single optimum parameter node which returns one optimal model based on the criteria, (2) generate cross-node ensemble by aggregating base learners above a predefined percentile which results in an ensemble of the top performing models or (3) automatically determine optimal percentile for optimum cross-node ensemble performance which depends on the regularization.
GLMNET (Hastie’s library for LASSO/Elastic-net regularized GLMs)#
The configuration options for the GLMNET models are the following:
1 | Define GLMNET parameters
1 | Define GLMNET parameters This option allows the user to access another menu where 5 GLMNET parameters can be configured.
Inside option 1, the following parameters appear:
1 | Define mixing factor (alpha) range (0 =ridge <-> 1 =lasso)
2 | Define minimum lambda range (e.g. if N_cases>N_feats: 0.0001, otherwise 0.01)
3 | Define number of lambda optimization steps
4 | Define maximum number(s) of variables in the model
5 | Standardize input matrix to unit variance prior to training the elastic net
1 | Define mixing factor (alpha) range
Alpha is the elastic net mixing parameter, with range alpha∈[0,1], alpha=1 is lasso regression (default) and α=0 is ridge regression.
2 | Define minimum lambda range
Specifies the minimum value of lambda for regularization. A smaller value applies lighter regularization, while a larger value applies stronger regularization, particularly useful in high-dimensional settings.
3 | Define number of lambda optimization steps
Determines the number of lambda values over which the model will be optimized. A larger number gives a finer-grained search over the regularization path but increases computation time.
4 | Define maximum number(s) of variables in the model Sets a limit on the number of variables that can be selected by the model. This controls the model’s complexity by constraining the number of features it can use.
5 | Standardize input matrix to unit variance prior to training the elastic net Specifies whether to standardize the input data before fitting the model. When ‘Yes’, the features are scaled to unit variance.
RANDOM FORESTS (sklearn RandomForestClassifier)#
The configuration options for the RF classifier are originally defined in the scikit-learn documentation of the RandomForestClassifier class. We will give a brief description of each parameter here.
1 | Define Number of decision trees
2 | Define Maximum number of features
3 | Define Function to measure the quality of a split
4 | Define Maximum depth of the tree
5 | Define Minimum number of samples to split
6 | Define Minimum number of samples to be at a leaf
7 | Define Minimum weighted fraction of the sum total of weights at a leaf
8 | Define Maximum number of leaf nodes
9 | Define Minimum decrease of impurity
10 | Define Bootstrap samples yes/no
11 | Define Out-of-bag samples yes/no
12 | Define Class weights
13 | Define Complexity parameter for Minimal Cost-Complexity Pruning
14 | Define Number of samples to draw from X to train base estimators (if bootstrap)
1 | Define Number of decision trees
Sets the number of decision trees (estimators) to be used in the random forest model. A higher number of trees generally improves accuracy but increases computation time.
2 | Define Maximum number of features
Specifies the maximum number of features to consider when looking for the best split at each node. Common settings include ‘sqrt’ (square root of the total number of features), ‘log2’, or an integer value.
3 | Define Function to measure the quality of a split
Chooses the criterion for splitting nodes in the trees. Options include ‘Gini impurity’, which measures the likelihood of misclassification, or ‘Entropy’, which measures the information gain.
4 | Define Maximum depth of the tree
Limits the maximum depth of each tree in the forest. Setting this prevents trees from growing too deep and potentially overfitting, while ‘No max. depth defined’ allows trees to grow until all leaves are pure or contain fewer than the minimum samples.
5 | Define Minimum number of samples to split
Determines the minimum number of samples required to split an internal node. A higher number can prevent overfitting by requiring larger segments of data to create a split.
6 | Define Minimum number of samples to be at a leaf
Specifies the minimum number of samples required to be at a leaf node. Increasing this value makes the model more conservative and prevents overly specific rules in the trees.
7 | Define Minimum weighted fraction of the sum total of weights at a leaf
Sets the minimum weighted fraction of the input data required to be at a leaf node. This parameter is particularly useful in datasets with varying sample weights.
8 | Define Maximum number of leaf nodes
Limits the maximum number of leaf nodes per tree. A smaller number results in simpler models, while ‘No max. N defined’ allows trees to grow fully based on the other criteria.
9 | Define Minimum decrease of impurity
Specifies the minimum decrease in impurity required for a node to be split. Setting this threshold helps prevent splitting nodes that do not significantly improve the model.
10 | Define Bootstrap samples yes/no
Indicates whether bootstrap samples (random samples with replacement) are used when building trees. ‘Yes’ enables bootstrapping, which is a common approach in random forests.
11 | Define Out-of-bag samples yes/no
Determines whether out-of-bag samples (samples not included in the bootstrap sample) are used to estimate the model’s performance. ‘Yes’ allows for internal performance evaluation without the need for a separate validation set.
12 | Define Class weights
Assigns weights to classes to handle imbalanced datasets. If ‘All equal’, each class is treated equally, but you can adjust weights to give more importance to underrepresented classes.
13 | Define Complexity parameter for Minimal Cost-Complexity Pruning
Specifies the threshold used for pruning the tree by considering the cost complexity. Higher values lead to more aggressive pruning.
14 | Define Number of samples to draw from X to train base estimators (if bootstrap)
Determines the number of samples to draw from the input dataset when bootstrapping is enabled. If set to ‘0’, all samples are used, making it equivalent to no bootstrapping.
GRADIENT BOOSTING (sklearn GradientBoostingClassifier)#
The configuration options for the Gradient boosting classifier are originally defined in the scikit-learn documentation of the GradientBoostingClassifier class. We will give a brief description of each configuration parameter included in NeuroMiner here.
1 | Define maximum number of boosting iterations
2 | Select type of loss (exponential/deviance for classification, squared/absolute/huber/quantile for regression)
3 | Define learning rate (0<->1)
4 | Define subsampling factor (0<->1)
5 | Define maximum tree depth
1 | Define maximum number of boosting iterations
Sets the number of boosting stages (iterations) to be run. More iterations can improve model performance but may also lead to overfitting.
2 | Select type of loss (exponential/deviance for classification, squared/absolute/huber/quantile for regression)
Specifies the loss function used to measure model performance. For classification, options include ‘log_loss’ (deviance), while for regression, options include squared_error, absolute_error, huber, and quantile.
3 | Define learning rate (0<->1)
Sets the step size for each boosting iteration. A smaller learning rate requires more boosting iterations but can lead to a more accurate model.
4 | Define subsampling factor (0<->1)
Specifies the fraction of samples used for fitting each individual tree. Values less than 1 can help prevent overfitting by introducing randomness.
5 | Define maximum tree depth
Limits the depth of each individual tree in the boosting process. Shallower trees are less likely to overfit, while deeper trees can capture more complex patterns.
MLP PERCEPTRON (sklearn MLPClassifier)#
The configuration options for the MLP classifier are originally defined in the scikit-learn documentation of the MLPClassifier class. Please refer to the original scikit-learn documentation for an exhaustive definition of each parameter. We will provide a brief description of each parameter here.
1 | Define MLPERC parameters
2 | Enable regularization of model selection using CV1 test performance variance
3 | Specify cross-parameter model selection process
1 | Define MLPERC parameters This option allows the user to access another menu where MLP parameters can be configured. By default, NeuroMiner allows to tune 4 MLP parameters (see below). Using expert mode (initialize neurominer using “nm expert”), 21 MLP parameters can be configured.
2 | Enable regularization of model selection This option enables the control of selection of models by considering the variability of performance across thre CV1 test folds. This approach aims to improve model selection by factoring in the consistency of performance metrics obtained from cross-validation.
3 | Specify cross-parameter model selection process This option controls the number of models (parameters combination) selected. One optimal model based on the criteria and regularization defined previously or an ensemble of the top performing models You are given the possibility to select between: (1) select a single optimum parameter node which returns one optimal model based on the criteria, (2) generate cross-node ensemble by aggregating base learners above a predefined percentile which results in an ensemble of the top performing models or (3) automatically determine optimal percentile for optimum cross-node ensemble performance which depends on the regularization.
Inside option 1, the following parameters appear:
1 | Define MLP structures: hidden layers (length of row) and sizes (values)
2 | Select activation function
3 | Select solver method
4 | Define alpha parameter
5 | Select batch size (expert mode)
6 | Select learning rate strategy (expert mode)
7 | Define initial learning rate (expert mode)
8 | Define power parameter for learning rate decay (expert mode)
9 | Set maximum number of iterations (expert mode)
10 | Set random seed (expert mode)
11 | Define tolerance for optimization (expert mode)
12 | Use warm start (expert mode)
13 | Set momentum parameter (expert mode)
14 | Use Nesterov’s momentum (expert mode)
15 | Set early stopping (expert mode)
16 | Define validation data fraction (expert mode)
17 | Define beta_1 parameter (expert mode)
18 | Define beta_2 parameter (expert mode)
19 | Define epsilon parameter for numerical stability (expert mode)
20 | Set number of iterations with no improvement before stopping (expert mode)
21 | Set maximum function evaluations (expert mode)
1 | Define MLP structures: hidden layers (length of row) and sizes (values)
Specifies the architecture(s) of the neural network including the number and size of hidden layers. Each row of the defined matrix will be used as a different network architecture.
For example, ‘[100 100; 200 100]’ will train two neural networks, one with a structure of 100x100 (two hidden layers, each with 100 neurons) and 200x100 (two hidden layers, the first one with 200 neurons and the second one with 100 neurons). If the user needs to define structures with different number of hidden layers, set layer sizes as ‘0’. For example, If testing a 100x100x100 and 100x100, input the following: ‘[100, 100, 100; 100,100, 0]’.
2 | Select activation function
Chooses the activation function used in the neurons. Common options include ‘relu’ (Rectified Linear Unit), ‘tanh’, and ‘sigmoid’.
3 | Select solver method
Determines the optimization algorithm used for training. Options include ‘adam’ (Adaptive Moment Estimation), ‘sgd’ (Stochastic Gradient Descent), and ‘lbfgs’ (Limited-memory Broyden-Fletcher-Goldfarb-Shanno).
4 | Define alpha parameter
Sets the regularization strength to prevent overfitting. A higher ‘alpha’ value increases regularization, helping to control model complexity.
5 | Select batch size
Specifies the number of samples used in each iteration of training. Options include ‘auto’ (which chooses a default value) or a specific integer value.
6 | Select learning rate strategy
Defines the learning rate schedule. Options include ‘constant’ (fixed learning rate), ‘invscaling’ (decreases learning rate as ‘1 / pow(t, power_t)’), and ‘adaptive’ (learning rate adapts based on performance).
7 | Define initial learning rate
Sets the starting learning rate for the training process. This parameter controls the step size during gradient descent.
8 | Define power parameter for learning rate decay
Specifies the power of the inverse scaling learning rate decay. Used when ‘learning_rate’ is set to ‘invscaling’.
9 | Set maximum number of iterations
Limits the number of iterations for training. This parameter helps control the time and resources spent on training.
10 | Set random seed
Specifies the seed for the random number generator to ensure reproducibility of results.
11 | Define tolerance for optimization
Sets the tolerance for stopping criteria. Training will stop when the improvement of the optimization is less than this tolerance.
12 | Use warm start
Indicates whether to reuse the solution of the previous call to fit and add more estimators. Setting this to ‘yes’ allows incremental training.
13 | Set momentum parameter
Specifies the momentum for the ‘sgd’ solver. Momentum helps accelerate convergence and smooth out training dynamics.
14 | Use Nesterov’s momentum
Enables or disables Nesterov’s Accelerated Gradient, which improves the convergence speed by considering the gradient of the anticipated future position.
15 | Set early stopping
Determines whether to use early stopping to halt training when the validation score is not improving. This helps prevent overfitting.
16 | Define validation data fraction
Sets the fraction of training data used for validation in early stopping. This helps monitor performance during training.
17 | Define beta_1 parameter
Specifies the exponential decay rate for the first moment estimates in the ‘adam’ optimizer. Typically set to a value like ‘0.9’.
18 | Define beta_2 parameter
Specifies the exponential decay rate for the second moment estimates in the ‘adam’ optimizer. Typically set to a value like ‘0.999’.
19 | Define epsilon parameter for numerical stability
Sets a small constant added to prevent division by zero in the ‘adam’ optimizer and other algorithms.
20 | Set number of iterations with no improvement before stopping
Specifies the number of iterations with no improvement in the validation score before stopping the training (early stopping).
21 | Set maximum function evaluations
Limits the number of function evaluations during the optimization process. This helps control the computational cost of training.
SeqNN (TensorFlow Sequential Model)#
The configuration options for the Sequential Neural Network (SeqNN) are based on the Sequential Model class of the Tensorflow API. Please refer to the official TensorFlow documentation for a detailed explanation of all available parameters and functionalities. Below we provide a summary of the configurable options within NeuroMiner for SeqNN.
1 | Define SeqNN structures: hidden layers (length of row) and sizes (values)
2 | Select activation function
3 | Select solver method
4 | Define alpha parameter (L2 regularization coefficient)
5 | Select batch size
6 | Define initial learning rate
7 | Set maximum number of iterations
8 | Set random seed
9 | Set loss class weighting
10 | Enable early stopping
11 | Define early stopping patience
12 | Select training loss function
13 | Enable internal validation
14 | Define validation data fraction
1 | Define SeqNN structures: hidden layers (length of row) and sizes (values) Specifies the architecture(s) of the sequential neural network, including the number and size of hidden layers. Each row in the defined matrix represents a different network configuration. For example, [100 100; 200 100] will train two networks: one with two hidden layers of 100 neurons each, and another with layers of 200 and 100 neurons, respectively. If you need to define structures with different numbers of layers, use 0 as a placeholder. For example, [100, 100, 100; 100, 100, 0] will test two structures — one with three hidden layers (100×100×100) and one with two (100×100).
2 | Select activation function Chooses the activation function applied to each layer’s neurons. Common options include ‘relu’ (Rectified Linear Unit), ‘tanh’, ‘sigmoid’, ‘linear’ and ‘softmax’. The activation function introduces non-linearity, enabling the network to model complex input–output relationships.
3 | Select solver method Determines the optimization algorithm used to minimize the loss function during training. Common solvers include ‘adam’ (Adaptive Moment Estimation), ‘sgd’ (Stochastic Gradient Descent) and ‘adagrad’ (Adaptive Gradient Algorithm). The choice of solver affects training dynamics and convergence stability.
4 | Define alpha parameter (L2 regularization coefficient) Sets the L2 regularization strength to penalize large weights and reduce overfitting. A higher alpha value increases the regularization effect, promoting simpler, more generalizable models.
5 | Select batch size Specifies the number of samples processed before model weights are updated. Smaller batch sizes provide more stochastic updates and may help generalization, while larger batches lead to smoother convergence.
6 | Define initial learning rate Sets the initial step size for weight updates. This parameter controls how much the model’s parameters are adjusted at each iteration based on the gradient. Typical values are between 0.0001 and 0.01.
7 | Set maximum number of iterations Defines the total number of epochs (full passes through the training data). This determines how long the model is trained and directly affects both accuracy and computation time.
8 | Set random seed Specifies the random seed used for initialization and data shuffling. Ensures reproducibility of results across runs.
9 | Loss class weighting Determines whether to apply class weights during training to handle imbalanced datasets. When enabled (‘yes’), minority classes are given greater influence on the loss function to balance model learning.
10 | Early stopping Enables early stopping to automatically halt training when validation performance ceases to improve. This prevents overfitting and unnecessary computation.
11 | Patience Specifies the number of epochs to wait without improvement before triggering early stopping. Larger patience allows longer training before stopping.
12 | Select training loss function Defines the loss function used to optimize the model. The default is’categorical_crossentropy’ which measures the dissimilarity between the predicted class probabilities and the true class labels. It is commonly used in multi-class classification tasks, where the goal is to maximize the predicted probability of the correct class while minimizing that of the incorrect ones. Alternatively, kl_divergence (Kullback–Leibler divergence) measures how one probability distribution diverges from a second, expected probability distribution. In the context of neural networks, it quantifies how much the predicted probability distribution deviates from the true label distribution In addition, NeuroMiner allows importing TensorFlow model weights using a soft loss defined according to the same mathematical formulation as the optimization criterion selected (e.g., BAC, ACC, F1 score). This enables consistency between model optimization and Tensorflow internal parameter updates.
13 | Internal validation Indicates whether to reserve part of the training data for internal validation. When enabled, this subset is used to monitor validation loss and guide early stopping.
14 | Select validation fraction Sets the proportion of training data used for internal validation (e.g., 0.1 = 10%). The remaining data is used for model training.
15 | Predefined models from files (leave none if using parameters above) Only available in NeuroMiner expert mode (initialize with nm expert). This configuration allows to upload a Python function defining a Tensorflow flow model directly using Python code, such as sequential API and functional API, the later allowing for the creation of more flexible models. If files are selected, the rest of the parameters (1 to 14) are ignored, and the hyperparameter space will include the models defined in the Python files.
This feature also supports transfer learning, enabling users to load previously trained model weights into the defined architecture in Python. By doing so, it allows reusing or fine-tuning existing models on new datasets while maintaining the underlying learned representations. For more guidance on how to define models with files, see these Example files.