Classification

class TwinAPI.SimuLearn.MLibrary.ClassificationML

A classification problem can be defined as a case scenario where the output is a set of binary values or a multi-class output. For example: If the output contains the operating state of a machine ie. “ON” or “OFF” state while accepting several variables from categorical to numerical values as input.

setup(data: Optional[Union[list, numpy.ndarray, pandas.core.frame.DataFrame, str]] = None, x_features: Optional[Union[list, numpy.ndarray, pandas.core.frame.DataFrame]] = None, y_labels: Optional[Union[list, numpy.ndarray, pandas.core.frame.DataFrame]] = None, verbose: int = 0, header: bool = False)

This function trains the model for a given set of parameters. Incase of Classification and Regression user_model is required while all other parameters are optional. Few parameters are subset to specific ML case.

Example

1from TwinAPI.SimuLearn.MLibrary import ClassificationML
2model_classification = ClassificationML()
3exp_x, exp_y = model_classification.setup(data = 'input.csv', x_features = '[0:5, 4, x1:x5]', y_labels = '[12]')
data: Union[list, np.ndarray, pd.DataFrame, str], default = None

Input dataset for the training experiment. It takes list, numpy nd array or pandas dataFrame as input. User can also provide a ‘csv’ or ‘xls’ or ‘xlsx’as an input string. Generally a ‘csv’ file or a dataframe works better.

x_features: Union[list, np.ndarray, pd.DataFrame], default = None

Name of the input x features. Incase the data parameter is set to None, user can provide with list, numpy nd array or a pandas DataFrame as input. Incase the data parameter is set to pandas DataFrame ora ‘csv’ file then this parameter takes column names or column indices as input provided in the form of lists. Case should match the original dataset. Example: ‘[0:5, 4, x1:x5]’.

y_labels: Union[list, np.ndarray, pd.DataFrame], default = None

Name of the output y labels. Incase the data parameter is set to None, user can provide with list, numpy nd array or a pandas DataFrame as input. Incase the data parameter is set to pandas DataFrame ora ‘csv’ file then this parameter takes column names or column indices as input provided in the form of lists. Case should match the original dataset. Example: ‘[0:5, 4,y1:y5]’.

verbose: int, default = 0

Verbosity of the results. Ranges from 0 to 2 and accepts integer values.

  • ‘0’ value provides with only training model name and prediction score.

  • ‘1’ value provides with training model name and different prediction scores.

  • ‘2’ value provides with a json file with all the above information, defaults to 0

header: bool, default = False

When set to False, the input header is ignored else it will be removed. Generally setting to False it better as the model automatically removes it when training.

Returns

Tuple of selected X and Y values as dictionary.

preprocess(normalize: bool = True, normalize_method: Optional[Union[list, str]] = 'standardizer', transformer: bool = False, transformer_method: str = 'onehot encoder', fix_imbalance: bool = False, imbalance_method: str = 'SMOTE', preprocess_arguments: Optional[dict] = None)

This function preprocesses the selected x features and y labels, and performs user selected preprocessing steps. All parameters are optional by default only parameter is set, i.e. normalize method.

Example

1from TwinAPI.SimuLearn.MLibrary import ClassificationML
2model_classification = ClassificationML()
3preprocess_x, preprocess_y = model_classification.preprocess()
normalize: bool, default = True

By default it is set to True and it is applied to the training pipeline.

normalize_method: Optional[Union[list, str]], default = “standardizer”

When normalize is True, allows for selecting from a set of preprocessing modules provided by sklearn. Accepted values are:

  • ‘binarizer’

  • ‘minmax’

  • ‘normalizer’

  • ‘standardizer’

  • ‘PCA’

  • ‘truncated SVD’

  • ‘select KBest’

transformer: bool, default = False

When set to True, allows for selecting a transfomer method. Used only in cases of string or category classification or regression.

transformer_method: str, default = “standardizer”

When transformer is True, allows for selecting from a set of preprocessing modules provided by sklearn.

  • ‘label encoder’

  • ‘onehot encoder’

fix_imbalance: bool, default = False

When set to True, allows for selecting a transfomer method. Used only in cases of string or category classification or regression.

imbalance_method: str, default = “SMOTE”

When fix_imbalance is True, allows for selecting from a set of preprocessing modules provided by imblearn.

  • ‘SMOTE’

  • ‘random undersampling’

preprocess_arguments: Optional[dict], default = None

Allows users to pass parameters for any of the selected normalize or transformer or imbalance method provided that parameter is accepted by the function. Accepts only dictionary with keys as parameter name and value as parameter value.

Returns

Tuple of preprocessed X features and Y labels.

train(user_model: Optional[str] = None, scoring_method: str = 'accuracy', split_mode: str = 'kfold split', test_size: float = 0.25, cross_validation_mode: str = 'kfold', n_iter_cv: int = 5, optimize: bool = False, optimizer_method: str = 'grid_search', n_jobs: int = 3, turbo_mode: bool = True, fit_arguments: Optional[dict] = None)

This function trains the model for a given set of parameters. Incase of Classification and Regression user_model is required while all other parameters are optional. Few parameters are subset to a specific ML case.

Example

1from TwinAPI.SimuLearn.MLibrary import ClassificationML
2model_classification = ClassificationML()
3model_classification.train(user_model = 'Random Forest Classifier')
user_model: Optional[str], default = None

String of estimator IDs based on ML case, irrelevant in case of Auto

  • ‘Random Forest Classifier’

  • ‘Decision Tree Classifier’

  • ‘SVM Classifier’

  • ‘ExtraTree Classifier’

  • ‘Linear Support Vector Classifier’

  • ‘Logistic Regressor’

  • ‘Stochastic Gradient Descent Classifier’

scoring_method: str, default = ‘accuracy’

Scoring methodology to testing prediction scores. Follows the sklearn scorer terminology. Accepted values are:

  • ‘accuracy’

  • ‘roc_auc’

  • ‘recall’

  • ‘precision’

  • ‘f1’

  • ‘balanced_accuracy’

  • ‘f1_weighted’

split_mode: str, default = ‘kfold split’

Selection of train and test data for training model.

  • ‘test-train split’

  • ‘kfold split’

test_size: float, default = 0.25

Test size for test-train split. Divides train and test in ratio of selected value. Example: train:test = 0.75:0.25.

cross_validation_mode: str, default = ‘kfold

Choice of cross validation strategy. Possible values are:

  • ‘kfold’

  • ‘stratified kfold’

  • ‘leave-one out’

  • ‘shuffle split’

n_iter_cv: int, default = 5

Number of iteration for cross validation model selection. The higher the number the longer the processing time.

optimize: bool, default = False

When set to True, a model is applicable for optimization strategies.

optimizer_method: str, default = ‘grid_search’

When optimize is set to True, allows for hyperparameter optimization method to train the model with best set of parameters for the estimator.

Note

  • Optimization may not always result in best results.

n_jobs: int, default = 3

The number of jobs to run in parallel (for functions that supports parallel processing) -1 means using all processors. To run all functions on single processor.

turbo_mode: bool, default = True

When set to False, another iteration of optimization is carred out inorder to avoid overfitting the model.

fit_arguments: dict, default = None

Allows users to pass parameters for any of the selected estimator provided that parameter is accepted by the function. Accepts only dictionary with keys as parameter name and value as parameter value.

Returns

Tuple of Trained model ID, trained model function and test scores.

Warning

  • Changing turbo to False may result in very high training times.

  • For multi-class classification only accuracy, balanced accuracy and f1_weighted values are accepted, else it will result in error.

predict(load_model: bool = False, model_name: Optional[str] = None, prediction_set: Optional[Any] = None)

This function predicts new data based on trained model or on provided trained model from load model function.

Example

1from TwinAPI.SimuLearn.MLibrary import ClassificationML
2model_classification = ClassificationML()
3prediction = model_classification.predict(load_model= True, model_name= 'trained_model', [1])
load_model: bool, default = False

When set to True, the function searches for a trained model provided by user

model_name: Optional[str], default = None

If load_model is set to True, Model name for the loading model is accepted.

prediction_set: Optional[Any] = None, default = None

Prediction dataset, integer or a list of values to be predicted

Returns

List of prediction.

modelsave(model_name: Optional[Union[int, str]] = None)

This function saves the trained model.

Example

1from TwinAPI.SimuLearn.MLibrary import ClassificationML
2model_classification = ClassificationML()
3model_classification.modelsave('trained_model')
model_name: Optional[Union[int, str]], default = None

Name for the trained model.

Returns

Json file.