AIR documentation

Introduction

This document is aimed at authors of, developers and contributors to the AIR project.

The source code of AIR is located in its project repository on GitLab.

For more information, please visit our official website.

Source Code Documentation

This chapter describes all Python modules in the AIR source code.

Analysis

evaluate_balance.py

Script to evaluate the performance of using class weights.

evaluate_classification_cases.py

Script to make a baseline evaluation of classification cases using CV with different datasets and classifers.

evaluate_gender_bias.py

Script to evaluate gender bias in the Complete case.

evaluate_preprocessing.py

Script to evaluate preprocessing strategies for cases.

class analysis.evaluate_preprocessing.BoxCoxNormalizer[source]
fit_transform(X, case=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

class analysis.evaluate_preprocessing.BoxCoxNormalizerNoGender[source]
fit_transform(X, case=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

class analysis.evaluate_preprocessing.DummyNormalizer[source]
fit_transform(X, case=None)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

class analysis.evaluate_preprocessing.DummyScaler[source]
fit_transform(X)[source]

Fit to data, then transform it.

Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.

Parameters
  • X (array-like of shape (n_samples, n_features)) – Input samples.

  • y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).

  • **fit_params (dict) – Additional fit parameters.

Returns

X_new – Transformed array.

Return type

ndarray array of shape (n_samples, n_features_new)

evaluate_survival_case.py

Script to make a baseline evaluation of the survival case using CV.

find_best_ats_resolution.py

Script to find the best ATS resolution for Complete case.

find_best_features.py

Script to find best features for cases using CV.

API

bearer.py

A class for JWTBearer token.

main.py

The main FastAPI module class.

Data

load_and_clean_data.py

Script to load the raw data and then clean it.

make_survival_data.py

Script to make dataset for Alarm case.

make_dataset_count.py

Script to make a dataset for Complete, Compliance, Fall and Risk cases where categorial features are encoded as an integer array with values as a count of the number of times a value appears in a column.

make_dataset_emb.py

Script to make a dataset for Complete, Compliance, Fall, Risk and Alarm cases where categorial features are encoded as entity embeddings. Takes in a full dataset generated with “make_dataset_full.py” for the Complete, Compliance, Fall or Risk case, or “make_survival_data.py” for the Alarm case.

make_dataset_full.py

Script to make a dataset for Complete, Compliance, Fall and Risk case based on the screenings.

make_dataset_ohe.py

Script to make a dataset for Complete, Compliance, Fall and Risk case using one hot encoding of categorial features.

make_dataset_ordinal.py

Script to make a dataset for Complete, Compliance, Fall and Risk using ordinal encoding of categorial features.

make_screening_data.py

Script to make dataset of screenings to be used for cases.

Db

insert_data_into_db.py

Script used to insert test data from Aalborg into our Azure db.

Model

train_alarm_model.py

Script to train the model for the Alarm case.

train_complete_model.py

Script to train the model for the Complete case.

train_compliance_model.py

Script to train the model for the Compliance case.

Tools

classifiers.py

Module to store classifers used for CV.

class tools.classifiers.BaseClassifer(X, y)[source]

Base class for classifiers.

evaluate(metrics: List, k: int)Tuple[dict, numpy.ndarray][source]

This method performs cross validation for k seeds on a given dataset X and y and outputs the results of N splits given a list of scoring metrics :param metrics: scoring metrics :param k: the seed to use :return: the results from a stratified K-fold CV process

abstract make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.KnnClassifier(X, y)[source]

KNN classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.LrClassifier(X, y)[source]

Logistic regression classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.MlpClassifier(X, y)[source]

Multi-layer Perceptron classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.RfClassifier(X, y)[source]

Random Forest classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.SvmClassifier(X, y)[source]

Support vector machine classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

class tools.classifiers.XgbClassifier(X, y)[source]

XGBoost classifier.

make_model()[source]

This method is an abstract method to be implemented by a concrete classifier. Must return a sklearn-compatible estimator object implementing ‘fit’.

cleaner.py

Module to clean raw data.

class tools.cleaner.BaseCleaner[source]

Base class for cleaners.

abstract clean_assistive_aids(ats, iso_classes)[source]

Cleans the assistive aids data set.

abstract clean_patient_data(patient_data)[source]

Cleans the patient data set.

abstract clean_screening_content(screening_content, patient_data)[source]

Cleans the screening content data set.

abstract clean_status_set(status_set, patient_data)[source]

Cleans the status set data set.

abstract clean_training_cancelled(training_cancelled, patient_data)[source]

Cleans the training cancelled data set.

abstract clean_training_done(training_done, patient_data)[source]

Cleans the training done data set.

class tools.cleaner.Cleaner2021[source]

Cleaner for 2021 dataset

clean_assistive_aids(df: pandas.core.frame.DataFrame, iso_classes: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the assistive aids data set.

clean_patient_data(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the patient data set.

clean_screening_content(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the screening content data set.

clean_status_set(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the status set data set.

clean_training_cancelled(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the training cancelled data set.

clean_training_done(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

Cleans the training done data set.

data_loader.py

Module to load data for cases.

class tools.data_loader.AlarmDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Data loader for Alarm case.

load_data()[source]

Loads the data from a data set at startup

class tools.data_loader.BaseDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Base class for data loaders.

get_data()Tuple[pandas.core.frame.DataFrame, numpy.ndarray][source]

This method returns the features and targets :return: X and y

get_features()List[str][source]

This method returns the feature names :return: the columns of X as a list

abstract load_data()None[source]

Loads the data from a data set at startup

prepare_data()Tuple[numpy.ndarray, numpy.ndarray][source]

This method prepares data by normalizing and scaling it. :return: prepared X and y

prepare_data_split(test_size: float)Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray][source]

This method prepares and splits the data from a data set :param test_size: the size of the test set :return: a split train and test dataset

class tools.data_loader.CompleteDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Data loader for Complete case.

load_data()[source]

Loads the data from a data set at startup

class tools.data_loader.ComplianceDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Data loader for Compliance case.

load_data()[source]

Loads the data from a data set at startup

class tools.data_loader.FallDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Data loader for Fall case.

load_data()[source]

Loads the data from a data set at startup

class tools.data_loader.RiskDataLoader(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]

Data loader for Risk case.

load_data()[source]

Loads the data from a data set at startup

file_reader.py

File reader module to read files.

tools.file_reader.read_array(infile: _io.BytesIO)numpy.ndarray[source]

This method reads an NumPy array file as a pickle :param infile: binary input stream :return: the NumPy array object

tools.file_reader.read_csv(infile: _io.StringIO, header: str = 'infer', sep: str = ',', usecols: Optional[List[int]] = None, names: Optional[List[str]] = None, converters: Optional[dict] = None, encoding=None, skiprows=None)pandas.core.frame.DataFrame[source]

This method reads a csv file using Pandas read_csv() method :param infile: text input stream :param header: file header :param sep: seperator identifier :param names: list of column names to use :param converters: dict of converters to use :return: the csv file

tools.file_reader.read_embedding(infile: _io.StringIO)dict[source]

This method reads an embedding file :param infile: text input stream :return: the embedding as a dict

tools.file_reader.read_excelfile(infile: _io.BytesIO, converters: Optional[dict] = None)pandas.core.frame.DataFrame[source]

This method reads an excel file :param infile: binary input stream :param converters: dict of converters to use :return: the excel file as a dataframe

tools.file_reader.read_excelfile_sheets(infile: _io.BytesIO, n_sheets: int, converters: Optional[dict] = None)pandas.core.frame.DataFrame[source]

This method reads sheets from an excel file :param infile: binary input stream :param n_sheets: number of sheets to read :param converters: dict of converters to use :return: the full excel file as a dataframe

tools.file_reader.read_joblib(infile: _io.BytesIO)any[source]

This method reads a joblib file :param infile: binary input stream :return: the joblib file

tools.file_reader.read_pickle(infile: _io.BytesIO)any[source]

This method reads any file stored as a pickle :param infile: binary input stream :return: the file object

file_writer.py

File writer module to write files.

tools.file_writer.write_array(data: numpy.ndarray, outfile: _io.BytesIO)None[source]

This method writes an NumPy array. :param data: data to write :param outfile: binary output stream :return: None

tools.file_writer.write_csv(df: pandas.core.frame.DataFrame, outfile: _io.StringIO, date_format: str = '%d-%m-%Y', index: bool = False)None[source]

This method writes a csv file using Pandas to_csv() method. :param df: dataframe to write :param outfile: text output stream :param date_format: data format to use :param index: write row names (index) :return: None

tools.file_writer.write_cv_plot(means: List, stds: List, metric: str, num_iter: int, clf_names: List, title: str, subtitle: str, outfile: _io.BytesIO)[source]

This method writes a plot of the result from a CV process. :param means: the mean values obtainted :param stds: the standard deviations obtained :param metric: the metric used :param num_iter: the number of iterations :param clf_names: names of classifiers used :param title: plot title :param subtitle: plot subtitle :param outfile: binary output stream :return: None

tools.file_writer.write_embedding(mapping: dict, outfile: _io.StringIO)None[source]

This method writes an embedding mapping as a csv file. :param mapping: mapping dict :param outfile: text output stream :return: None

tools.file_writer.write_joblib(data: any, outfile: _io.BytesIO)None[source]

This method writes a joblib file. :param data: data to write :param outfile: binary output stream :return: None

tools.file_writer.write_pickle(data: any, outfile: _io.BytesIO)None[source]

This method writes a pickle file. :param data: data to write :param outfile: binary output stream :return: None

tools.file_writer.write_shap_importance_plot(features: List[str], importances: List[float], title: str, outfile: _io.BytesIO)[source]

This method writes a SHAP importance plot. :param features: feature names :param importances: feature importances :param outfile: binary output stream :return: None

inputter.py

Module for to create features based on screenings.

tools.inputter.convert_date_to_datetime(date: pandas._libs.tslibs.timestamps.Timestamp, date_format: str)pandas._libs.tslibs.timestamps.Timestamp[source]

This method converts a date to timedate :param date: date to convert :param date_format: date format to use :return: the converted date

tools.inputter.get_ats(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp, settings: dict)str[source]

This method extracts a citizen’s ats from a screening :param df: a dataframe containing ats :param end_date: the screening end date :param settings: the settings to use :return: the citizen’s ats

tools.inputter.get_avg_loan_period(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp)int[source]

This method extracts the average ats loan period :param df: a dataframe containing ats :param end_date: the screening end date :return: the average loan period

tools.inputter.get_birth_year(sc: pandas.core.frame.DataFrame)int[source]

This method extracts a citizen’s birth year :param sc: a tuple with a screening :return: the birth year

tools.inputter.get_cancels_week(tcw: pandas.core.frame.DataFrame)pandas.core.series.Series[source]

This method extracts the cancellations per week :param tcw: windowed training cancellations :return: cancellations per week

tools.inputter.get_citizen_data(data: utility.data.Data, citizen_id: str)utility.data.Data[source]

This method extracts all screening data we have on a citizen :param data: a data DTO consisting of screening data of all citizens :param id: id of the citizen :return: a data DTO with only the citizen’s data in it

tools.inputter.get_exercise_content(sc: Tuple)str[source]

This method extracts the exercise content from a screening :param sc: a tuple with a screening :return: the exercise content

tools.inputter.get_gender(sc: pandas.core.frame.DataFrame)int[source]

This method extracts a citizen’s gender :param sc: a tuple with a screening :return: the gender

tools.inputter.get_interval_length(start_date: pandas._libs.tslibs.timestamps.Timestamp, end_date: pandas._libs.tslibs.timestamps.Timestamp)float[source]

This method extracts the interval length between a start date of the screening and end date :param start_date: the start date :param end_date: the end date :return: the length of the interval

tools.inputter.get_max_evaluation(tdw: pandas.core.frame.DataFrame)int[source]

This method extracts the largest evaluation score a citizen has gotten :param tdw: the windowed training data :return: citizen’s largest evaluation score

tools.inputter.get_mean_cancels_week(n_cancel: int, n_weeks: float)int[source]

This method extracts the mean number of cancellations per week :param n_cancel: number of cancellations :param n_weeks: number of screening weeks :return: citizen’s mean number of cancellations per week

tools.inputter.get_mean_evaluation(tdw: pandas.core.frame.DataFrame)float[source]

This method extracts the mean of citizen’s evaluation score :param tdw: the windowed training data :return: citizen’s mean evaluation score

tools.inputter.get_mean_time_between_cancels(tcw: pandas.core.frame.DataFrame, n_decimals=2)float[source]

This method extracts the mean time between cancellations :param tcw: the windowed cancellation data :param n_decimals: number of decimals for rounding :return: citizen’s mean time between cancellations

tools.inputter.get_min_evaluation(tdw: pandas.core.frame.DataFrame)int[source]

This method extracts the smallest evaluation score a citizen has gotten :param tdw: the windowed training data :return: citizen’s smallest evaluation score

tools.inputter.get_n_cancel_week_min(cancelsprweek: pandas.core.series.Series)int[source]

This method extracts a citizen’s number of cancels per week :param cancelsprweek: a series with cancels per week :return: number of cancels per week

tools.inputter.get_n_training_week(n_weeks: float, n_training_window: int)int[source]

This method extracts the number of completed trainings per week :param n_weeks: number of screening weeks :param n_training_window: length of training window :return: number of trainings per week

tools.inputter.get_n_training_week_max(training_pr_week: pandas.core.series.Series)int[source]

This method extracts the largest number of trainings a citizen has done per week :param training_pr_week: trainings per week :return: largest number of tranings done per week

tools.inputter.get_n_training_week_min(training_pr_week: pandas.core.series.Series, n_weeks_with_trainings: int, n_weeks: float)int[source]

This method extracts the smallest number of trainings a citizen has done per week :param training_pr_week: trainings per week :param n_weeks_with_trainings: number of weeks with training :param n_weeks: number of screening weeks :return: smallest number of tranings done per week

tools.inputter.get_n_training_window(tdw: pandas.core.frame.DataFrame)int[source]

This method extracts the number of training windows :param tdw: the windowed training data :return: number of training windows

tools.inputter.get_n_weeks_with_training(tdw: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp)int[source]

This method extracts a citizen’s number of weeks with training :param n_weeks: number of screening weeks :param n_weeks_with_trainings: number of training weeks :return: number of weeks with training

tools.inputter.get_n_weeks_without_training(n_weeks: float, n_weeks_with_trainings: int)int[source]

This method extracts a citizen’s number of weeks without training :param n_weeks: number of screening weeks :param n_weeks_with_trainings: number of training weeks :return: number of weeks without training

tools.inputter.get_needs(sc: Tuple)int[source]

This method extracts a screning’s need for help score :param sc: a tuple with a screening :return: the need for help score

tools.inputter.get_needs_reason(sc: Tuple)str[source]

This method extracts a screning’s need for help reason :param sc: a tuple with a screening :return: the need for help reason

tools.inputter.get_number_ats(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp)int[source]

This method extracts the number of ats a citizen have :param df: a dataframe containing ats :param end_date: the screening end date :return: the number of ats

tools.inputter.get_number_exercises(sc: Tuple)int[source]

This method extracts the number of exercises in a program :param sc: a tuple with a screening :return: the number of exercises

tools.inputter.get_physics(sc: Tuple)int[source]

This method extracts a screning’s physics score :param sc: a tuple with a screening :return: the physical strength score

tools.inputter.get_physics_reason(sc: Tuple)str[source]

This method extracts a screning’s physics reason :param sc: a tuple with a screening :return: the physical strength reason

tools.inputter.get_screening_data(td: pandas.core.frame.DataFrame, tc: pandas.core.frame.DataFrame, ss: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp, end_date: pandas._libs.tslibs.timestamps.Timestamp)Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]

This method extracts screening data by a start and end date :param td: training data :param tc: training cancellations :param ss: status data :param start_date: start date of the screening :param end_date: end date of the screening :return: tuple with the windowed training data, training cancellations and status data

tools.inputter.get_std_evaluation(tdw: pandas.core.frame.DataFrame)float[source]

This method extracts the standard deviation of citizen’s evaluation score :param tdw: the windowed training data :return: standard deviation of citizen’s evaluation score

tools.inputter.get_time_between_training_mean(tdw: pandas.core.frame.DataFrame, n_decimals: int = 2)float[source]

This method extracts the mean time between trainings :param tdw: the windowed training data :param n_decimals: number of decimals for rounding :return: citizen’s mean time between training

tools.inputter.get_training_week(tdw: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp)pandas.core.series.Series[source]

This method extracts the training a citizen as done per week :param tdw: the windowed training data :param start_date: the start date :return: citizen’s training per week

labeler.py

Labeler module to make class labels for cases.

tools.labeler.accumulate_screenings(df: pandas.core.frame.DataFrame, settings: dict)pandas.core.frame.DataFrame[source]

This method accumulates screenings and annoates when a citizen complete, reach compliance or fall during a program :param df: a dataframe containing screenings :param settings: settings to use :return: dataframe with accmulated screenings

tools.labeler.annotate_falls(row, digi_db, risk_period_months)[source]

Utility method to annotate rows in a dataframe if their citizen id appear in a fall data set for a specific time period.

tools.labeler.do_citizens_complete_sessions(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method evaluates if citizens in a dataframe have completed the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for completed

tools.labeler.do_citizens_fall(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method evaluates if citizens in a dataframe have fallen during the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for fall

tools.labeler.do_citizens_reach_compliance(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method evaluates if citizens in a dataframe reach compliance during the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for compliance

tools.labeler.make_complete_label(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method takes accumulated screenings and annotates citizens who have either completed or not completed their program based on the first screening :param df: a dataframe containing screenings :return: annotated dataframe

tools.labeler.make_compliance_label(df: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method takes accumulated screenings and annotates citizens who have either reached or not reached compliance in their program based on the first screening and assuming they completed :param df: a dataframe containing screenings :return: annotated dataframe

tools.labeler.make_fall_label(df: pandas.core.frame.DataFrame)[source]

This method takes accumulated screenings and annotates citizens who have either fallen or not fallen in their program based on the first screening :param df: a dataframe containing screenings :return: annotated dataframe

tools.labeler.make_risk_label(df: pandas.core.frame.DataFrame, risk_period_months: int)[source]

This method takes accumulated screenings and annotates citizens who have either fallen or not fallen in a risk period. :param df: a dataframe containing screenings :param risk_period_months: the length of the risk period in months :return: annotated dataframe

neural_embedder.py

Module to turn categorial features into embeddings.

class tools.neural_embedder.NetworkCategory(alias: str, unique_values: int)[source]

Used to store fields related to a given category, such as its name, count of unique values and the size of each embedding layer.

get_embedding_size(unique_values: int)int[source]

Return the embedding size to be used on the Embedding layer. :param unique_values: the number of unique values in the given category :return: the size to be used on the embedding layer

class tools.neural_embedder.NeuralEmbedder(df: pandas.core.frame.DataFrame, target_name: str, metrics: List[str], train_ratio: float = 0.8, network_layers: List[int] = (32, 32), dropout_rate: float = 0, activation_fn: str = 'relu', kernel_initializer: str = 'glorot_uniform', regularization_factor: float = 0, loss_fn: str = 'binary_crossentropy', optimizer_fn: str = 'Adam', epochs: int = 10, batch_size: int = 32, verbose: bool = False, model_path: str = 'models')[source]

A neural embedder that can learn entity embeddings from categorial features.

fit(X_train: numpy.ndarray, y_train: numpy.ndarray, X_valid: numpy.ndarray, y_valid: numpy.ndarray, callbacks=None, class_weight: Optional[dict] = None)tensorflow.python.keras.callbacks.History[source]

This method is used to fit a given training and validation data into our entity embeddings model :param X_train: training features :param y_train: training targets :param X_valid: validation features :param y_valid: validation targets :param callbacks: any desired callbacks :param class_weight: any desired class weight :return a History object

get_embedded_weights()List[source]

This method extracts the weights of the embedded layers :return: a List with embedded weights

get_labels_path()pathlib.Path[source]

Used to return the path of the stored labels :return: the pah of the stored labels on disk

get_scaler_path()pathlib.Path[source]

Used to return the path of the stored scaler :return: the pah of the stored scaler on disk

get_visualizations_dir()pathlib.Path[source]

Used to return the path of the stored visualizations :return: the pah of the stored visualizations on disk

get_weights_path()pathlib.Path[source]

Used to return the path of the stored weights :return: the pah of the stored weights on disk

make_visualizations_from_network(extension: str = 'pdf')List[matplotlib.figure.Figure][source]

This method makes visualizations of the embedded weights for each categorial variable :param extension: extension to use :return: a List with Figure objects

save_labels(labels: List)None[source]

This method saves a list of labels :param labels: labels to save :return: None

save_model()None[source]

This method saves the current model :return: None

save_weights(weights: List)None[source]

This method saves a list of weights :param weights: weights to save :return: None

preprocessor.py

Preprocessor to prepare data for models.

tools.preprocessor.encode_vector_label(data: List[numpy.ndarray], n_num_cols: Optional[int] = None)Tuple[List[numpy.ndarray], List[sklearn.preprocessing._label.LabelEncoder]][source]

This method label-encodes categorial data :param data: a List of data NumPy arrays :param n_num_cols: number of numerial columns to skip :return: a Tuple of Lists with encoded data and encoders

tools.preprocessor.extract_cat_count(df: pandas.core.frame.DataFrame, cat_values: List[str], cat_feature_names: List[str], prefix: str)pandas.core.frame.DataFrame[source]

This method extracts the number of times a categorial value appears in a column :param df: dataframe containing the data :param cat_values: a List of unique categorial values :param cat_feature_names: a List of names of categorial features :param resolution: the resolution to use :return: a dataframe with categorial data represented as a count

tools.preprocessor.get_X_y(df: pandas.core.frame.DataFrame, name_target: str)Tuple[List, List][source]

This method is used to gather the X (features) and y (targets) from a given dataframe based on a given target name :param df: the dataframe to be used as source :param name_target: the name of the target variable :return: the list of features and targets

tools.preprocessor.get_ats_list(ats: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method groups ats by CitizenId and returns the result as a single column dataframe :param ats: the dataframe with associated CitizenId and DevISOClass column :return: the dataframe with the grouped ats

tools.preprocessor.get_class_weight(neg: int, pos: int)dict[source]

This method computes the class weight for a classification problem given the number of negative and positive labels :param neg: number of negative labels :param pos: number of positive labels :return: the class weight as a dictionary

tools.preprocessor.normalize_data(df: pandas.core.frame.DataFrame, feature_names: List[str])pandas.core.frame.DataFrame[source]

This method normalizes data in dataframe :param df: dataframe containing the data :param feature_names: names of features to normalize :return: a dataframe with normalized features

tools.preprocessor.one_hot_encode(df: pandas.core.frame.DataFrame, feature_names: List[str])pandas.core.frame.DataFrame[source]

This method one-hot-encodes data in a dataframe :param df: dataframe containing the data :param feature_names: names of features to one-hot-encode :return: a dataframe with one-hot-encoded features

tools.preprocessor.prepare_data_for_emb(df: pandas.core.frame.DataFrame, target_name: str, train_ratio: float, n_num_cols: Optional[int] = None)Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, List[sklearn.preprocessing._label.LabelEncoder]][source]

This method prepares data in dataframe for a neural embedder by extracting the X, y variables, label encoding the categorial variables and splitting the data into a train and test set given a split ratio. Finally it returns the split and the encoded labels :param df: dataframe containing the X and y values :param target_name: name of target label :param train_ratio: the split ratio :param n_num_cols: number of numerical columns in dataframe :return: the split and the encoded labels

tools.preprocessor.replace_cat_values(df: pandas.core.frame.DataFrame, mapping: pandas.core.frame.DataFrame)pandas.core.frame.DataFrame[source]

This method replaces categorial values in dataframe by their real-world counterpart given a mapping :param df: a dataframe with the values to replace :param mapping: the mapping to use :return: a dataframe where categorial values have been replaced

tools.preprocessor.sample(X: numpy.ndarray, y: numpy.ndarray, n: int)Tuple[numpy.ndarray, numpy.ndarray][source]

This method is used to sample a random number of N rows betwen [0, X.shape[0]] :param X: the X array to sample from :param y: the y array to sample from :param n: the number of samples :return: the tuple containing a subset of samples in X and y

tools.preprocessor.scale_data(df: pandas.core.frame.DataFrame, feature_names: List[str])pandas.core.frame.DataFrame[source]

This method scales data in dataframe :param df: dataframe containing the data :param feature_names: names of features to scale :return: a dataframe with scaled features

tools.preprocessor.series_to_list(series: pandas.core.series.Series)List[source]

This method is used to convert a given pd.Series object into a list :param series: the list to be converted :return: the list containing all the elements from the Series object

tools.preprocessor.split_cat_columns(df: pandas.core.frame.DataFrame, col_to_split: str, tag: str, resolution: int)pandas.core.frame.DataFrame[source]

This method splits a categorial column by a resolution :param df: dataframe containing the data :param col_to_split: name of column to split :param tag: name of tag for new columns :param resolution: the resolution to use :return: a dataframe with splitted columns

tools.preprocessor.transpose_to_list(X: numpy.ndarray)List[numpy.ndarray][source]

This method is used to convert the nd.array X (features) to a list of features :param X: the ndarray to be used as source :return: a list of nd.array containing the elements from the numpy array

raw_loader.py

Module to load raw data for cases.

class tools.raw_loader.BaseRawLoader2021[source]

Base class for raw loaders 2021.

abstract load_assistive_aids(aalborg_file_name, viborg_file_name, viborg_alarm_file_name, file_path)[source]

Load the DiGiRehab Assistive Aids data set.

abstract load_iso_classes(file_name, file_path)[source]

Load the ISO classes data set

abstract load_screening_content(aalborg_file_name, viborg_file_name, file_path)[source]

Load the DiGiRehab Screening Content data set.

abstract load_status_set(aalborg_file_name, viborg_file_name, file_path)[source]

Load the DiGiRehab StatusSet data set.

abstract load_training_cancelled(aalborg_file_name, viborg_file_name, file_path)[source]

Load the DiGiRehab Training Cancelled data set.

abstract load_training_done(aalborg_file_name, viborg_file_name, file_path)[source]

Load the DiGiRehab Training Done data set.

class tools.raw_loader.RawLoader2021[source]

Raw loader for 2021 dataset

load_assistive_aids(aalborg_file_name: str, viborg_file_name: str, viborg_alarm_file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads assistive aids data :param file_name: name of file :param file_path: path of file :param n_sheets: number of sheets in excel file :return: dataframe with loaded data

load_iso_classes(file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads iso classes :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data

load_screening_content(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads screening datta :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data

load_status_set(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads status set data :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data

load_training_cancelled(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads training cancellation data :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data

load_training_done(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path)pandas.core.frame.DataFrame[source]

This method loads training completed datta :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data

Tuning

tune_alarm_boost_wb.py

Gradient boosting tune script for Alarm case on WanDB.

tune_alarm_rsf_wb.py

Random Survival Forest tune script for Alarm case on WanDB.

tune_complete_rf_wb.py

Random Forest tune script for Complete case on WanDB.

tune_complete_xgb_wb.py

XGBoost tune script for Complete case on WanDB.

tune_compliance_rf_wb.py

Random Forest tune script for Compliance case on WanDB.

tune_compliance_xgb_wb.py

XGBoost tune script for Compliance case on WanDB.

Utility

config.py

Utility config functions.

utility.config.load_config(file_path: pathlib.Path, file_name: str)dict[source]

Loads a YAML config file :param file_path: file path to use :param file_name: file name ot use :return: config file

data.py

Utility data functions.

class utility.data.Data(sc: pandas.core.frame.DataFrame, ss: pandas.core.frame.DataFrame, td: pandas.core.frame.DataFrame, tc: pandas.core.frame.DataFrame, ats: pandas.core.frame.DataFrame)[source]

Dto to keep track of feature data

utility.data.read_csv(file_path: pathlib.Path, file_name: str, converters: Optional[dict] = None)pandas.core.frame.DataFrame[source]

Reads a CSV from a file :param data: data to be written :param file_path: path of the file :param file_name: name of the file :param converters: converters to use

utility.data.read_pickle(file_path: pathlib.Path, file_name: str)any[source]

Reads a pickle from a file :param file_path: path of the file :param file_name: name of the file

utility.data.write_csv(data: pandas.core.frame.DataFrame, file_path: pathlib.Path, file_name: str)None[source]

Writes a CSV to a file :param data: data to be written :param file_path: path of the file :param file_name: name of the file

utility.data.write_embedding(mapping: dict, file_path: pathlib.Path, file_name: str)None[source]

Writes an embedding to a CSV file :param mapping: the embedded mapping :param file_path: path of the file :param file_name: name of the file

utility.data.write_pickle(data: any, file_path: pathlib.Path, file_name: str)None[source]

Writes a pickle to a file :param data: data to be written :param file_path: path of the file :param file_name: name of the file

embedder.py

Utility embedder functions.

utility.embedder.check_batch_size(batch_size: int)None[source]

Checks batch size is greater than zero :param batch_size: batch size to check

utility.embedder.check_epochs(epochs: int)None[source]

Checks number of epochs is greater than zero :param epochs: number of epochs to check

utility.embedder.check_not_empty_dataframe(df: pandas.core.frame.DataFrame)None[source]

Checks if a dataframe is empty :param df: dataframe to check

utility.embedder.check_target_existent_in_df(target_name: str, df: pandas.core.frame.DataFrame)None[source]

Checks if a target name exists in a dataframe :param target_name: target name to check :param df: dataframe to check

utility.embedder.check_target_name(target_name: str)None[source]

Checks if a target name is set :param target_name: target name to check

utility.embedder.check_train_ratio(train_ratio: float)None[source]

Checks a train ratio is between zero and one :param train_ratio: train ratio to check

utility.embedder.encode_dataframe(df: pandas.core.frame.DataFrame, target_name: str, metrics: List[str], batch_size: int, train_ratio: float, epochs: int, optimizer: str, network_layers: List[int], verbose: bool, model_path: str, enable_emb_viz: bool)pandas.core.frame.DataFrame[source]

Encodes the categorial features of a dataframe as entity embeddings :param df: dataframe to encode :param target_name: the label name :param metrics: a list of metrics to use :param batch_size: batch size to use :param train_ratio: the train/test split ratio :param epochs: number of epochs :param optimizer: optimizer to use :param network_layers: a list with sizes of network layers, e.g. (32, 32) :param verbose: verbose execution flag :param model_path: where to store the model :param enable_emb_viz: make viz flag

utility.embedder.get_X_y(df: pandas.core.frame.DataFrame, name_target: str)Tuple[List, List][source]

This method is used to gather the X (features) and y (targets) from a given dataframe based on a given target name :param df: the dataframe to be used as source :param name_target: the name of the target variable :return: the list of features and targets

utility.embedder.get_all_columns_except(df: pandas.core.frame.DataFrame, columns_to_skip: List[str])pandas.core.frame.DataFrame[source]

Used to get all columns in a dataframe except columns to skip :param df: dataframe to select columns from :param columns_to_skip: list of columns to skip :return: a dataframe with selected columns

utility.embedder.get_embedding_size(unique_values: int)int[source]

Return the embedding size to be used on the Embedding layer :param unique_values: the number of unique values in the given category :return: the size to be used on the embedding layer

utility.embedder.get_numerical_cols(df: pandas.core.frame.DataFrame, target_name: str)List[source]

Generates a list of numerial categories from a dataframe

utility.embedder.is_not_single_embedding(label: sklearn.preprocessing._label.LabelEncoder)bool[source]

Used to check if there is more than one class in a given LabelEncoder :param label: label encoder to be checked :return: a boolean if the embedding contains more than one class

utility.embedder.make_plot_from_history(history: tensorflow.python.keras.callbacks.History, output_path: Optional[str] = None, extension: str = 'pdf')matplotlib.figure.Figure[source]

Used to make a Figure object containing the loss curve between the epochs. :param history: the history outputted from the model.fit method :param output_path: (optional) where the image will be saved :param extension: (optional) the extension of the file :return: a Figure object containing the plot

utility.embedder.make_visualizations(labels: List[sklearn.preprocessing._label.LabelEncoder], embeddings: List[numpy.array], df: pandas.core.frame.DataFrame, output_path: Optional[str] = None, extension: str = 'pdf', n_numerical_cols: Optional[int] = None)List[matplotlib.figure.Figure][source]

Used to generate the embedding visualizations for each categorical variable :param labels: a list of the LabelEncoders of each categorical variable :param embeddings: a Numpy array containing the weights from the categorical variables :param n_numerical_cols: number of numerical columns :param df: the dataframe from where the weights were extracted :param output_path: (optional) where the visualizations will be saved :param extension: (optional) the extension to be used when saving the artifacts :param n_numerical_cols: number of numerical columns :return: the list of figures for each categorical variable

utility.embedder.series_to_list(series: pandas.core.series.Series)List[source]

This method is used to convert a given pd.Series object into a list :param series: the list to be converted :return: the list containing all the elements from the Series object

utility.embedder.transpose_to_list(X: numpy.ndarray)List[numpy.ndarray][source]
Parameters

X – the ndarray to be used as source

Returns

a list of nd.array containing the elements from the numpy array

metrics.py

Utility metrics functions.

utility.metrics.compute_mean(values: List)float[source]

Computes the rounded mean of a list of values :param values: values to compute :return: rounded list of values

utility.metrics.compute_std(values: List)float[source]

Computes the rounded std of a list of values :param values: values to compute :return: rounded list of values

utility.metrics.eval_gini(y_true: numpy.array, y_prob: numpy.array)float[source]

Computes the gini score :param y_true: the true labels :param y_prob: the predicated labels :return: the gini score

utility.metrics.get_cm_by_protected_variable(df: pandas.core.frame.DataFrame, protected_col_name: str, y_target_name: str, y_pred_name: str)pandas.core.frame.DataFrame[source]

Makes a confusion matrix for each value of a protected variable :param df: dataframe with the data :param protected_col_name: name of protected variable :param y_target_name: name of the target :param y_pred_name: name of the output :return: confusion matrix as a dataframe

utility.metrics.gini_xgb(preds: numpy.array, dtrain: xgboost.core.DMatrix)List[Tuple][source]

Computes the negated gini socre :param preds: predictions to use :param dtrain: a DMatrix with the true labels :return: a list of tuples with the gini score

time.py

Utility time functions.

utility.time.assert_datetime(entry: str)bool[source]

This method checks whether an entry can be converted to a valid datatime :param entry: entry to check :return: boolean value whether conversion successful

Indices and tables