AIR documentation¶
Introduction¶
This document is aimed at authors of, developers and contributors to the AIR project.
The source code of AIR is located in its project repository on GitLab.
For more information, please visit our official website.
Source Code Documentation¶
This chapter describes all Python modules in the AIR source code.
Analysis¶
evaluate_balance.py¶
Script to evaluate the performance of using class weights.
evaluate_classification_cases.py¶
Script to make a baseline evaluation of classification cases using CV with different datasets and classifers.
evaluate_gender_bias.py¶
Script to evaluate gender bias in the Complete case.
evaluate_preprocessing.py¶
Script to evaluate preprocessing strategies for cases.
-
class
analysis.evaluate_preprocessing.
BoxCoxNormalizer
[source]¶ -
fit_transform
(X, case=None)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
-
class
analysis.evaluate_preprocessing.
BoxCoxNormalizerNoGender
[source]¶ -
fit_transform
(X, case=None)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
-
class
analysis.evaluate_preprocessing.
DummyNormalizer
[source]¶ -
fit_transform
(X, case=None)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
-
class
analysis.evaluate_preprocessing.
DummyScaler
[source]¶ -
fit_transform
(X)[source]¶ Fit to data, then transform it.
Fits transformer to X and y with optional parameters fit_params and returns a transformed version of X.
- Parameters
X (array-like of shape (n_samples, n_features)) – Input samples.
y (array-like of shape (n_samples,) or (n_samples, n_outputs), default=None) – Target values (None for unsupervised transformations).
**fit_params (dict) – Additional fit parameters.
- Returns
X_new – Transformed array.
- Return type
ndarray array of shape (n_samples, n_features_new)
-
evaluate_survival_case.py¶
Script to make a baseline evaluation of the survival case using CV.
find_best_ats_resolution.py¶
Script to find the best ATS resolution for Complete case.
find_best_features.py¶
Script to find best features for cases using CV.
Data¶
load_and_clean_data.py¶
Script to load the raw data and then clean it.
make_survival_data.py¶
Script to make dataset for Alarm case.
make_dataset_count.py¶
Script to make a dataset for Complete, Compliance, Fall and Risk cases where categorial features are encoded as an integer array with values as a count of the number of times a value appears in a column.
make_dataset_emb.py¶
Script to make a dataset for Complete, Compliance, Fall, Risk and Alarm cases where categorial features are encoded as entity embeddings. Takes in a full dataset generated with “make_dataset_full.py” for the Complete, Compliance, Fall or Risk case, or “make_survival_data.py” for the Alarm case.
make_dataset_full.py¶
Script to make a dataset for Complete, Compliance, Fall and Risk case based on the screenings.
make_dataset_ohe.py¶
Script to make a dataset for Complete, Compliance, Fall and Risk case using one hot encoding of categorial features.
make_dataset_ordinal.py¶
Script to make a dataset for Complete, Compliance, Fall and Risk using ordinal encoding of categorial features.
make_screening_data.py¶
Script to make dataset of screenings to be used for cases.
Model¶
train_alarm_model.py¶
Script to train the model for the Alarm case.
train_complete_model.py¶
Script to train the model for the Complete case.
train_compliance_model.py¶
Script to train the model for the Compliance case.
Tools¶
classifiers.py¶
Module to store classifers used for CV.
-
class
tools.classifiers.
BaseClassifer
(X, y)[source]¶ Base class for classifiers.
-
evaluate
(metrics: List, k: int) → Tuple[dict, numpy.ndarray][source]¶ This method performs cross validation for k seeds on a given dataset X and y and outputs the results of N splits given a list of scoring metrics :param metrics: scoring metrics :param k: the seed to use :return: the results from a stratified K-fold CV process
-
cleaner.py¶
Module to clean raw data.
-
class
tools.cleaner.
BaseCleaner
[source]¶ Base class for cleaners.
-
abstract
clean_screening_content
(screening_content, patient_data)[source]¶ Cleans the screening content data set.
-
abstract
-
class
tools.cleaner.
Cleaner2021
[source]¶ Cleaner for 2021 dataset
-
clean_assistive_aids
(df: pandas.core.frame.DataFrame, iso_classes: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Cleans the assistive aids data set.
-
clean_patient_data
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Cleans the patient data set.
-
clean_screening_content
(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Cleans the screening content data set.
-
clean_status_set
(df: pandas.core.frame.DataFrame, patient_data: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ Cleans the status set data set.
-
data_loader.py¶
Module to load data for cases.
-
class
tools.data_loader.
AlarmDataLoader
(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]¶ Data loader for Alarm case.
-
class
tools.data_loader.
BaseDataLoader
(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]¶ Base class for data loaders.
-
get_data
() → Tuple[pandas.core.frame.DataFrame, numpy.ndarray][source]¶ This method returns the features and targets :return: X and y
-
get_features
() → List[str][source]¶ This method returns the feature names :return: the columns of X as a list
-
-
class
tools.data_loader.
CompleteDataLoader
(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]¶ Data loader for Complete case.
-
class
tools.data_loader.
ComplianceDataLoader
(file_path: pathlib.Path, file_name: str, settings: dict, converters: Optional[dict] = None)[source]¶ Data loader for Compliance case.
file_reader.py¶
File reader module to read files.
-
tools.file_reader.
read_array
(infile: _io.BytesIO) → numpy.ndarray[source]¶ This method reads an NumPy array file as a pickle :param infile: binary input stream :return: the NumPy array object
-
tools.file_reader.
read_csv
(infile: _io.StringIO, header: str = 'infer', sep: str = ',', usecols: Optional[List[int]] = None, names: Optional[List[str]] = None, converters: Optional[dict] = None, encoding=None, skiprows=None) → pandas.core.frame.DataFrame[source]¶ This method reads a csv file using Pandas read_csv() method :param infile: text input stream :param header: file header :param sep: seperator identifier :param names: list of column names to use :param converters: dict of converters to use :return: the csv file
-
tools.file_reader.
read_embedding
(infile: _io.StringIO) → dict[source]¶ This method reads an embedding file :param infile: text input stream :return: the embedding as a dict
-
tools.file_reader.
read_excelfile
(infile: _io.BytesIO, converters: Optional[dict] = None) → pandas.core.frame.DataFrame[source]¶ This method reads an excel file :param infile: binary input stream :param converters: dict of converters to use :return: the excel file as a dataframe
-
tools.file_reader.
read_excelfile_sheets
(infile: _io.BytesIO, n_sheets: int, converters: Optional[dict] = None) → pandas.core.frame.DataFrame[source]¶ This method reads sheets from an excel file :param infile: binary input stream :param n_sheets: number of sheets to read :param converters: dict of converters to use :return: the full excel file as a dataframe
file_writer.py¶
File writer module to write files.
-
tools.file_writer.
write_array
(data: numpy.ndarray, outfile: _io.BytesIO) → None[source]¶ This method writes an NumPy array. :param data: data to write :param outfile: binary output stream :return: None
-
tools.file_writer.
write_csv
(df: pandas.core.frame.DataFrame, outfile: _io.StringIO, date_format: str = '%d-%m-%Y', index: bool = False) → None[source]¶ This method writes a csv file using Pandas to_csv() method. :param df: dataframe to write :param outfile: text output stream :param date_format: data format to use :param index: write row names (index) :return: None
-
tools.file_writer.
write_cv_plot
(means: List, stds: List, metric: str, num_iter: int, clf_names: List, title: str, subtitle: str, outfile: _io.BytesIO)[source]¶ This method writes a plot of the result from a CV process. :param means: the mean values obtainted :param stds: the standard deviations obtained :param metric: the metric used :param num_iter: the number of iterations :param clf_names: names of classifiers used :param title: plot title :param subtitle: plot subtitle :param outfile: binary output stream :return: None
-
tools.file_writer.
write_embedding
(mapping: dict, outfile: _io.StringIO) → None[source]¶ This method writes an embedding mapping as a csv file. :param mapping: mapping dict :param outfile: text output stream :return: None
-
tools.file_writer.
write_joblib
(data: any, outfile: _io.BytesIO) → None[source]¶ This method writes a joblib file. :param data: data to write :param outfile: binary output stream :return: None
inputter.py¶
Module for to create features based on screenings.
-
tools.inputter.
convert_date_to_datetime
(date: pandas._libs.tslibs.timestamps.Timestamp, date_format: str) → pandas._libs.tslibs.timestamps.Timestamp[source]¶ This method converts a date to timedate :param date: date to convert :param date_format: date format to use :return: the converted date
-
tools.inputter.
get_ats
(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp, settings: dict) → str[source]¶ This method extracts a citizen’s ats from a screening :param df: a dataframe containing ats :param end_date: the screening end date :param settings: the settings to use :return: the citizen’s ats
-
tools.inputter.
get_avg_loan_period
(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp) → int[source]¶ This method extracts the average ats loan period :param df: a dataframe containing ats :param end_date: the screening end date :return: the average loan period
-
tools.inputter.
get_birth_year
(sc: pandas.core.frame.DataFrame) → int[source]¶ This method extracts a citizen’s birth year :param sc: a tuple with a screening :return: the birth year
-
tools.inputter.
get_cancels_week
(tcw: pandas.core.frame.DataFrame) → pandas.core.series.Series[source]¶ This method extracts the cancellations per week :param tcw: windowed training cancellations :return: cancellations per week
-
tools.inputter.
get_citizen_data
(data: utility.data.Data, citizen_id: str) → utility.data.Data[source]¶ This method extracts all screening data we have on a citizen :param data: a data DTO consisting of screening data of all citizens :param id: id of the citizen :return: a data DTO with only the citizen’s data in it
-
tools.inputter.
get_exercise_content
(sc: Tuple) → str[source]¶ This method extracts the exercise content from a screening :param sc: a tuple with a screening :return: the exercise content
-
tools.inputter.
get_gender
(sc: pandas.core.frame.DataFrame) → int[source]¶ This method extracts a citizen’s gender :param sc: a tuple with a screening :return: the gender
-
tools.inputter.
get_interval_length
(start_date: pandas._libs.tslibs.timestamps.Timestamp, end_date: pandas._libs.tslibs.timestamps.Timestamp) → float[source]¶ This method extracts the interval length between a start date of the screening and end date :param start_date: the start date :param end_date: the end date :return: the length of the interval
-
tools.inputter.
get_max_evaluation
(tdw: pandas.core.frame.DataFrame) → int[source]¶ This method extracts the largest evaluation score a citizen has gotten :param tdw: the windowed training data :return: citizen’s largest evaluation score
-
tools.inputter.
get_mean_cancels_week
(n_cancel: int, n_weeks: float) → int[source]¶ This method extracts the mean number of cancellations per week :param n_cancel: number of cancellations :param n_weeks: number of screening weeks :return: citizen’s mean number of cancellations per week
-
tools.inputter.
get_mean_evaluation
(tdw: pandas.core.frame.DataFrame) → float[source]¶ This method extracts the mean of citizen’s evaluation score :param tdw: the windowed training data :return: citizen’s mean evaluation score
-
tools.inputter.
get_mean_time_between_cancels
(tcw: pandas.core.frame.DataFrame, n_decimals=2) → float[source]¶ This method extracts the mean time between cancellations :param tcw: the windowed cancellation data :param n_decimals: number of decimals for rounding :return: citizen’s mean time between cancellations
-
tools.inputter.
get_min_evaluation
(tdw: pandas.core.frame.DataFrame) → int[source]¶ This method extracts the smallest evaluation score a citizen has gotten :param tdw: the windowed training data :return: citizen’s smallest evaluation score
-
tools.inputter.
get_n_cancel_week_min
(cancelsprweek: pandas.core.series.Series) → int[source]¶ This method extracts a citizen’s number of cancels per week :param cancelsprweek: a series with cancels per week :return: number of cancels per week
-
tools.inputter.
get_n_training_week
(n_weeks: float, n_training_window: int) → int[source]¶ This method extracts the number of completed trainings per week :param n_weeks: number of screening weeks :param n_training_window: length of training window :return: number of trainings per week
-
tools.inputter.
get_n_training_week_max
(training_pr_week: pandas.core.series.Series) → int[source]¶ This method extracts the largest number of trainings a citizen has done per week :param training_pr_week: trainings per week :return: largest number of tranings done per week
-
tools.inputter.
get_n_training_week_min
(training_pr_week: pandas.core.series.Series, n_weeks_with_trainings: int, n_weeks: float) → int[source]¶ This method extracts the smallest number of trainings a citizen has done per week :param training_pr_week: trainings per week :param n_weeks_with_trainings: number of weeks with training :param n_weeks: number of screening weeks :return: smallest number of tranings done per week
-
tools.inputter.
get_n_training_window
(tdw: pandas.core.frame.DataFrame) → int[source]¶ This method extracts the number of training windows :param tdw: the windowed training data :return: number of training windows
-
tools.inputter.
get_n_weeks_with_training
(tdw: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp) → int[source]¶ This method extracts a citizen’s number of weeks with training :param n_weeks: number of screening weeks :param n_weeks_with_trainings: number of training weeks :return: number of weeks with training
-
tools.inputter.
get_n_weeks_without_training
(n_weeks: float, n_weeks_with_trainings: int) → int[source]¶ This method extracts a citizen’s number of weeks without training :param n_weeks: number of screening weeks :param n_weeks_with_trainings: number of training weeks :return: number of weeks without training
-
tools.inputter.
get_needs
(sc: Tuple) → int[source]¶ This method extracts a screning’s need for help score :param sc: a tuple with a screening :return: the need for help score
-
tools.inputter.
get_needs_reason
(sc: Tuple) → str[source]¶ This method extracts a screning’s need for help reason :param sc: a tuple with a screening :return: the need for help reason
-
tools.inputter.
get_number_ats
(df: pandas.core.frame.DataFrame, end_date: pandas._libs.tslibs.timestamps.Timestamp) → int[source]¶ This method extracts the number of ats a citizen have :param df: a dataframe containing ats :param end_date: the screening end date :return: the number of ats
-
tools.inputter.
get_number_exercises
(sc: Tuple) → int[source]¶ This method extracts the number of exercises in a program :param sc: a tuple with a screening :return: the number of exercises
-
tools.inputter.
get_physics
(sc: Tuple) → int[source]¶ This method extracts a screning’s physics score :param sc: a tuple with a screening :return: the physical strength score
-
tools.inputter.
get_physics_reason
(sc: Tuple) → str[source]¶ This method extracts a screning’s physics reason :param sc: a tuple with a screening :return: the physical strength reason
-
tools.inputter.
get_screening_data
(td: pandas.core.frame.DataFrame, tc: pandas.core.frame.DataFrame, ss: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp, end_date: pandas._libs.tslibs.timestamps.Timestamp) → Tuple[pandas.core.frame.DataFrame, pandas.core.frame.DataFrame, pandas.core.frame.DataFrame][source]¶ This method extracts screening data by a start and end date :param td: training data :param tc: training cancellations :param ss: status data :param start_date: start date of the screening :param end_date: end date of the screening :return: tuple with the windowed training data, training cancellations and status data
-
tools.inputter.
get_std_evaluation
(tdw: pandas.core.frame.DataFrame) → float[source]¶ This method extracts the standard deviation of citizen’s evaluation score :param tdw: the windowed training data :return: standard deviation of citizen’s evaluation score
-
tools.inputter.
get_time_between_training_mean
(tdw: pandas.core.frame.DataFrame, n_decimals: int = 2) → float[source]¶ This method extracts the mean time between trainings :param tdw: the windowed training data :param n_decimals: number of decimals for rounding :return: citizen’s mean time between training
-
tools.inputter.
get_training_week
(tdw: pandas.core.frame.DataFrame, start_date: pandas._libs.tslibs.timestamps.Timestamp) → pandas.core.series.Series[source]¶ This method extracts the training a citizen as done per week :param tdw: the windowed training data :param start_date: the start date :return: citizen’s training per week
labeler.py¶
Labeler module to make class labels for cases.
-
tools.labeler.
accumulate_screenings
(df: pandas.core.frame.DataFrame, settings: dict) → pandas.core.frame.DataFrame[source]¶ This method accumulates screenings and annoates when a citizen complete, reach compliance or fall during a program :param df: a dataframe containing screenings :param settings: settings to use :return: dataframe with accmulated screenings
-
tools.labeler.
annotate_falls
(row, digi_db, risk_period_months)[source]¶ Utility method to annotate rows in a dataframe if their citizen id appear in a fall data set for a specific time period.
-
tools.labeler.
do_citizens_complete_sessions
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method evaluates if citizens in a dataframe have completed the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for completed
-
tools.labeler.
do_citizens_fall
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method evaluates if citizens in a dataframe have fallen during the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for fall
-
tools.labeler.
do_citizens_reach_compliance
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method evaluates if citizens in a dataframe reach compliance during the sessions they’ve started :param df: a dataframe containing screenings :return: dataframe with a label for compliance
-
tools.labeler.
make_complete_label
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method takes accumulated screenings and annotates citizens who have either completed or not completed their program based on the first screening :param df: a dataframe containing screenings :return: annotated dataframe
-
tools.labeler.
make_compliance_label
(df: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method takes accumulated screenings and annotates citizens who have either reached or not reached compliance in their program based on the first screening and assuming they completed :param df: a dataframe containing screenings :return: annotated dataframe
-
tools.labeler.
make_fall_label
(df: pandas.core.frame.DataFrame)[source]¶ This method takes accumulated screenings and annotates citizens who have either fallen or not fallen in their program based on the first screening :param df: a dataframe containing screenings :return: annotated dataframe
-
tools.labeler.
make_risk_label
(df: pandas.core.frame.DataFrame, risk_period_months: int)[source]¶ This method takes accumulated screenings and annotates citizens who have either fallen or not fallen in a risk period. :param df: a dataframe containing screenings :param risk_period_months: the length of the risk period in months :return: annotated dataframe
neural_embedder.py¶
Module to turn categorial features into embeddings.
-
class
tools.neural_embedder.
NetworkCategory
(alias: str, unique_values: int)[source]¶ Used to store fields related to a given category, such as its name, count of unique values and the size of each embedding layer.
-
class
tools.neural_embedder.
NeuralEmbedder
(df: pandas.core.frame.DataFrame, target_name: str, metrics: List[str], train_ratio: float = 0.8, network_layers: List[int] = (32, 32), dropout_rate: float = 0, activation_fn: str = 'relu', kernel_initializer: str = 'glorot_uniform', regularization_factor: float = 0, loss_fn: str = 'binary_crossentropy', optimizer_fn: str = 'Adam', epochs: int = 10, batch_size: int = 32, verbose: bool = False, model_path: str = 'models')[source]¶ A neural embedder that can learn entity embeddings from categorial features.
-
fit
(X_train: numpy.ndarray, y_train: numpy.ndarray, X_valid: numpy.ndarray, y_valid: numpy.ndarray, callbacks=None, class_weight: Optional[dict] = None) → tensorflow.python.keras.callbacks.History[source]¶ This method is used to fit a given training and validation data into our entity embeddings model :param X_train: training features :param y_train: training targets :param X_valid: validation features :param y_valid: validation targets :param callbacks: any desired callbacks :param class_weight: any desired class weight :return a History object
-
get_embedded_weights
() → List[source]¶ This method extracts the weights of the embedded layers :return: a List with embedded weights
-
get_labels_path
() → pathlib.Path[source]¶ Used to return the path of the stored labels :return: the pah of the stored labels on disk
-
get_scaler_path
() → pathlib.Path[source]¶ Used to return the path of the stored scaler :return: the pah of the stored scaler on disk
-
get_visualizations_dir
() → pathlib.Path[source]¶ Used to return the path of the stored visualizations :return: the pah of the stored visualizations on disk
-
get_weights_path
() → pathlib.Path[source]¶ Used to return the path of the stored weights :return: the pah of the stored weights on disk
-
make_visualizations_from_network
(extension: str = 'pdf') → List[matplotlib.figure.Figure][source]¶ This method makes visualizations of the embedded weights for each categorial variable :param extension: extension to use :return: a List with Figure objects
-
preprocessor.py¶
Preprocessor to prepare data for models.
-
tools.preprocessor.
encode_vector_label
(data: List[numpy.ndarray], n_num_cols: Optional[int] = None) → Tuple[List[numpy.ndarray], List[sklearn.preprocessing._label.LabelEncoder]][source]¶ This method label-encodes categorial data :param data: a List of data NumPy arrays :param n_num_cols: number of numerial columns to skip :return: a Tuple of Lists with encoded data and encoders
-
tools.preprocessor.
extract_cat_count
(df: pandas.core.frame.DataFrame, cat_values: List[str], cat_feature_names: List[str], prefix: str) → pandas.core.frame.DataFrame[source]¶ This method extracts the number of times a categorial value appears in a column :param df: dataframe containing the data :param cat_values: a List of unique categorial values :param cat_feature_names: a List of names of categorial features :param resolution: the resolution to use :return: a dataframe with categorial data represented as a count
-
tools.preprocessor.
get_X_y
(df: pandas.core.frame.DataFrame, name_target: str) → Tuple[List, List][source]¶ This method is used to gather the X (features) and y (targets) from a given dataframe based on a given target name :param df: the dataframe to be used as source :param name_target: the name of the target variable :return: the list of features and targets
-
tools.preprocessor.
get_ats_list
(ats: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method groups ats by CitizenId and returns the result as a single column dataframe :param ats: the dataframe with associated CitizenId and DevISOClass column :return: the dataframe with the grouped ats
-
tools.preprocessor.
get_class_weight
(neg: int, pos: int) → dict[source]¶ This method computes the class weight for a classification problem given the number of negative and positive labels :param neg: number of negative labels :param pos: number of positive labels :return: the class weight as a dictionary
-
tools.preprocessor.
normalize_data
(df: pandas.core.frame.DataFrame, feature_names: List[str]) → pandas.core.frame.DataFrame[source]¶ This method normalizes data in dataframe :param df: dataframe containing the data :param feature_names: names of features to normalize :return: a dataframe with normalized features
-
tools.preprocessor.
one_hot_encode
(df: pandas.core.frame.DataFrame, feature_names: List[str]) → pandas.core.frame.DataFrame[source]¶ This method one-hot-encodes data in a dataframe :param df: dataframe containing the data :param feature_names: names of features to one-hot-encode :return: a dataframe with one-hot-encoded features
-
tools.preprocessor.
prepare_data_for_emb
(df: pandas.core.frame.DataFrame, target_name: str, train_ratio: float, n_num_cols: Optional[int] = None) → Tuple[numpy.ndarray, numpy.ndarray, numpy.ndarray, numpy.ndarray, List[sklearn.preprocessing._label.LabelEncoder]][source]¶ This method prepares data in dataframe for a neural embedder by extracting the X, y variables, label encoding the categorial variables and splitting the data into a train and test set given a split ratio. Finally it returns the split and the encoded labels :param df: dataframe containing the X and y values :param target_name: name of target label :param train_ratio: the split ratio :param n_num_cols: number of numerical columns in dataframe :return: the split and the encoded labels
-
tools.preprocessor.
replace_cat_values
(df: pandas.core.frame.DataFrame, mapping: pandas.core.frame.DataFrame) → pandas.core.frame.DataFrame[source]¶ This method replaces categorial values in dataframe by their real-world counterpart given a mapping :param df: a dataframe with the values to replace :param mapping: the mapping to use :return: a dataframe where categorial values have been replaced
-
tools.preprocessor.
sample
(X: numpy.ndarray, y: numpy.ndarray, n: int) → Tuple[numpy.ndarray, numpy.ndarray][source]¶ This method is used to sample a random number of N rows betwen [0, X.shape[0]] :param X: the X array to sample from :param y: the y array to sample from :param n: the number of samples :return: the tuple containing a subset of samples in X and y
-
tools.preprocessor.
scale_data
(df: pandas.core.frame.DataFrame, feature_names: List[str]) → pandas.core.frame.DataFrame[source]¶ This method scales data in dataframe :param df: dataframe containing the data :param feature_names: names of features to scale :return: a dataframe with scaled features
-
tools.preprocessor.
series_to_list
(series: pandas.core.series.Series) → List[source]¶ This method is used to convert a given pd.Series object into a list :param series: the list to be converted :return: the list containing all the elements from the Series object
-
tools.preprocessor.
split_cat_columns
(df: pandas.core.frame.DataFrame, col_to_split: str, tag: str, resolution: int) → pandas.core.frame.DataFrame[source]¶ This method splits a categorial column by a resolution :param df: dataframe containing the data :param col_to_split: name of column to split :param tag: name of tag for new columns :param resolution: the resolution to use :return: a dataframe with splitted columns
raw_loader.py¶
Module to load raw data for cases.
-
class
tools.raw_loader.
BaseRawLoader2021
[source]¶ Base class for raw loaders 2021.
-
abstract
load_assistive_aids
(aalborg_file_name, viborg_file_name, viborg_alarm_file_name, file_path)[source]¶ Load the DiGiRehab Assistive Aids data set.
-
abstract
load_screening_content
(aalborg_file_name, viborg_file_name, file_path)[source]¶ Load the DiGiRehab Screening Content data set.
-
abstract
load_status_set
(aalborg_file_name, viborg_file_name, file_path)[source]¶ Load the DiGiRehab StatusSet data set.
-
abstract
-
class
tools.raw_loader.
RawLoader2021
[source]¶ Raw loader for 2021 dataset
-
load_assistive_aids
(aalborg_file_name: str, viborg_file_name: str, viborg_alarm_file_name: str, file_path: pathlib.Path) → pandas.core.frame.DataFrame[source]¶ This method loads assistive aids data :param file_name: name of file :param file_path: path of file :param n_sheets: number of sheets in excel file :return: dataframe with loaded data
-
load_iso_classes
(file_name: str, file_path: pathlib.Path) → pandas.core.frame.DataFrame[source]¶ This method loads iso classes :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data
-
load_screening_content
(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path) → pandas.core.frame.DataFrame[source]¶ This method loads screening datta :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data
-
load_status_set
(aalborg_file_name: str, viborg_file_name: str, file_path: pathlib.Path) → pandas.core.frame.DataFrame[source]¶ This method loads status set data :param file_name: name of file :param file_path: path of file :return: dataframe with loaded data
-
Tuning¶
tune_alarm_boost_wb.py¶
Gradient boosting tune script for Alarm case on WanDB.
tune_alarm_rsf_wb.py¶
Random Survival Forest tune script for Alarm case on WanDB.
tune_complete_rf_wb.py¶
Random Forest tune script for Complete case on WanDB.
tune_complete_xgb_wb.py¶
XGBoost tune script for Complete case on WanDB.
tune_compliance_rf_wb.py¶
Random Forest tune script for Compliance case on WanDB.
tune_compliance_xgb_wb.py¶
XGBoost tune script for Compliance case on WanDB.
Utility¶
config.py¶
Utility config functions.
data.py¶
Utility data functions.
-
class
utility.data.
Data
(sc: pandas.core.frame.DataFrame, ss: pandas.core.frame.DataFrame, td: pandas.core.frame.DataFrame, tc: pandas.core.frame.DataFrame, ats: pandas.core.frame.DataFrame)[source]¶ Dto to keep track of feature data
-
utility.data.
read_csv
(file_path: pathlib.Path, file_name: str, converters: Optional[dict] = None) → pandas.core.frame.DataFrame[source]¶ Reads a CSV from a file :param data: data to be written :param file_path: path of the file :param file_name: name of the file :param converters: converters to use
-
utility.data.
read_pickle
(file_path: pathlib.Path, file_name: str) → any[source]¶ Reads a pickle from a file :param file_path: path of the file :param file_name: name of the file
-
utility.data.
write_csv
(data: pandas.core.frame.DataFrame, file_path: pathlib.Path, file_name: str) → None[source]¶ Writes a CSV to a file :param data: data to be written :param file_path: path of the file :param file_name: name of the file
embedder.py¶
Utility embedder functions.
-
utility.embedder.
check_batch_size
(batch_size: int) → None[source]¶ Checks batch size is greater than zero :param batch_size: batch size to check
-
utility.embedder.
check_epochs
(epochs: int) → None[source]¶ Checks number of epochs is greater than zero :param epochs: number of epochs to check
-
utility.embedder.
check_not_empty_dataframe
(df: pandas.core.frame.DataFrame) → None[source]¶ Checks if a dataframe is empty :param df: dataframe to check
-
utility.embedder.
check_target_existent_in_df
(target_name: str, df: pandas.core.frame.DataFrame) → None[source]¶ Checks if a target name exists in a dataframe :param target_name: target name to check :param df: dataframe to check
-
utility.embedder.
check_target_name
(target_name: str) → None[source]¶ Checks if a target name is set :param target_name: target name to check
-
utility.embedder.
check_train_ratio
(train_ratio: float) → None[source]¶ Checks a train ratio is between zero and one :param train_ratio: train ratio to check
-
utility.embedder.
encode_dataframe
(df: pandas.core.frame.DataFrame, target_name: str, metrics: List[str], batch_size: int, train_ratio: float, epochs: int, optimizer: str, network_layers: List[int], verbose: bool, model_path: str, enable_emb_viz: bool) → pandas.core.frame.DataFrame[source]¶ Encodes the categorial features of a dataframe as entity embeddings :param df: dataframe to encode :param target_name: the label name :param metrics: a list of metrics to use :param batch_size: batch size to use :param train_ratio: the train/test split ratio :param epochs: number of epochs :param optimizer: optimizer to use :param network_layers: a list with sizes of network layers, e.g. (32, 32) :param verbose: verbose execution flag :param model_path: where to store the model :param enable_emb_viz: make viz flag
-
utility.embedder.
get_X_y
(df: pandas.core.frame.DataFrame, name_target: str) → Tuple[List, List][source]¶ This method is used to gather the X (features) and y (targets) from a given dataframe based on a given target name :param df: the dataframe to be used as source :param name_target: the name of the target variable :return: the list of features and targets
-
utility.embedder.
get_all_columns_except
(df: pandas.core.frame.DataFrame, columns_to_skip: List[str]) → pandas.core.frame.DataFrame[source]¶ Used to get all columns in a dataframe except columns to skip :param df: dataframe to select columns from :param columns_to_skip: list of columns to skip :return: a dataframe with selected columns
-
utility.embedder.
get_embedding_size
(unique_values: int) → int[source]¶ Return the embedding size to be used on the Embedding layer :param unique_values: the number of unique values in the given category :return: the size to be used on the embedding layer
-
utility.embedder.
get_numerical_cols
(df: pandas.core.frame.DataFrame, target_name: str) → List[source]¶ Generates a list of numerial categories from a dataframe
-
utility.embedder.
is_not_single_embedding
(label: sklearn.preprocessing._label.LabelEncoder) → bool[source]¶ Used to check if there is more than one class in a given LabelEncoder :param label: label encoder to be checked :return: a boolean if the embedding contains more than one class
-
utility.embedder.
make_plot_from_history
(history: tensorflow.python.keras.callbacks.History, output_path: Optional[str] = None, extension: str = 'pdf') → matplotlib.figure.Figure[source]¶ Used to make a Figure object containing the loss curve between the epochs. :param history: the history outputted from the model.fit method :param output_path: (optional) where the image will be saved :param extension: (optional) the extension of the file :return: a Figure object containing the plot
-
utility.embedder.
make_visualizations
(labels: List[sklearn.preprocessing._label.LabelEncoder], embeddings: List[numpy.array], df: pandas.core.frame.DataFrame, output_path: Optional[str] = None, extension: str = 'pdf', n_numerical_cols: Optional[int] = None) → List[matplotlib.figure.Figure][source]¶ Used to generate the embedding visualizations for each categorical variable :param labels: a list of the LabelEncoders of each categorical variable :param embeddings: a Numpy array containing the weights from the categorical variables :param n_numerical_cols: number of numerical columns :param df: the dataframe from where the weights were extracted :param output_path: (optional) where the visualizations will be saved :param extension: (optional) the extension to be used when saving the artifacts :param n_numerical_cols: number of numerical columns :return: the list of figures for each categorical variable
metrics.py¶
Utility metrics functions.
-
utility.metrics.
compute_mean
(values: List) → float[source]¶ Computes the rounded mean of a list of values :param values: values to compute :return: rounded list of values
-
utility.metrics.
compute_std
(values: List) → float[source]¶ Computes the rounded std of a list of values :param values: values to compute :return: rounded list of values
-
utility.metrics.
eval_gini
(y_true: numpy.array, y_prob: numpy.array) → float[source]¶ Computes the gini score :param y_true: the true labels :param y_prob: the predicated labels :return: the gini score
-
utility.metrics.
get_cm_by_protected_variable
(df: pandas.core.frame.DataFrame, protected_col_name: str, y_target_name: str, y_pred_name: str) → pandas.core.frame.DataFrame[source]¶ Makes a confusion matrix for each value of a protected variable :param df: dataframe with the data :param protected_col_name: name of protected variable :param y_target_name: name of the target :param y_pred_name: name of the output :return: confusion matrix as a dataframe