Experiment Manager#
ATOMMIC’s Experiment Manager leverages PyTorch Lightning for model checkpointing, TensorBoard Logging and Weights and Biases logging. The Experiment Manager is included by default in all ATOMMIC example scripts.
To use the experiment manager simply call exp_manager
and pass in the PyTorch
Lightning Trainer
.
exp_dir = exp_manager(trainer, cfg.get("exp_manager", None))
And is configurable via YAML with Hydra.
exp_manager:
exp_dir: /path/to/my/experiments
name: my_experiment_name
create_tensorboard_logger: True
create_checkpoint_callback: True
Optionally, launch TensorBoard to view the training results in ./atommic_experiments
(by default).
tensorboard --bind_all --logdir atommic_experiments
If create_checkpoint_callback
is set to True
, then ATOMMIC automatically creates checkpoints during training
using PyTorch Lightning’s ModelCheckpoint.
We can configure the ModelCheckpoint
via YAML or CLI.
exp_manager:
...
# configure the PyTorch Lightning ModelCheckpoint using checkpoint_call_back_params
# any ModelCheckpoint argument can be set here
# save the best checkpoints based on this metric
checkpoint_callback_params.monitor=val_loss
# choose how many total checkpoints to save
checkpoint_callback_params.save_top_k=5
Resume Training#
We can auto-resume training as well by configuring the exp_manager
. Being able to auto-resume is important when
doing long training runs that are premptible or may be shut down before the training procedure has completed. To
auto-resume training, set the following via YAML or CLI:
exp_manager:
...
# resume training if checkpoints already exist
resume_if_exists: True
# to start training with no existing checkpoints
resume_ignore_no_checkpoint: True
# by default experiments will be versioned by datetime
# we can set our own version with
exp_manager.version: my_experiment_version
Experiment Loggers#
Alongside Tensorboard, ATOMMIC also supports Weights and Biases. To use this logger, simply set the following
via YAML or ExpManagerConfig
.
Weights and Biases (WandB)#
exp_manager:
...
create_checkpoint_callback: True
create_wandb_logger: True
wandb_logger_kwargs:
name: ${name}
project: ${project}
entity: ${entity}
<Add any other arguments supported by WandB logger here>
Exponential Moving Average#
ATOMMIC supports using exponential moving average (EMA) for model parameters. This can be useful for improving model
generalization and stability. To use EMA, simply set the following via YAML or
ExpManagerConfig
.
exp_manager:
...
# use exponential moving average for model parameters
ema:
enabled: True # False by default
decay: 0.999 # decay rate
cpu_offload: False # If EMA parameters should be offloaded to CPU to save GPU memory
every_n_steps: 1 # How often to update EMA weights
validate_original_weights: False # Whether to use original weights for validation calculation or EMA weights
Support for Preemption#
ATOMMIC adds support for a callback upon preemption while running the models on clusters. The callback takes care of
saving the current state of training via the .ckpt
file followed by a graceful exit from the run. The checkpoint
saved upon preemption has the *last.ckpt
suffix and replaces the previously saved last checkpoints. This feature
is useful to increase utilization on clusters. The PreemptionCallback
is enabled by default. To disable it simply
add create_preemption_callback: False
under exp_manager in the config YAML file.
Hydra Multi-Run with ATOMMIC#
When training neural networks, it is common to perform hyper parameter search in order to improve the performance of a model on some validation data. However, it can be tedious to manually prepare a grid of experiments and management of all checkpoints and their metrics. In order to simplify such tasks, ATOMMIC integrates with Hydra Multi-Run support in order to provide a unified way to run a set of experiments all from the config.
There are certain limitations to this framework, which we list below:
All experiments are assumed to be run on a single GPU, and multi GPU for single run (model parallel models are not
supported as of now).
ATOMMIC Multi-Run supports only grid search over a set of hyper-parameters, but we will eventually add support for
advanced hyper parameter search strategies.
ATOMMIC Multi-Run only supports running on one or more GPUs and will not work if no GPU devices are present.
Config Setup#
In order to enable ATOMMIC Multi-Run, we first update our YAML configs with some information to let Hydra know we expect to run multiple experiments from this one config -
# Required for Hydra launch of hyperparameter search via multirun
defaults:
- override hydra/launcher: atommic_launcher
# Hydra arguments necessary for hyperparameter optimization
hydra:
# Helper arguments to ensure all hyper parameter runs are from the directory that launches the script.
sweep:
dir: "."
subdir: "."
# Define all the hyper parameters here
sweeper:
params:
# Place all the parameters you wish to search over here (corresponding to the rest of the config)
# NOTE: Make sure that there are no spaces between the commas that separate the config params !
model.optim.lr: 0.001,0.0001
model.encoder.dim: 32,64,96,128
model.decoder.dropout: 0.0,0.1,0.2
# Arguments to the process launcher
launcher:
num_gpus: -1 # Number of gpus to use. Each run works on a single GPU.
jobs_per_gpu: 1 # If each GPU has large memory, you can run multiple jobs on the same GPU for faster results (until OOM).
Next, we will setup the config for Experiment Manager
. When we perform hyper parameter search, each run may take
some time to complete. We want to therefore avoid the case where a run ends (say due to OOM or timeout on the machine)
and we need to redo all experiments. We therefore setup the experiment manager config such that every experiment has a
unique “key”, whose value corresponds to a single resumable experiment.
Let us see how to setup such a unique “key” via the experiment name. Simply attach all the hyper parameter arguments to the experiment name as shown below -
exp_manager:
exp_dir: null # Can be set by the user.
# Add a unique name for all hyper parameter arguments to allow continued training.
# NOTE: It is necessary to add all hyperparameter arguments to the name !
# This ensures successful restoration of model runs in case HP search crashes.
name: ${name}-lr-${model.optim.lr}-adim-${model.adapter.dim}-sd-${model.adapter.adapter_strategy.stochastic_depth}
...
checkpoint_callback_params:
...
save_top_k: 1 # Dont save too many .ckpt files during HP search
always_save_atommic: True # saves the checkpoints as atommic files for fast checking of results later
...
# We highly recommend use of any experiment tracking took to gather all the experiments in one location
create_wandb_logger: True
wandb_logger_kwargs:
project: "<Add some project name here>"
# HP Search may crash due to various reasons, best to attempt continuation in order to
# resume from where the last failure case occured.
resume_if_exists: true
resume_ignore_no_checkpoint: true
Running a Multi-Run config#
Once the config has been updated, we can now run it just like any normal Hydra script – with one special flag (-m
) !
atommic run -c BC -m
Tips and Tricks#
Preserving disk space for large number of experiments
Some models may have a large number of parameters, and it may be very expensive to save a large number of checkpoints on physical storage drives. For example, if you use Adam optimizer, each PyTorch Lightning “.ckpt” file will actually be 3x the size of just the model parameters - per ckpt file ! This can be exhorbitant if you have multiple runs.
In the above config, we explicitly set save_top_k: 1
and always_save_atommic: True
- what this does is limit
the number of ckpt files to just 1, and also save a ATOMMIC file (which will contain just the model parameters without
optimizer state) and can be restored immediately for further work.
We can further reduce the storage space by utilizing some utility functions of ATOMMIC to automatically delete either ckpt or ATOMMIC files after a training run has finished. This is sufficient in case you are collecting results in some experiment tracking tool and can simply rerun the best config after the search is finished.
# Import `clean_exp_ckpt` along with exp_manager
from atommic.utils.exp_manager import clean_exp_ckpt, exp_manager
@hydra_runner(...)
def main(cfg):
...
# Keep track of the experiment directory
exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
... add any training code here as needed ...
# Add following line to end of the training script
# Remove PTL ckpt file, and potentially also remove .atommic file to conserve storage space.
clean_exp_ckpt(exp_log_dir, remove_ckpt=True, remove_atommic=False)
Debugging Multi-Run Scripts
When running hydra scripts, you may sometimes face config issues which crash the program. In ATOMMIC Multi-Run, a crash in any one run will not crash the entire program, we will simply take note of it and move onto the next job. Once all jobs are completed, we then raise the error in the order that it occurred (it will crash the program with the first error’s stack trace).
In order to debug Muti-Run, we suggest to comment out the full hyper parameter config set inside sweep.params
and instead run just a single experiment with the config - which would immediately raise the error.
Experiment name cannot be parsed by Hydra
Sometimes our hyper parameters include PyTorch Lightning trainer
arguments - such as number of steps, number of
epochs whether to use gradient accumulation or not etc. When we attempt to add these as keys to the experiment
manager’s name
. Hydra may complain that trainer.xyz
cannot be resolved.
A simple solution is to finalize the hydra config before you call exp_manager()
as follows -
@hydra_runner(...)
def main(cfg):
# Make any changes as necessary to the config
cfg.xyz.abc = uvw
# Finalize the config
cfg = OmegaConf.resolve(cfg)
# Carry on as normal by calling trainer and exp_manager
trainer = pl.Trainer(**cfg.trainer)
exp_log_dir = exp_manager(trainer, cfg.get("exp_manager", None))
...
ExpManagerConfig#
- class atommic.utils.exp_manager.ExpManagerConfig(explicit_log_dir: Optional[str] = None, exp_dir: Optional[str] = None, name: Optional[str] = None, version: Optional[str] = None, use_datetime_version: Optional[bool] = True, resume_if_exists: Optional[bool] = False, resume_past_end: Optional[bool] = False, resume_ignore_no_checkpoint: Optional[bool] = False, resume_from_checkpoint: Optional[str] = None, create_tensorboard_logger: Optional[bool] = True, summary_writer_kwargs: Optional[Dict[Any, Any]] = None, create_wandb_logger: Optional[bool] = False, wandb_logger_kwargs: Optional[Dict[Any, Any]] = None, create_checkpoint_callback: Optional[bool] = True, checkpoint_callback_params: Optional[CallbackParams] = CallbackParams(filepath=None, dirpath=None, filename=None, monitor='val_loss', verbose=True, save_last=True, save_top_k=3, save_weights_only=False, mode='min', auto_insert_metric_name=True, every_n_epochs=1, every_n_train_steps=None, train_time_interval=None, prefix=None, postfix='.atommic', save_best_model=False, always_save_atommic=False, save_atommic_on_train_end=True, model_parallel_size=None, save_on_train_epoch_end=False), create_early_stopping_callback: Optional[bool] = False, early_stopping_callback_params: Optional[EarlyStoppingParams] = EarlyStoppingParams(monitor='val_loss', mode='min', min_delta=0.001, patience=10, verbose=True, strict=True, check_finite=True, stopping_threshold=None, divergence_threshold=None, check_on_train_epoch_end=None, log_rank_zero_only=False), create_preemption_callback: Optional[bool] = True, files_to_copy: Optional[List[str]] = None, log_step_timing: Optional[bool] = True, step_timing_kwargs: Optional[StepTimingParams] = StepTimingParams(reduction='mean', sync_cuda=False, buffer_size=1), log_local_rank_0_only: Optional[bool] = False, log_global_rank_0_only: Optional[bool] = False, disable_validation_on_resume: Optional[bool] = True, ema: Optional[EMAParams] = EMAParams(enable=False, decay=0.999, cpu_offload=False, validate_original_weights=False, every_n_steps=1), max_time_per_run: Optional[str] = None, seconds_to_sleep: float = 5)[source]#
Bases:
object
Configuration for the experiment manager.
- explicit_log_dir: Optional[str] = None#
- exp_dir: Optional[str] = None#
- name: Optional[str] = None#
- version: Optional[str] = None#
- use_datetime_version: Optional[bool] = True#
- resume_if_exists: Optional[bool] = False#
- resume_past_end: Optional[bool] = False#
- resume_ignore_no_checkpoint: Optional[bool] = False#
- resume_from_checkpoint: Optional[str] = None#
- create_tensorboard_logger: Optional[bool] = True#
- summary_writer_kwargs: Optional[Dict[Any, Any]] = None#
- create_wandb_logger: Optional[bool] = False#
- wandb_logger_kwargs: Optional[Dict[Any, Any]] = None#
- create_checkpoint_callback: Optional[bool] = True#
- checkpoint_callback_params: Optional[CallbackParams] = CallbackParams(filepath=None, dirpath=None, filename=None, monitor='val_loss', verbose=True, save_last=True, save_top_k=3, save_weights_only=False, mode='min', auto_insert_metric_name=True, every_n_epochs=1, every_n_train_steps=None, train_time_interval=None, prefix=None, postfix='.atommic', save_best_model=False, always_save_atommic=False, save_atommic_on_train_end=True, model_parallel_size=None, save_on_train_epoch_end=False)#
- create_early_stopping_callback: Optional[bool] = False#
- early_stopping_callback_params: Optional[EarlyStoppingParams] = EarlyStoppingParams(monitor='val_loss', mode='min', min_delta=0.001, patience=10, verbose=True, strict=True, check_finite=True, stopping_threshold=None, divergence_threshold=None, check_on_train_epoch_end=None, log_rank_zero_only=False)#
- create_preemption_callback: Optional[bool] = True#
- files_to_copy: Optional[List[str]] = None#
- log_step_timing: Optional[bool] = True#
- step_timing_kwargs: Optional[StepTimingParams] = StepTimingParams(reduction='mean', sync_cuda=False, buffer_size=1)#
- log_local_rank_0_only: Optional[bool] = False#
- log_global_rank_0_only: Optional[bool] = False#
- disable_validation_on_resume: Optional[bool] = True#
- ema: Optional[EMAParams] = EMAParams(enable=False, decay=0.999, cpu_offload=False, validate_original_weights=False, every_n_steps=1)#
- max_time_per_run: Optional[str] = None#
- seconds_to_sleep: float = 5#