Weights & Biases (wandb)
Weights & Biases (wandb) is a "meta machine learning platform" designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, users can track, compare, explain and reproduce their machine learning experiments.
Using wandb on Compute Canada clusters
Availability
Since it requires an internet connection, wandb has restricted availability on compute nodes, depending on the cluster:
Cluster | Availability | Note |
---|---|---|
Béluga | No ❌ | Wandb requires access to Google Cloud Storage, which is not accessible from the compute nodes |
Cedar | Yes ✅ | Internet access is enabled |
Graham | No ❌ | Internet access is disabled on compute nodes |
Béluga
While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about the user's environment to a Google Cloud Storage bucket, resulting in a crash during or at the very end of a training run. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with wandb.save() also requires access to Google Cloud Storage, which is not available on Béluga's compute nodes.
Users can still use wandb on Béluga by enabling the offline or dryrun modes. In these two modes, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After their jobs finish running, users can sync their wandb content to the online service by running the command wandb sync on the login node.
Note that Comet.ml is a product very similar to Weights & Biases, and works on Béluga.
Example
The following is an example of how to use wandb to track experiments on Béluga. To reproduce this on Cedar, it is not necessary to enable the offline mode.
#!/bin/bash
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --cpus-per-task=1
#SBATCH --mem=4G
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
module load python/3.8
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index wandb
### Save your wandb API key in your .bash_profile or replace $API_KEY with your actual API key. Uncomment the line below and comment out 'wandb offline'. if running on Cedar ###
#wandb login $API_KEY
wandb offline
python wandb-test.py
The script wandb-test.py does simple metric logging. See W&B's full documentation for more options.
import wandb
wandb.init(project="wandb-pytorch-test")
for my_metric in range(10):
wandb.log({'my_metric': my_metric})
After a training run in offline mode, there will be a new folder ./wandb/offline-run*
. You can send the metrics to the server using the command wandb sync ./wandb/offline-run*
. Note that using *
will sync all runs.