Weights & Biases (wandb): Difference between revisions

From Alliance Doc
Jump to navigation Jump to search
(for python 3.8 need StdEnv/2020)
 
(24 intermediate revisions by 5 users not shown)
Line 1: Line 1:
<languages />
<languages />
[[Category:AI and Machine Learning]]
<translate>
<translate>
<!--T:1-->
<!--T:1-->
[https://wandb.ai Weights & Biases (wandb)] is a "meta machine learning platform" designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, users can track, compare, explain and reproduce their machine learning experiments.
[https://wandb.ai Weights & Biases (wandb)] is a <i>meta machine learning platform</i> designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, you can track, compare, explain and reproduce machine learning experiments.


== Using wandb on Compute Canada clusters == <!--T:2-->
== Using wandb on Alliance clusters == <!--T:2-->


=== Availability === <!--T:3-->  
=== Availability on compute nodes === <!--T:3-->  




<!--T:4-->
<!--T:4-->
Since it requires an internet connection, wandb has restricted availability on compute nodes, depending on the cluster:
Since it requires an internet connection, wandb has restricted availability on compute nodes, depending on the cluster:


<!--T:5-->
<!--T:5-->
Line 18: Line 18:
! Cluster !! Availability !! Note
! Cluster !! Availability !! Note
|-
|-
| Béluga || No ❌  || Wandb requires access to Google Cloud Storage, which is not available on Béluga
| Béluga || rowspan="2"| No ❌  || rowspan="2"| wandb requires access to Google Cloud Storage, which is not accessible from the compute nodes
|-
| Narval
|-
|-
| Cedar || Yes ✅ || Internet access is enabled
| Cedar || Yes ✅ || internet access is enabled
|-
|-
| Graham || No ❌ || Internet access is disabled on compute nodes
| Graham || No ❌ || internet access is disabled on compute nodes
|}
|}


=== Béluga ===
=== Béluga and Narval === <!--T:40-->
 
<!--T:41-->
While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about your environment to a Google Cloud Storage bucket, resulting in a crash during or at the very end of a training run. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with <code>wandb.save()</code> also requires access to Google Cloud Storage, which is not available on Béluga's compute nodes.


While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about the user's environment to a Google Cloud Storage bucket. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with <tt>wandb.save()</tt> also requires access to Google Cloud Storage, which is not available on Béluga's compute nodes.
<!--T:42-->
You can still use wandb on Béluga by enabling the [https://docs.wandb.ai/library/cli#wandb-offline <code>offline</code>] mode. In this mode, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After your jobs finish running, you can sync their wandb content to the online service by running the command [https://docs.wandb.ai/ref/cli#wandb-sync <code>wandb sync</code>] on the login node.


Users can still use wandb on Béluga by enabling the [https://docs.wandb.ai/library/cli#wandb-offline <tt>offline</tt>] or [https://docs.wandb.ai/library/init#save-logs-offline <tt>dryrun</tt>] modes. In these two modes, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After their jobs finish running, users can sync their wandb content to the online service by running the command [https://docs.wandb.ai/ref/cli#wandb-sync <tt>wandb sync</tt>] on the login node.
<!--T:46-->
Note that [[Comet.ml]] is a product very similar to Weights & Biases, and works on Béluga.


=== Example === <!--T:6-->
=== Example === <!--T:6-->
Line 42: Line 49:
   |contents=
   |contents=
#!/bin/bash
#!/bin/bash
#SBATCH --cpus-per-task=1
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --mem=2G        
#SBATCH --cpus-per-task=2 # At least two cpus is recommended - one for the main process and one for the wandB process
#SBATCH --mem=4G        
#SBATCH --time=0-03:00
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out
#SBATCH --output=%N-%j.out
Line 49: Line 57:


<!--T:9-->
<!--T:9-->
module load StdEnv/2020 python/3.8
virtualenv --no-download $SLURM_TMPDIR/env
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
source $SLURM_TMPDIR/env/bin/activate
pip install torchvision wandb --no-index
pip install --no-index wandb


### Save your wandb API key in your .bash_profile or replace $API_KEY with your actual API key. Uncomment the line below and comment out 'wandb offline'. if running on Cedar ###
<!--T:43-->
### Save your wandb API key in your .bash_profile or replace $API_KEY with your actual API key. Uncomment the line below and comment out "wandb offline" if running on Cedar ###


<!--T:44-->
#wandb login $API_KEY  
#wandb login $API_KEY  


<!--T:45-->
wandb offline
wandb offline


Line 64: Line 76:


<!--T:13-->
<!--T:13-->
The script wandb-test.py uses the <tt>watch()</tt> method to log default metrics to Weights & Biases. See their [https://docs.wandb.ai full documentation] for more options.
The script wandb-test.py is a simple example of metric logging. See [https://docs.wandb.ai W&B's full documentation] for more options.


<!--T:14-->
<!--T:14-->
Line 71: Line 83:
   |lang="python"
   |lang="python"
   |contents=
   |contents=
import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn
<!--T:15-->
import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader
<!--T:16-->
import argparse
<!--T:17-->
import wandb
import wandb


<!--T:47-->
wandb.init(project="wandb-pytorch-test", settings=wandb.Settings(start_method="fork"))


<!--T:18-->
<!--T:48-->
parser = argparse.ArgumentParser(description='cifar10 classification models, wandb test')
for my_metric in range(10):
parser.add_argument('--lr', default=0.1, help='')
    wandb.log({'my_metric': my_metric})
parser.add_argument('--batch_size', type=int, default=768, help='')
parser.add_argument('--max_epochs', type=int, default=4, help='')
parser.add_argument('--num_workers', type=int, default=0, help='')


<!--T:19-->
<!--T:39-->
def main():
}}
   
    args = parser.parse_args()
 
    <!--T:20-->
print("Starting Wandb...")
 
    <!--T:21-->
wandb.init(project="wandb-pytorch-test", config=args)
 
    <!--T:22-->
class Net(nn.Module):
 
      <!--T:23-->
def __init__(self):
          super(Net, self).__init__()
 
          <!--T:24-->
self.conv1 = nn.Conv2d(3, 6, 5)
          self.pool = nn.MaxPool2d(2, 2)
          self.conv2 = nn.Conv2d(6, 16, 5)
          self.fc1 = nn.Linear(16 * 5 * 5, 120)
          self.fc2 = nn.Linear(120, 84)
          self.fc3 = nn.Linear(84, 10)
 
      <!--T:25-->
def forward(self, x):
          x = self.pool(F.relu(self.conv1(x)))
          x = self.pool(F.relu(self.conv2(x)))
          x = x.view(-1, 16 * 5 * 5)
          x = F.relu(self.fc1(x))
          x = F.relu(self.fc2(x))
          x = self.fc3(x)
          return x
 
    <!--T:26-->
net = Net()
 
    <!--T:27-->
transform_train = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])
 
    <!--T:28-->
dataset_train = CIFAR10(root='./data', train=True, download=False, transform=transform_train)
 
    <!--T:29-->
train_loader = DataLoader(dataset_train, batch_size=args.batch_size, num_workers=args.num_workers)
 
    <!--T:30-->
criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=args.lr)
 
    <!--T:31-->
wandb.watch(net)
 
    <!--T:32-->
for epoch in range(args.max_epochs):
 
        <!--T:33-->
train(epoch, net, criterion, optimizer, train_loader)


<!--T:49-->
After a training run in offline mode, there will be a new folder <code>./wandb/offline-run*</code>. You can send the metrics to the server using the command <code>wandb sync ./wandb/offline-run*</code>. Note that using <code>*</code> will sync all runs.


<!--T:34-->
def train(epoch, net, criterion, optimizer, train_loader):
    <!--T:35-->
for batch_idx, (inputs, targets) in enumerate(train_loader):
      <!--T:36-->
outputs = net(inputs)
      loss = criterion(outputs, targets)
      <!--T:37-->
optimizer.zero_grad()
      loss.backward()
      optimizer.step()
<!--T:38-->
if __name__=='__main__':
  main()
<!--T:39-->
}}
</translate>
</translate>

Latest revision as of 17:57, 12 July 2024

Other languages:

Weights & Biases (wandb) is a meta machine learning platform designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, you can track, compare, explain and reproduce machine learning experiments.

Using wandb on Alliance clusters

Availability on compute nodes

Since it requires an internet connection, wandb has restricted availability on compute nodes, depending on the cluster:

Cluster Availability Note
Béluga No ❌ wandb requires access to Google Cloud Storage, which is not accessible from the compute nodes
Narval
Cedar Yes ✅ internet access is enabled
Graham No ❌ internet access is disabled on compute nodes

Béluga and Narval

While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about your environment to a Google Cloud Storage bucket, resulting in a crash during or at the very end of a training run. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with wandb.save() also requires access to Google Cloud Storage, which is not available on Béluga's compute nodes.

You can still use wandb on Béluga by enabling the offline mode. In this mode, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After your jobs finish running, you can sync their wandb content to the online service by running the command wandb sync on the login node.

Note that Comet.ml is a product very similar to Weights & Biases, and works on Béluga.

Example

The following is an example of how to use wandb to track experiments on Béluga. To reproduce this on Cedar, it is not necessary to enable the offline mode.


File : wandb-test.sh

#!/bin/bash
#SBATCH --account=YOUR_ACCOUNT
#SBATCH --cpus-per-task=2 # At least two cpus is recommended - one for the main process and one for the wandB process
#SBATCH --mem=4G       
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out


module load StdEnv/2020 python/3.8
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install --no-index wandb

### Save your wandb API key in your .bash_profile or replace $API_KEY with your actual API key. Uncomment the line below and comment out "wandb offline" if running on Cedar ###

#wandb login $API_KEY 

wandb offline

python wandb-test.py


The script wandb-test.py is a simple example of metric logging. See W&B's full documentation for more options.


File : wandb-test.py

import wandb

wandb.init(project="wandb-pytorch-test", settings=wandb.Settings(start_method="fork"))

for my_metric in range(10):
    wandb.log({'my_metric': my_metric})


After a training run in offline mode, there will be a new folder ./wandb/offline-run*. You can send the metrics to the server using the command wandb sync ./wandb/offline-run*. Note that using * will sync all runs.