Weights & Biases (wandb): Difference between revisions

Revision as of 16:46, 10 March 2021

Other languages:

English
français

Weights & Biases (wandb) is a "meta machine learning platform" designed to help AI practitioners and teams build reliable machine learning models for real-world applications by streamlining the machine learning model lifecycle. By using wandb, users can track, compare, explain and reproduce their machine learning experiments.

Using wandb on Compute Canada clusters

Availability

Since it requires an internet connection, wandb has restricted availability on compute nodes, depending on the cluster:

Cluster	Availability	Note
Béluga	No ❌	Wandb requires access to Google Cloud Storage, which is not available on Béluga
Cedar	Yes ✅	Internet access is enabled
Graham	No ❌	Internet access is disabled on compute nodes

Béluga

While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about the user's environment to a Google Cloud Storage bucket. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with wandb.save() also requires access to Google Cloud Storage, which is not available on Béluga.

Users can still use wandb on Béluga by enabling the offline or dryrun modes. In these two modes, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After their jobs finish running, users can sync their wandb content to the online service by running the command wandb sync on the login node.

Example

The following is an example of how to use wandb to track experiments on Béluga. To reproduce this on Cedar, it is not necessary to enable the offline mode.

File : wandb-test.sh

#!/bin/bash
#SBATCH --cpus-per-task=1 
#SBATCH --mem=2G       
#SBATCH --time=0-03:00
#SBATCH --output=%N-%j.out


module load python/3.6 httpproxy
virtualenv --no-download $SLURM_TMPDIR/env
source $SLURM_TMPDIR/env/bin/activate
pip install torchvision wandb --no-index

wandb offline

python wandb-test.py

The script wandb-test.py uses the watch() method to log default metrics to Weights & Biases. See their full documentation for more options.

File : wandb-test.py

import torch
import torch.nn as nn
import torch.nn.functional as F
import torch.optim as optim
import torch.backends.cudnn as cudnn

import torchvision
import torchvision.transforms as transforms
from torchvision.datasets import CIFAR10
from torch.utils.data import DataLoader

import argparse

import wandb


parser = argparse.ArgumentParser(description='cifar10 classification models, wandb test')
parser.add_argument('--lr', default=0.1, help='')
parser.add_argument('--batch_size', type=int, default=768, help='')
parser.add_argument('--max_epochs', type=int, default=4, help='')
parser.add_argument('--num_workers', type=int, default=0, help='')

def main():
    
    args = parser.parse_args()

    print("Starting Wandb...")

    wandb.init(project="wandb-pytorch-test", config=args)

    class Net(nn.Module):

       def __init__(self):
          super(Net, self).__init__()

          self.conv1 = nn.Conv2d(3, 6, 5)
          self.pool = nn.MaxPool2d(2, 2)
          self.conv2 = nn.Conv2d(6, 16, 5)
          self.fc1 = nn.Linear(16 * 5 * 5, 120)
          self.fc2 = nn.Linear(120, 84)
          self.fc3 = nn.Linear(84, 10)

       def forward(self, x):
          x = self.pool(F.relu(self.conv1(x)))
          x = self.pool(F.relu(self.conv2(x)))
          x = x.view(-1, 16 * 5 * 5)
          x = F.relu(self.fc1(x))
          x = F.relu(self.fc2(x))
          x = self.fc3(x)
          return x

    net = Net()

    transform_train = transforms.Compose([transforms.ToTensor(),transforms.Normalize((0.5, 0.5, 0.5), (0.5, 0.5, 0.5))])

    dataset_train = CIFAR10(root='./data', train=True, download=False, transform=transform_train)

    train_loader = DataLoader(dataset_train, batch_size=args.batch_size, num_workers=args.num_workers)

    criterion = nn.CrossEntropyLoss()
    optimizer = optim.SGD(net.parameters(), lr=args.lr)

    wandb.watch(net)

    for epoch in range(args.max_epochs):

        train(epoch, net, criterion, optimizer, train_loader)


def train(epoch, net, criterion, optimizer, train_loader):

    for batch_idx, (inputs, targets) in enumerate(train_loader):

       outputs = net(inputs)
       loss = criterion(outputs, targets)

       optimizer.zero_grad()
       loss.backward()
       optimizer.step()


if __name__=='__main__':
   main()

@@ Line 24: / Line 24: @@
 | Graham || No ❌ || Internet access is disabled on compute nodes
 |}
+=== Béluga ===
+While it is possible to upload basic metrics to Weights&Biases during a job on Béluga, the wandb package automatically uploads information about the user's environment to a Google Cloud Storage bucket. It is not currently possible to disable this behaviour. Uploading artifacts to W&B with <tt>wandb.save()</tt> also requires access to Google Cloud Storage, which is not available on Béluga.
+Users can still use wandb on Béluga by enabling the [https://docs.wandb.ai/library/cli#wandb-offline <tt>offline</tt>] or [https://docs.wandb.ai/library/init#save-logs-offline <tt>dryrun</tt>] modes. In these two modes, wandb will write all metrics, logs and artifacts to the local disk and will not attempt to sync anything to the Weights&Biases service on the internet. After their jobs finish running, users can sync their wandb content to the online service by running the command [https://docs.wandb.ai/ref/cli#wandb-sync <tt>wandb sync</tt>] on the login node.
 === Example === <!--T:6-->

Weights & Biases (wandb): Difference between revisions

Revision as of 16:46, 10 March 2021

Contents

Using wandb on Compute Canada clusters

Availability

Béluga

Example

Navigation menu

Weights & Biases (wandb): Difference between revisions

Revision as of 16:46, 10 March 2021

Using wandb on Compute Canada clusters

Availability

Béluga

Example

Navigation menu

Search