Ray: Difference between revisions

Jump to navigation Jump to search
m
No edit summary
Line 176: Line 176:
To run this example, you can use one of the job submission templates provided [[#Job_submission | above]] depending on whether you require one or multiple nodes. As you will see in the code that follows, the amount of resources required by your job will depend mainly on two factors: the number of samples you wish to draw from the search space and the size of your model in memory. Knowing these two things you can reason about how many trials you will run in total and how many of them can run in parallel using as few resources as possible. For example, how many copies of your model can you fit inside the memory of a single GPU? That is the number of trials you can run in parallel using just one GPU.
To run this example, you can use one of the job submission templates provided [[#Job_submission | above]] depending on whether you require one or multiple nodes. As you will see in the code that follows, the amount of resources required by your job will depend mainly on two factors: the number of samples you wish to draw from the search space and the size of your model in memory. Knowing these two things you can reason about how many trials you will run in total and how many of them can run in parallel using as few resources as possible. For example, how many copies of your model can you fit inside the memory of a single GPU? That is the number of trials you can run in parallel using just one GPU.


In the example, our model takes up about 1GB in memory. We will run 20 trials in total, 10 in parallel at a time in the same GPU, and we will give one CPU per trial to be used as a <code>DataLoader</code> worker. So we will pick the single node job submission template and we will replace the number of cpus per task with <code>#SBATCH --cpus-per-task=10</code> and the Python call with <code>python ray-tune-example.py --num_samples=20 --cpus-per-trial=1 gpus-per-trial=0.1</code>. We will also need to install the packages <code>ray[tune]</code> and <code>torchvision</code> in our virtualenv.
In the example, our model takes up about 1GB in memory. We will run 20 trials in total, 10 in parallel at a time on the same GPU, and we will give one CPU to each trial to be used as a <code>DataLoader</code> worker. So we will pick the single node job submission template and we will replace the number of cpus per task with <code>#SBATCH --cpus-per-task=10</code> and the Python call with <code>python ray-tune-example.py --num_samples=20 --cpus-per-trial=1 gpus-per-trial=0.1</code>. We will also need to install the packages <code>ray[tune]</code> and <code>torchvision</code> in our virtualenv.


{{File
{{File
cc_staff
282

edits

Navigation menu