AI and Machine Learning: Difference between revisions

Long running computations; Running many similar jobs
(Section about datasets)
(Long running computations; Running many similar jobs)
Line 36: Line 36:
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to  
On a distributed filesystem, data should be stored in large single-file archives. On this subject, please refer to  
[[Handling large collections of files]].
[[Handling large collections of files]].
= Long running computations =
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you could split it in 3 chunks of 24 hours. This would prevent you from losing all the work in case of an outage, and would give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing. Please see our suggestions about [[Running jobs#Resubmitting_jobs_for_long_running_computations|resubmitting jobs for long running computations]]. If your program does not natively support this, we provide a [[Points de contrôle|general checkpointing solution]].
= Running many similar jobs =
If you are in one of these situations:
* Hyperparameter search
* Training many variants of the same method
* Running many optimization processes of similar duration
... you should consider grouping many jobs into one. [[GLOST]] and [[GNU Parallel]] are available to help you with this.
cc_staff
353

edits