Translations:AI and Machine Learning/16/en: Difference between revisions
Jump to navigation
Jump to search
(Importing a new version from external source) |
(Importing a new version from external source) |
||
(One intermediate revision by the same user not shown) | |||
Line 1: | Line 1: | ||
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you | If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you should split it in 3 chunks of 24 hours. This will prevent you from losing all the work in case of an outage, and give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing; the typical case is covered in our [[Tutoriel_Apprentissage_machine/en#Checkpointing_a_long-running_job|tutorial]]. If your program does not natively support this, we provide a [[Points de contrôle/en|general checkpointing solution]]. |
Latest revision as of 17:10, 8 October 2019
If your computations are long, you should use checkpointing. For example, if your training time is 3 days, you should split it in 3 chunks of 24 hours. This will prevent you from losing all the work in case of an outage, and give you an edge in terms of priority (more nodes are available for short jobs). Most machine learning libraries natively support checkpointing; the typical case is covered in our tutorial. If your program does not natively support this, we provide a general checkpointing solution.