38,760
edits
(Updating to match new version of source page) |
(Updating to match new version of source page) |
||
Line 444: | Line 444: | ||
}} | }} | ||
==Creating Model Checkpoints== | |||
Whether or not you expect your code to run for long time periods, it is a good habit to create Checkpoints during training. A checkpoint is a snapshot of your model at a given point during the training process (after a certain number of iterations or after a number of epochs) that is saved to disk and can be loaded at a later time. It is a handy way of breaking jobs that are expected to run for a very long time, into multiple shorter jobs that may get allocated on the cluster more quickly. It is also a good way of avoiding losing progress in case of unexpected errors in your code or node failures. | |||
===With Keras=== | |||
To create a checkpoint when training with <code>keras</code>, we recommend using the <code>callbacks</code> parameter of the <code>model.fit()</code> method. The following example shows how to instruct TensorFlow to create a checkpoint at the end of every training epoch: | |||
callbacks = [tf.keras.callbacks.ModelCheckpoint(filepath="./ckpt",save_freq="epoch")] # Make sure the path where you want to create the checkpoint exists | |||
model.fit(dataset, epochs=10 , callbacks=callbacks) | |||
For more information, please refer to the [https://www.tensorflow.org/api_docs/python/tf/keras/callbacks/ModelCheckpoint official TensorFlow documentation]. | |||
===With a Custom Training Loop=== | |||
Please refer to the [https://www.tensorflow.org/guide/checkpoint#writing_checkpoints official TensorFlow documentation]. | |||
==Opérateurs personnalisés== | ==Opérateurs personnalisés== |