-
Notifications
You must be signed in to change notification settings - Fork 38
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Epoch limit reached #9
Comments
@rakibhasan48 what is your os, and how many GB of ram do you have? |
I am also getting the same error. What is the solution? |
I tested on aws. With a titan xp, so specs shouldn't be a problem. |
check out the |
I am running the following :
python train.py --slices 55 --width 12 --stride 1 --Bwidth 350 --vocabulary_size 29
--height 25 --train_data_pattern ./tf-data/handwritten-test-{}.tfrecords --train_dir models-feds
--test_data_pattern ./tf-data/handwritten-test-{}.tfrecords --max_steps 20 --batch_size 20 --beam_size 1
--input_chanels 1 --start_new_model --rnn_cell LSTM --model LSTMCTCModel --num_epochs 6000
Ouput
FutureWarning: Conversion of the second argument of issubdtype from
float
tonp.floating
is deprecated. In future, it will be treated asnp.float64 == np.dtype(float).type
.from ._conv import register_converters as _register_converters
INFO:tensorflow:/job:master/task:0: Tensorflow version: 1.1.0.
(8750, '', 25, 350, 1)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Removing existing train directory.
INFO:tensorflow:/job:master/task:0: Flag 'start_new_model' is set. Building a new model.
INFO:tensorflow:Using batch size of 20 for training.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of training files: 3.
(8750, '', 25, 350, 1)
(8750, '', 25, 350, 1)
INFO:tensorflow:Using batch size of 20 for testing.
tf-data/handwritten-test-{}.tfrecords
INFO:tensorflow:Number of testing files: 3.
(8750, '', 25, 350, 1)
(8750, '********************', 25, 350, 1)
Tensor("Reshape:0", shape=(20, 25, 350, 1), dtype=float32)
[20, 25, 350, 1]
0
[20, 25, None, 1]
INFO:tensorflow:/job:master/task:0: Starting managed session.
2018-03-19 18:25:13.630111: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.1 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630151: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use SSE4.2 instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630159: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630174: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use AVX2instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:13.630180: W tensorflow/core/platform/cpu_feature_guard.cc:45] The TensorFlow library wasn't compiled to use FMA instructions, but these are available on your machine and could speed up CPU computations.
2018-03-19 18:25:16.371931: I tensorflow/stream_executor/cuda/cuda_gpu_executor.cc:901] successful NUMA node read from SysFS had negative value (-1), but there must be at least one NUMA node, so returning NUMA node zero
2018-03-19 18:25:16.372284: I tensorflow/core/common_runtime/gpu/gpu_device.cc:887] Found device 0 with properties:
name: Tesla K80
major: 3 minor: 7 memoryClockRate (GHz) 0.8235
pciBusID 0000:00:1e.0
Total memory: 11.17GiB
Free memory: 11.10GiB
2018-03-19 18:25:16.372318: I tensorflow/core/common_runtime/gpu/gpu_device.cc:908] DMA: 0
2018-03-19 18:25:16.372332: I tensorflow/core/common_runtime/gpu/gpu_device.cc:918] 0: Y
2018-03-19 18:25:16.372347: I tensorflow/core/common_runtime/gpu/gpu_device.cc:977] Creating TensorFlow device (/gpu:0) -> (device: 0, name: Tesla K80, pci bus id: 0000:00:1e.0)
INFO:tensorflow:/job:master/task:0: Entering training loop.
INFO:tensorflow:global_step/sec: 0
INFO:tensorflow:models-feds/model.ckpt-0 is not in all_model_checkpoint_paths. Manually adding it.
2018-03-19 18:25:19.788531: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
2018-03-19 18:25:19.789465: W tensorflow/core/kernels/queue_base.cc:302] _3_test_input/shuffle_batch_join/random_shuffle_queue: Skipping cancelled dequeue attempt with queue not closed
INFO:tensorflow:/job:master/task:0: Done training -- epoch limit reached.
INFO:tensorflow:/job:master/task:0: Exited training loop.
The text was updated successfully, but these errors were encountered: