Groggy4 said:
It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.
Unfortunately, it happend again, although I had set to high ram. But it is weired:
Starting. Press "Enter" to stop training and save model.
Saving: 80% 4/5 [00:03<00:01, 1.16s/it]tcmalloc: large alloc 1402912768 bytes == 0x55966e1ba000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e8397541 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
tcmalloc: large alloc 2132000768 bytes == 0x5596c1ba6000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:18:25][#3159393][0995ms][0.1394][0.3914]
Saving: 80% 4/5 [00:02<00:00, 1.46it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:28:18][#3159959][1024ms][0.1387][0.3966]
Saving: 80% 4/5 [00:02<00:00, 1.50it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:38:19][#3160532][0993ms][0.1388][0.3988]
Saving: 80% 4/5 [00:02<00:00, 1.45it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:48:19][#3161105][1022ms][0.1406][0.3975]
[02:58:19][#3161679][1014ms][0.1423][0.3950]
[03:08:20][#3162253][0990ms][0.1393][0.3960]
[03:16:20][#3162721][1013ms][0.1693][0.4662]
It started with the same issue, but after a while, it didn't show up. So will that be a problem? It seems that the model file that it saved is as large as the original one.