[GUIDE] DeepFaceLab 2.0 - Google Colab Guide

forVenice · Feb 2, 2022

Code:

Saving:  80% 4/5 [00:08<00:02,  2.31s/it]tcmalloc: large alloc 1362739200 bytes == 0x561cfef92000   0x7f8c1e3c82a4 0x561b6e2904cc 0x561b6e34c1a2 0x561b6e345034 0x561b6e233255 0x561b6e3463ac 0x561b6e344fda 0x561b6e345781 0x561b6e343f3c 0x561b6e3d9a49 0x561b6e29346c 0x561b6e293240 0x561b6e3070f3 0x561b6e3019ee 0x561b6e294bda 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e301ced 0x561b6e294bda 0x561b6e302915 0x561b6e301ced 0x561b6e1d3e2b 0x561b6e303fe4 0x561b6e294afa 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e295039
tcmalloc: large alloc 2350432256 bytes == 0x561d5032e000   0x7f8c1e3c82a4 0x561b6e2904cc 0x561b6e34c1a2 0x561b6e345034 0x561b6e233255 0x561b6e3463ac 0x561b6e344fda 0x561b6e345781 0x561b6e343f3c 0x561b6e3d9a49 0x561b6e29346c 0x561b6e293240 0x561b6e3070f3 0x561b6e3019ee 0x561b6e294bda 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e301ced 0x561b6e294bda 0x561b6e302915 0x561b6e301ced 0x561b6e1d3e2b 0x561b6e303fe4 0x561b6e294afa 0x561b6e302c0d 0x561b6e294afa 0x561b6e302c0d 0x561b6e295039

I started to get tcmalloc errors on autosaving (saving 80%) the model, which stops the process and it won't continue to train after. Anybody know how to deal with this? This didn't happen last week on the same exact model.

Groggy4 · Feb 2, 2022

forVenice said:
I started to get tcmalloc errors on autosaving (saving 80%) the model, which stops the process and it won't continue to train after. Anybody know how to deal with this? This didn't happen last week on the same exact model.

It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.

forVenice · Feb 2, 2022

Groggy4 said:
forVenice said:

I started to get tcmalloc errors on autosaving (saving 80%) the model, which stops the process and it won't continue to train after. Anybody know how to deal with this? This didn't happen last week on the same exact model.

Click to expand...

It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.

thank you! starting the session with high ram did solve it! was driving me mad...

3rdtry77 · Feb 6, 2022

I'm getting a lot of random disconnects lately, even when using "prevent random disconnects". It's not the usage block. It'll just randomly disconnect and attempt to reconnect, and I'll lose however much has processed since my last save. It sucks.

cashho1 · Feb 7, 2022

wish i could do this, but its too complicated for an adhd guy

belovick · Feb 8, 2022

Im getting a lot if gpu usage limitation just after 4 hours of use. i even stopped for a day. I was wondering if colab pro provides a much longer gpu usage ?

3rdtry77 · Feb 16, 2022

Yesterday and today I've been getting the error
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

It still loads and runs, but I don't know if that's affecting the ability to train. Any idea what's going on and whether it's something to worry about or not?

dsyrock · Feb 17, 2022

Groggy4 said:
forVenice said:

I started to get tcmalloc errors on autosaving (saving 80%) the model, which stops the process and it won't continue to train after. Anybody know how to deal with this? This didn't happen last week on the same exact model.

Click to expand...

It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.

OMFG! I didn't know the real reason until today! I thought that is because colab can not write files that larger than 1G, damn! Thank you so much!

dsyrock · Feb 18, 2022

Groggy4 said:
It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.

Unfortunately, it happend again, although I had set to high ram. But it is weired:

Starting. Press "Enter" to stop training and save model.
Saving: 80% 4/5 [00:03<00:01, 1.16s/it]tcmalloc: large alloc 1402912768 bytes == 0x55966e1ba000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e8397541 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
tcmalloc: large alloc 2132000768 bytes == 0x5596c1ba6000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:18:25][#3159393][0995ms][0.1394][0.3914]
Saving: 80% 4/5 [00:02<00:00, 1.46it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:28:18][#3159959][1024ms][0.1387][0.3966]
Saving: 80% 4/5 [00:02<00:00, 1.50it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:38:19][#3160532][0993ms][0.1388][0.3988]
Saving: 80% 4/5 [00:02<00:00, 1.45it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:48:19][#3161105][1022ms][0.1406][0.3975]
[02:58:19][#3161679][1014ms][0.1423][0.3950]
[03:08:20][#3162253][0990ms][0.1393][0.3960]
[03:16:20][#3162721][1013ms][0.1693][0.4662]

It started with the same issue, but after a while, it didn't show up. So will that be a problem? It seems that the model file that it saved is as large as the original one.

honor · Feb 18, 2022

dsyrock said:
Groggy4 said:

It happens sometimes when you get a slow cpu/ram assigned. If you terminate session and start fresh with a "high ram" runtime, it should work. Might work with a regular runtime as well. 90% of the time it does. At least with P100 assigned. Whenever you start a regular runtime with T4, it fails every time.

Click to expand...

Unfortunately, it happend again, although I had set to high ram. But it is weired:

Starting. Press "Enter" to stop training and save model.
Saving: 80% 4/5 [00:03<00:01, 1.16s/it]tcmalloc: large alloc 1402912768 bytes == 0x55966e1ba000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e8397541 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
tcmalloc: large alloc 2132000768 bytes == 0x5596c1ba6000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:18:25][#3159393][0995ms][0.1394][0.3914]
Saving: 80% 4/5 [00:02<00:00, 1.46it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:28:18][#3159959][1024ms][0.1387][0.3966]
Saving: 80% 4/5 [00:02<00:00, 1.50it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:38:19][#3160532][0993ms][0.1388][0.3988]
Saving: 80% 4/5 [00:02<00:00, 1.45it/s]tcmalloc: large alloc 2132000768 bytes == 0x5596d51bc000 0x7fa0e63222a4 0x5594e82e13bc 0x5594e839df62 0x5594e8396df4 0x5594e8284225 0x5594e839816c 0x5594e8396d9a 0x5594e83974ae 0x5594e8395cfc 0x5594e842b9f9 0x5594e82e434c 0x5594e82e4120 0x5594e8358679 0x5594e835302f 0x5594e82e5aba 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e835366e 0x5594e82e5aba 0x5594e8353eae 0x5594e835366e 0x5594e8224e2b 0x5594e8355633 0x5594e82e59da 0x5594e8354108 0x5594e82e59da 0x5594e8354108 0x5594e82e5f19
[02:48:19][#3161105][1022ms][0.1406][0.3975]
[02:58:19][#3161679][1014ms][0.1423][0.3950]
[03:08:20][#3162253][0990ms][0.1393][0.3960]
[03:16:20][#3162721][1013ms][0.1693][0.4662]

It started with the same issue, but after a while, it didn't show up. So will that be a problem? It seems that the model file that it saved is as large as the original one.

There is an environmental variable;

TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD, which by default is set to 2132000768 bytes. ~2GB
So if there is an allocation that is == or exceeds 2GB of memory, this message will be displayed.

if it saves fine, and continues training... its not a problem... its is simply a warning message to inform that the threshold has been crossed .

biglou994 · Feb 18, 2022

3rdtry77 said:
Yesterday and today I've been getting the error
RuntimeError: module compiled against API version 0xe but this version of numpy is 0xd

It still loads and runs, but I don't know if that's affecting the ability to train. Any idea what's going on and whether it's something to worry about or not?

im getting this error too , i think nothing is affected but the process speed is so slow , is it slow for you too ?

dsyrock · Feb 19, 2022

honor said:
There is an environmental variable;

TCMALLOC_LARGE_ALLOC_REPORT_THRESHOLD, which by default is set to 2132000768 bytes. ~2GB
So if there is an allocation that is == or exceeds 2GB of memory, this message will be displayed.

if it saves fine, and continues training... its not a problem... its is simply a warning message to inform that the threshold has been crossed .

But the file is about 1.32G, but anyway, it seems no problem with that. Thanks!

DeliveryBoy · Feb 28, 2022

Amazing guide thank you so very much ...

DeliveryBoy · Feb 28, 2022

Is it possible to update this tutorial for 2022 version?

I am getting this error: WARNING: Skipping tensorflow as it is not installed.

Code:

Cloning into 'DeepFaceLab'...
remote: Enumerating objects: 8007, done.
remote: Counting objects: 100% (256/256), done.
remote: Compressing objects: 100% (139/139), done.
remote: Total 8007 (delta 144), reused 175 (delta 115), pack-reused 7751
Receiving objects: 100% (8007/8007), 823.29 MiB | 28.93 MiB/s, done.
Resolving deltas: 100% (5138/5138), done.
Checking out files: 100% (211/211), done.
/content/DeepFaceLab
/content
WARNING: Skipping tensorflow as it is not installed.

Also i noticed you need to run an auto mouse movements macro to stop getting disconnected for idling.

1D0F4K35 · Mar 1, 2022

DeliveryBoy said:
Also i noticed you need to run an auto mouse movements macro to stop getting disconnected for idling.

If you have such a macro (that works) could you please share it. I dont know where to find or construct 'macros'

DeliveryBoy · Mar 10, 2022

1D0F4K35 said:
DeliveryBoy said:

Also i noticed you need to run an auto mouse movements macro to stop getting disconnected for idling.

Click to expand...

If you have such a macro (that works) could you please share it. I dont know where to find or construct 'macros'

All you need is little app that records your mouse movements then replay the mouse movements onto the screen. You can just google it, there are thousands of those free mouse macro apps.

3rdtry77 · Mar 18, 2022

Can anyone clarify how google colab uses my internet data bandwidth? Obviously it's using my personal bandwidth when I upload the files to google drive and all that, but when it's transferring from drive to the colab notebook, is that on me as well? What about the actual training? I've tried googling to figure this out and I've only found 2 responses, one that says "everything happens on the cloud, so it doesn't use your internet" and one that says "it uses a ridiculous amount of your internet". So not exactly clear.

I've got a data cap through my provider, and I've gone over it a couple of times in the last few months, and I'm wondering if google colab might be the culprit.

sm9075 · Mar 18, 2022

3rdtry77 said:
Can anyone clarify how google colab uses my internet data bandwidth? Obviously it's using my personal bandwidth when I upload the files to google drive and all that, but when it's transferring from drive to the colab notebook, is that on me as well? What about the actual training? I've tried googling to figure this out and I've only found 2 responses, one that says "everything happens on the cloud, so it doesn't use your internet" and one that says "it uses a ridiculous amount of your internet". So not exactly clear.

I've got a data cap through my provider, and I've gone over it a couple of times in the last few months, and I'm wondering if google colab might be the culprit.

Drive <-> Colab is between their network. When it imports/exports the workspace that's internal to Google. I'm training now and at most it's using a few KB/s, presumably just the page updating. Showing the iteration, for example.
According to it would be around 10Gigs/month if I'm reading that right.
Are you syncing your google-drive locally while that's running? Would it constantly update the workspace.zip if it's being updated with the colab? I could see that being bulky then.

3rdtry77 · Mar 18, 2022

It auto-backups every hour or so, but I'm not doing any thing else with Drive that I'm aware of. But if colab isn't the cause of my excess data usage, then that's a relief.

rincoln25 · Mar 23, 2022

Any Colab Pro user here? I recently subscribed to it and i wonder if it still disconnects? What are the chances that it will disconnect?

[GUIDE] DeepFaceLab 2.0 - Google Colab Guide

New member

NotSure

New member

New member

New member

New member

New member

New member

New member

Member

New member

New member

Member

Member

Member

Member

New member

New member

New member

New member