Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Try to train fast (grouped-conv) versions of csdarknet53 and csdarknet19 #6

Open
AlexeyAB opened this issue Jan 4, 2020 · 109 comments
Open

Comments

@AlexeyAB
Copy link
Collaborator

AlexeyAB commented Jan 4, 2020

@WongKinYiu Hi,

Since CSPDarkNet53 is better than CSPResNeXt50 for Detector, try to train these 4 models:

Model GPU 256x256 512x512 608x608
darknet53.cfg (original) RTX 2070 113 56 38
csdarknet53.cfg (original) RTX 2070 101 57 41
csdarknet53g.cfg.txt RTX 2070 122 64 46
csdarknet53ghr.cfg.txt RTX 2070 100 75 57
spinenet49.cfg.txt low priority RTX 2070 49 44 43
csdarknet19-fast.cfg.txt RTX 2070 213 149 116

csdarknet19-fast.cfg contains DropBlock, so use the latest version of Darknet that uses fast random-functions for DropBlock.

@WongKinYiu
Copy link
Owner

@AlexeyAB Thanks,

I will get free gpus after finish training local_avgpool models.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Jan 30, 2020

@WongKinYiu Hi,

So you should combine two networks with more layers and more parameters (CSPDarknet-53) + more outputs (CSPResNeXt-50)


  1. It seems that the model CSPResNeXt50 has higher Top1/Top5 because it has more outputs for each layer out_w * out_h * out_c (i.e. it has higher filters= in [conv] layers).
    258 291 (1.1x) CSPresnext50 / 233 348 (1.0x) CSPDarknet53.

  2. It seems that although a small number of parameters ~21M (CSPResNet50/CSPResNext50) = [conv] groups = 32 and 84 layers are sufficient for the 256x256 Classifier, but a much more number of parameters ~27M (CSPDarkNet53) [conv] groups = 1 and 108 layers is needed for the 512x512 Detector.

Suggestion:

  • Either increase number of filters in CSPDarknet53 (and also increase groups= from 1 to 2 - 8 for layers with high value of filters)
  • or increase number of layers in CSPresnext50 (and also decrease groups= from 32 to 1 - 8)
Model Num of layers G r o u p s Para meters average outputs = out_w X out_h X out_c RTX 2070 FPS b f l o p s Top1 / Top5 Top1 / Top5 (mosaic + label smooth + mish) AP (Detector) 512x512
CSP DarkNet53 256x256 108 1 27 233 348 (1.0x) 125 13 77.2% / 93.6% 78.7% / 94.8% 38.7 %
CSP ResNeXt50 256x256 84 32 20 258 291 (1.1x) 72 8 77.9% / 94.0% 79.8% / 95.2% 38.0 %
CSP ResNet50 256x256 84 1 21 203 665 (0.87x) 168 9 76.6% / 93.3% 78.1% / 94.2% 38.0 %

Did you try to train with DropBlock, does it work well?

@WongKinYiu
Copy link
Owner

@AlexeyAB

Yes, If I change output channel of CSPResNet50 and CSPDarknet53 to 2048, I think it can achieve better results but with large amount of computation.

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

The DropBlock models are still training. Currently, the models get a little bit lower accuracy then the models without DropBlock at same epoch. But it may because DropBlock need more epochs to get converge.

@WongKinYiu
Copy link
Owner

@AlexeyAB Hello,

The model with DropBlock gets lower accuracy than without it (79.8 vs 79.1).
I think we need follow what EfficientNet do - reduce drop probability during training.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Feb 8, 2020

@WongKinYiu Hi,

This is already done: https://github.com/AlexeyAB/darknet/blob/d51d89053afc4b7f50a30ace7b2fcf1b2ddd7598/src/dropout_layer_kernels.cu#L28-L31

  • May be we should increase drop probability during whole training instead of half the training process

  • Or may be drop-block requires more parameters in the model, since drop-block/out/connect divides the model into an ensemble of many models, each of which turns out to be too small

So we should try:

  • new models more layers + more parameters + more outputs
  • fast cuda-implementation of drop-block

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Feb 10, 2020

@WongKinYiu Hi,

Do you need an ImageNet pre-trained model which has more layers + more parameters + more outputs ? If yes, I can train a model. Or if you have a cfg file, I will get 2 free gpus tomorrow for training it.

Try to train please these 2 models - both use: MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 + reduced groups= for faster inference

  1. csresnext50morelayers.cfg.txt - added more layers between 1st and 2nd subsamling

  2. csresnext50sub.cfg.txt - added more layers between 1st and 2nd subsamling, and concatenated 2 subsamlings [conv] stride=2 and [maxpool] stride=2


Also did you try to train CSPResNeXt-50+Elastic with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 ?

Also did you try to train spinenet49.cfg.txt ?

@WongKinYiu
Copy link
Owner

OK, will train these two models.

No, the inference speed of CSPResNeXt-50 with Elastic is too slow.
I think it can not run real time object detection.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Feb 10, 2020

@WongKinYiu

All with MISH-activation and 608x608 network resolution on GeForce RTX 2070:

  • csresnext50.cfg - 51.2 FPS

  • csdarknet53.cfg - 53.6 FPS

  • csresnext50morelayers.cfg.txt - 44.0 FPS

  • csresnext50sub.cfg.txt - 43.3 FPS

  • spinenet49.cfg - 43.2 FPS (640x640 network resolution)

  • elastic-csresnext50.cfg - 34.7 FPS (576x576 network resolution)

So it may make sense to train the model spinenet49.cfg with MISH + mosaic=1 cutmix=1 label_smooth_eps=0.1 spinenet49.cfg.txt

@WongKinYiu
Copy link
Owner

@AlexeyAB

I have only two free gpus currently, will train csresnext50morelayers and spinenet49 first.

@WongKinYiu
Copy link
Owner

@AlexeyAB

Model CutMix Mosaic Label Smoothing Mish Top-1 Top-5
SpineNet-49 ✔️ ✔️ ✔️ ✔️ 78.3% 94.6%

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 2, 2020

@WongKinYiu Thanks!

So SpineNet-49 is worse than csdarknet53 and csresnext50 at least for ImageNet

  • spinenet49.cfg.txt - 43.2 FPS - 78.3% | 94.6%
  • csdarknet53.cfg - 53.6 FPS - 78.7% | 94.8%
  • csresnext50.cfg - 51.2 FPS - 79.8% | 95.2%

Also I fixed label_smoothing for Detector (not for Classifier) AlexeyAB/darknet@81290b0
in such a way as there: https://github.com/david8862/keras-YOLOv3-model-set/blob/6cc297434e0604e2f6c34a8a2557b342468f083a/yolo3/loss.py#L225-L227
For such a probability transformation http://fooplot.com/#W3sidHlwZSI6MCwiZXEiOiJ4KjAuOSswLjA1IiwiY29sb3IiOiIjMDAwMDAwIn0seyJ0eXBlIjoxMDAwLCJ3aW5kb3ciOlsiMCIsIjEiLCIwIiwiMSJdfV0-

So you can try to train Detector with new label_smoothing.

Usage

[yolo]
label_smooth_eps=0.1

for each [yolo] layer

Since old label_smoothing worked well for the Classifier, but worked bad for the Detector.

@WongKinYiu
Copy link
Owner

The results in original paper.

image

CSPDarkNet-53 has more parameters and FLOPs.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 2, 2020

Yes, SpineNet-49 has fewer params and flops, but CSPDarkNet-53 faster and more accurate for Classifier.
But may be SpineNet-49 more accurate for Detector.

@WongKinYiu
Copy link
Owner

WongKinYiu commented Mar 7, 2020

@AlexeyAB

Model CutMix Mosaic Label Smoothing Mish Top-1 Top-5
CSPResNeXt-50-morelayers ✔️ ✔️ ✔️ ✔️ 79.4% 95.2%

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 8, 2020

@WongKinYiu Thanks! Do you mean csresnext50morelayers.cfg or CSPDarkNet-53-morelayers? #6 (comment)

@WongKinYiu
Copy link
Owner

@AlexeyAB Oh, sorry, it is csresnext50morelayers.cfg.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 8, 2020

@WongKinYiu
So csresnext50morelayers.cfg is worse than csresnext50.cfg (Top1 79.4% vs 79.8%) on ImageNet. https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md

But I think csresnext50morelayers.cfg will be better as backbone for Detector.

@WongKinYiu
Copy link
Owner

@AlexeyAB

Yes, csresnext50 performs better on ImageNet.

I will get a free gpu after about 4 days.
However, currently I do not have results of backbone with mish activation on MSCOCO, could you help for designing the cfg for detector with csresnext50morelayers backbone?

Thanks.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 8, 2020

@WongKinYiu
Ok, I can make 2 cfg-files, with [net] mosaic=1 dynamic_minibatch=1 and mish-activation:

  1. csresnext50morelayers + SPP_PAN
  2. csresnext50morelayers + SPP+ASFF+BiFPN

Will we try to test new label_smoothing for Detector?
When will the CBN, DropBlock, ASFF and BiFPN model training end approximately?

@WongKinYiu
Copy link
Owner

@AlexeyAB

for classifier,
cbn will finish in one week,
cbn+dropblock still very slow, i think it need more than one month to finish training.

for detector,
rfb+bn need about two weeks,
cbn need about two weeks,
bifpn need about three to four weeks, but the training may stop several days or weeks,
asff not yet start.

i will also do ablation study for dynamic_minibatch and new label_smoothing.

@glenn-jocher
Copy link

glenn-jocher commented Mar 8, 2020

@AlexeyAB @WongKinYiu have you had any success with label smoothing? I just learned about it recently, but was confused about a few things:

  • Can it be applied in both classification models and object detection models?
  • Is it always applied to both negative samples (i.e. 0.1) and positive samples (i.e. 0.9) , or could it be applied only to negatives etc?
  • For object detection, should it be applied to both objectness loss and classification loss?
  • Can it be applied to either CEloss and BCEloss criteria?

@WongKinYiu
Copy link
Owner

@glenn-jocher

  • Yes, it can be applied on classification head of detectors - BoF.
  • I think both okay, because it can be applied on both of YOLOv3 and FasterRCNN.
  • In BoF paper, it seems only applied on classification head.
  • I think yes, the paper mentioned "In the case of sigmoid outputs of range 0 to 1.0 as in YOLOv3 [16], label smoothing is even simpler by correcting the upper and lower limit of the range of targets as in Eq. 3."

image

But unfortunately, all of mixup, cosine lr, and label smoothing get worse results in my experiments.

@glenn-jocher
Copy link

@WongKinYiu ah thanks, that's super informative!

That solves a big mystery for me then. I tried to apply it to both obj loss and class loss at the same time, and it destroyed my NMS because every anchor single was above threshold (of 0.001).

I implemented cosine lr scheduler a couple weeks ago, it worked well (+0.3 mAP) though I noticed it worked better if I raised the initial LR. Before with the traditional step scheduler I was using about lr0=0.006, now with the cosine scheduler I use lr0=0.010 to get that +0.3 increase on COCO.

Name [email protected] [email protected] Comments
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg 61.6 41.6 step lr
(288-640)-608 to 273 bs16a4 yolov3-spp.cfg 61.8 41.9 cos lr0=0.01

@glenn-jocher
Copy link

@WongKinYiu see ultralytics/yolov3#238 (comment) for the cosine scheduler implementation. These are the training plots for the two runs (step and cos lr). Interestingly the val losses are better at the end with step, and you can see cos obj loss is starting to overtrain at the end, but the cos final mAP is still slightly higher. I'm not quite sure what that means.

results

@glenn-jocher
Copy link

glenn-jocher commented Mar 8, 2020

@WongKinYiu do you know what the value of epsilon should be in eqn3 of the BoF paper? If I assume epsilon=0.1 the classification target values (after a sigmoid) would be

  • positive: (1 - 0.1) = 0.9
  • negative: 0.1/(80-1) = 0.0013

Does that seem right??

Screen Shot 2020-03-07 at 10 09 31 PM

@glenn-jocher
Copy link

In their case they seem to be using epsilon as smooth_weight, with a constraint to keep it getting too large if the class count is low. Ok, I'll start from there.
smooth_weight = min(1. / self._num_class, 1. / 40)

@WongKinYiu
Copy link
Owner

It seems only YOLOv3 can apply label smooth.
All of SSD, CenterNet, FasterRCNN, MaskRCNN do not have label smooth function.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 19, 2020

@WongKinYiu

update: still gets all zero iou.

After how many iterations?

Try to train without CBN. I noticed that CBN worse accuracy on most of my models.

I train this cfg-file for 2300 iterations on MS COCO and don't get iou=0 or Nan loss: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt (just to know there is max_batches = 50050 steps=40000,45000 instead of max_batches = 500500 steps=400000,450000)

label_smooth_eps=0.1, dynamic_minibatch=1, mosaic=1, BiFPN, ASFF, RFB, DropBlock - do not cause problems.

image

@WongKinYiu
Copy link
Owner

WongKinYiu commented Mar 19, 2020

about 40 iterations, now i change 608/64/64 back to 416/64/32 and still performs normal at 1500 iterations.

update: becomes all zero at 3xxx iterations.

@AlexeyAB
Copy link
Collaborator Author

@WongKinYiu Nice! Do you currently train CSResNext50-PANet and CSDarknet53-PANet with Mosaic,Genetic,Mish... which are based on the best of these models https://github.com/WongKinYiu/CrossStagePartialNetworks/blob/master/imagenet/results.md ?

@WongKinYiu
Copy link
Owner

yes, the training of CSPResNext50-PANet-SPP with csresnext50-gamma.cfg pretrained model will finish in 1~2 weeks.

@AlexeyAB
Copy link
Collaborator Author

AlexeyAB commented Mar 20, 2020

@WongKinYiu Try to train with 608/64/64 + mosaic=1 dynamic_minibatch=1 label_smooth_eps=0.1 but without CBN, i.e. with batch_normalize=1

I successfully trained such model without Nan or zero-IoU: csresnext50sub-spp-asff-bifpn-rfb-db.cfg.txt

1 2
chart_csresnext50sub-spp-asff-bifpn-rfb-db image

@WongKinYiu
Copy link
Owner

@AlexeyAB start training.

@AlexeyAB
Copy link
Collaborator Author

@WongKinYiu

Does BiFPN+ASFF+RFB+DB training go well without Nan/IoU=0?

@WongKinYiu
Copy link
Owner

@AlexeyAB

i resume training from 2k iterations several times when Nan/IoU=0 occurs, now training go to 7k iterations without Nan/IoU=0.

@AlexeyAB
Copy link
Collaborator Author

@WongKinYiu This is strange, since I didn't get Nan/IoU=0 at all.

@WongKinYiu
Copy link
Owner

@AlexeyAB hmm... i get IoU=0 3 times of this cfg, i already test previous cfg on cuda 9.0/10.0/10.1/10.2 before, all of training meet same situation.

@AlexeyAB
Copy link
Collaborator Author

@WongKinYiu Maybe this is a temporary phenomenon, which itself will be corrected, and which should not be paid attention to reaching ~10,000 iterations?

@syjeon121
Copy link

@AlexeyAB @WongKinYiu hi i want to use cspdarknet53-panet-spp in this repo readme for custom object training

Screenshot from 2020-04-10 14-19-51

how many layers should i extract from weights file using partial?

@WongKinYiu
Copy link
Owner

@sctrueew
Copy link

@WongKinYiu Hi,

What pre-trained should I use for CSPDarknet53-PANet-SPP model?

Thanks

@WongKinYiu
Copy link
Owner

WongKinYiu commented Apr 12, 2020

Hello,

which cfg do you want to use?
and is your dataset lager than mscoco?

@sctrueew
Copy link

sctrueew commented Apr 12, 2020

@WongKinYiu Hi,

which cfg do you want to use?

I already used CSPResNeXt50-PANet-SPP and I got a good result but the training time is high and I am going to use CSPDarknet53-PANet-SPP.

and is your dataset lager than mscoco?
Yes, I have a big dataset, it's' about 300 classes and 1m images. My dataset includes traffic signs.

Which cfg is it good for this case? the accuracy is importance for me.

Thanks

@WongKinYiu
Copy link
Owner

WongKinYiu commented Apr 12, 2020

for this case, you can use:

  1. CSPDarknet53-PANet-SPP: 512x512 input/42.4 AP/64.5 AP50
    [imagenet pretrained] [coco pretrained]

  2. CSPDarknet53-PANet-SPP(Mish): 512x512 input/43.0 AP/64.9 AP50
    [imagenet pretrained] [coco pretrained]

If your dataset is larger than mscoco, you can considerate using imagenet pretrained model (partial 104). If you hope the model converge quickly, you can use mscoco pretrained model (partial 135).

@sctrueew
Copy link

@WongKinYiu Hi,

My dataset is larger than mscoco. Can I use a 608 network size to get higher accuracy?

Thanks

@WongKinYiu
Copy link
Owner

does your dataset contains many small object?
If yes, training with 608 network size can get higher accuracy definitely.

@sctrueew
Copy link

@WongKinYiu Hi,

Yes, some objects are small. Where can I download the pre-trained for CSPDarknet53-PANet-SPP(Mish)?

@WongKinYiu
Copy link
Owner

here #6 (comment)

@sctrueew
Copy link

@WongKinYiu Hi,

Thanks for the reply, are these command right to generate pre-trained?

ImageNet dataset

darknet partial cfg/cd53paspp-omega.cfg cd53paspp-omega_final.weights cd53paspp-omega_final.weights.conv.104 104

Coco dataset

darknet partial cfg/cd53paspp-omega.cfg cd53paspp-omega_final.weights cd53paspp-omega_final.weights.conv.135 135

@WongKinYiu
Copy link
Owner

No, the weights file of imagenet pretrained model is csdarknet53-omega_final.weights.

@sctrueew
Copy link

@WongKinYiu Hi,

Sorry, I thought that I have to make the pre-trained with the command.

Thanks a lot

@sctrueew
Copy link

@WongKinYiu Hi,

I have a problem that sometimes some pictures are not detected or detected wrong. I attached my model and some images for testing. Could you please check it and guide me? I have about 2K images per class. Please give me some information about the hyperparameters for my case.

file

Thanks in advance

@WongKinYiu
Copy link
Owner

hello, how u calculate anchors?
could u show object number of each classes?

@sctrueew
Copy link

sctrueew commented Apr 21, 2020

@WongKinYiu Hi, Thanks for the reply

Did you test it?

hello, how u calculate anchors?

darknet detector calc_anchors a.obj -num_of_clusters 9 -width 608 -height 608

could u show object number of each classes?

1.txt

please rename 1.txt to 1.zip.

@Shraddha767
Copy link

where we can add dropblock in yolov4 cfg??

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

8 participants