Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test models with good hyperparameters - +4.8% [email protected] on MS COCO test-dev #4430

Closed
AlexeyAB opened this issue Dec 2, 2019 · 22 comments

Comments

@AlexeyAB
Copy link
Owner

AlexeyAB commented Dec 2, 2019

Test models with good hyperparameters: #3114 (comment) and #4147 (comment)

batch=64
subdivisions=8
width=608
height=608
channels=3
momentum=0.949
decay=0.0005
angle=0
saturation = 1.5
exposure = 1.5
hue=.1

learning_rate=0.00261 # or 0.122 (so iou~=3.29 and cls & obj ~= 47) as in @glenn-jocher yolov3
burn_in=1000
max_batches = 500500
policy=steps
steps=400000,450000
scales=.1,.1

mosaic=1
[yolo] # or  [Gaussian_yolo]
...

jitter=.3
ignore_thresh = .7
truth_thresh = 1
random=1
scale_x_y = 1.05   # 1.05, 1.10, 1.20
iou_thresh=0.213
cls_normalizer=1.0
iou_normalizer=0.07
uc_normalizer=0.07
iou_loss=ciou
nms_kind=greedynms
beta_nms=0.6
beta1=0.6

  • set iou_normalizer=1 for [yolo]
  • set iou_normalizer=0.07 for [yolo] + C/D/GIoU
  • set iou_normalizer=0.1 and uc_normalizer=0.1 for [Gaussian_yolo]
  • and set iou_normalizer=0.07 and uc_normalizer=0.07 for [Gaussian_yolo] + C/D/GIoU
@sctrueew
Copy link

sctrueew commented Dec 3, 2019

Hi @AlexeyAB

So, We have to change the uc_normalizer from 1.0 to 0.1 in Gaussian_yolov3_BDD.cfg. is it right?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 3, 2019

@zpmmehrdad Yes.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 3, 2019

@glenn-jocher Hi,

Do you use static learning rate = 0.00261?
Or do you use SGDR (cosine) - constantly decreasing learning rate?

learning_rate=0.00261
momentum=0.949

@glenn-jocher
Copy link

@AlexeyAB I use original darknet LR scheduler, with drops of *=0.1 at 80% and 90% of total epochs. It's true that that a smoother drop may have a slight benefit (I think BOF paper showed this), but it's likely a very minimal effect. See ultralytics/yolov3#238

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 8, 2019

At least for some datasets, it seems that iou_n=0.07 is too low value for GIoU, and iou_n=0.5 is much better: #3874 (comment)

@glenn-jocher
Copy link

glenn-jocher commented Dec 8, 2019

@AlexeyAB yes, you should definitely examine the value of each loss component to ensure that the balancing parameters produce roughly equal loss between the 3 components. In ultrayltics/yolov3 they produce magnitudes of about 5, 5, 5 for GIoU, obj, cls on COCO epoch 0. If they produce different magnitudes here, you should adjust accordingly.

BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?

In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 8, 2019

@glenn-jocher

BTW, one thing that has always bothered me about the ultralytics/yolov3 loss function is that each of the yolo layers is treated equally (because we mean all the elements in each layer), and I think here you sum all the elements in each layer instead. Is this correct?

What do you mean?
Just each final activation produce separate delta, which is backpropagated without changes.

In all of the papers I always see mAP_small underperform mAP_large and medium, and the smaller object output grid points far outnumber the large object output grid points, so it makes sense to me that the small object layer should generate more loss (yet this is not currently the case at ultralytics). I experimented with this change in the past unsuccessfully unfortunately. What do you think?

It is just because smaller objects have fewer pixels, especially after resizing to the network size 416x416.
I think we should use more anchors, use routes to the lower layers and special blocks (which have many layers but don't lose detailed information) - for the small objects.

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 8, 2019

@glenn-jocher Also did you think about rotation/scale-invariant features like SIFT/SURF (rotation/scale-invariant conv-layers or something else)?

@glenn-jocher
Copy link

glenn-jocher commented Dec 8, 2019

@AlexeyAB yes I've worked a lot with SURF and SIFT, but don't get these confused with object detection. SURF is a faster version of SIFT, they are not AI algorithms, their purpose is to match points in one image to points in a second image by comparing feature vectors between possible point pairs. This is useful is Structure From Motion (SFM) applications like AR where its necessary to know the camera motion between frames to reconstruct a 3D scene, or simply to find an object in a second image that exists in a first image.

But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.

Yes a more targeted strategy to lower layers is a good idea. But the point I was making is that I think the darknet loss function (if there are no balancers) treats each element the same, whereas the ultralytics loss treats each layer the same (i.e. for 416 there would be 507 + 2028 + 8112 = 10467 loss elements in the 3 layers).

The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss:
loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)

I'm thinking if I take the mean of the entire 10467 anchors instead, this would result in more effective training of the smaller object layers. I tried this before with poor effect, but maybe I should try again. The current COCO results are:

 Average Precision  (AP) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.243 <--
 Average Precision  (AP) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.450
 Average Precision  (AP) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.514

 Average Recall     (AR) @[ IoU=0.50:0.95 | area= small | maxDets=100 ] = 0.422 <--
 Average Recall     (AR) @[ IoU=0.50:0.95 | area=medium | maxDets=100 ] = 0.640
 Average Recall     (AR) @[ IoU=0.50:0.95 | area= large | maxDets=100 ] = 0.707

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Dec 9, 2019

@glenn-jocher

The current ultralytics loss reduces the value of the lower layer elements because it takes a mean() of each layer for the total loss:
loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss)

Does this calculation only affect the display of the total loss on the screen? Or does it somehow affect the value of each delta that will be redistributed?

F.e.

  • if output[i] = 0.2 (for x=2,y=3, anchors=1, yolo-layer-3)
  • and delta_class[i] = 1 - p = 1 - 0.2 = 0.8 (for the same x=2,y=3, anchors=1, yolo-layer-3)
    then after this loss-calculation loss = mean(layer0_loss) + mean(layer1_loss) + mean(layer2_loss), what value will be back-propagated in ultralytics-yolo, is it 0.8 or what?

  • SURF: translation/scale/rotation invarinat
  • CONV-filter: translation invariant

Both are not color invariant.

DNN can achieve color/scale/rotation invarinat only due to a large amount of filters (millions of parameters).
Surf is very demanding on resources, so we will not be able to make the network completely only out of millions of Surfs. But perhaps we can apply a certain amount of them in some layers, for example, with subsampling. Or something else, there are many algorithms: SIFT, SURF, BRIEF, ORB...

But it does not generalize at all, so for example SURF points from a blue car will never match to SURF points on a red car, so in this sense it is completely separate from object detection.

It depends on whether we want to detect only red cars or any color. Otherwise, we will either have to add surfers for the blue car etc..., or use color-invariant SURF-descriptors: https://link.springer.com/chapter/10.1007/978-3-642-35740-4_6


SURF is not an object detection method, it is a method of matching areas in an image that can be used for detection/tracking objects, with rotation/scale-invariance.

In all surf tutorials, Surf is demonstrated as a method for comparing whole images, rather than individual objects. The reason - just because SURF has different efficiencies for different key points in the image, it is just assumed that on a separate object there may be or may not be good points, but on the whole image they should be much more likely.

Many years ago I successfully used the Surf Extractor Ptr<SurfDescriptorExtractor> extractor = new SurfDescriptorExtractor(); to track an object with rotation and scale invariance with occlusion and long disappearances, with instant training. Because we can calculate surf descriptors for any point in any area of ​​the image (including where the object) and save them to a file.

After detection by using Surf Extractor, the area with the object was rotated and scaled, and then other refinement algorithms for object recognition/detection/comparison were applied asynchronously, like Similarity check (PNSR and SSIM) on the GPU, HaarCascades / Viola–Jones object detection, ...

@nyj-ocean
Copy link

In my case,iou_normalizer=0.07 seems better than iou_normalizer=0.5 in yolov3+Gaussian+CIoU

iou_normalizer=0.07
chart

iou_normalizer=0.5
0 5-gaussian-ciou-chart

@AlexeyAB
Copy link
Owner Author

@nyj-ocean

  • Can you share both cfg-file in zip-archive?
  • Did you use the same other params?
  • How many classes in your dataset?

@glenn-jocher
Copy link

glenn-jocher commented Dec 14, 2019

@nyj-ocean oh that's an impressive difference! @AlexeyAB I don't think we should get too hung up exactly the best normalizer for every situation, because I think they will all be different depending on many factors, including the custom data, number of classes, class frequency, etc.

I think a robust balancing method would probably sacrifice epoch 0 simply to see what the default balancers produce (i.e. 1, 1, 1), and then restart training using those results to balance the loss components. The steps would roughly be:

  1. Set balancers to 1, 1, 1 for box, obj, cls
  2. Train up to 1 epoch / 10 minutes / 1000 iterations, saving loss component means.
  3. Set balancers to inverse loss component means.
  4. Train normally.

@AlexeyAB
Copy link
Owner Author

@glenn-jocher Yes, just I think we should keep 1, 0.1, 0.1 for box, obj, cls, at least for high AP@75 and may be for AP@50 too

@tuteming
Copy link

from your cfg.zip, those two cfg both is yolo layers rather than gaussian-yolo layers.
please confirm. thanks.

@AlexeyAB
Copy link
Owner Author

@nyj-ocean Thanks. In your cfg-file there are [yolo] layers instead of [Gaussian_yolo].

@AlexeyAB AlexeyAB changed the title Test models with good hyperparameters Test models with good hyperparameters - +4.8% [email protected] on MS COCO test-dev Jan 2, 2020
@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Jan 2, 2020

@AlexeyAB AlexeyAB added enhancement and removed want enhancement Want to improve accuracy, speed or functionality labels Jan 2, 2020
@nyj-ocean
Copy link

@AlexeyAB

  • Is there a need to add SE module to YOLOv3?

图片
图片


图片
Squeeze-and-Excitation Networks.pdf

@AlexeyAB
Copy link
Owner Author

AlexeyAB commented Jan 8, 2020

@nyj-ocean
Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3) 4 months ago, but it is very slow https://github.com/AlexeyAB/darknet#pre-trained-models

Open a new issue. May be I will benchmark SE-module and will check can I improve SE-speed.

@nyj-ocean
Copy link

@AlexeyAB

Squeeze-and-Excitation blocks are already implemented in enet-coco.cfg (EfficientNetB0-Yolov3) 4 months ago

I add Squeeze-and-Excitation blocks to yolov3.cfg.
Then train with my dataset

  mAP
yolov3 86.03
yolov3+senet 85.78

The mAP of yolov3+senet is lower than yolov3
The result is strange

@AlexeyAB
Copy link
Owner Author

@nyj-ocean
If you add SE-block to the darkent53 backbone, then you should retrain classifier for using new pre-trained weights file.

@becauseofAI
Copy link

In my case,iou_normalizer=0.07 seems better than iou_normalizer=0.5 in yolov3+Gaussian+CIoU

iou_normalizer=0.07
chart

iou_normalizer=0.5
0 5-gaussian-ciou-chart

@AlexeyAB @nyj-ocean The "good hyperparameters" is effective. But why does loss function not converge normally?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

7 participants