Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Problems decoding the predicted bboxes in TSV #9

Open
AndresPMD opened this issue Jun 11, 2020 · 7 comments
Open

Problems decoding the predicted bboxes in TSV #9

AndresPMD opened this issue Jun 11, 2020 · 7 comments

Comments

@AndresPMD
Copy link

Hello @shilrley6,
Congrats for your repo, it is quite useful.
Nonetheless, it seems I have found an error while decoding the boxes in the output of a TSV file. The image features are decoded correctly, but the bounding boxes are not.
I was testing the generate_tsv.py file, and even the predicted bboxes are correct, at the moment of encoding and storing the whole data is changed.

Have you encountered this issue? Any suggestion?
In my case, I can recompute another TSV and do not store the bboxes with this encoding....but it will take a lot of time to recompute it.

@titaiwangms
Copy link

Have you solved this bug? I am facing the same problem..

@AndresPMD
Copy link
Author

Yes, I did...I wrote a modified script for it.

#!/usr/bin/env python

"""Generate bottom-up attention features as a tsv file. Can use cuda and multiple GPUs.
Modify the load_image_ids script as necessary for your data location. """

Example:

python generate_tsv.py --net res101 --dataset vg --out test.csv --cuda

from future import absolute_import
from future import division
from future import print_function

import _init_paths
import os
import sys
import numpy as np
import argparse
import pprint
import pdb
import time
import cv2
import csv
import torch
import base64
import json
from utils.timer import Timer
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim

import torchvision.transforms as transforms
import torchvision.datasets as dset

from scipy.misc import imread

from imageio import imread
from roi_data_layer.roidb import combined_roidb
from roi_data_layer.roibatchLoader import roibatchLoader
from model.utils.config import cfg, cfg_from_file, cfg_from_list, get_output_dir
from model.rpn.bbox_transform import clip_boxes

from model.nms.nms_wrapper import nms

from model.roi_layers import nms
from model.rpn.bbox_transform import bbox_transform_inv
from model.utils.net_utils import save_net, load_net, vis_detections
from model.utils.blob import im_list_to_blob
from model.faster_rcnn.vgg16 import vgg16
from model.faster_rcnn.resnet import resnet
import pdb
from tqdm import tqdm

try:
xrange # Python 2
except NameError:
xrange = range # Python 3

csv.field_size_limit(sys.maxsize)

FIELDNAMES = ['image_id', 'image_w','image_h','num_boxes', 'boxes', 'features']

Settings for the number of features per image. To re-create pretrained features with 36 features

per image, set both values to 36.

MIN_BOXES = 36
MAX_BOXES = 36

MIN_BOXES = 10

MAX_BOXES = 100

def parse_args():
"""
Parse input arguments
"""
parser = argparse.ArgumentParser(description='Generate bbox output from a Fast R-CNN network')
parser.add_argument('--dataset', dest='dataset',
help='training dataset',
default='vg', type=str)
parser.add_argument('--net', dest='net',
help='vgg16, res50, res101, res152',
default='res101', type=str)
parser.add_argument('--load_dir', dest='load_dir',
help='directory to load models',
default="models")
parser.add_argument('--cuda', dest='cuda',
help='whether use CUDA',
action='store_true')
parser.add_argument('--mGPUs', dest='mGPUs',
help='whether use multiple GPUs',
action='store_true')
parser.add_argument('--image_dir', dest='image_dir',
help='directory to load images',
default="images")
parser.add_argument('--classes_dir', dest='classes_dir',
help='directory to load object classes for classification',
default="data/genome/1600-400-20")
parser.add_argument('--out', dest='outfile',
help='output filepath',
default=None, type=str)
parser.add_argument('--cfg', dest='cfg_file',
help='optional config file',
default='cfgs/res101.yml', type=str)
parser.add_argument('--set', dest='set_cfgs',
help='set config keys', default=None,
nargs=argparse.REMAINDER)
parser.add_argument('--cag', dest='class_agnostic',
help='whether perform class_agnostic bbox regression',
action='store_true')
parser.add_argument('--split', dest='data_split',
help='dataset to use',
default='stacmr', type=str)

if len(sys.argv) == 1:
    parser.print_help()
    sys.exit(1)

args = parser.parse_args()

return args

lr = cfg.TRAIN.LEARNING_RATE
momentum = cfg.TRAIN.MOMENTUM
weight_decay = cfg.TRAIN.WEIGHT_DECAY

def _get_image_blob(im):
"""Converts an image into a network input.
Arguments:
im (ndarray): a color image in BGR order
Returns:
blob (ndarray): a data blob holding an image pyramid
im_scale_factors (list): list of image scales (relative to im) used
in the image pyramid
"""
im_orig = im.astype(np.float32, copy=True)
im_orig -= cfg.PIXEL_MEANS

im_shape = im_orig.shape
im_size_min = np.min(im_shape[0:2])
im_size_max = np.max(im_shape[0:2])

processed_ims = []
im_scale_factors = []

for target_size in cfg.TEST.SCALES:
im_scale = float(target_size) / float(im_size_min)
# Prevent the biggest axis from being more than MAX_SIZE
if np.round(im_scale * im_size_max) > cfg.TEST.MAX_SIZE:
im_scale = float(cfg.TEST.MAX_SIZE) / float(im_size_max)
im = cv2.resize(im_orig, None, None, fx=im_scale, fy=im_scale,
interpolation=cv2.INTER_LINEAR)
im_scale_factors.append(im_scale)
processed_ims.append(im)

Create a blob to hold the input images

blob = im_list_to_blob(processed_ims)

return blob, np.array(im_scale_factors)

#build [image_path, image_id] for dataset, and you can create your own
def load_image_ids(split_name):
''' Load a list of (path,image_id tuples). Modify this to suit your data locations. '''
split = []
if split_name == 'coco_test2014':
with open('/data/coco/annotations/image_info_test2014.json') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['id'])
filepath = os.path.join('/data/test2014/', item['file_name'])
split.append((filepath,image_id))
elif split_name == 'coco_test2015':
with open('/data/coco/annotations/image_info_test2015.json') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['id'])
filepath = os.path.join('/data/test2015/', item['file_name'])
split.append((filepath,image_id))
elif split_name == 'genome':
with open('/data/visualgenome/image_data.json') as f:
for item in json.load(f):
image_id = int(item['image_id'])
filepath = os.path.join('/data/visualgenome/', item['url'].split('rak248/')[-1])
split.append((filepath,image_id))
# ADDED FOR STACMR SUPPORT
elif split_name == 'stacmr':
with open ("/SSD/Datasets/Coco-Text/ST_CMR_testdataset/New_Split/dataset_cocotext_captioned_full.json", 'r') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['imgid'])
filepath = os.path.join('/SSD/Datasets/Coco-Text/ST_CMR_testdataset/Full_STACMR_dataset/images/', item['filename'])
split.append((filepath,image_id))
# ADDED FOR FLICKR SUPPORT
# elif split_name == 'flickr':
# with open ("/SSD/Datasets/Flickr30K/flickr30k_img.txt", 'r') as f:
# data = f.readlines()
# for image_id, image_name in enumerate (data):
# filepath = os.path.join('/SSD/Datasets/Flickr30K/images/', image_name.strip())
# split.append((filepath,image_id))
# ORIG KARPATHY SPLIT

elif split_name == 'flickr':
  with open ("/SSD/Datasets/Flickr30K/dataset.json", 'r') as f:
    data = json.load(f)
    for item in data['images']:
      image_id = int(item['imgid'])
      filepath = os.path.join('/SSD/Datasets/Flickr30K/images/', item['filename'])
      split.append((filepath,image_id))

elif split_name == 'context':
  context_img_path = '/SSD/Datasets/Context/data/JPEGImages/'
  context_imgs = os.listdir(context_img_path)
  # Debug only:
  # test_list = []
  # test_list.append(context_imgs[0])
  # test_list.append(context_imgs[1])
  # test_list.append(context_imgs[2])
  # for idx, img in enumerate (test_list):
  for idx, img in enumerate (context_imgs):
      image_id = idx
      filepath = context_img_path + img
      split.append((filepath, image_id))
  with open('./context_image_ids.txt','w') as fp:
      for item in split:
        fp.write(str(item) + '\n')

elif split_name == 'bottles':
  bottles_img_path = '/SSD/Datasets/Drink_Bottle/images/'
  folders = os.listdir(bottles_img_path)
  image_id = 0
  for folder in folders:
      full_path = os.path.join(bottles_img_path, folder)
      bottle_imgs = os.listdir(full_path + '/')
      for img in (bottle_imgs):
          filepath = full_path + '/' + img
          split.append((filepath, image_id))
          image_id += 1
  with open('./bottles_image_ids.txt','w') as fp:
      for item in split:
          fp.write(str(item) + '\n')


elif split_name == 'textcaps':
  # MODIFIED TO EXTRACT ONLY TRAINING FEATURES
  textcaps_img_path = '/SSD/Datasets/TextCaps/train_images/'
  with open ('/SSD/Datasets/TextCaps/TextCaps_0.1_train.json','r') as fp:
      train_data = json.load(fp)
  # with open ('/SSD/Datasets/TextCaps/TextCaps_0.1_val.json','r') as fp:
  #     val_data = json.load(fp)
  # data = train_data['data'] + val_data['data']
  data = train_data['data']

  processed_ims = set()
  for idx, item in enumerate (data):
      if item['image_id'] not in processed_ims:
          processed_ims.add(item['image_id'])
          image_id = idx
          filepath = textcaps_img_path + item['image_id'] + '.jpg'
          split.append((filepath, image_id))
  # with open('./textcaps_image_ids.txt','w') as fp:
  with open('./textcaps_image_ids_train_only.txt','w') as fp:
      for item in split:
        fp.write(str(item) + '\n')

else:
  print ('Unknown split')
return split

def get_detections_from_im(fasterRCNN, classes, im_file, image_id, args, conf_thresh=0.2):
"""obtain the image_info for each image,
im_file: the path of the image

return: dict of {'image_id', 'image_h', 'image_w', 'num_boxes', 'boxes', 'features'}
boxes: the coordinate of each box
"""
# initilize the tensor holder here.
im_data = torch.FloatTensor(1)
im_info = torch.FloatTensor(1)
num_boxes = torch.LongTensor(1)
gt_boxes = torch.FloatTensor(1)

# ship to cuda
if args.cuda > 0:
    im_data = im_data.cuda()
    im_info = im_info.cuda()
    num_boxes = num_boxes.cuda()
    gt_boxes = gt_boxes.cuda()

# make variable
with torch.no_grad():
    im_data = Variable(im_data)
    im_info = Variable(im_info)
    num_boxes = Variable(num_boxes)
    gt_boxes = Variable(gt_boxes)

if args.cuda > 0:
    cfg.CUDA = True

if args.cuda > 0:
    fasterRCNN.cuda()

fasterRCNN.eval()

#load images
# im = cv2.imread(im_file)
im_in = np.array(imread(im_file))
if len(im_in.shape) == 2:
  im_in = im_in[:,:,np.newaxis]
  im_in = np.concatenate((im_in,im_in,im_in), axis=2)
# rgb -> bgr
im = im_in[:,:,::-1]

vis = True

blobs, im_scales = _get_image_blob(im)
assert len(im_scales) == 1, "Only single-image batch implemented"
im_blob = blobs
im_info_np = np.array([[im_blob.shape[1], im_blob.shape[2], im_scales[0]]], dtype=np.float32)

im_data_pt = torch.from_numpy(im_blob)
im_data_pt = im_data_pt.permute(0, 3, 1, 2)
im_info_pt = torch.from_numpy(im_info_np)

with torch.no_grad():
        im_data.resize_(im_data_pt.size()).copy_(im_data_pt)
        im_info.resize_(im_info_pt.size()).copy_(im_info_pt)
        gt_boxes.resize_(1, 1, 5).zero_()
        num_boxes.resize_(1).zero_()
det_tic = time.time()

# the region features[box_num * 2048] are required.
rois, cls_prob, bbox_pred, \
rpn_loss_cls, rpn_loss_box, \
RCNN_loss_cls, RCNN_loss_bbox, \
rois_label, pooled_feat = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, pool_feat = True)

scores = cls_prob.data
boxes = rois.data[:, :, 1:5]

if cfg.TEST.BBOX_REG:
    # Apply bounding-box regression deltas
    box_deltas = bbox_pred.data
    if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED:
    # Optionally normalize targets by a precomputed mean and stdev
      if args.class_agnostic:
          if args.cuda > 0:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()
          else:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS) \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS)

          box_deltas = box_deltas.view(1, -1, 4)
      else:
          if args.cuda > 0:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()
          else:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS) \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS)
          box_deltas = box_deltas.view(1, -1, 4 * len(classes))

    pred_boxes = bbox_transform_inv(boxes, box_deltas, 1)
    pred_boxes = clip_boxes(pred_boxes, im_info.data, 1)
else:
    # Simply repeat the boxes, once for each class
    pred_boxes = np.tile(boxes, (1, scores.shape[1]))

pred_boxes /= im_scales[0]

scores = scores.squeeze()
pred_boxes = pred_boxes.squeeze()

det_toc = time.time()
detect_time = det_toc - det_tic
misc_tic = time.time()

max_conf = torch.zeros((pred_boxes.shape[0]))
if args.cuda > 0:
    max_conf = max_conf.cuda()

if vis:
    im2show = np.copy(im)
for j in xrange(1, len(classes)):
    inds = torch.nonzero(scores[:,j]>conf_thresh).view(-1)
    # if there is det
    if inds.numel() > 0:
      cls_scores = scores[:,j][inds]
      _, order = torch.sort(cls_scores, 0, True)
      if args.class_agnostic:
        cls_boxes = pred_boxes[inds, :]
      else:
        cls_boxes = pred_boxes[inds][:, j * 4:(j + 1) * 4]

      cls_dets = torch.cat((cls_boxes, cls_scores.unsqueeze(1)), 1)
      # cls_dets = torch.cat((cls_boxes, cls_scores), 1)
      cls_dets = cls_dets[order]
      # keep = nms(cls_dets, cfg.TEST.NMS, force_cpu=not cfg.USE_GPU_NMS)
      keep = nms(cls_boxes[order, :], cls_scores[order], cfg.TEST.NMS)
      cls_dets = cls_dets[keep.view(-1).long()]
      index = inds[order[keep]]
      max_conf[index] = torch.where(scores[index, j] > max_conf[index], scores[index, j], max_conf[index])
      if vis:
        im2show = vis_detections(im2show, classes[j], cls_dets.cpu().numpy(), 0.5)

if args.cuda > 0:
    keep_boxes = torch.where(max_conf >= conf_thresh, max_conf, torch.tensor(0.0).cuda())
else:
    keep_boxes = torch.where(max_conf >= conf_thresh, max_conf, torch.tensor(0.0))
keep_boxes = torch.squeeze(torch.nonzero(keep_boxes))
try:
    len_keep_boxes = len(keep_boxes)
except TypeError:
    len_keep_boxes = 0
if len_keep_boxes < MIN_BOXES:
    keep_boxes = torch.argsort(max_conf, descending = True)[:MIN_BOXES]
elif len_keep_boxes > MAX_BOXES:
    keep_boxes = torch.argsort(max_conf, descending = True)[:MAX_BOXES]

objects = torch.argmax(scores[keep_boxes][:,1:], dim=1)
box_dets = np.zeros((len(keep_boxes), 4), dtype=np.float32)
boxes = pred_boxes[keep_boxes]
for i in range(len(keep_boxes)):
    kind = objects[i]+1
    bbox = boxes[i, kind * 4: (kind + 1) * 4]
    box_dets[i] = np.array(bbox.cpu())

return {
    'image_id': image_id,
    'image_h': np.size(im, 0),
    'image_w': np.size(im, 1),
    'num_boxes': len(keep_boxes),
    'boxes': box_dets,
    'features': base64.b64encode((pooled_feat[keep_boxes].cpu()).detach().numpy())
}

def load_model(args):
# set cfg according to the dataset used to train the pre-trained model
if args.dataset == "pascal_voc":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "pascal_voc_0712":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "coco":
args.set_cfgs = ['ANCHOR_SCALES', '[4, 8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "imagenet":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "vg":
args.set_cfgs = ['ANCHOR_SCALES', '[4, 8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']

if args.cfg_file is not None:
  cfg_from_file(args.cfg_file)
if args.set_cfgs is not None:
  cfg_from_list(args.set_cfgs)

cfg.USE_GPU_NMS = args.cuda

print('Using config:')
pprint.pprint(cfg)
np.random.seed(cfg.RNG_SEED)

# Load classes
classes = ['__background__']
with open(os.path.join(args.classes_dir, 'objects_vocab.txt')) as f:
    for object in f.readlines():
        classes.append(object.split(',')[0].lower().strip())

if not os.path.exists(args.load_dir):
    raise Exception('There is no input directory for loading network from ' + args.load_dir)
load_name = os.path.join(args.load_dir, 'faster_rcnn_{}_{}.pth'.format(args.net, args.dataset))

# initilize the network here. the network used to train the pre-trained model
if args.net == 'vgg16':
  fasterRCNN = vgg16(classes, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res101':
  fasterRCNN = resnet(classes, 101, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res50':
  fasterRCNN = resnet(classes, 50, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res152':
  fasterRCNN = resnet(classes, 152, pretrained=False, class_agnostic=args.class_agnostic)
else:
  print("network is not defined")

fasterRCNN.create_architecture()

print("load checkpoint %s" % (load_name))
if args.cuda > 0:
  checkpoint = torch.load(load_name)
else:
  checkpoint = torch.load(load_name, map_location=(lambda storage, loc: storage))
fasterRCNN.load_state_dict(checkpoint['model'])
if 'pooling_mode' in checkpoint.keys():
  cfg.POOLING_MODE = checkpoint['pooling_mode']

print('load model successfully!')

print("load model %s" % (load_name))

return classes, fasterRCNN

def generate_npy(outfile, image_ids, args):

# First check if file exists, and if it is complete
# image_ids: [image_path, image_id]
classes, fasterRCNN = load_model(args)
_t = {'misc' : Timer()}
count = 0
dataset_bboxes = np.zeros((len(image_ids), 36, 4), dtype="float32")
for im_file,image_id in tqdm(image_ids):
    predictions = get_detections_from_im(fasterRCNN, classes, im_file, image_id, args)

    image_h = predictions['image_h']
    image_w = predictions['image_w']
    temp = predictions['boxes']
    temp[:,0] = temp[:,0] / image_w
    temp[:,2] = temp[:,2] / image_w
    temp[:,1] = temp[:,1] / image_h
    temp[:,3] = temp[:,3] / image_h
    dataset_bboxes[image_id] = temp

print('Saving into NUMPY')
np.save(outfile+'.npy', dataset_bboxes)
# dataset_bboxes.tolist()
# with open ('outfile'+'.json','w') as fp:
#     json.dump(dataset_bboxes, fp)
print('Complete!')

if name == 'main':
args = parse_args()

print('Called with args:')
print(args)

image_ids = load_image_ids(args.data_split)
# image_ids = [['images/COCO_train2014_000000000151.jpg', 0], ['images/COCO_train2014_000000000368.jpg', 1]]

generate_npy(args.outfile, image_ids, args)

@titaiwangms
Copy link

I see. You didn't encode it. Thanks!

@zhuocai
Copy link

zhuocai commented Jul 30, 2020

Hello @shilrley6,
Congrats for your repo, it is quite useful.
Nonetheless, it seems I have found an error while decoding the boxes in the output of a TSV file. The image features are decoded correctly, but the bounding boxes are not.
I was testing the generate_tsv.py file, and even the predicted bboxes are correct, at the moment of encoding and storing the whole data is changed.

Have you encountered this issue? Any suggestion?
In my case, I can recompute another TSV and do not store the bboxes with this encoding....but it will take a lot of time to recompute it.

When decoding 'bbox',
buf = base64.b64decode(data[1:]) temp = np.frombuffer(buf, dtype=np.float32)

I think it should be replaced by dtype=np.float64 to get the correct bounding boxes, though I am not sure about the meaning of the 4 numbers in the bounding box.

@cheerss
Copy link

cheerss commented Dec 7, 2020

Hello @shilrley6,
Congrats for your repo, it is quite useful.
Nonetheless, it seems I have found an error while decoding the boxes in the output of a TSV file. The image features are decoded correctly, but the bounding boxes are not.
I was testing the generate_tsv.py file, and even the predicted bboxes are correct, at the moment of encoding and storing the whole data is changed.
Have you encountered this issue? Any suggestion?
In my case, I can recompute another TSV and do not store the bboxes with this encoding....but it will take a lot of time to recompute it.

When decoding 'bbox',
buf = base64.b64decode(data[1:]) temp = np.frombuffer(buf, dtype=np.float32)

I think it should be replaced by dtype=np.float64 to get the correct bounding boxes, though I am not sure about the meaning of the 4 numbers in the bounding box.

That's greatly useful for me!

@xiyu0407
Copy link

Yes, I did...I wrote a modified script for it.

#!/usr/bin/env python

"""Generate bottom-up attention features as a tsv file. Can use cuda and multiple GPUs.
Modify the load_image_ids script as necessary for your data location. """

Example:

python generate_tsv.py --net res101 --dataset vg --out test.csv --cuda

from future import absolute_import
from future import division
from future import print_function

import _init_paths
import os
import sys
import numpy as np
import argparse
import pprint
import pdb
import time
import cv2
import csv
import torch
import base64
import json
from utils.timer import Timer
from torch.autograd import Variable
import torch.nn as nn
import torch.optim as optim

import torchvision.transforms as transforms
import torchvision.datasets as dset

from scipy.misc import imread

from imageio import imread
from roi_data_layer.roidb import combined_roidb
from roi_data_layer.roibatchLoader import roibatchLoader
from model.utils.config import cfg, cfg_from_file, cfg_from_list, get_output_dir
from model.rpn.bbox_transform import clip_boxes

from model.nms.nms_wrapper import nms

from model.roi_layers import nms
from model.rpn.bbox_transform import bbox_transform_inv
from model.utils.net_utils import save_net, load_net, vis_detections
from model.utils.blob import im_list_to_blob
from model.faster_rcnn.vgg16 import vgg16
from model.faster_rcnn.resnet import resnet
import pdb
from tqdm import tqdm

try:
xrange # Python 2
except NameError:
xrange = range # Python 3

csv.field_size_limit(sys.maxsize)

FIELDNAMES = ['image_id', 'image_w','image_h','num_boxes', 'boxes', 'features']

Settings for the number of features per image. To re-create pretrained features with 36 features

per image, set both values to 36.

MIN_BOXES = 36
MAX_BOXES = 36

MIN_BOXES = 10

MAX_BOXES = 100

def parse_args():
"""
Parse input arguments
"""
parser = argparse.ArgumentParser(description='Generate bbox output from a Fast R-CNN network')
parser.add_argument('--dataset', dest='dataset',
help='training dataset',
default='vg', type=str)
parser.add_argument('--net', dest='net',
help='vgg16, res50, res101, res152',
default='res101', type=str)
parser.add_argument('--load_dir', dest='load_dir',
help='directory to load models',
default="models")
parser.add_argument('--cuda', dest='cuda',
help='whether use CUDA',
action='store_true')
parser.add_argument('--mGPUs', dest='mGPUs',
help='whether use multiple GPUs',
action='store_true')
parser.add_argument('--image_dir', dest='image_dir',
help='directory to load images',
default="images")
parser.add_argument('--classes_dir', dest='classes_dir',
help='directory to load object classes for classification',
default="data/genome/1600-400-20")
parser.add_argument('--out', dest='outfile',
help='output filepath',
default=None, type=str)
parser.add_argument('--cfg', dest='cfg_file',
help='optional config file',
default='cfgs/res101.yml', type=str)
parser.add_argument('--set', dest='set_cfgs',
help='set config keys', default=None,
nargs=argparse.REMAINDER)
parser.add_argument('--cag', dest='class_agnostic',
help='whether perform class_agnostic bbox regression',
action='store_true')
parser.add_argument('--split', dest='data_split',
help='dataset to use',
default='stacmr', type=str)

if len(sys.argv) == 1:
    parser.print_help()
    sys.exit(1)

args = parser.parse_args()

return args

lr = cfg.TRAIN.LEARNING_RATE
momentum = cfg.TRAIN.MOMENTUM
weight_decay = cfg.TRAIN.WEIGHT_DECAY

def _get_image_blob(im):
"""Converts an image into a network input.
Arguments:
im (ndarray): a color image in BGR order
Returns:
blob (ndarray): a data blob holding an image pyramid
im_scale_factors (list): list of image scales (relative to im) used
in the image pyramid
"""
im_orig = im.astype(np.float32, copy=True)
im_orig -= cfg.PIXEL_MEANS

im_shape = im_orig.shape
im_size_min = np.min(im_shape[0:2])
im_size_max = np.max(im_shape[0:2])

processed_ims = []
im_scale_factors = []

for target_size in cfg.TEST.SCALES:
im_scale = float(target_size) / float(im_size_min)

Prevent the biggest axis from being more than MAX_SIZE

if np.round(im_scale * im_size_max) > cfg.TEST.MAX_SIZE:
im_scale = float(cfg.TEST.MAX_SIZE) / float(im_size_max)
im = cv2.resize(im_orig, None, None, fx=im_scale, fy=im_scale,
interpolation=cv2.INTER_LINEAR)
im_scale_factors.append(im_scale)
processed_ims.append(im)

Create a blob to hold the input images

blob = im_list_to_blob(processed_ims)

return blob, np.array(im_scale_factors)

#build [image_path, image_id] for dataset, and you can create your own
def load_image_ids(split_name):
''' Load a list of (path,image_id tuples). Modify this to suit your data locations. '''
split = []
if split_name == 'coco_test2014':
with open('/data/coco/annotations/image_info_test2014.json') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['id'])
filepath = os.path.join('/data/test2014/', item['file_name'])
split.append((filepath,image_id))
elif split_name == 'coco_test2015':
with open('/data/coco/annotations/image_info_test2015.json') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['id'])
filepath = os.path.join('/data/test2015/', item['file_name'])
split.append((filepath,image_id))
elif split_name == 'genome':
with open('/data/visualgenome/image_data.json') as f:
for item in json.load(f):
image_id = int(item['image_id'])
filepath = os.path.join('/data/visualgenome/', item['url'].split('rak248/')[-1])
split.append((filepath,image_id))

ADDED FOR STACMR SUPPORT

elif split_name == 'stacmr':
with open ("/SSD/Datasets/Coco-Text/ST_CMR_testdataset/New_Split/dataset_cocotext_captioned_full.json", 'r') as f:
data = json.load(f)
for item in data['images']:
image_id = int(item['imgid'])
filepath = os.path.join('/SSD/Datasets/Coco-Text/ST_CMR_testdataset/Full_STACMR_dataset/images/', item['filename'])
split.append((filepath,image_id))

ADDED FOR FLICKR SUPPORT

elif split_name == 'flickr':

with open ("/SSD/Datasets/Flickr30K/flickr30k_img.txt", 'r') as f:

data = f.readlines()

for image_id, image_name in enumerate (data):

filepath = os.path.join('/SSD/Datasets/Flickr30K/images/', image_name.strip())

split.append((filepath,image_id))

ORIG KARPATHY SPLIT

elif split_name == 'flickr':
  with open ("/SSD/Datasets/Flickr30K/dataset.json", 'r') as f:
    data = json.load(f)
    for item in data['images']:
      image_id = int(item['imgid'])
      filepath = os.path.join('/SSD/Datasets/Flickr30K/images/', item['filename'])
      split.append((filepath,image_id))

elif split_name == 'context':
  context_img_path = '/SSD/Datasets/Context/data/JPEGImages/'
  context_imgs = os.listdir(context_img_path)
  # Debug only:
  # test_list = []
  # test_list.append(context_imgs[0])
  # test_list.append(context_imgs[1])
  # test_list.append(context_imgs[2])
  # for idx, img in enumerate (test_list):
  for idx, img in enumerate (context_imgs):
      image_id = idx
      filepath = context_img_path + img
      split.append((filepath, image_id))
  with open('./context_image_ids.txt','w') as fp:
      for item in split:
        fp.write(str(item) + '\n')

elif split_name == 'bottles':
  bottles_img_path = '/SSD/Datasets/Drink_Bottle/images/'
  folders = os.listdir(bottles_img_path)
  image_id = 0
  for folder in folders:
      full_path = os.path.join(bottles_img_path, folder)
      bottle_imgs = os.listdir(full_path + '/')
      for img in (bottle_imgs):
          filepath = full_path + '/' + img
          split.append((filepath, image_id))
          image_id += 1
  with open('./bottles_image_ids.txt','w') as fp:
      for item in split:
          fp.write(str(item) + '\n')


elif split_name == 'textcaps':
  # MODIFIED TO EXTRACT ONLY TRAINING FEATURES
  textcaps_img_path = '/SSD/Datasets/TextCaps/train_images/'
  with open ('/SSD/Datasets/TextCaps/TextCaps_0.1_train.json','r') as fp:
      train_data = json.load(fp)
  # with open ('/SSD/Datasets/TextCaps/TextCaps_0.1_val.json','r') as fp:
  #     val_data = json.load(fp)
  # data = train_data['data'] + val_data['data']
  data = train_data['data']

  processed_ims = set()
  for idx, item in enumerate (data):
      if item['image_id'] not in processed_ims:
          processed_ims.add(item['image_id'])
          image_id = idx
          filepath = textcaps_img_path + item['image_id'] + '.jpg'
          split.append((filepath, image_id))
  # with open('./textcaps_image_ids.txt','w') as fp:
  with open('./textcaps_image_ids_train_only.txt','w') as fp:
      for item in split:
        fp.write(str(item) + '\n')

else:
  print ('Unknown split')
return split

def get_detections_from_im(fasterRCNN, classes, im_file, image_id, args, conf_thresh=0.2):
"""obtain the image_info for each image,
im_file: the path of the image

return: dict of {'image_id', 'image_h', 'image_w', 'num_boxes', 'boxes', 'features'}
boxes: the coordinate of each box
"""
# initilize the tensor holder here.
im_data = torch.FloatTensor(1)
im_info = torch.FloatTensor(1)
num_boxes = torch.LongTensor(1)
gt_boxes = torch.FloatTensor(1)

# ship to cuda
if args.cuda > 0:
    im_data = im_data.cuda()
    im_info = im_info.cuda()
    num_boxes = num_boxes.cuda()
    gt_boxes = gt_boxes.cuda()

# make variable
with torch.no_grad():
    im_data = Variable(im_data)
    im_info = Variable(im_info)
    num_boxes = Variable(num_boxes)
    gt_boxes = Variable(gt_boxes)

if args.cuda > 0:
    cfg.CUDA = True

if args.cuda > 0:
    fasterRCNN.cuda()

fasterRCNN.eval()

#load images
# im = cv2.imread(im_file)
im_in = np.array(imread(im_file))
if len(im_in.shape) == 2:
  im_in = im_in[:,:,np.newaxis]
  im_in = np.concatenate((im_in,im_in,im_in), axis=2)
# rgb -> bgr
im = im_in[:,:,::-1]

vis = True

blobs, im_scales = _get_image_blob(im)
assert len(im_scales) == 1, "Only single-image batch implemented"
im_blob = blobs
im_info_np = np.array([[im_blob.shape[1], im_blob.shape[2], im_scales[0]]], dtype=np.float32)

im_data_pt = torch.from_numpy(im_blob)
im_data_pt = im_data_pt.permute(0, 3, 1, 2)
im_info_pt = torch.from_numpy(im_info_np)

with torch.no_grad():
        im_data.resize_(im_data_pt.size()).copy_(im_data_pt)
        im_info.resize_(im_info_pt.size()).copy_(im_info_pt)
        gt_boxes.resize_(1, 1, 5).zero_()
        num_boxes.resize_(1).zero_()
det_tic = time.time()

# the region features[box_num * 2048] are required.
rois, cls_prob, bbox_pred, \
rpn_loss_cls, rpn_loss_box, \
RCNN_loss_cls, RCNN_loss_bbox, \
rois_label, pooled_feat = fasterRCNN(im_data, im_info, gt_boxes, num_boxes, pool_feat = True)

scores = cls_prob.data
boxes = rois.data[:, :, 1:5]

if cfg.TEST.BBOX_REG:
    # Apply bounding-box regression deltas
    box_deltas = bbox_pred.data
    if cfg.TRAIN.BBOX_NORMALIZE_TARGETS_PRECOMPUTED:
    # Optionally normalize targets by a precomputed mean and stdev
      if args.class_agnostic:
          if args.cuda > 0:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()
          else:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS) \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS)

          box_deltas = box_deltas.view(1, -1, 4)
      else:
          if args.cuda > 0:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS).cuda() \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS).cuda()
          else:
              box_deltas = box_deltas.view(-1, 4) * torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_STDS) \
                         + torch.FloatTensor(cfg.TRAIN.BBOX_NORMALIZE_MEANS)
          box_deltas = box_deltas.view(1, -1, 4 * len(classes))

    pred_boxes = bbox_transform_inv(boxes, box_deltas, 1)
    pred_boxes = clip_boxes(pred_boxes, im_info.data, 1)
else:
    # Simply repeat the boxes, once for each class
    pred_boxes = np.tile(boxes, (1, scores.shape[1]))

pred_boxes /= im_scales[0]

scores = scores.squeeze()
pred_boxes = pred_boxes.squeeze()

det_toc = time.time()
detect_time = det_toc - det_tic
misc_tic = time.time()

max_conf = torch.zeros((pred_boxes.shape[0]))
if args.cuda > 0:
    max_conf = max_conf.cuda()

if vis:
    im2show = np.copy(im)
for j in xrange(1, len(classes)):
    inds = torch.nonzero(scores[:,j]>conf_thresh).view(-1)
    # if there is det
    if inds.numel() > 0:
      cls_scores = scores[:,j][inds]
      _, order = torch.sort(cls_scores, 0, True)
      if args.class_agnostic:
        cls_boxes = pred_boxes[inds, :]
      else:
        cls_boxes = pred_boxes[inds][:, j * 4:(j + 1) * 4]

      cls_dets = torch.cat((cls_boxes, cls_scores.unsqueeze(1)), 1)
      # cls_dets = torch.cat((cls_boxes, cls_scores), 1)
      cls_dets = cls_dets[order]
      # keep = nms(cls_dets, cfg.TEST.NMS, force_cpu=not cfg.USE_GPU_NMS)
      keep = nms(cls_boxes[order, :], cls_scores[order], cfg.TEST.NMS)
      cls_dets = cls_dets[keep.view(-1).long()]
      index = inds[order[keep]]
      max_conf[index] = torch.where(scores[index, j] > max_conf[index], scores[index, j], max_conf[index])
      if vis:
        im2show = vis_detections(im2show, classes[j], cls_dets.cpu().numpy(), 0.5)

if args.cuda > 0:
    keep_boxes = torch.where(max_conf >= conf_thresh, max_conf, torch.tensor(0.0).cuda())
else:
    keep_boxes = torch.where(max_conf >= conf_thresh, max_conf, torch.tensor(0.0))
keep_boxes = torch.squeeze(torch.nonzero(keep_boxes))
try:
    len_keep_boxes = len(keep_boxes)
except TypeError:
    len_keep_boxes = 0
if len_keep_boxes < MIN_BOXES:
    keep_boxes = torch.argsort(max_conf, descending = True)[:MIN_BOXES]
elif len_keep_boxes > MAX_BOXES:
    keep_boxes = torch.argsort(max_conf, descending = True)[:MAX_BOXES]

objects = torch.argmax(scores[keep_boxes][:,1:], dim=1)
box_dets = np.zeros((len(keep_boxes), 4), dtype=np.float32)
boxes = pred_boxes[keep_boxes]
for i in range(len(keep_boxes)):
    kind = objects[i]+1
    bbox = boxes[i, kind * 4: (kind + 1) * 4]
    box_dets[i] = np.array(bbox.cpu())

return {
    'image_id': image_id,
    'image_h': np.size(im, 0),
    'image_w': np.size(im, 1),
    'num_boxes': len(keep_boxes),
    'boxes': box_dets,
    'features': base64.b64encode((pooled_feat[keep_boxes].cpu()).detach().numpy())
}

def load_model(args):

set cfg according to the dataset used to train the pre-trained model

if args.dataset == "pascal_voc":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "pascal_voc_0712":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "coco":
args.set_cfgs = ['ANCHOR_SCALES', '[4, 8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "imagenet":
args.set_cfgs = ['ANCHOR_SCALES', '[8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']
elif args.dataset == "vg":
args.set_cfgs = ['ANCHOR_SCALES', '[4, 8, 16, 32]', 'ANCHOR_RATIOS', '[0.5,1,2]']

if args.cfg_file is not None:
  cfg_from_file(args.cfg_file)
if args.set_cfgs is not None:
  cfg_from_list(args.set_cfgs)

cfg.USE_GPU_NMS = args.cuda

print('Using config:')
pprint.pprint(cfg)
np.random.seed(cfg.RNG_SEED)

# Load classes
classes = ['__background__']
with open(os.path.join(args.classes_dir, 'objects_vocab.txt')) as f:
    for object in f.readlines():
        classes.append(object.split(',')[0].lower().strip())

if not os.path.exists(args.load_dir):
    raise Exception('There is no input directory for loading network from ' + args.load_dir)
load_name = os.path.join(args.load_dir, 'faster_rcnn_{}_{}.pth'.format(args.net, args.dataset))

# initilize the network here. the network used to train the pre-trained model
if args.net == 'vgg16':
  fasterRCNN = vgg16(classes, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res101':
  fasterRCNN = resnet(classes, 101, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res50':
  fasterRCNN = resnet(classes, 50, pretrained=False, class_agnostic=args.class_agnostic)
elif args.net == 'res152':
  fasterRCNN = resnet(classes, 152, pretrained=False, class_agnostic=args.class_agnostic)
else:
  print("network is not defined")

fasterRCNN.create_architecture()

print("load checkpoint %s" % (load_name))
if args.cuda > 0:
  checkpoint = torch.load(load_name)
else:
  checkpoint = torch.load(load_name, map_location=(lambda storage, loc: storage))
fasterRCNN.load_state_dict(checkpoint['model'])
if 'pooling_mode' in checkpoint.keys():
  cfg.POOLING_MODE = checkpoint['pooling_mode']

print('load model successfully!')

print("load model %s" % (load_name))

return classes, fasterRCNN

def generate_npy(outfile, image_ids, args):

# First check if file exists, and if it is complete
# image_ids: [image_path, image_id]
classes, fasterRCNN = load_model(args)
_t = {'misc' : Timer()}
count = 0
dataset_bboxes = np.zeros((len(image_ids), 36, 4), dtype="float32")
for im_file,image_id in tqdm(image_ids):
    predictions = get_detections_from_im(fasterRCNN, classes, im_file, image_id, args)

    image_h = predictions['image_h']
    image_w = predictions['image_w']
    temp = predictions['boxes']
    temp[:,0] = temp[:,0] / image_w
    temp[:,2] = temp[:,2] / image_w
    temp[:,1] = temp[:,1] / image_h
    temp[:,3] = temp[:,3] / image_h
    dataset_bboxes[image_id] = temp

print('Saving into NUMPY')
np.save(outfile+'.npy', dataset_bboxes)
# dataset_bboxes.tolist()
# with open ('outfile'+'.json','w') as fp:
#     json.dump(dataset_bboxes, fp)
print('Complete!')

if name == 'main':
args = parse_args()

print('Called with args:')
print(args)

image_ids = load_image_ids(args.data_split)
# image_ids = [['images/COCO_train2014_000000000151.jpg', 0], ['images/COCO_train2014_000000000368.jpg', 1]]

generate_npy(args.outfile, image_ids, args)

Hello, @AndresPMD .Thanks for your work, and now i can run the generate.py on my computer.However, i encounted a problem. When i run this python file on MSCOCO dataset, i got the error:IndexError:index 523573 is out of bounds for axis 0 with 40775. Can you help me? Thanks a lot.

@vlb9ae
Copy link

vlb9ae commented Apr 7, 2022

Hi! I had the same issue and came to report it, but I was able to solve my issue with a really small change to the code base. I'm assuming this bug didn't get caught since the bounding boxes retrieved in convert_data.py are never output, but I had added a command line option to choose whether to output features or bounding boxes, and the bounding box outputs looked really odd. It turns out, the values that are encoded are of type float64, but the buffer reader used is looking for data type float32! I changed "np.float32" to "np.float64" in convert_data.py, and the bounding boxes retrieved looked good with no other changes :D hope this is helpful if anyone else needs to use the bounding boxes in the future!

UPDATE: just found the other people saying the same thing, which I missed my first time through since they were hidden between long responses I scrolled past; sorry for the repetition!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

6 participants