-
Notifications
You must be signed in to change notification settings - Fork 1
-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
How can an image be input into a model to output its scene graph information and bounding box information for visualization? #7
Comments
Hi :). I was also deeply impressed by the EGTR paper. I am facing the same issue. As I am not very knowledgeable in the SGG domain, I understand that it involves predicting relationships between objects. I need to perform the SGG as a preliminary step to use SG embedding for downstream tasks. I would like to refer to EGTR as the SGG module, but could you please let me know how the output scene graph is generated? |
First of all, I have not attempted to extract the scene graph well other than measuring measures for evaluation. from glob import glob
import torch
from PIL import Image
from model.deformable_detr import DeformableDetrConfig, DeformableDetrFeatureExtractor
from model.egtr import DetrForSceneGraphGeneration
# config
architecture = "SenseTime/deformable-detr"
min_size = 800
max_size = 1333
artifact_path = YOUR_ARTIFACT_PATH
# feature extractor
feature_extractor = DeformableDetrFeatureExtractor.from_pretrained(
architecture, size=min_size, max_size=max_size
)
# inference image
image = Image.open(YOUR_IMAGE_PATH)
image = feature_extractor(image, return_tensors="pt")
# model
config = DeformableDetrConfig.from_pretrained(artifact_path)
model = DetrForSceneGraphGeneration.from_pretrained(
architecture, config=config, ignore_mismatched_sizes=True
)
ckpt_path = sorted(
glob(f"{artifact_path}/checkpoints/epoch=*.ckpt"),
key=lambda x: int(x.split("epoch=")[1].split("-")[0]),
)[-1]
state_dict = torch.load(ckpt_path, map_location="cpu")["state_dict"]
for k in list(state_dict.keys()):
state_dict[k[6:]] = state_dict.pop(k) # "model."
model.load_state_dict(state_dict)
model.cuda()
model.eval()
# output
outputs = model(
pixel_values=image['pixel_values'].cuda(),
pixel_mask=image['pixel_mask'].cuda(),
output_attention_states=True
)
pred_logits = outputs['logits'][0]
obj_scores, pred_classes = torch.max(pred_logits.softmax(-1), -1)
pred_boxes = outputs['pred_boxes'][0]
pred_connectivity = outputs['pred_connectivity'][0]
pred_rel = outputs['pred_rel'][0]
pred_rel = torch.mul(pred_rel, pred_connectivity)
# get valid objects and triplets
obj_threshold = YOUR_OBJ_THRESHOLD
valid_obj_indices = (obj_scores >= obj_threshold).nonzero()[:, 0]
valid_obj_classes = pred_classes[valid_obj_indices] # [num_valid_objects]
valid_obj_boxes = pred_boxes[valid_obj_indices] # [num_valid_objects, 4]
rel_threshold = YOUR_REL_THRESHOLD
valid_triplets = (pred_rel[valid_obj_indices][:, valid_obj_indices] >= rel_threshold).nonzero() # [num_valid_triplets, 3] You can generate a scene graph based on valid_obj_classes, valid_obj_boxes, and valid_triplets.
I built a scene graph using thresholds in this example, but it can also be implemented by selecting the top k objects or triplets. |
(1) obj boxes
As I mentioned before, pred_boxes are cxcywh format. (2) obj scores We used Deformable DETR rather than DETR, and in Deformable DETR, focal loss is used for object detection instead of cross entropy loss.
Therefore, It is more natural to use sigmoid instead of softmax (https://github.com/huggingface/transformers/blob/409fcfdfccde77a14b7cc36972b774cabc371ae1/src/transformers/models/deformable_detr/image_processing_deformable_detr.py#L1555), but we used softmax to get obj_scores. |
@jinbae |
@Aoihigashi |
@PiPiSang |
@Aoihigashi Yes. The obj_threshold and rel_threshold for me are 0.3 and 1e-4. But based on my tests, I strongly suggest that you directly select the top_k triplets based on their scores. Then, you can read the function where the author calculates the loss in train_egtr.py. This demonstrates how to select the top_k triplets and perform the transformation. |
@PiPiSang |
First of all, the work is very instructive, thank you! As a novice to scene graph generation, my current research requires methods to extract information from images using scene graph generation. So I'm curious about how to use the output of the model to generate a scene graph, and I'm confused about how to visualize the output. Any guidance would be greatly appreciated.
The text was updated successfully, but these errors were encountered: