SImilarities computed using motion and text embeddings are incorrect #27

sohananisetty · 2023-09-25T13:47:49Z

Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.

Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',

emb = enc.encode_motions(batch['x']).to(device)
emb /= emb.norm(dim=-1, keepdim=True)

text_inputs = torch.cat([clip.tokenize(c) for c in batch["clip_text"]]).to(device)
text_features = clip_model.encode_text(text_inputs).float()
text_features /= text_features.norm(dim=-1, keepdim=True)

logit_scale = clip_model.logit_scale.exp()
similarity = (logit_scale * emb @ text_features.float().T).softmax(dim=-1)

values, indices = similarity[0].topk(len(batch["clip_text"]))

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{batch['clip_text'][index]:>16s}: {100 * value.item():.2f}%")

Expected output for similarity[0] -> high "jump" probability
But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?

The text was updated successfully, but these errors were encountered:

GuyTevet · 2023-10-01T12:21:57Z

That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?

sohananisetty · 2023-10-01T14:44:56Z

I ran the script using the general model. I was getting:

Top-5 Acc. : 29.86%  (637/2133)
Top-1 Acc. : 13.41%  (286/2133)

Using the finetuned model:

Top-5 Acc. : 63.72%  (1354/2125)
Top-1 Acc. : 44.99%  (956/2125)

I assumed the zero-shot nature of CLIP would at least provide some generalizability. But that does not seem to be the case.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

SImilarities computed using motion and text embeddings are incorrect #27

SImilarities computed using motion and text embeddings are incorrect #27

sohananisetty commented Sep 25, 2023 •

edited

Loading

GuyTevet commented Oct 1, 2023

sohananisetty commented Oct 1, 2023

SImilarities computed using motion and text embeddings are incorrect #27

SImilarities computed using motion and text embeddings are incorrect #27

Comments

sohananisetty commented Sep 25, 2023 • edited Loading

GuyTevet commented Oct 1, 2023

sohananisetty commented Oct 1, 2023

sohananisetty commented Sep 25, 2023 •

edited

Loading