Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

SImilarities computed using motion and text embeddings are incorrect #27

Open
sohananisetty opened this issue Sep 25, 2023 · 2 comments

Comments

@sohananisetty
Copy link

sohananisetty commented Sep 25, 2023

Like CLIP, where we compute the image and text embeddings and compute the similarities to retrieve the best matching text, I tried the same using motion and text, but it does not work.

Eg. Using the AMASS dataset and bs = 2; texts: 'jump', 'dancing',

emb = enc.encode_motions(batch['x']).to(device)
emb /= emb.norm(dim=-1, keepdim=True)

text_inputs = torch.cat([clip.tokenize(c) for c in batch["clip_text"]]).to(device)
text_features = clip_model.encode_text(text_inputs).float()
text_features /= text_features.norm(dim=-1, keepdim=True)

logit_scale = clip_model.logit_scale.exp()
similarity = (logit_scale * emb @ text_features.float().T).softmax(dim=-1)

values, indices = similarity[0].topk(len(batch["clip_text"]))

# Print the result
print("\nTop predictions:\n")
for value, index in zip(values, indices):
    print(f"{batch['clip_text'][index]:>16s}: {100 * value.item():.2f}%")

Expected output for similarity[0] -> high "jump" probability
But I get a high "dance" probability output. I have tested this with multiple batches and the correct text does not get the highest similarity a majority of the times. Am I inferencing it wrong?

@GuyTevet
Copy link
Owner

GuyTevet commented Oct 1, 2023

That's weird. Your code looks good to me, but we do know that the cosine similarity should work to some extent according to the action classification experiment. Did you try using it as a reference?

@sohananisetty
Copy link
Author

I ran the script using the general model. I was getting:

Top-5 Acc. : 29.86%  (637/2133)
Top-1 Acc. : 13.41%  (286/2133)

Using the finetuned model:

Top-5 Acc. : 63.72%  (1354/2125)
Top-1 Acc. : 44.99%  (956/2125)

I assumed the zero-shot nature of CLIP would at least provide some generalizability. But that does not seem to be the case.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants