Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

TensorflowPredictVGGish #1333

Closed
seunggookim opened this issue May 16, 2023 · 5 comments
Closed

TensorflowPredictVGGish #1333

seunggookim opened this issue May 16, 2023 · 5 comments

Comments

@seunggookim
Copy link

I would like to find corresponding patches for shallow layer activation (MaxPool). But, because they are flattened, it is difficult for me. For example, if I feed a short audio segment for a single patch (15600 samples at 16 kHz), model/vggish/pool4/MaxPool is (24, 512). But for a longer audio file, it is (24*#patches, 512). For me, it's unclear whether the first dimension, which is flattened, is along patches first ([p1f1, p2f1, p3f1, ..., p1f2, p2f2, p3f2, ...]) or features ([p1f1, p1f2, p1f3, ..., p2f1, p2f2, p2f3, ...]).

To figure this out myself, I manually segmented the audio file into 15600 samples before feeding them to the models as follows:

patchSize_smp = 15600      # 96 frames (95 frame hops + 1 frame)
patchHopSize_smp = 14880   # 93 frames (93 frame hops)
frameSize_smp = 400
frameHopSize_smp = 160
audio = MonoLoader(filename=audio_path, sampleRate=16000)()
audioSize_smp = len(audio)
startPoints_smp = np.arange(1, audioSize_smp - patchSize_smp, patchHopSize_smp) 
endPoints_smp = np.arange(patchSize_smp, audioSize_smp, patchHopSize_smp)

# extract activation
model = TensorflowPredictVGGish(graphFilename=model_path, output="model/vggish/embeddings")
activation = []
for start_smp, end_smp in zip(startPoints_smp, endPoints_smp):
    activation.append(model(audio[start_smp:end_smp+1]).copy())
activation = np.array(activation)

Because the final layer "embeddings" is clear to me, I compared the results for the final layer. What I tried got me the same number of patches, but slightly different activation values (r = 0.6349 to 1; mean r = 0.9743), which is still disturbing. Should I add anything to make these identical to the original function?

@palonso
Copy link
Contributor

palonso commented May 17, 2023

Hi @seunggookim,
The unsqueezed output of model/vggish/pool4/MaxPool should be (-1, 6, 4, 512),
I found this by inspecting this (unofficial) implementation of VGGish.
I don't see why Essentia is not reshaping the output tensor like this, so my guess is that we may not be parsing correctly the output nodes of the graph to get the shape of the intermediate output tensors. I will have a look into that this week.

For now, you could reconstruct the patches by:

np.reshape(activation, (-1, 6, 4, 512))

Regarding your experiment, I would expect slightly different activations since we approach the patch generation in a slightly different way which could result in different amounts of padding applied.

@seunggookim
Copy link
Author

Thanks @palonso!
I saw the mel_features.py in the repo you linked and another one (perhaps an official version?).

The dimension of the input image is 96 frames x 64 mel-bands, not 64x96. With (96, 64), the pool4 should be (-1, 6, 4, 512), not (-1, 4, 6, 512). While this is not critical for my purpose, I just wonder if the input image was transposed in what Essentia uses.

@palonso
Copy link
Contributor

palonso commented May 22, 2023

Right @seunggookim, I confused the axes.

Essentia generates data as (batch, channel, timestamps, features), so the only transposition done internally is putting the channel as the last dimension.

@palonso
Copy link
Contributor

palonso commented May 22, 2023

I updated my previous answer with your comment

@palonso
Copy link
Contributor

palonso commented May 26, 2023

Hi @seunggookim ,
I was looking into this issue today and found the reason. When we created the model-specific algorithms (e.g., TensorflowPredictVGGish) we thought that the most common use case would be to extract predictions so we limited the output to 2D. In case there are more than 2 dimensions, all but the last one are flattened to the first dimension.

If you want to retrieve outputs with an arbitrary shape, you can use the generic algorithm TensorflowPredict, which requires additional configuration work in exchange for the extra flexibility. For your case:

from essentia.standard import TensorflowInputVGGish, TensorflowPredict, FrameGenerator
from essentia import Pool
import numpy as np

vggish_path = (
    "/media/data/models/essentia-models/feature-extractors/vggish/audioset-vggish-3.pb"
)
output_node = "model/vggish/pool4/MaxPool"
input_node = "model/Placeholder"

frame_size = 400
hop_size = 160
patch_size = 96
n_melbands = 64

vggish = TensorflowPredict(
    graphFilename=vggish_path, outputs=[output_node], inputs=[input_node]
)
melband_extractor = TensorflowInputVGGish()
pool = Pool()

audio = np.ones(2 * 16000).astype("float32")

# Compute mel bands
melbands = np.array([
    melband_extractor(frame)
    for frame in FrameGenerator(audio, frameSize=frame_size, hopSize=hop_size)
])

# Reshape to patches
trim = len(melbands) % patch_size
melbands = np.reshape(melbands[:-trim], (-1, patch_size, n_melbands))
# Add channel dimension
melbands = np.expand_dims(melbands, axis=1)
# Put the data into a pool
pool.set(input_node, melbands)
output = vggish(pool)[output_node]

print(output.shape)
> (2, 6, 4, 512)

@palonso palonso closed this as completed May 26, 2023
palonso added a commit to palonso/essentia that referenced this issue Oct 19, 2023
In other TensorflowPredict algorithm we have prefered to return 2D
outputs since they typically fit in the schema (timestamps,
embeddings), or (timestamps, activations). Hoewer in some cases more
dimensions are required. For example, we need 3D to return the attention
layers (batch, tokens, dimensions).
A similar problem has happened before when trying to retrieve the
internal representations of VGGish:
MTG#1333
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants