TensorflowPredictVGGish #1333

seunggookim · 2023-05-16T17:55:30Z

I would like to find corresponding patches for shallow layer activation (MaxPool). But, because they are flattened, it is difficult for me. For example, if I feed a short audio segment for a single patch (15600 samples at 16 kHz), model/vggish/pool4/MaxPool is (24, 512). But for a longer audio file, it is (24*#patches, 512). For me, it's unclear whether the first dimension, which is flattened, is along patches first ([p1f1, p2f1, p3f1, ..., p1f2, p2f2, p3f2, ...]) or features ([p1f1, p1f2, p1f3, ..., p2f1, p2f2, p2f3, ...]).

To figure this out myself, I manually segmented the audio file into 15600 samples before feeding them to the models as follows:

patchSize_smp = 15600      # 96 frames (95 frame hops + 1 frame)
patchHopSize_smp = 14880   # 93 frames (93 frame hops)
frameSize_smp = 400
frameHopSize_smp = 160
audio = MonoLoader(filename=audio_path, sampleRate=16000)()
audioSize_smp = len(audio)
startPoints_smp = np.arange(1, audioSize_smp - patchSize_smp, patchHopSize_smp) 
endPoints_smp = np.arange(patchSize_smp, audioSize_smp, patchHopSize_smp)

# extract activation
model = TensorflowPredictVGGish(graphFilename=model_path, output="model/vggish/embeddings")
activation = []
for start_smp, end_smp in zip(startPoints_smp, endPoints_smp):
    activation.append(model(audio[start_smp:end_smp+1]).copy())
activation = np.array(activation)

Because the final layer "embeddings" is clear to me, I compared the results for the final layer. What I tried got me the same number of patches, but slightly different activation values (r = 0.6349 to 1; mean r = 0.9743), which is still disturbing. Should I add anything to make these identical to the original function?

The text was updated successfully, but these errors were encountered:

palonso · 2023-05-17T07:29:55Z

Hi @seunggookim,
The unsqueezed output of model/vggish/pool4/MaxPool should be (-1, 6, 4, 512),
I found this by inspecting this (unofficial) implementation of VGGish.
I don't see why Essentia is not reshaping the output tensor like this, so my guess is that we may not be parsing correctly the output nodes of the graph to get the shape of the intermediate output tensors. I will have a look into that this week.

For now, you could reconstruct the patches by:

np.reshape(activation, (-1, 6, 4, 512))

Regarding your experiment, I would expect slightly different activations since we approach the patch generation in a slightly different way which could result in different amounts of padding applied.

seunggookim · 2023-05-21T15:00:34Z

Thanks @palonso!
I saw the mel_features.py in the repo you linked and another one (perhaps an official version?).

The dimension of the input image is 96 frames x 64 mel-bands, not 64x96. With (96, 64), the pool4 should be (-1, 6, 4, 512), not (-1, 4, 6, 512). While this is not critical for my purpose, I just wonder if the input image was transposed in what Essentia uses.

palonso · 2023-05-22T15:46:38Z

Right @seunggookim, I confused the axes.

Essentia generates data as (batch, channel, timestamps, features), so the only transposition done internally is putting the channel as the last dimension.

palonso · 2023-05-22T15:47:52Z

I updated my previous answer with your comment

palonso · 2023-05-26T07:29:35Z

Hi @seunggookim ,
I was looking into this issue today and found the reason. When we created the model-specific algorithms (e.g., TensorflowPredictVGGish) we thought that the most common use case would be to extract predictions so we limited the output to 2D. In case there are more than 2 dimensions, all but the last one are flattened to the first dimension.

If you want to retrieve outputs with an arbitrary shape, you can use the generic algorithm TensorflowPredict, which requires additional configuration work in exchange for the extra flexibility. For your case:

from essentia.standard import TensorflowInputVGGish, TensorflowPredict, FrameGenerator
from essentia import Pool
import numpy as np

vggish_path = (
    "/media/data/models/essentia-models/feature-extractors/vggish/audioset-vggish-3.pb"
)
output_node = "model/vggish/pool4/MaxPool"
input_node = "model/Placeholder"

frame_size = 400
hop_size = 160
patch_size = 96
n_melbands = 64

vggish = TensorflowPredict(
    graphFilename=vggish_path, outputs=[output_node], inputs=[input_node]
)
melband_extractor = TensorflowInputVGGish()
pool = Pool()

audio = np.ones(2 * 16000).astype("float32")

# Compute mel bands
melbands = np.array([
    melband_extractor(frame)
    for frame in FrameGenerator(audio, frameSize=frame_size, hopSize=hop_size)
])

# Reshape to patches
trim = len(melbands) % patch_size
melbands = np.reshape(melbands[:-trim], (-1, patch_size, n_melbands))
# Add channel dimension
melbands = np.expand_dims(melbands, axis=1)
# Put the data into a pool
pool.set(input_node, melbands)
output = vggish(pool)[output_node]

print(output.shape)

> (2, 6, 4, 512)

In other TensorflowPredict algorithm we have prefered to return 2D outputs since they typically fit in the schema (timestamps, embeddings), or (timestamps, activations). Hoewer in some cases more dimensions are required. For example, we need 3D to return the attention layers (batch, tokens, dimensions). A similar problem has happened before when trying to retrieve the internal representations of VGGish: MTG#1333

palonso closed this as completed May 26, 2023

palonso mentioned this issue May 26, 2023

Improve intermediate layer extraction explanation #1338

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

TensorflowPredictVGGish #1333

TensorflowPredictVGGish #1333

seunggookim commented May 16, 2023

palonso commented May 17, 2023 •

edited

Loading

seunggookim commented May 21, 2023

palonso commented May 22, 2023

palonso commented May 22, 2023

palonso commented May 26, 2023

TensorflowPredictVGGish #1333

TensorflowPredictVGGish #1333

Comments

seunggookim commented May 16, 2023

palonso commented May 17, 2023 • edited Loading

seunggookim commented May 21, 2023

palonso commented May 22, 2023

palonso commented May 22, 2023

palonso commented May 26, 2023

palonso commented May 17, 2023 •

edited

Loading