Unexpected behaviour change after download worker refactor #302

pabl0 · 2024-01-31T08:20:32Z

Previously, video2dataset would create and download distribution.process_count shards at a time. So if I had 10 processes and, say, 32 threads configured, it would create ten .tar and .parquet files, and after finishing, the _stats.json file would be created (triggering my automation to upload the downloaded data elsewhere), and video2dataset would move to processing a new shard.

After the recent changes, it seems to be processing hundreds of shard at a time. Also, with process_count 4 and thread_count 16, I see some 300-400 threads running instead of the expected 64 or so. I don't think it works as designed, but if it does, how can I limit the number of shards processed at a time to make my download pipeline work smoothly?

rom1504 · 2024-01-31T08:22:42Z

@MattUnderscoreZhang could you please take a look?

Indeed this is definitely not expected

pabl0 · 2024-01-31T09:02:08Z

Reverting commit 83afef0 restored the old behavior for me. It seems that there is still much higher number of threads than processes_count × thread_count (haven't checked it before), but at least it creates the correct number of shards at a time.

Oh, BTW, the setting is called process_es__count, not process_count that might sound more correct (vs thread_count), for anyone grepping in the source tree. Confused me at first since the only occurrence was in the no_distributor() function.

rom1504 · 2024-01-31T12:00:04Z

Yeah I might have merged that a bit too fast. Will look more and we can probably revert if there's no easy fix

MattUnderscoreZhang · 2024-01-31T16:03:06Z

I’ll take a look as soon as I get home. It’s likely a change I made to the semaphores.

I’ll commit a fix in a new pull request, and see if I can add a relevant unit test.

MattUnderscoreZhang · 2024-02-01T00:15:12Z

@rom1504 I created a pull request that should limit processing to the correct number of shards at a time (as tested on my local environment). However, I see the same thing that @pabl0 does, where the number of threads created is higher than expected from processes_count x thread_count, going far back in the commit history. This seems to be a bug that has existed for a long time.

Testing with a 5k sample dataset, I set number_sample_per_shard=100, processes_count=5, thread_count=2:

As a result, my config requires 50 shards, but I only download 5 at a time:

As you mentioned, the number of threads is higher than expected:

pabl0 · 2024-02-01T09:53:18Z

Thanks @rom1504 @MattUnderscoreZhang. Your fix does indeed seem to improve things in my quick testing. However, it seems to me that the current behavior is that it creates batches of N number of tar files, but it is only writing to one at a time. Like only one process is active and the rest are not doing any work, and the tar files remain empty.

BTW, is the output ".../_tmp/X.fether" caused by this print statement added by the worker refactor PR? It feels to me like unnecessary debug output that should be dropped.

video2dataset/video2dataset/workers/download_worker.py

Line 74 in 83afef0

print(shard_path)

MattUnderscoreZhang · 2024-02-03T14:46:40Z

@pabl0 I can't seem to replicate your observation. What environment are you running on, and what configs are you using? For reference, here's what I see. With five processes, I see 5 tar files being written simultaneously.

MattUnderscoreZhang · 2024-02-03T14:49:17Z

Also, I removed the print statement on the threading_fix path. Nice catch.

pabl0 · 2024-02-04T07:01:12Z

I see. IIRC, the WebVid dataset consist of normal http links using different d/l backend than YouTube datasets. Is it possible they behave differently?

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Unexpected behaviour change after download worker refactor #302

Unexpected behaviour change after download worker refactor #302

pabl0 commented Jan 31, 2024

rom1504 commented Jan 31, 2024

pabl0 commented Jan 31, 2024

rom1504 commented Jan 31, 2024

MattUnderscoreZhang commented Jan 31, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 1, 2024 •

edited

Loading

pabl0 commented Feb 1, 2024

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 3, 2024

pabl0 commented Feb 4, 2024

Unexpected behaviour change after download worker refactor #302

Unexpected behaviour change after download worker refactor #302

Comments

pabl0 commented Jan 31, 2024

rom1504 commented Jan 31, 2024

pabl0 commented Jan 31, 2024

rom1504 commented Jan 31, 2024

MattUnderscoreZhang commented Jan 31, 2024 • edited Loading

MattUnderscoreZhang commented Feb 1, 2024 • edited Loading

pabl0 commented Feb 1, 2024

MattUnderscoreZhang commented Feb 3, 2024 • edited Loading

MattUnderscoreZhang commented Feb 3, 2024

pabl0 commented Feb 4, 2024

MattUnderscoreZhang commented Jan 31, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 1, 2024 •

edited

Loading

MattUnderscoreZhang commented Feb 3, 2024 •

edited

Loading