Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unexpected behaviour change after download worker refactor #302

Open
pabl0 opened this issue Jan 31, 2024 · 9 comments
Open

Unexpected behaviour change after download worker refactor #302

pabl0 opened this issue Jan 31, 2024 · 9 comments

Comments

@pabl0
Copy link

pabl0 commented Jan 31, 2024

Previously, video2dataset would create and download distribution.process_count shards at a time. So if I had 10 processes and, say, 32 threads configured, it would create ten .tar and .parquet files, and after finishing, the _stats.json file would be created (triggering my automation to upload the downloaded data elsewhere), and video2dataset would move to processing a new shard.

After the recent changes, it seems to be processing hundreds of shard at a time. Also, with process_count 4 and thread_count 16, I see some 300-400 threads running instead of the expected 64 or so. I don't think it works as designed, but if it does, how can I limit the number of shards processed at a time to make my download pipeline work smoothly?

@rom1504
Copy link
Collaborator

rom1504 commented Jan 31, 2024

@MattUnderscoreZhang could you please take a look?

Indeed this is definitely not expected

@pabl0
Copy link
Author

pabl0 commented Jan 31, 2024

Reverting commit 83afef0 restored the old behavior for me. It seems that there is still much higher number of threads than processes_count × thread_count (haven't checked it before), but at least it creates the correct number of shards at a time.

Oh, BTW, the setting is called process_es__count, not process_count that might sound more correct (vs thread_count), for anyone grepping in the source tree. Confused me at first since the only occurrence was in the no_distributor() function.

@rom1504
Copy link
Collaborator

rom1504 commented Jan 31, 2024

Yeah I might have merged that a bit too fast. Will look more and we can probably revert if there's no easy fix

@MattUnderscoreZhang
Copy link
Contributor

MattUnderscoreZhang commented Jan 31, 2024

I’ll take a look as soon as I get home. It’s likely a change I made to the semaphores.

I’ll commit a fix in a new pull request, and see if I can add a relevant unit test.

@MattUnderscoreZhang
Copy link
Contributor

MattUnderscoreZhang commented Feb 1, 2024

@rom1504 I created a pull request that should limit processing to the correct number of shards at a time (as tested on my local environment). However, I see the same thing that @pabl0 does, where the number of threads created is higher than expected from processes_count x thread_count, going far back in the commit history. This seems to be a bug that has existed for a long time.

Testing with a 5k sample dataset, I set number_sample_per_shard=100, processes_count=5, thread_count=2:
image

As a result, my config requires 50 shards, but I only download 5 at a time:
image

As you mentioned, the number of threads is higher than expected:
image

@pabl0
Copy link
Author

pabl0 commented Feb 1, 2024

Thanks @rom1504 @MattUnderscoreZhang. Your fix does indeed seem to improve things in my quick testing. However, it seems to me that the current behavior is that it creates batches of N number of tar files, but it is only writing to one at a time. Like only one process is active and the rest are not doing any work, and the tar files remain empty.

BTW, is the output ".../_tmp/X.fether" caused by this print statement added by the worker refactor PR? It feels to me like unnecessary debug output that should be dropped.

@MattUnderscoreZhang
Copy link
Contributor

MattUnderscoreZhang commented Feb 3, 2024

@pabl0 I can't seem to replicate your observation. What environment are you running on, and what configs are you using? For reference, here's what I see. With five processes, I see 5 tar files being written simultaneously.

image

image

@MattUnderscoreZhang
Copy link
Contributor

Also, I removed the print statement on the threading_fix path. Nice catch.

@pabl0
Copy link
Author

pabl0 commented Feb 4, 2024

I see. IIRC, the WebVid dataset consist of normal http links using different d/l backend than YouTube datasets. Is it possible they behave differently?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants