-
-
Notifications
You must be signed in to change notification settings - Fork 372
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
API: failures and race conditions with concurrent publish update operations #1125
Comments
This was probably introduced in the background api changes: #971 Despite all your awesome preparations 🎉 I was not able to reproduce your issues. However I encountered Would you mind testing this version and see if it fixes the problem? |
I could try it but I suspect the ease of reproducing this depends on factors like I/O speed. Our setup has storage on an NFS mount and the database is huge (several mirrors of various Ubuntu releases). Deploying a development build on our production setup isn't really an option though, and if I try it in a fake setup, it will probably not be reproducible either. So, we're sort of trusting your judgment that the fix will work, but to avoid this bug in the meantime, we have wrapped some duct tape around it by implementing a mutex server and having each build script require acquiring this mutex to perform the
This message is logged both when doing a |
@DrLex0 After digging into the async implementation a bit more, I do have a suspicion. As far as I understand is that there is some resource management implemented into the async API, however |
Got this issue today and using the binary provided by @randombenj the problem is solved. |
Using _async=true on publish updates seems to have fixed this and we no longer see the apt hash sum mismatch errors. When this is triggered the package can be found with
|
@jmunson If it's the same race condition, then another ticket wouldn't be necessary. |
Just an update: we're now hitting both Concurrent operations seem less of a problem now, but we do still see complaints of missing packages, and my current suspicion is that the script that registers the file and then hits the publish endpoint is publishing before the file is done being added. I'm still not entirely confident this is what is happening and haven't had a clean reproduction yet, but I do wonder if what we really need now is to actually utilize the TaskIDs so that after we add a file we call Does this idea seem worth while to you, or is there any other approach I might be missing here? |
I stumbled on this issue in a use case that I guess is outside the typical usage. The setup I'm working with is a number of CI's that publish debs at any time, and a CI that consume debs for building a distro, and a number of devboxes that at anytime could install a given version of a package or the latest. There probably are more elegant solutions that publishing on every package added, but for what I'm doing that is certainly the simplest. Since I do a lot of concurrent publishes (over the api) I get alot of error HTTP returns. This again can be worked around by just looping the http PUT until it succeeds, but not so elegant. In an effort to be a good citizen (and a lazy developer that does not need to maintain workarounds) I've opened #1261 which adds a failing test and a really ugly workaround. The value of the MR is pretty much the test case as is, but I'm committed to pushing a better implementation/workaround (given that this is actually a use case that is desired to be supported). Far as I can tell there is reason for not queuing up operations in the code (even though my submission is blocking rather than queuing). Is the issue or the MR the correct place for further discussion? |
I update the MR, see #1271, this should fix it :) please test and confirm, thank you ! |
Detailed Description
Using Aptly 1.5.0.
Running 2 API calls
PUT {api_url}/api/publish/{path}/{distribution}
on the same path at the same time, will lead to various failures and potentially an unpredictable end result. It seems as if there is no locking at all on this operation?This looks like a regression introduced in the latest Aptly release. I cannot remember having encountered this while we were using version 1.4.0.
Context
We run builds on a Jenkins server that publish Debian packages to repositories. Often there are multiple builds that want to publish their different products to the same repository. Since there are multiple workers, it is not unusual that 2 builds try to upload their products and do a publish update at the same time. Often the build will fail in this case, which is already annoying enough on its own. But it gets worse: sometimes the API calls don't even seem to return an error and everything seems to be fine, but one of the packages is not in the repository. This will then lead to weird errors later on and a lot of wasted time figuring out what exactly happened.
Your Environment
Attached is a Python3 script with some Debian packages randomly taken from Ubuntu jammy that reproduced this problem on the first try for me.
aptly-publish-bug.tar.gz
test-aptly-race-bug
.I have also added a random package to it, just to ensure the following steps will start from a similar state.
aptly repo create test-aptly-race-bug
aptly repo add test-aptly-race-bug vim_8.2.3995-1ubuntu2_amd64.deb
aptly publish repo -distribution jammy test-aptly-race-bug test-aptly-race-bug
cd
to thegnome-shell
directory.../publish_pkgs.py -a 'http://your-aptly-server:8080' test-aptly-race-bug jammy gnome-shell_42.4-0ubuntu0.22.04.1.dsc gnome-shell_42.4-0ubuntu0.22.04.1_source.changes
(Adjust the
-a
argument to point to your aptly test server, or editDEFAULT_APTLY_URI
in the script itself and omit this parameter.)cd
to thetzdata
directory.../publish_pkgs.py -a 'http://your-aptly-server:8080' test-aptly-race-bug jammy tzdata_2022e-0ubuntu0.22.04.0.dsc tzdata_2022e-0ubuntu0.22.04.0_source.changes
(Again, update or omit
-a
argument.)gnome-shell
terminal, and then as quickly as possible in thetzdata
terminal as well, such that the 2 instances of the script start almost at the same time.Journalctl on the aptly server during the failure:
It is obvious that the 2 instances of the API call are just running concurrently and get in each other's way because one instance has already moved/deleted files that the other is also using. The specific failure may vary depending on the timing of starting both instances of the script. Sometimes the API call does not break entirely, but the final result lacks one of the packages because one instance overwrote the modified Packages file from the other.
The text was updated successfully, but these errors were encountered: